Gemini Client

Overview

The GeminiClient provides integration with Google’s Gemini language models, including Gemini 3.0 Flash/Pro and Gemini 2.5 series with support for thinking configurations.

Installation

pip install graphiti-core[google-genai]

Basic Usage

from graphiti_core.llm_client import GeminiClient
from graphiti_core.llm_client.config import LLMConfig
from graphiti_core.prompts.models import Message
from pydantic import BaseModel

# Initialize client
client = GeminiClient(
    config=LLMConfig(
        api_key="your-google-api-key",
        model="gemini-3-flash-preview",
        temperature=1.0
    )
)

# Define response structure
class Summary(BaseModel):
    title: str
    main_points: list[str]
    word_count: int

# Generate structured response
messages = [
    Message(role="system", content="Summarize the following article."),
    Message(role="user", content="Long article text...")
]

response = await client.generate_response(
    messages=messages,
    response_model=Summary
)

Constructor

config

LLMConfig | None

default:"None"

Configuration object. If None, creates default config.

cache

bool

default:"False"

Enable response caching (stored in ./llm_cache)

max_tokens

int | None

default:"None"

Maximum output tokens. If not set, uses model-specific defaults (see table below).

thinking_config

types.ThinkingConfig | None

default:"None"

Optional thinking configuration for Gemini 2.5+ models that support enhanced reasoning.

client

genai.Client | None

default:"None"

Optional pre-configured genai.Client instance. If not provided, creates one from config.

Supported Models

The client supports all Gemini models with model-specific max token limits:

Gemini 3 (Preview) - 64K output

gemini-3-pro-preview
gemini-3-flash-preview (default)

Gemini 2.5 - 64K output

gemini-2.5-pro
gemini-2.5-flash
gemini-2.5-flash-lite (64K)

Gemini 2.0 - 8K output

gemini-2.0-flash
gemini-2.0-flash-lite

Gemini 1.5 - 8K output

gemini-1.5-pro
gemini-1.5-flash
gemini-1.5-flash-8b

Max Tokens Resolution

Similar to AnthropicClient, max tokens are resolved with the following precedence:

Explicit parameter to generate_response()
Instance max_tokens set during initialization
Model-specific maximum from the mapping above
Default fallback: 8192 tokens

# Automatic: uses 65536 for gemini-3-flash-preview
client = GeminiClient(
    config=LLMConfig(model="gemini-3-flash-preview")
)

# Override: use 32K for all requests
client = GeminiClient(
    config=LLMConfig(model="gemini-2.5-pro"),
    max_tokens=32000
)

# Per-request: 16K for this specific call
response = await client.generate_response(
    messages=messages,
    max_tokens=16384
)

Thinking Configuration

Gemini 2.5+ models support enhanced reasoning modes:

from google.genai import types

client = GeminiClient(
    config=LLMConfig(
        api_key="your-key",
        model="gemini-2.5-pro"
    ),
    thinking_config=types.ThinkingConfig(
        # Configure thinking depth and approach
    )
)

Only use thinking_config with Gemini 2.5+ models. Earlier models do not support this feature.

Structured Output via Schema

Gemini uses JSON schema for structured outputs:

class ExtractedData(BaseModel):
    """Schema for extraction"""
    entities: list[str]
    relationships: list[dict[str, str]]

# The client automatically:
# 1. Converts Pydantic model to JSON schema
# 2. Sets response_mime_type to 'application/json'
# 3. Validates response against schema

response = await client.generate_response(
    messages=messages,
    response_model=ExtractedData
)

Generation configuration:

generation_config = types.GenerateContentConfig(
    temperature=self.temperature,
    max_output_tokens=resolved_max_tokens,
    response_mime_type='application/json',
    response_schema=ExtractedData,  # Pydantic model
    system_instruction=system_prompt,
    thinking_config=self.thinking_config
)

Model Size Selection

Use model_size parameter to automatically select between models:

client = GeminiClient(
    config=LLMConfig(
        model="gemini-3-flash-preview",        # Medium
        small_model="gemini-2.5-flash-lite"    # Small
    )
)

# Uses gemini-2.5-flash-lite
response = await client.generate_response(
    messages=messages,
    model_size=ModelSize.small
)

# Uses gemini-3-flash-preview
response = await client.generate_response(
    messages=messages,
    model_size=ModelSize.medium
)

Error Handling

Safety Blocks

Gemini may block content for safety reasons:

try:
    response = await client.generate_response(messages=messages)
except Exception as e:
    if 'safety' in str(e).lower() or 'blocked' in str(e).lower():
        print(f"Content blocked by safety filters: {e}")
        # No retry - content was blocked

Safety information is extracted from the response:

# Example safety block details:
# "Response blocked by Gemini safety filters: HARM_CATEGORY_HARASSMENT: HIGH"

Rate Limits

from graphiti_core.llm_client.errors import RateLimitError

try:
    response = await client.generate_response(messages=messages)
except RateLimitError as e:
    print(f"Rate limited: {e}")
    # No automatic retry - implement backoff

Rate limit detection checks for:

“rate limit” in error message
“quota” in error message
“resource_exhausted” in error message
HTTP 429 status code

Automatic Retries

The client retries up to 2 times for:

JSON parsing errors
Validation errors
Transient API failures

Retry with error context:

# On validation error:
error_context = (
    f'The previous response attempt was invalid. '
    f'Error type: {e.__class__.__name__}. '
    f'Error details: {str(e)}. '
    f'Please try again with a valid response.'
)
messages.append(Message(role='user', content=error_context))

JSON Salvage

If output is truncated or malformed, the client attempts to salvage partial JSON:

# Looks for last valid closing bracket
array_match = re.search(r'\]\s*$', raw_output)
if array_match:
    return json.loads(raw_output[:array_match.end()])

obj_match = re.search(r'\}\s*$', raw_output)  
if obj_match:
    return json.loads(raw_output[:obj_match.end()])

This is useful when responses are cut off due to max_tokens limits.

Token Usage Tracking

The client extracts token counts from Gemini’s response:

client = GeminiClient()

response = await client.generate_response(
    messages=messages,
    prompt_name="summarization"
)

# Token usage from response.usage_metadata
usage = client.token_tracker.get_usage()
print(f"Prompt tokens: {usage['input_tokens']}")
print(f"Candidate tokens: {usage['output_tokens']}")
print(f"Total: {usage['total_tokens']}")

System Instructions

System messages and schema instructions are combined:

messages = [
    Message(role="system", content="You are a data extraction assistant."),
    Message(role="user", content="Extract entities from: ...")
]

# System instruction includes:
# 1. Original system message
# 2. JSON schema output instructions (if response_model provided)
# 3. Formatting guidelines

system_prompt = (
    "You are a data extraction assistant.\n\n"
    "Output ONLY valid JSON matching this schema: {...}.\n"
    "Do not include any explanatory text before or after the JSON."
)

Example: Batch Processing

from graphiti_core.llm_client import GeminiClient
from graphiti_core.llm_client.config import LLMConfig, ModelSize
from pydantic import BaseModel

class Classification(BaseModel):
    category: str
    confidence: float

client = GeminiClient(
    config=LLMConfig(
        model="gemini-3-flash-preview",
        small_model="gemini-2.5-flash-lite"
    )
)

items = ["text 1", "text 2", "text 3"]
results = []

for item in items:
    messages = [
        Message(role="system", content="Classify the text."),
        Message(role="user", content=item)
    ]
    
    result = await client.generate_response(
        messages=messages,
        response_model=Classification,
        model_size=ModelSize.small  # Use faster model
    )
    results.append(result)

Performance Tips

Use Flash variants for speed: gemini-3-flash-preview is much faster than Pro
Set appropriate max_tokens: Don’t request 64K if you only need 2K
Use model_size=ModelSize.small for simple tasks
Enable caching for repeated queries
Monitor safety blocks: Adjust prompts if frequently blocked

Prompt Feedback

Check if your prompt was blocked:

# The client checks prompt_feedback.block_reason
if prompt_feedback and block_reason:
    raise Exception(f'Prompt blocked by Gemini: {block_reason}')

Common block reasons:

SAFETY: Content policy violation
OTHER: Other blocking reason

Core

Data Models

Drivers

LLM Clients

Embedders

Overview

Installation

Basic Usage

Constructor

Supported Models

Gemini 3 (Preview) - 64K output

Gemini 2.5 - 64K output

Gemini 2.0 - 8K output

Gemini 1.5 - 8K output

Max Tokens Resolution

Thinking Configuration

Structured Output via Schema

Model Size Selection

Error Handling

Safety Blocks

Rate Limits

Automatic Retries

JSON Salvage

Token Usage Tracking

System Instructions

Example: Batch Processing

Performance Tips

Prompt Feedback

Build docs developers (and LLMs) love

Core

Data Models

Drivers

LLM Clients

Embedders

​Overview

​Installation

​Basic Usage

​Constructor

​Supported Models

​Gemini 3 (Preview) - 64K output

​Gemini 2.5 - 64K output

​Gemini 2.0 - 8K output

​Gemini 1.5 - 8K output

​Max Tokens Resolution

​Thinking Configuration

​Structured Output via Schema

​Model Size Selection

​Error Handling

​Safety Blocks

​Rate Limits

​Automatic Retries

​JSON Salvage

​Token Usage Tracking

​System Instructions

​Example: Batch Processing

​Performance Tips

​Prompt Feedback

Build docs developers (and LLMs) love

Overview

Installation

Basic Usage

Constructor

Supported Models

Gemini 3 (Preview) - 64K output

Gemini 2.5 - 64K output

Gemini 2.0 - 8K output

Gemini 1.5 - 8K output

Max Tokens Resolution

Thinking Configuration

Structured Output via Schema

Model Size Selection

Error Handling

Safety Blocks

Rate Limits

Automatic Retries

JSON Salvage

Token Usage Tracking

System Instructions

Example: Batch Processing

Performance Tips

Prompt Feedback