Overview
The GeminiClient provides integration with Google’s Gemini language models, including Gemini 3.0 Flash/Pro and Gemini 2.5 series with support for thinking configurations.
Installation
pip install graphiti-core[google-genai]
Basic Usage
from graphiti_core.llm_client import GeminiClient
from graphiti_core.llm_client.config import LLMConfig
from graphiti_core.prompts.models import Message
from pydantic import BaseModel
# Initialize client
client = GeminiClient(
config=LLMConfig(
api_key="your-google-api-key",
model="gemini-3-flash-preview",
temperature=1.0
)
)
# Define response structure
class Summary(BaseModel):
title: str
main_points: list[str]
word_count: int
# Generate structured response
messages = [
Message(role="system", content="Summarize the following article."),
Message(role="user", content="Long article text...")
]
response = await client.generate_response(
messages=messages,
response_model=Summary
)
Constructor
config
LLMConfig | None
default:"None"
Configuration object. If None, creates default config.
Enable response caching (stored in ./llm_cache)
Maximum output tokens. If not set, uses model-specific defaults (see table below).
thinking_config
types.ThinkingConfig | None
default:"None"
Optional thinking configuration for Gemini 2.5+ models that support enhanced reasoning.
client
genai.Client | None
default:"None"
Optional pre-configured genai.Client instance. If not provided, creates one from config.
Supported Models
The client supports all Gemini models with model-specific max token limits:
Gemini 3 (Preview) - 64K output
gemini-3-pro-preview
gemini-3-flash-preview (default)
Gemini 2.5 - 64K output
gemini-2.5-pro
gemini-2.5-flash
gemini-2.5-flash-lite (64K)
Gemini 2.0 - 8K output
gemini-2.0-flash
gemini-2.0-flash-lite
Gemini 1.5 - 8K output
gemini-1.5-pro
gemini-1.5-flash
gemini-1.5-flash-8b
Max Tokens Resolution
Similar to AnthropicClient, max tokens are resolved with the following precedence:
- Explicit parameter to
generate_response()
- Instance max_tokens set during initialization
- Model-specific maximum from the mapping above
- Default fallback: 8192 tokens
# Automatic: uses 65536 for gemini-3-flash-preview
client = GeminiClient(
config=LLMConfig(model="gemini-3-flash-preview")
)
# Override: use 32K for all requests
client = GeminiClient(
config=LLMConfig(model="gemini-2.5-pro"),
max_tokens=32000
)
# Per-request: 16K for this specific call
response = await client.generate_response(
messages=messages,
max_tokens=16384
)
Thinking Configuration
Gemini 2.5+ models support enhanced reasoning modes:
from google.genai import types
client = GeminiClient(
config=LLMConfig(
api_key="your-key",
model="gemini-2.5-pro"
),
thinking_config=types.ThinkingConfig(
# Configure thinking depth and approach
)
)
Only use thinking_config with Gemini 2.5+ models. Earlier models do not support this feature.
Structured Output via Schema
Gemini uses JSON schema for structured outputs:
class ExtractedData(BaseModel):
"""Schema for extraction"""
entities: list[str]
relationships: list[dict[str, str]]
# The client automatically:
# 1. Converts Pydantic model to JSON schema
# 2. Sets response_mime_type to 'application/json'
# 3. Validates response against schema
response = await client.generate_response(
messages=messages,
response_model=ExtractedData
)
Generation configuration:
generation_config = types.GenerateContentConfig(
temperature=self.temperature,
max_output_tokens=resolved_max_tokens,
response_mime_type='application/json',
response_schema=ExtractedData, # Pydantic model
system_instruction=system_prompt,
thinking_config=self.thinking_config
)
Model Size Selection
Use model_size parameter to automatically select between models:
client = GeminiClient(
config=LLMConfig(
model="gemini-3-flash-preview", # Medium
small_model="gemini-2.5-flash-lite" # Small
)
)
# Uses gemini-2.5-flash-lite
response = await client.generate_response(
messages=messages,
model_size=ModelSize.small
)
# Uses gemini-3-flash-preview
response = await client.generate_response(
messages=messages,
model_size=ModelSize.medium
)
Error Handling
Safety Blocks
Gemini may block content for safety reasons:
try:
response = await client.generate_response(messages=messages)
except Exception as e:
if 'safety' in str(e).lower() or 'blocked' in str(e).lower():
print(f"Content blocked by safety filters: {e}")
# No retry - content was blocked
Safety information is extracted from the response:
# Example safety block details:
# "Response blocked by Gemini safety filters: HARM_CATEGORY_HARASSMENT: HIGH"
Rate Limits
from graphiti_core.llm_client.errors import RateLimitError
try:
response = await client.generate_response(messages=messages)
except RateLimitError as e:
print(f"Rate limited: {e}")
# No automatic retry - implement backoff
Rate limit detection checks for:
- “rate limit” in error message
- “quota” in error message
- “resource_exhausted” in error message
- HTTP 429 status code
Automatic Retries
The client retries up to 2 times for:
- JSON parsing errors
- Validation errors
- Transient API failures
Retry with error context:
# On validation error:
error_context = (
f'The previous response attempt was invalid. '
f'Error type: {e.__class__.__name__}. '
f'Error details: {str(e)}. '
f'Please try again with a valid response.'
)
messages.append(Message(role='user', content=error_context))
JSON Salvage
If output is truncated or malformed, the client attempts to salvage partial JSON:
# Looks for last valid closing bracket
array_match = re.search(r'\]\s*$', raw_output)
if array_match:
return json.loads(raw_output[:array_match.end()])
obj_match = re.search(r'\}\s*$', raw_output)
if obj_match:
return json.loads(raw_output[:obj_match.end()])
This is useful when responses are cut off due to max_tokens limits.
Token Usage Tracking
The client extracts token counts from Gemini’s response:
client = GeminiClient()
response = await client.generate_response(
messages=messages,
prompt_name="summarization"
)
# Token usage from response.usage_metadata
usage = client.token_tracker.get_usage()
print(f"Prompt tokens: {usage['input_tokens']}")
print(f"Candidate tokens: {usage['output_tokens']}")
print(f"Total: {usage['total_tokens']}")
System Instructions
System messages and schema instructions are combined:
messages = [
Message(role="system", content="You are a data extraction assistant."),
Message(role="user", content="Extract entities from: ...")
]
# System instruction includes:
# 1. Original system message
# 2. JSON schema output instructions (if response_model provided)
# 3. Formatting guidelines
system_prompt = (
"You are a data extraction assistant.\n\n"
"Output ONLY valid JSON matching this schema: {...}.\n"
"Do not include any explanatory text before or after the JSON."
)
Example: Batch Processing
from graphiti_core.llm_client import GeminiClient
from graphiti_core.llm_client.config import LLMConfig, ModelSize
from pydantic import BaseModel
class Classification(BaseModel):
category: str
confidence: float
client = GeminiClient(
config=LLMConfig(
model="gemini-3-flash-preview",
small_model="gemini-2.5-flash-lite"
)
)
items = ["text 1", "text 2", "text 3"]
results = []
for item in items:
messages = [
Message(role="system", content="Classify the text."),
Message(role="user", content=item)
]
result = await client.generate_response(
messages=messages,
response_model=Classification,
model_size=ModelSize.small # Use faster model
)
results.append(result)
- Use Flash variants for speed:
gemini-3-flash-preview is much faster than Pro
- Set appropriate max_tokens: Don’t request 64K if you only need 2K
- Use model_size=ModelSize.small for simple tasks
- Enable caching for repeated queries
- Monitor safety blocks: Adjust prompts if frequently blocked
Prompt Feedback
Check if your prompt was blocked:
# The client checks prompt_feedback.block_reason
if prompt_feedback and block_reason:
raise Exception(f'Prompt blocked by Gemini: {block_reason}')
Common block reasons:
SAFETY: Content policy violation
OTHER: Other blocking reason