Skip to main content

Overview

The GeminiClient provides integration with Google’s Gemini language models, including Gemini 3.0 Flash/Pro and Gemini 2.5 series with support for thinking configurations.

Installation

pip install graphiti-core[google-genai]

Basic Usage

from graphiti_core.llm_client import GeminiClient
from graphiti_core.llm_client.config import LLMConfig
from graphiti_core.prompts.models import Message
from pydantic import BaseModel

# Initialize client
client = GeminiClient(
    config=LLMConfig(
        api_key="your-google-api-key",
        model="gemini-3-flash-preview",
        temperature=1.0
    )
)

# Define response structure
class Summary(BaseModel):
    title: str
    main_points: list[str]
    word_count: int

# Generate structured response
messages = [
    Message(role="system", content="Summarize the following article."),
    Message(role="user", content="Long article text...")
]

response = await client.generate_response(
    messages=messages,
    response_model=Summary
)

Constructor

config
LLMConfig | None
default:"None"
Configuration object. If None, creates default config.
cache
bool
default:"False"
Enable response caching (stored in ./llm_cache)
max_tokens
int | None
default:"None"
Maximum output tokens. If not set, uses model-specific defaults (see table below).
thinking_config
types.ThinkingConfig | None
default:"None"
Optional thinking configuration for Gemini 2.5+ models that support enhanced reasoning.
client
genai.Client | None
default:"None"
Optional pre-configured genai.Client instance. If not provided, creates one from config.

Supported Models

The client supports all Gemini models with model-specific max token limits:

Gemini 3 (Preview) - 64K output

  • gemini-3-pro-preview
  • gemini-3-flash-preview (default)

Gemini 2.5 - 64K output

  • gemini-2.5-pro
  • gemini-2.5-flash
  • gemini-2.5-flash-lite (64K)

Gemini 2.0 - 8K output

  • gemini-2.0-flash
  • gemini-2.0-flash-lite

Gemini 1.5 - 8K output

  • gemini-1.5-pro
  • gemini-1.5-flash
  • gemini-1.5-flash-8b

Max Tokens Resolution

Similar to AnthropicClient, max tokens are resolved with the following precedence:
  1. Explicit parameter to generate_response()
  2. Instance max_tokens set during initialization
  3. Model-specific maximum from the mapping above
  4. Default fallback: 8192 tokens
# Automatic: uses 65536 for gemini-3-flash-preview
client = GeminiClient(
    config=LLMConfig(model="gemini-3-flash-preview")
)

# Override: use 32K for all requests
client = GeminiClient(
    config=LLMConfig(model="gemini-2.5-pro"),
    max_tokens=32000
)

# Per-request: 16K for this specific call
response = await client.generate_response(
    messages=messages,
    max_tokens=16384
)

Thinking Configuration

Gemini 2.5+ models support enhanced reasoning modes:
from google.genai import types

client = GeminiClient(
    config=LLMConfig(
        api_key="your-key",
        model="gemini-2.5-pro"
    ),
    thinking_config=types.ThinkingConfig(
        # Configure thinking depth and approach
    )
)
Only use thinking_config with Gemini 2.5+ models. Earlier models do not support this feature.

Structured Output via Schema

Gemini uses JSON schema for structured outputs:
class ExtractedData(BaseModel):
    """Schema for extraction"""
    entities: list[str]
    relationships: list[dict[str, str]]

# The client automatically:
# 1. Converts Pydantic model to JSON schema
# 2. Sets response_mime_type to 'application/json'
# 3. Validates response against schema

response = await client.generate_response(
    messages=messages,
    response_model=ExtractedData
)
Generation configuration:
generation_config = types.GenerateContentConfig(
    temperature=self.temperature,
    max_output_tokens=resolved_max_tokens,
    response_mime_type='application/json',
    response_schema=ExtractedData,  # Pydantic model
    system_instruction=system_prompt,
    thinking_config=self.thinking_config
)

Model Size Selection

Use model_size parameter to automatically select between models:
client = GeminiClient(
    config=LLMConfig(
        model="gemini-3-flash-preview",        # Medium
        small_model="gemini-2.5-flash-lite"    # Small
    )
)

# Uses gemini-2.5-flash-lite
response = await client.generate_response(
    messages=messages,
    model_size=ModelSize.small
)

# Uses gemini-3-flash-preview
response = await client.generate_response(
    messages=messages,
    model_size=ModelSize.medium
)

Error Handling

Safety Blocks

Gemini may block content for safety reasons:
try:
    response = await client.generate_response(messages=messages)
except Exception as e:
    if 'safety' in str(e).lower() or 'blocked' in str(e).lower():
        print(f"Content blocked by safety filters: {e}")
        # No retry - content was blocked
Safety information is extracted from the response:
# Example safety block details:
# "Response blocked by Gemini safety filters: HARM_CATEGORY_HARASSMENT: HIGH"

Rate Limits

from graphiti_core.llm_client.errors import RateLimitError

try:
    response = await client.generate_response(messages=messages)
except RateLimitError as e:
    print(f"Rate limited: {e}")
    # No automatic retry - implement backoff
Rate limit detection checks for:
  • “rate limit” in error message
  • “quota” in error message
  • “resource_exhausted” in error message
  • HTTP 429 status code

Automatic Retries

The client retries up to 2 times for:
  • JSON parsing errors
  • Validation errors
  • Transient API failures
Retry with error context:
# On validation error:
error_context = (
    f'The previous response attempt was invalid. '
    f'Error type: {e.__class__.__name__}. '
    f'Error details: {str(e)}. '
    f'Please try again with a valid response.'
)
messages.append(Message(role='user', content=error_context))

JSON Salvage

If output is truncated or malformed, the client attempts to salvage partial JSON:
# Looks for last valid closing bracket
array_match = re.search(r'\]\s*$', raw_output)
if array_match:
    return json.loads(raw_output[:array_match.end()])

obj_match = re.search(r'\}\s*$', raw_output)  
if obj_match:
    return json.loads(raw_output[:obj_match.end()])
This is useful when responses are cut off due to max_tokens limits.

Token Usage Tracking

The client extracts token counts from Gemini’s response:
client = GeminiClient()

response = await client.generate_response(
    messages=messages,
    prompt_name="summarization"
)

# Token usage from response.usage_metadata
usage = client.token_tracker.get_usage()
print(f"Prompt tokens: {usage['input_tokens']}")
print(f"Candidate tokens: {usage['output_tokens']}")
print(f"Total: {usage['total_tokens']}")

System Instructions

System messages and schema instructions are combined:
messages = [
    Message(role="system", content="You are a data extraction assistant."),
    Message(role="user", content="Extract entities from: ...")
]

# System instruction includes:
# 1. Original system message
# 2. JSON schema output instructions (if response_model provided)
# 3. Formatting guidelines

system_prompt = (
    "You are a data extraction assistant.\n\n"
    "Output ONLY valid JSON matching this schema: {...}.\n"
    "Do not include any explanatory text before or after the JSON."
)

Example: Batch Processing

from graphiti_core.llm_client import GeminiClient
from graphiti_core.llm_client.config import LLMConfig, ModelSize
from pydantic import BaseModel

class Classification(BaseModel):
    category: str
    confidence: float

client = GeminiClient(
    config=LLMConfig(
        model="gemini-3-flash-preview",
        small_model="gemini-2.5-flash-lite"
    )
)

items = ["text 1", "text 2", "text 3"]
results = []

for item in items:
    messages = [
        Message(role="system", content="Classify the text."),
        Message(role="user", content=item)
    ]
    
    result = await client.generate_response(
        messages=messages,
        response_model=Classification,
        model_size=ModelSize.small  # Use faster model
    )
    results.append(result)

Performance Tips

  1. Use Flash variants for speed: gemini-3-flash-preview is much faster than Pro
  2. Set appropriate max_tokens: Don’t request 64K if you only need 2K
  3. Use model_size=ModelSize.small for simple tasks
  4. Enable caching for repeated queries
  5. Monitor safety blocks: Adjust prompts if frequently blocked

Prompt Feedback

Check if your prompt was blocked:
# The client checks prompt_feedback.block_reason
if prompt_feedback and block_reason:
    raise Exception(f'Prompt blocked by Gemini: {block_reason}')
Common block reasons:
  • SAFETY: Content policy violation
  • OTHER: Other blocking reason

Build docs developers (and LLMs) love