JSON indexing enables you to index structured JSON documents and query by nested field paths while leveraging vector similarity for text content.
Overview
Many documents contain structured data beyond simple text - product catalogs with specifications, user profiles with preferences, articles with nested metadata. JSON indexing lets you search both the semantic content and the structured fields.
JSON indexing combines the best of both worlds: vector search for semantic understanding plus structured queries for precise field matching.
How it works
Document ingestion - Parse JSON documents and extract text content
Vector embedding - Generate embeddings for searchable text fields
Metadata storage - Store full JSON structure in document metadata
Query execution - Search by vector similarity and filter by JSON paths
JSON path filtering
Use dot notation to filter by nested JSON fields:
from vectordb.langchain.utils.filters import DocumentFilter
# Filter by nested path
filtered_docs = DocumentFilter.filter_by_metadata_json(
documents = results,
json_path = "author.name" ,
value = "John Doe" ,
operator = "equals"
)
# Filter by deeper nesting
filtered_docs = DocumentFilter.filter_by_metadata_json(
documents = results,
json_path = "product.specs.cpu.cores" ,
value = 8 ,
operator = "gte"
)
Supported operators
JSON path filtering supports all standard metadata operators:
Operator Description Example equalsExact match author.country = "USA"containsSubstring (case-insensitive) author.bio contains "engineer"startswithPrefix (case-insensitive) product.sku startswith "TECH"gt / gteNumeric comparison specs.ram >= 16lt / lteNumeric comparison price < 1000inValue in list tags in ["featured", "new"]
Document structure
Structure your JSON documents with searchable text and filterable metadata:
{
"title" : "High-Performance Laptop" ,
"description" : "Professional workstation with advanced specifications" ,
"author" : {
"name" : "Tech Reviews Inc" ,
"email" : "[email protected] " ,
"verified" : true
},
"product" : {
"sku" : "TECH-2024-001" ,
"category" : "electronics" ,
"specs" : {
"cpu" : {
"model" : "Intel i9" ,
"cores" : 12
},
"ram" : 32 ,
"storage" : 1000
}
},
"price" : 2499 ,
"tags" : [ "featured" , "professional" , "high-end" ]
}
Configuration
Configure JSON indexing in your pipeline:
collection :
name : "json_indexed"
embedder :
model : "sentence-transformers/all-MiniLM-L6-v2"
schema :
text_fields :
- "title"
- "description"
metadata_fields :
- "author"
- "product"
- "price"
- "tags"
Usage example
from vectordb.langchain.json_indexing.search.weaviate import (
WeaviateJsonSearchPipeline,
)
pipeline = WeaviateJsonSearchPipeline(config)
# Search with nested JSON filter
results = pipeline.search(
query = "high performance laptops" ,
filters = {
"product.specs.cpu.cores" : { "$gte" : 8 },
"price" : { "$lt" : 3000 }
},
top_k = 10
)
# Access nested metadata
for doc in results[ "documents" ]:
product = doc.metadata[ "product" ]
print ( f "SKU: { product[ 'sku' ] } " )
print ( f "CPU: { product[ 'specs' ][ 'cpu' ][ 'model' ] } " )
print ( f "Price: $ { doc.metadata[ 'price' ] } " )
Indexing JSON documents
Index JSON documents with automatic metadata extraction:
from vectordb.langchain.json_indexing.indexing.weaviate import (
WeaviateJsonIndexingPipeline,
)
from langchain_core.documents import Document
import json
pipeline = WeaviateJsonIndexingPipeline(config)
# Load JSON documents
with open ( "products.json" ) as f:
products = json.load(f)
# Convert to LangChain documents
documents = [
Document(
page_content = f " { p[ 'title' ] } { p[ 'description' ] } " ,
metadata = p
)
for p in products
]
# Index with embeddings
result = pipeline.index(documents)
print ( f "Indexed { result[ 'count' ] } documents" )
Query patterns
Multi-condition filters
Combine multiple JSON path filters:
results = pipeline.search(
query = "professional workstation" ,
filters = {
"$and" : [
{ "author.verified" : { "$eq" : true}},
{ "product.specs.ram" : { "$gte" : 32 }},
{ "product.category" : { "$eq" : "electronics" }}
]
}
)
Array field filtering
Filter documents by array membership:
# Find documents with specific tag
results = DocumentFilter.filter_by_metadata_json(
documents = results,
json_path = "tags" ,
value = "featured" ,
operator = "in"
)
Nested object queries
Access deeply nested fields:
# Filter by CPU model
results = DocumentFilter.filter_by_metadata_json(
documents = results,
json_path = "product.specs.cpu.model" ,
value = "Intel i9" ,
operator = "equals"
)
Database support
JSON indexing is available for all vector databases:
Database JSON Support Nested Queries Array Filters Weaviate ✓ ✓ ✓ Qdrant ✓ ✓ ✓ Pinecone ✓ ✓ ✓ Milvus ✓ ✓ ✓ Chroma ✓ ✓ ✓
Best practices
Design your schema
Separate searchable text fields from filterable metadata. Embed only the content that needs semantic search.
Index strategically
Most databases automatically index metadata fields. For large catalogs, verify field indexing for query performance.
Normalize data types
Use consistent types across documents (strings vs numbers, date formats) to avoid filter failures.
Limit nesting depth
While deeply nested paths work, flatter structures query faster and are easier to maintain.
Metadata filtering Structured constraints on retrieval
Semantic search Vector similarity search
Hybrid search Combine dense and sparse retrieval
Multi-tenancy Tenant-isolated indexing