Skip to main content
JSON indexing enables you to index structured JSON documents and query by nested field paths while leveraging vector similarity for text content.

Overview

Many documents contain structured data beyond simple text - product catalogs with specifications, user profiles with preferences, articles with nested metadata. JSON indexing lets you search both the semantic content and the structured fields.
JSON indexing combines the best of both worlds: vector search for semantic understanding plus structured queries for precise field matching.

How it works

  1. Document ingestion - Parse JSON documents and extract text content
  2. Vector embedding - Generate embeddings for searchable text fields
  3. Metadata storage - Store full JSON structure in document metadata
  4. Query execution - Search by vector similarity and filter by JSON paths

JSON path filtering

Use dot notation to filter by nested JSON fields:
from vectordb.langchain.utils.filters import DocumentFilter

# Filter by nested path
filtered_docs = DocumentFilter.filter_by_metadata_json(
    documents=results,
    json_path="author.name",
    value="John Doe",
    operator="equals"
)

# Filter by deeper nesting
filtered_docs = DocumentFilter.filter_by_metadata_json(
    documents=results,
    json_path="product.specs.cpu.cores",
    value=8,
    operator="gte"
)

Supported operators

JSON path filtering supports all standard metadata operators:
OperatorDescriptionExample
equalsExact matchauthor.country = "USA"
containsSubstring (case-insensitive)author.bio contains "engineer"
startswithPrefix (case-insensitive)product.sku startswith "TECH"
gt / gteNumeric comparisonspecs.ram >= 16
lt / lteNumeric comparisonprice < 1000
inValue in listtags in ["featured", "new"]

Document structure

Structure your JSON documents with searchable text and filterable metadata:
{
  "title": "High-Performance Laptop",
  "description": "Professional workstation with advanced specifications",
  "author": {
    "name": "Tech Reviews Inc",
    "email": "[email protected]",
    "verified": true
  },
  "product": {
    "sku": "TECH-2024-001",
    "category": "electronics",
    "specs": {
      "cpu": {
        "model": "Intel i9",
        "cores": 12
      },
      "ram": 32,
      "storage": 1000
    }
  },
  "price": 2499,
  "tags": ["featured", "professional", "high-end"]
}

Configuration

Configure JSON indexing in your pipeline:
collection:
  name: "json_indexed"
  
embedder:
  model: "sentence-transformers/all-MiniLM-L6-v2"

schema:
  text_fields:
    - "title"
    - "description"
  metadata_fields:
    - "author"
    - "product"
    - "price"
    - "tags"

Usage example

from vectordb.langchain.json_indexing.search.weaviate import (
    WeaviateJsonSearchPipeline,
)

pipeline = WeaviateJsonSearchPipeline(config)

# Search with nested JSON filter
results = pipeline.search(
    query="high performance laptops",
    filters={
        "product.specs.cpu.cores": {"$gte": 8},
        "price": {"$lt": 3000}
    },
    top_k=10
)

# Access nested metadata
for doc in results["documents"]:
    product = doc.metadata["product"]
    print(f"SKU: {product['sku']}")
    print(f"CPU: {product['specs']['cpu']['model']}")
    print(f"Price: ${doc.metadata['price']}")

Indexing JSON documents

Index JSON documents with automatic metadata extraction:
from vectordb.langchain.json_indexing.indexing.weaviate import (
    WeaviateJsonIndexingPipeline,
)
from langchain_core.documents import Document
import json

pipeline = WeaviateJsonIndexingPipeline(config)

# Load JSON documents
with open("products.json") as f:
    products = json.load(f)

# Convert to LangChain documents
documents = [
    Document(
        page_content=f"{p['title']} {p['description']}",
        metadata=p
    )
    for p in products
]

# Index with embeddings
result = pipeline.index(documents)
print(f"Indexed {result['count']} documents")

Query patterns

Multi-condition filters

Combine multiple JSON path filters:
results = pipeline.search(
    query="professional workstation",
    filters={
        "$and": [
            {"author.verified": {"$eq": true}},
            {"product.specs.ram": {"$gte": 32}},
            {"product.category": {"$eq": "electronics"}}
        ]
    }
)

Array field filtering

Filter documents by array membership:
# Find documents with specific tag
results = DocumentFilter.filter_by_metadata_json(
    documents=results,
    json_path="tags",
    value="featured",
    operator="in"
)

Nested object queries

Access deeply nested fields:
# Filter by CPU model
results = DocumentFilter.filter_by_metadata_json(
    documents=results,
    json_path="product.specs.cpu.model",
    value="Intel i9",
    operator="equals"
)

Database support

JSON indexing is available for all vector databases:
DatabaseJSON SupportNested QueriesArray Filters
Weaviate
Qdrant
Pinecone
Milvus
Chroma

Best practices

1

Design your schema

Separate searchable text fields from filterable metadata. Embed only the content that needs semantic search.
2

Index strategically

Most databases automatically index metadata fields. For large catalogs, verify field indexing for query performance.
3

Normalize data types

Use consistent types across documents (strings vs numbers, date formats) to avoid filter failures.
4

Limit nesting depth

While deeply nested paths work, flatter structures query faster and are easier to maintain.

Metadata filtering

Structured constraints on retrieval

Semantic search

Vector similarity search

Hybrid search

Combine dense and sparse retrieval

Multi-tenancy

Tenant-isolated indexing

Build docs developers (and LLMs) love