Processing Overview
Content flows through multiple stages from source to searchable chunks: Each stage is specialized for different content types while maintaining a consistent interface.Strategy Pattern Architecture
The system uses the Strategy pattern to handle different content sources: Code References:src/scraper/types.ts- ScraperStrategy interfacesrc/scraper/strategies/BaseScraperStrategy.ts- Abstract basesrc/scraper/strategies/*.ts- Concrete implementations
Content Sources
Web Scraper Strategy
Handles: HTTP/HTTPS URLs with JavaScript rendering support Features:- Playwright for JavaScript-heavy sites
- Automatic link discovery and crawling
- Scope filtering (same domain, path prefix)
- Retry logic with exponential backoff
- Rate limiting and politeness delays
src/scraper/strategies/WebScraperStrategy.ts
Local File Strategy
Handles: Local filesystem access with directory traversal Features:- Recursive directory scanning
- File type filtering (
.md,.txt,.json, etc.) - Git-aware (respects
.gitignore) - Symbolic link handling
- MIME type detection
src/scraper/strategies/LocalFileStrategy.ts
Package Registry Strategies
NPM Strategy:- Fetches package documentation from npm registry
- Extracts README and type definitions
- Version-specific documentation
- Fetches Python package documentation
- Extracts long description and metadata
- Support for Sphinx documentation links
src/scraper/strategies/NpmScraperStrategy.tssrc/scraper/strategies/PyPiScraperStrategy.ts
Content Fetchers
Abstract content retrieval across different sources:HTTP Fetcher
Location:src/scraper/fetcher/HttpFetcher.ts
Capabilities:
- Standard HTTP requests with headers
- Playwright rendering for dynamic content
- Automatic retry with backoff
- Error classification and handling
- Response caching
The fetcher automatically detects when JavaScript rendering is needed based on content type and response headers.
File Fetcher
Location:src/scraper/fetcher/FileFetcher.ts
Capabilities:
- Local filesystem access
- MIME type detection
- Character encoding resolution
- Binary file filtering
- Symbolic link dereferencing
Processing Pipelines
Transform raw content using middleware chains and content-type-specific logic:HTML Pipeline
Location:src/scraper/pipelines/HtmlPipeline.ts
Most extensive middleware pipeline for web content:
- Dynamic Content Rendering: Optional Playwright rendering for JavaScript
- DOM Parsing: Convert HTML string to manipulable DOM (Cheerio)
- Metadata Extraction: Extract title from
<title>or<h1> - Link Discovery: Gather all links for crawler
- Content Sanitization: Remove navigation, footers, ads, boilerplate
- URL Normalization: Convert relative URLs to absolute, clean non-functional links
- Markdown Conversion: Convert clean HTML to Markdown
Middleware order is critical - sanitization happens before URL normalization to avoid processing irrelevant content.
Markdown Pipeline
Location:src/scraper/pipelines/MarkdownPipeline.ts
Lighter processing for Markdown files:
- Front Matter Extraction: Parse YAML/TOML front matter
- Metadata Extraction: Extract title, description
- Link Processing: Resolve relative links
JSON Pipeline
Location:src/scraper/pipelines/JsonPipeline.ts
Minimal middleware to preserve structure:
- Structure Validation: Ensure valid JSON
- Metadata Extraction: Extract schema information
Source Code Pipeline
Location:src/scraper/pipelines/SourceCodePipeline.ts
Language-aware processing:
- Language Detection: Identify programming language
- Syntax Validation: Check for parse errors
- Comment Extraction: Preserve documentation comments
Text Pipeline
Location:src/scraper/pipelines/TextPipeline.ts
Fallback for generic text:
- Encoding Detection: Ensure correct character encoding
- Basic Metadata: Extract filename and size
Document Splitting
Two-phase approach: semantic splitting preserves structure, size optimization ensures embedding quality.Phase 1: Semantic Splitting
Content-type-specific splitters preserve document structure:Semantic Markdown Splitter
Location:src/splitter/SemanticMarkdownSplitter.ts
Strategy:
- Analyzes heading hierarchy (H1-H6)
- Creates hierarchical paths like
["Guide", "Installation", "Setup"] - Preserves code blocks, tables, and list structures
- Maintains parent-child relationships
["Guide", "Installation", "Setup"]
JSON Document Splitter
Location:src/splitter/JsonDocumentSplitter.ts
Strategy:
- Object and property-level splitting
- Hierarchical path construction
- Concatenation-friendly design
- Structural context preservation
["api", "endpoints", "users"]
Text Document Splitter
Location:src/splitter/TextDocumentSplitter.ts
Strategy:
- Line-based splitting with context
- Simple hierarchical structure
- Language-aware processing
- Fallback for unsupported content
This is a temporary splitter. A syntax-aware Tree-sitter implementation is planned for better semantic boundaries in source code.
Phase 2: Size Optimization
Location:src/splitter/GreedySplitter.ts
Universal optimization across all content types:
Optimization Process:
- Greedy Concatenation: Merge small chunks until minimum size
- Boundary Respect: Preserve major section breaks (H1/H2)
- Metadata Merging: Combine chunk metadata intelligently
- Context Preservation: Maintain hierarchical relationships
| Setting | Role | Default |
|---|---|---|
minChunkSize | Floor for merging | 500 chars |
preferredChunkSize | Soft target | 1500 chars |
maxChunkSize | Hard ceiling | 3000 chars |
All sizes are measured in characters (
string.length), not tokens. The actual token count depends on the embedding model’s tokenizer.src/splitter/GreedySplitter.ts
Content Processing Flow
Complete flow from source to embedded chunks:Content-Type-Specific Processing
Different content types follow specialized paths: Processing Differences:- HTML: Multi-stage middleware pipeline for cleaning and conversion
- JSON: Structural validation with hierarchical splitting
- Source Code: Tree-sitter semantic boundary detection
- Markdown: Direct semantic splitting with metadata
GreedySplitter for universal size optimization.
Chunk Structure
Hierarchical Organization
Chunks maintain hierarchical relationships through path-based organization:- Parent: Path with one fewer element
- Children: Paths extending current by one level
- Siblings: Same path length with shared parent
- Context: Related chunks in search results
Search Context Retrieval
When returning search results, the system automatically includes contextual chunks for comprehensive understanding.
- The matching chunk itself
- Parent chunks for broader context
- Previous and following siblings for navigation
- Direct child chunks for deeper exploration
src/store/DocumentRetrieverService.ts
Error Handling
Content Filtering
Automatic filtering of low-quality content:- Navigation menus and sidebars
- Advertisement content and widgets
- Boilerplate text and templates
- Duplicate content detection
- Minimum content length thresholds
src/scraper/middleware/sanitization/
Error Recovery
Graceful handling of processing errors:| Error Type | Strategy |
|---|---|
| Recoverable | Retry with exponential backoff |
| Content | Skip page and continue processing |
| Fatal | Stop job with detailed error info |
| Warning | Log and continue |
src/pipeline/PipelineWorker.ts
Progress Tracking
Real-time processing feedback:- Page-level progress updates
- Processing rate metrics (pages/min)
- Error count and classification
- Memory usage monitoring
src/pipeline/PipelineManager.ts
System Integration
Content processing integrates with downstream components: Integration Points:- Embedding Generation: Consistent chunk formatting enables seamless vector generation
- Database Storage: Hierarchical paths and metadata support efficient indexing
- Search System: Context-aware results leverage chunk relationships
Performance Optimization
Parallel Processing
The system processes multiple pages concurrently while respecting rate limits and politeness delays.
- Configurable worker pool size
- Rate limiting per domain
- Memory-based backpressure
- Graceful degradation under load
Caching Strategy
HTTP Response Caching:- In-memory cache for recent fetches
- Respects
Cache-Controlheaders - Reduces redundant requests
- Reuse embeddings for unchanged content
- Content hash-based cache keys
- Automatic invalidation on updates
Resource Management
Memory Management:- Streaming processing for large files
- Incremental chunk generation
- Automatic garbage collection hints
- Browser instance pooling
- Page context reuse
- Automatic cleanup on errors
Next Steps
Pipeline System
Learn about job processing and workers
Architecture Overview
Understand overall system design
