Skip to main content

Overview

The Extractor class handles intelligent web content extraction with automatic fallback strategies. It uses direct HTTP requests for standard sites and falls back to Jina AI’s Reader API for single-page applications (SPAs), JavaScript-heavy sites, and protected content. The class includes special handling for Facebook pages and automatic content truncation.

Constructor

Extractor()

timeout
number
default:"5"
Request timeout in seconds. Applies to both direct fetches and Jina API requests.
max_chars
number
default:"10000"
Maximum characters to extract from a page. Content is truncated to this limit to optimize LLM processing.
from extractor import Extractor

# Use default settings
extractor = Extractor()

# Custom timeout and character limit
extractor = Extractor(timeout=10, max_chars=15000)

# Quick extraction with lower limits
extractor = Extractor(timeout=3, max_chars=5000)
The default max_chars=10000 is optimized for LLM context windows while capturing sufficient content for accurate evaluation.

Methods

process()

Main extraction method with intelligent routing and fallback logic.
url
string
required
The URL to extract content from. Supports HTTP, HTTPS, and special handling for Facebook URLs.
Returns:
result
dict
Extraction result with the following fields:
Raises:
  • Exception - If both local extraction and Jina fallback fail and no content is retrieved
For Facebook URLs, the method returns different fields including platform-specific metadata. Check for the platform field in the result.

fetch_url()

Low-level method to fetch raw HTML or Markdown from a URL.
url
string
required
The URL to fetch
use_jina
boolean
default:"false"
If true, uses Jina AI’s Reader API by prefixing the URL with https://r.jina.ai/
Returns:
response
tuple
A tuple containing:
Raises:
  • Exception - If the HTTP request fails (network error, timeout, invalid response)
from extractor import Extractor

extractor = Extractor()

# Direct fetch
html, latency = extractor.fetch_url("https://example.com")
print(f"Fetched {len(html)} chars in {latency:.2f}s")

# Fetch via Jina for SPA
markdown, latency = extractor.fetch_url("https://spa-site.com", use_jina=True)
print(f"Jina returned {len(markdown)} chars in {latency:.2f}s")

clean_html()

Cleans and extracts text from HTML or Markdown content.
content
string
required
Raw HTML or Markdown content to clean
is_markdown
boolean
default:"false"
If true, treats content as Markdown and returns as-is (stripped). If false, parses as HTML.
Returns:
text
string
Cleaned text with whitespace normalized and unwanted elements removed
from extractor import Extractor

extractor = Extractor()

html = "<html><body><nav>Menu</nav><p>Main content here</p><script>alert('hi')</script></body></html>"
text = extractor.clean_html(html)
print(text)  # "Main content here" (nav and script removed)

# Markdown pass-through
markdown = "# Title\n\nContent here"
text = extractor.clean_html(markdown, is_markdown=True)
print(text)  # "# Title\n\nContent here" (stripped only)
The HTML cleaner removes <script>, <style>, <nav>, <footer>, and <header> tags to focus on main content.

Usage Examples

Basic Content Extraction

from extractor import Extractor
import json

extractor = Extractor()

try:
    result = extractor.process("https://example.com")
    
    print(f"URL: {result['url']}")
    print(f"Characters extracted: {result['char_count']}")
    print(f"Fetch latency: {result['latency_fetch']:.2f}s")
    print(f"\nContent preview:\n{result['text'][:500]}...")
    
except Exception as e:
    print(f"Extraction failed: {e}")

Batch URL Processing

from extractor import Extractor

extractor = Extractor(timeout=10)

urls = [
    "https://example1.com",
    "https://example2.com",
    "https://example3.com"
]

results = []

for url in urls:
    try:
        result = extractor.process(url)
        results.append({
            "url": url,
            "success": True,
            "chars": result['char_count'],
            "latency": result['latency_fetch']
        })
        print(f"✓ {url} - {result['char_count']} chars in {result['latency_fetch']:.2f}s")
    except Exception as e:
        results.append({
            "url": url,
            "success": False,
            "error": str(e)
        })
        print(f"✗ {url} - {e}")

print(f"\nSuccessfully extracted: {sum(1 for r in results if r['success'])}/{len(urls)}")

Handling Facebook URLs

from extractor import Extractor
import json

extractor = Extractor()

facebook_url = "https://www.facebook.com/BusinessPage"

try:
    result = extractor.process(facebook_url)
    
    if "error" in result:
        print(f"Facebook extraction failed: {result['error']}")
    else:
        print(f"Platform: {result.get('platform')}")
        print(f"Content: {result['text'][:200]}...")
        print(f"\nMetadata: {json.dumps(result.get('metadata'), indent=2)}")
        
except Exception as e:
    print(f"Error: {e}")

Custom Extraction Settings

from extractor import Extractor

# Fast extraction for quick scanning
fast_extractor = Extractor(timeout=3, max_chars=3000)

# Detailed extraction for comprehensive analysis
detailed_extractor = Extractor(timeout=15, max_chars=20000)

url = "https://long-article-site.com"

# Quick scan
quick_result = fast_extractor.process(url)
print(f"Quick: {quick_result['char_count']} chars")

# Detailed extraction
detailed_result = detailed_extractor.process(url)
print(f"Detailed: {detailed_result['char_count']} chars")

Manual Fetch and Clean

from extractor import Extractor

extractor = Extractor()

url = "https://example.com"

try:
    # Fetch raw HTML
    html, latency = extractor.fetch_url(url)
    print(f"Fetched {len(html)} bytes in {latency:.2f}s")
    
    # Clean HTML
    text = extractor.clean_html(html)
    print(f"Extracted {len(text)} chars of clean text")
    
    # Truncate manually if needed
    max_chars = 5000
    truncated = text[:max_chars]
    print(f"Truncated to {len(truncated)} chars")
    
except Exception as e:
    print(f"Error: {e}")

Smart Fallback Strategy

The Extractor implements a two-tier extraction strategy:
1

Primary: Direct HTTP

Attempts direct HTTP fetch with custom headers to simulate a real browser
2

Content Validation

Checks if extracted content is substantial (>200 characters)
3

Fallback Trigger

If content < 200 chars or primary fetch fails, triggers Jina AI fallback
4

Jina AI Reader

Uses Jina’s headless browser service to render JavaScript and extract content
5

Result Assembly

Returns the best available content with cumulative latency tracking
The 200-character threshold detects minimal pages (logo + title only) that indicate JavaScript rendering is needed.

Facebook Integration

For Facebook URLs, the Extractor routes to a specialized Facebook client:
from facebook_client import get_facebook_page_data

# Automatically called for facebook.com URLs
result = extractor.process("https://www.facebook.com/PageName")

if "platform" in result and result["platform"] == "facebook":
    # Facebook-specific handling
    metadata = result.get("metadata", {})
    print(f"Page info: {metadata}")
Facebook extraction requires the Graph API and may return an error structure if the API call fails. Always check for the error field in the result.

HTTP Headers

The Extractor sends browser-like headers to avoid bot detection:
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
    "Accept-Language": "en-US,en;q=0.9",
}

Cleaning Process

The HTML cleaner performs the following operations:
  1. Parse HTML with BeautifulSoup
  2. Remove Elements: <script>, <style>, <nav>, <footer>, <header>
  3. Extract Text with space separators
  4. Normalize Whitespace: Strip lines and remove double spaces
  5. Join Lines with newline characters
  6. Return Clean Text

Performance Characteristics

  • Local Fetch: 0.5-2 seconds typical
  • Jina Fallback: 2-5 seconds typical
  • Facebook API: 1-3 seconds typical
  • Memory Usage: Minimal (content truncated to max_chars)
Total latency is tracked across all attempts and fallbacks, providing accurate performance metrics.

Error Handling

  • Network errors trigger fallback before raising exceptions
  • Facebook API errors return error structures instead of raising
  • Both primary and fallback failures result in a descriptive exception
  • All errors are logged at WARNING level

Command-Line Usage

The Extractor can be run standalone for testing:
python extractor.py https://example.com
Output:
URL: https://example.com
Latency: 1.23s
Chars: 8543
--------------------
Example Domain This domain is for use in illustrative examples...
  • LeadEngine - Uses Extractor as the first pipeline stage
  • Evaluator - Processes extracted content for AI analysis

Build docs developers (and LLMs) love