Extractor

Overview

The Extractor class handles intelligent web content extraction with automatic fallback strategies. It uses direct HTTP requests for standard sites and falls back to Jina AI’s Reader API for single-page applications (SPAs), JavaScript-heavy sites, and protected content. The class includes special handling for Facebook pages and automatic content truncation.

Constructor

Extractor()

timeout

number

default:"5"

Request timeout in seconds. Applies to both direct fetches and Jina API requests.

max_chars

number

default:"10000"

Maximum characters to extract from a page. Content is truncated to this limit to optimize LLM processing.

from extractor import Extractor

# Use default settings
extractor = Extractor()

# Custom timeout and character limit
extractor = Extractor(timeout=10, max_chars=15000)

# Quick extraction with lower limits
extractor = Extractor(timeout=3, max_chars=5000)

The default max_chars=10000 is optimized for LLM context windows while capturing sufficient content for accurate evaluation.

Methods

process()

Main extraction method with intelligent routing and fallback logic.

url

string

required

The URL to extract content from. Supports HTTP, HTTPS, and special handling for Facebook URLs.

Returns:

result

dict

Extraction result with the following fields:

Show Standard Result Fields

url

string

The processed URL

text

string

Extracted and cleaned text content (truncated to max_chars)

latency_fetch

number

Total fetch time in seconds (includes retries and fallback)

char_count

number

Number of characters in the extracted text

Show Facebook-Specific Fields

platform

string

Set to “facebook” for Facebook URLs

metadata

object

Facebook-specific metadata from the Graph API

Show Error Result

error

string

Error message if extraction failed completely

Raises:

Exception - If both local extraction and Jina fallback fail and no content is retrieved

For Facebook URLs, the method returns different fields including platform-specific metadata. Check for the platform field in the result.

fetch_url()

Low-level method to fetch raw HTML or Markdown from a URL.

url

string

required

The URL to fetch

use_jina

boolean

default:"false"

If true, uses Jina AI’s Reader API by prefixing the URL with https://r.jina.ai/

Returns:

response

tuple

A tuple containing:

Show Tuple Elements

content

string

Raw HTML content (or Markdown if using Jina)

latency

number

Fetch time in seconds

Raises:

Exception - If the HTTP request fails (network error, timeout, invalid response)

from extractor import Extractor

extractor = Extractor()

# Direct fetch
html, latency = extractor.fetch_url("https://example.com")
print(f"Fetched {len(html)} chars in {latency:.2f}s")

# Fetch via Jina for SPA
markdown, latency = extractor.fetch_url("https://spa-site.com", use_jina=True)
print(f"Jina returned {len(markdown)} chars in {latency:.2f}s")

clean_html()

Cleans and extracts text from HTML or Markdown content.

content

string

required

Raw HTML or Markdown content to clean

is_markdown

boolean

default:"false"

If true, treats content as Markdown and returns as-is (stripped). If false, parses as HTML.

Returns:

text

string

Cleaned text with whitespace normalized and unwanted elements removed

from extractor import Extractor

extractor = Extractor()

html = "<html><body><nav>Menu</nav><p>Main content here</p><script>alert('hi')</script></body></html>"
text = extractor.clean_html(html)
print(text)  # "Main content here" (nav and script removed)

# Markdown pass-through
markdown = "# Title\n\nContent here"
text = extractor.clean_html(markdown, is_markdown=True)
print(text)  # "# Title\n\nContent here" (stripped only)

The HTML cleaner removes <script>, <style>, <nav>, <footer>, and <header> tags to focus on main content.

Usage Examples

Basic Content Extraction

from extractor import Extractor
import json

extractor = Extractor()

try:
    result = extractor.process("https://example.com")
    
    print(f"URL: {result['url']}")
    print(f"Characters extracted: {result['char_count']}")
    print(f"Fetch latency: {result['latency_fetch']:.2f}s")
    print(f"\nContent preview:\n{result['text'][:500]}...")
    
except Exception as e:
    print(f"Extraction failed: {e}")

Batch URL Processing

from extractor import Extractor

extractor = Extractor(timeout=10)

urls = [
    "https://example1.com",
    "https://example2.com",
    "https://example3.com"
]

results = []

for url in urls:
    try:
        result = extractor.process(url)
        results.append({
            "url": url,
            "success": True,
            "chars": result['char_count'],
            "latency": result['latency_fetch']
        })
        print(f"✓ {url} - {result['char_count']} chars in {result['latency_fetch']:.2f}s")
    except Exception as e:
        results.append({
            "url": url,
            "success": False,
            "error": str(e)
        })
        print(f"✗ {url} - {e}")

print(f"\nSuccessfully extracted: {sum(1 for r in results if r['success'])}/{len(urls)}")

Handling Facebook URLs

from extractor import Extractor
import json

extractor = Extractor()

facebook_url = "https://www.facebook.com/BusinessPage"

try:
    result = extractor.process(facebook_url)
    
    if "error" in result:
        print(f"Facebook extraction failed: {result['error']}")
    else:
        print(f"Platform: {result.get('platform')}")
        print(f"Content: {result['text'][:200]}...")
        print(f"\nMetadata: {json.dumps(result.get('metadata'), indent=2)}")
        
except Exception as e:
    print(f"Error: {e}")

Custom Extraction Settings

from extractor import Extractor

# Fast extraction for quick scanning
fast_extractor = Extractor(timeout=3, max_chars=3000)

# Detailed extraction for comprehensive analysis
detailed_extractor = Extractor(timeout=15, max_chars=20000)

url = "https://long-article-site.com"

# Quick scan
quick_result = fast_extractor.process(url)
print(f"Quick: {quick_result['char_count']} chars")

# Detailed extraction
detailed_result = detailed_extractor.process(url)
print(f"Detailed: {detailed_result['char_count']} chars")

Manual Fetch and Clean

from extractor import Extractor

extractor = Extractor()

url = "https://example.com"

try:
    # Fetch raw HTML
    html, latency = extractor.fetch_url(url)
    print(f"Fetched {len(html)} bytes in {latency:.2f}s")
    
    # Clean HTML
    text = extractor.clean_html(html)
    print(f"Extracted {len(text)} chars of clean text")
    
    # Truncate manually if needed
    max_chars = 5000
    truncated = text[:max_chars]
    print(f"Truncated to {len(truncated)} chars")
    
except Exception as e:
    print(f"Error: {e}")

Smart Fallback Strategy

The Extractor implements a two-tier extraction strategy:

Primary: Direct HTTP

Attempts direct HTTP fetch with custom headers to simulate a real browser

Content Validation

Checks if extracted content is substantial (>200 characters)

Fallback Trigger

If content < 200 chars or primary fetch fails, triggers Jina AI fallback

Jina AI Reader

Uses Jina’s headless browser service to render JavaScript and extract content

Result Assembly

Returns the best available content with cumulative latency tracking

The 200-character threshold detects minimal pages (logo + title only) that indicate JavaScript rendering is needed.

Facebook Integration

For Facebook URLs, the Extractor routes to a specialized Facebook client:

from facebook_client import get_facebook_page_data

# Automatically called for facebook.com URLs
result = extractor.process("https://www.facebook.com/PageName")

if "platform" in result and result["platform"] == "facebook":
    # Facebook-specific handling
    metadata = result.get("metadata", {})
    print(f"Page info: {metadata}")

Facebook extraction requires the Graph API and may return an error structure if the API call fails. Always check for the error field in the result.

HTTP Headers

The Extractor sends browser-like headers to avoid bot detection:

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
    "Accept-Language": "en-US,en;q=0.9",
}

Cleaning Process

The HTML cleaner performs the following operations:

Parse HTML with BeautifulSoup
Remove Elements: <script>, <style>, <nav>, <footer>, <header>
Extract Text with space separators
Normalize Whitespace: Strip lines and remove double spaces
Join Lines with newline characters
Return Clean Text

Performance Characteristics

Local Fetch: 0.5-2 seconds typical
Jina Fallback: 2-5 seconds typical
Facebook API: 1-3 seconds typical
Memory Usage: Minimal (content truncated to max_chars)

Total latency is tracked across all attempts and fallbacks, providing accurate performance metrics.

Error Handling

Network errors trigger fallback before raising exceptions
Facebook API errors return error structures instead of raising
Both primary and fallback failures result in a descriptive exception
All errors are logged at WARNING level

Command-Line Usage

The Extractor can be run standalone for testing:

python extractor.py https://example.com

Output:

URL: https://example.com
Latency: 1.23s
Chars: 8543
--------------------
Example Domain This domain is for use in illustrative examples...

LeadEngine - Uses Extractor as the first pipeline stage
Evaluator - Processes extracted content for AI analysis

Core Components

Integrations

Overview

Constructor

Extractor()

Methods

process()

fetch_url()

clean_html()

Usage Examples

Basic Content Extraction

Batch URL Processing

Handling Facebook URLs

Custom Extraction Settings

Manual Fetch and Clean

Smart Fallback Strategy

Facebook Integration

HTTP Headers

Cleaning Process

Performance Characteristics

Error Handling

Command-Line Usage

Build docs developers (and LLMs) love

Core Components

Integrations

​Overview

​Constructor

​Extractor()

​Methods

​process()

​fetch_url()

​clean_html()

​Usage Examples

​Basic Content Extraction

​Batch URL Processing

​Handling Facebook URLs

​Custom Extraction Settings

​Manual Fetch and Clean

​Smart Fallback Strategy

​Facebook Integration

​HTTP Headers

​Cleaning Process

​Performance Characteristics

​Error Handling

​Command-Line Usage

​Related Components

Build docs developers (and LLMs) love

Overview

Constructor

Extractor()

Methods

process()

fetch_url()

clean_html()

Usage Examples

Basic Content Extraction

Batch URL Processing

Handling Facebook URLs

Custom Extraction Settings

Manual Fetch and Clean

Smart Fallback Strategy

Facebook Integration

HTTP Headers

Cleaning Process

Performance Characteristics

Error Handling

Command-Line Usage

Related Components