Overview
The Extractor class handles intelligent web content extraction with automatic fallback strategies. It uses direct HTTP requests for standard sites and falls back to Jina AI’s Reader API for single-page applications (SPAs), JavaScript-heavy sites, and protected content. The class includes special handling for Facebook pages and automatic content truncation.
Constructor
Request timeout in seconds. Applies to both direct fetches and Jina API requests.
Maximum characters to extract from a page. Content is truncated to this limit to optimize LLM processing.
from extractor import Extractor
# Use default settings
extractor = Extractor()
# Custom timeout and character limit
extractor = Extractor( timeout = 10 , max_chars = 15000 )
# Quick extraction with lower limits
extractor = Extractor( timeout = 3 , max_chars = 5000 )
The default max_chars=10000 is optimized for LLM context windows while capturing sufficient content for accurate evaluation.
Methods
process()
Main extraction method with intelligent routing and fallback logic.
The URL to extract content from. Supports HTTP, HTTPS, and special handling for Facebook URLs.
Returns:
Extraction result with the following fields: Show Standard Result Fields
Extracted and cleaned text content (truncated to max_chars)
Total fetch time in seconds (includes retries and fallback)
Number of characters in the extracted text
Show Facebook-Specific Fields
Set to “facebook” for Facebook URLs
Facebook-specific metadata from the Graph API
Error message if extraction failed completely
Raises:
Exception - If both local extraction and Jina fallback fail and no content is retrieved
For Facebook URLs, the method returns different fields including platform-specific metadata. Check for the platform field in the result.
fetch_url()
Low-level method to fetch raw HTML or Markdown from a URL.
If true, uses Jina AI’s Reader API by prefixing the URL with https://r.jina.ai/
Returns:
A tuple containing: Raw HTML content (or Markdown if using Jina)
Raises:
Exception - If the HTTP request fails (network error, timeout, invalid response)
from extractor import Extractor
extractor = Extractor()
# Direct fetch
html, latency = extractor.fetch_url( "https://example.com" )
print ( f "Fetched { len (html) } chars in { latency :.2f} s" )
# Fetch via Jina for SPA
markdown, latency = extractor.fetch_url( "https://spa-site.com" , use_jina = True )
print ( f "Jina returned { len (markdown) } chars in { latency :.2f} s" )
clean_html()
Cleans and extracts text from HTML or Markdown content.
Raw HTML or Markdown content to clean
If true, treats content as Markdown and returns as-is (stripped). If false, parses as HTML.
Returns:
Cleaned text with whitespace normalized and unwanted elements removed
from extractor import Extractor
extractor = Extractor()
html = "<html><body><nav>Menu</nav><p>Main content here</p><script>alert('hi')</script></body></html>"
text = extractor.clean_html(html)
print (text) # "Main content here" (nav and script removed)
# Markdown pass-through
markdown = "# Title \n\n Content here"
text = extractor.clean_html(markdown, is_markdown = True )
print (text) # "# Title\n\nContent here" (stripped only)
The HTML cleaner removes <script>, <style>, <nav>, <footer>, and <header> tags to focus on main content.
Usage Examples
from extractor import Extractor
import json
extractor = Extractor()
try :
result = extractor.process( "https://example.com" )
print ( f "URL: { result[ 'url' ] } " )
print ( f "Characters extracted: { result[ 'char_count' ] } " )
print ( f "Fetch latency: { result[ 'latency_fetch' ] :.2f} s" )
print ( f " \n Content preview: \n { result[ 'text' ][: 500 ] } ..." )
except Exception as e:
print ( f "Extraction failed: { e } " )
Batch URL Processing
from extractor import Extractor
extractor = Extractor( timeout = 10 )
urls = [
"https://example1.com" ,
"https://example2.com" ,
"https://example3.com"
]
results = []
for url in urls:
try :
result = extractor.process(url)
results.append({
"url" : url,
"success" : True ,
"chars" : result[ 'char_count' ],
"latency" : result[ 'latency_fetch' ]
})
print ( f "✓ { url } - { result[ 'char_count' ] } chars in { result[ 'latency_fetch' ] :.2f} s" )
except Exception as e:
results.append({
"url" : url,
"success" : False ,
"error" : str (e)
})
print ( f "✗ { url } - { e } " )
print ( f " \n Successfully extracted: { sum ( 1 for r in results if r[ 'success' ]) } / { len (urls) } " )
Handling Facebook URLs
from extractor import Extractor
import json
extractor = Extractor()
facebook_url = "https://www.facebook.com/BusinessPage"
try :
result = extractor.process(facebook_url)
if "error" in result:
print ( f "Facebook extraction failed: { result[ 'error' ] } " )
else :
print ( f "Platform: { result.get( 'platform' ) } " )
print ( f "Content: { result[ 'text' ][: 200 ] } ..." )
print ( f " \n Metadata: { json.dumps(result.get( 'metadata' ), indent = 2 ) } " )
except Exception as e:
print ( f "Error: { e } " )
from extractor import Extractor
# Fast extraction for quick scanning
fast_extractor = Extractor( timeout = 3 , max_chars = 3000 )
# Detailed extraction for comprehensive analysis
detailed_extractor = Extractor( timeout = 15 , max_chars = 20000 )
url = "https://long-article-site.com"
# Quick scan
quick_result = fast_extractor.process(url)
print ( f "Quick: { quick_result[ 'char_count' ] } chars" )
# Detailed extraction
detailed_result = detailed_extractor.process(url)
print ( f "Detailed: { detailed_result[ 'char_count' ] } chars" )
Manual Fetch and Clean
from extractor import Extractor
extractor = Extractor()
url = "https://example.com"
try :
# Fetch raw HTML
html, latency = extractor.fetch_url(url)
print ( f "Fetched { len (html) } bytes in { latency :.2f} s" )
# Clean HTML
text = extractor.clean_html(html)
print ( f "Extracted { len (text) } chars of clean text" )
# Truncate manually if needed
max_chars = 5000
truncated = text[:max_chars]
print ( f "Truncated to { len (truncated) } chars" )
except Exception as e:
print ( f "Error: { e } " )
Smart Fallback Strategy
The Extractor implements a two-tier extraction strategy:
Primary: Direct HTTP
Attempts direct HTTP fetch with custom headers to simulate a real browser
Content Validation
Checks if extracted content is substantial (>200 characters)
Fallback Trigger
If content < 200 chars or primary fetch fails, triggers Jina AI fallback
Jina AI Reader
Uses Jina’s headless browser service to render JavaScript and extract content
Result Assembly
Returns the best available content with cumulative latency tracking
The 200-character threshold detects minimal pages (logo + title only) that indicate JavaScript rendering is needed.
Facebook Integration
For Facebook URLs, the Extractor routes to a specialized Facebook client:
from facebook_client import get_facebook_page_data
# Automatically called for facebook.com URLs
result = extractor.process( "https://www.facebook.com/PageName" )
if "platform" in result and result[ "platform" ] == "facebook" :
# Facebook-specific handling
metadata = result.get( "metadata" , {})
print ( f "Page info: { metadata } " )
Facebook extraction requires the Graph API and may return an error structure if the API call fails. Always check for the error field in the result.
The Extractor sends browser-like headers to avoid bot detection:
headers = {
"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36" ,
"Accept" : "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7" ,
"Accept-Language" : "en-US,en;q=0.9" ,
}
Cleaning Process
The HTML cleaner performs the following operations:
Parse HTML with BeautifulSoup
Remove Elements : <script>, <style>, <nav>, <footer>, <header>
Extract Text with space separators
Normalize Whitespace : Strip lines and remove double spaces
Join Lines with newline characters
Return Clean Text
Local Fetch : 0.5-2 seconds typical
Jina Fallback : 2-5 seconds typical
Facebook API : 1-3 seconds typical
Memory Usage : Minimal (content truncated to max_chars)
Total latency is tracked across all attempts and fallbacks, providing accurate performance metrics.
Error Handling
Network errors trigger fallback before raising exceptions
Facebook API errors return error structures instead of raising
Both primary and fallback failures result in a descriptive exception
All errors are logged at WARNING level
Command-Line Usage
The Extractor can be run standalone for testing:
python extractor.py https://example.com
Output:
URL: https://example.com
Latency: 1.23s
Chars: 8543
--------------------
Example Domain This domain is for use in illustrative examples...
LeadEngine - Uses Extractor as the first pipeline stage
Evaluator - Processes extracted content for AI analysis