Overview
The Price Tracker Bot uses Playwright with Chromium to scrape Amazon product pages. The scraping logic handles multiple selector patterns, timeout management, and error recovery.Playwright is a browser automation library that provides reliable, cross-browser web scraping with modern JavaScript support.
Playwright Configuration
Browser Launch Options
Chromium is launched in headless mode with security flags:--no-sandbox: Required for running as root in containers--disable-dev-shm-usage: Uses/tmpinstead of/dev/shm(prevents crashes in low-memory environments)
index.mjs:148, index.mjs:412, index.mjs:561
Different parts of the codebase launch browsers with slightly different configurations. The
/add and /edit commands omit --disable-dev-shm-usage.Browser Context
Each browser instance creates a context (isolated session):- Isolated cookies and localStorage
- Prevents cross-contamination between products
- Allows parallel contexts (not currently used)
Core Scraping Function
scrapeProduct(page, url)
The main scraping function accepts a Playwright page instance and product URL:
index.mjs:73-136
Return types:
- Success:
{ url, title, price, imageUrl } - Failure:
{ error: string }
Page Navigation
Timeout Configuration
Navigation uses a 60-second timeout:index.mjs:75
waitUntil options:
domcontentloaded: Wait until DOM is ready (faster)load: Wait for all resources (slower, more reliable)networkidle: Wait until network is quiet (slowest)
The bot uses
domcontentloaded for speed. This works for Amazon because product data is in the initial HTML, not loaded via JavaScript.Post-Navigation Delay
An 800ms delay allows lazy-loaded content to render:index.mjs:78
This is a “best effort” delay. Some lazy-loaded images may still not appear, which is why image selectors have fallbacks.
Selector Strategies
Title Extraction
Multiple selectors are tried in sequence:index.mjs:80-93
Selector Priority
| Selector | Amazon Page Type | Notes |
|---|---|---|
#productTitle | Most product pages | Primary selector |
h1#title | Older layouts | Fallback |
h1.a-size-large | Some mobile views | Broader match |
h1[data-automation-id="product-title"] | Fresh/Whole Foods | Specialty stores |
.product-title | Generic fallback | Class-based match |
Price Extraction
Price selectors target hidden “screenreader” elements:index.mjs:95-109
Why .a-offscreen?
Amazon uses .a-offscreen elements for accessibility. These contain clean, unformatted price text:
.a-offscreen avoids parsing split-up price components.
Legacy selectors like
#priceblock_ourprice are included for older Amazon layouts that may still exist in some regions.Image Extraction
Image selectors target high-resolution product images:index.mjs:111-125
Selector Priority
| Selector | Purpose | Notes |
|---|---|---|
#landingImage | Main product image | Most reliable |
div#imgTagWrapperId img | Image container | Backup for lazy-loaded images |
img[data-old-hires] | High-res image | Amazon’s data attribute |
.a-dynamic-image | Dynamic images | Class-based fallback |
img (first) | Generic fallback | May return logo/icon |
The final fallback (
document.querySelector("img")) may return non-product images like the Amazon logo. This is acceptable since imageUrl is optional.Data Parsing
Price Parsing
Price text is cleaned and converted to a number:index.mjs:128
Examples:
The regex
/[^0-9.]/g removes all characters except digits and dots. This works for most currencies but may fail for comma-decimal formats (e.g., “45,50” → 4550).Validation
Extracted data is validated before returning:index.mjs:127-129
Validation rules:
- Title must be non-empty string
- Price must be a positive number
- Image URL is optional (may be
null)
Error Handling
Error Return Pattern
All errors are caught and returned as objects:index.mjs:132-135
Caller Error Handling
Callers check for theerror property:
index.mjs:165-172
Even on error,
lastChecked is updated to track scraping attempts.URL Sanitization
sanitizeAmazonURL(url)
Removes tracking parameters and query strings:
index.mjs:53-61
Examples
Why sanitize? Amazon URLs contain session-specific tracking parameters. Sanitizing prevents duplicate products and allows consistent key-based lookups.
Browser Lifecycle
Shared Browser (Price Checks)
During scheduled price checks, a single browser is reused:index.mjs:148-217
Benefits:
- Faster (no repeated browser launches)
- Lower memory usage
- Consistent session state
Temporary Browsers (User Commands)
User-triggered commands launch temporary browsers:index.mjs:412-417, index.mjs:561-566
Used by:
/add [url]command/edit [old] [new]command
Each user command gets its own browser instance to avoid blocking scheduled price checks.
Rate Limiting
Delays Between Products
The price check loop includes delays to avoid overwhelming Amazon:index.mjs:214
Delay durations:
- Between products: 900ms
- Between notifications: 500ms (
index.mjs:242) - On errors: 800ms (
index.mjs:170)
These delays are conservative. Amazon’s actual rate limits are unknown but likely much higher.
Performance Considerations
Selector Efficiency
Selectors are ordered by frequency to minimize iteration:#productTitle) are fastest, so they’re tried first.
page.evaluate() Execution
All selector logic runs in the browser context:
- Fewer round-trips between Node.js and browser
- Faster DOM access
- Simpler error handling
Memory Management
Browsers are explicitly closed after use:index.mjs:217, index.mjs:417, index.mjs:566
Failing to close browsers causes memory leaks. Chromium instances consume 100-300 MB of RAM each.
Common Scraping Issues
Issue: Timeout Errors
Symptom:Navigation timeout of 60000 ms exceeded
Causes:
- Slow network connection
- Amazon’s bot detection blocking the request
- Invalid URL (404 page)
- Increase timeout:
{ timeout: 120000 } - Add retry logic
- Implement rotating user agents
Issue: Selector Not Found
Symptom: Returns{ error: "Título no encontrado" } or { error: "Precio inválido" }
Causes:
- Amazon changed page layout
- Product page is non-standard (e.g., Books, Digital)
- Geo-restricted page
- Inspect page HTML manually
- Add new selectors to the list
- Use network tab to check for API endpoints
Issue: Price Parsing Fails
Symptom: Returns{ error: "Precio inválido o no encontrado" }
Causes:
- Price uses comma decimal separator (“45,50”)
- Price text includes words (“From $10”)
- Out of stock (no price displayed)
- Improve regex:
/[^0-9.,]/gand handle commas - Extract first number found
- Return
nullprice for out-of-stock items
Issue: Rate Limiting
Symptom: Amazon returns CAPTCHA or blocks requests Causes:- Too many requests in short period
- Missing or suspicious user agent
- Always same IP address
- Increase delays between requests
- Rotate user agents
- Use residential proxies
- Add cookies from manual session
Advanced Techniques
User Agent Rotation
Set a realistic user agent to avoid detection:Cookie Persistence
Save and restore cookies between sessions:Proxy Support
Rotate IP addresses using proxies:API Scraping (Alternative)
Some Amazon pages expose JSON data in<script> tags:
Testing Scrapers
Manual Testing
Test scraping logic in isolation:headless: false to watch the browser navigate.
Automated Tests
Use mocked HTML pages:Selector Maintenance
Monitoring Failures
Track scraping errors to detect selector breakage:index.mjs:249
Adding New Selectors
When Amazon changes layouts, add new selectors:Security Considerations
XSS Prevention
All scraped text is escaped before displaying in Telegram:index.mjs:64-66
URL Validation
URLs are validated before scraping:index.mjs:401
Sandboxing
Chromium runs with--no-sandbox flag. This is a security trade-off:
Risk: Malicious pages could potentially escape the browser sandbox.
Mitigation: Only scrape trusted domains (Amazon). For production, run in a containerized environment.