# Connect to PostgreSQLsudo -u postgres psql# Create database and userCREATE DATABASE imdb_scraper;CREATE USER your_username WITH PASSWORD 'your_password';GRANT ALL PRIVILEGES ON DATABASE imdb_scraper TO your_username;\q
# Install from requirements.txtpip install -r requirements.txt
Dependencies installed:
requests # HTTP clientbs4 # HTML parsingpsycopg2-binary # PostgreSQL adapterpython-dotenv # Environment variable managementstem # TOR control libraryrequests[socks] # SOCKS proxy support
5
Configure Environment Variables
Create .env file in project root:
.env
POSTGRES_DB=imdb_scraperPOSTGRES_USER=your_usernamePOSTGRES_PASSWORD=your_passwordPOSTGRES_HOST=localhostPOSTGRES_PORT=5432# Optional: Configure proxies/VPN if availablePROXY_HOST=PROXY_PORT=PROXY_USER=PROXY_PASS=
6
Run the Scraper
# Execute the scraperpython presentation/cli/run_scraper.py
Monitor output:
INFO - Inicializando contenedor de dependencias...INFO - Construyendo scraper...INFO - Iniciando proceso de scraping...INFO - [HTML] IDs obtenidos: 250INFO - Tráfico total usado: 15.42 MBINFO - Proceso de scraping finalizado exitosamente.
Error:Cannot connect to the Docker daemonSolution:
# Linuxsudo systemctl start docker# macOSopen /Applications/Docker.app# Windows# Start Docker Desktop from Start Menu
Permission denied errors (Docker)
Error:Got permission denied while trying to connect to the Docker daemon socketSolution:
# Add user to docker groupsudo usermod -aG docker $USER# Log out and back in, or run:newgrp docker
PostgreSQL connection refused
Error:psql: error: connection to server at "localhost" port 5432 failedSolution:
# Check if PostgreSQL is runningsudo systemctl status postgresql# Start if not runningsudo systemctl start postgresql# Check port bindingsudo netstat -plnt | grep 5432
Python module import errors
Error:ModuleNotFoundError: No module named 'bs4'Solution:
Error:SOCKSHTTPConnectionPool: Max retries exceededSolution:
# Restart TOR servicesudo systemctl restart tor# Verify TOR is listeningsudo netstat -plnt | grep 9050# Test TOR connectivitycurl --socks5 localhost:9050 https://check.torproject.org/
CSV files empty or missing
Error: CSV files exist but have no dataSolution:
Check scraper logs for validation errors
IMDb may have changed HTML structure (update selectors in config.py)
Network issues prevented data retrieval (check proxy/VPN settings)
# View detailed logstail -f logs/scraper.log
Out of memory errors
Error:MemoryError or container crashesSolution:
Reduce MAX_THREADS in shared/config/config.py (default: 50)
Increase Docker memory limit in Docker Desktop settings
Process movies in smaller batches by reducing NUM_MOVIES