blackweb.txt file is accurate, deduplicated, and optimized for Squid-Cache.
Processing Stages
The update process consists of several sequential stages:1. Capture Public Blocklists
The first stage downloads domains from all public blocklists and unifies them into a single file.2. Domains Debugging
Removes overlapping domains and performs homologation to Squid-Cache format. Excludes false positives using the allowlist (debugwl.txt).
Input:
How subdomain deduplication works
How subdomain deduplication works
BlackWeb removes redundant subdomains because Squid-Cache matches all subdomains when a parent domain is blocked. For example:
.domain.comblockswww.domain.com,api.domain.com,mail.domain.com, etc.- Therefore, listing both
.domain.comand.subdomain.domain.comis redundant - Only
.domain.comis kept to minimize file size
3. TLD Validation
Removes domains with invalid TLDs using a comprehensive list of Public and Private Suffix TLDs including:- ccTLD (Country Code Top-Level Domains)
- ccSLD (Country Code Second-Level Domains)
- sTLD (Sponsored Top-Level Domains)
- uTLD (Unsponsored Top-Level Domains)
- gSLD (Generic Second-Level Domains)
- gTLD (Generic Top-Level Domains)
- eTLD (Effective Top-Level Domains)
- Up to 4th level domains (4LDs)
The
.exe TLD is invalid and removed. Only domains with valid, registered TLDs are kept.4. Debugging Punycode-IDN
Converts international domain names (IDN) to Punycode/IDNA format and removes invalid entries:- Removes hostnames larger than 63 characters (RFC 1035)
- Removes characters inadmissible by IDN standards
- Converts non-ASCII domains to Punycode format
What is Punycode?
What is Punycode?
Punycode is a way to represent Unicode characters in ASCII for domain names. It helps prevent IDN homograph attacks where attackers use visually similar characters from different alphabets to create fake domains.For example:
google.com(ASCII)gооgle.com(Cyrillic ‘o’ characters) →xn--ggle-0nda.com
xn-- can prevent these attacks.5. Debugging non-ASCII Characters
Removes entries with:- Invalid encoding
- Non-printable characters
- Whitespace
- Disallowed symbols
- Corrupted UTF-8, CP1252, ISO-8859-1 encoding
charset=us-ascii for a clean, standardized list.
Input:
6. DNS Lookup
Most public sources contain millions of invalid or nonexistent domains. Each domain is verified via DNS lookup in 2 steps to exclude nonexistent entries. Performance Configuration: You can control DNS lookup concurrency with thePROCS variable:
7. Exclude Government TLDs
Removes government domains (.gov) and other related TLDs from BlackWeb to avoid blocking official government services. Input:8. Run Squid-Cache with BlackWeb
The final stage tests the blocklist with Squid-Cache. Any errors are sent toSquidErrors.txt for review.
Processing Timeline
Check/var/log/syslog for completion:
Performance Considerations
CPU Usage
CPU Usage
The DNS lookup stage is CPU-intensive due to parallel processing. Monitor CPU temperature and adjust
PROCS if needed.Memory Usage
Memory Usage
Processing 7+ million domains requires adequate RAM. Recommended minimum: 4GB RAM.
Network Bandwidth
Network Bandwidth
- Downloading 100+ blocklists: ~500MB-1GB
- DNS lookups: Millions of queries
- Consider bandwidth caps on metered connections
Disk I/O
Disk I/O
Temporary files can consume several GB during processing. Ensure adequate disk space in
/tmp and working directories.Resuming Interrupted Updates
If you interrupt the update process with Ctrl+C during the DNS Lookup stage, the script will resume from that point on the next run. If interrupted earlier, you’ll need to start from the beginning or modify the script manually.
Default Path
The default installation path for BlackWeb is/etc/acl. You can change this to your preferred location.
Dependencies
The update process requires:- Python 3.x
- Bash 5.x
- System packages:
wget,git,curl,libnotify-bin,perl,tar,rar,unrar,unzip,zip,gzip,python-is-python3,idn2,iconv
