Skip to main content

DoFi

Domain Filtering - A collection of Python and Bash scripts for validating, filtering, and checking domain lists.

Overview

DoFi provides two main tools:

Domain Filter

Python script that validates TLDs, removes overlapping domains, and filters invalid entries

Domain Check

Bash script that verifies domain existence using the host command

Download Project

sudo apt install -y python-is-python3
wget -qO gitfolder.py https://raw.githubusercontent.com/maravento/vault/master/scripts/python/gitfolder.py
chmod +x gitfolder.py
python gitfolder.py https://github.com/maravento/vault/dofi

Requirements

  • Python: 3.12.3 or later
  • Bash: 5.2.21 or later
  • Tested on: Ubuntu 22.04/24.04 x64
Input Format: Ensure your domain list has no http://, https://, or www. prefixes before processing.

Domain Filter (Python)

Advanced domain filtering with TLD validation and overlap removal.

Features

The Python script performs comprehensive domain filtering:
1

TLD Collection

Downloads public suffix TLDs from multiple authoritative sources
2

TLD Validation

Removes invalid or duplicate TLDs from the collection
3

Domain Validation

Filters domains to ensure they end with a valid TLD
4

Overlap Removal

Removes overlapping domains (e.g., keeps example.com and removes sub.example.com if both exist)
5

Duplicate Exclusion

Excludes duplicates from previously validated domains
6

Output Generation

Saves cleaned domains and removed entries to separate files

Basic Usage

# Download the script
wget -qO domfilter.py https://raw.githubusercontent.com/maravento/vault/master/dofi/domfilter.py

# Run with your domain list
python domfilter.py --input mylst.txt

Output Files

By default, the script creates:
  • output.txt - Validated and filtered domains
  • removed.txt - Domains that were filtered out

TLD Coverage

The filter validates against comprehensive TLD sources:
  • ccTLDs - Country code top-level domains
  • gTLDs - Generic top-level domains
  • sTLDs - Sponsored top-level domains
  • eTLDs - Effective top-level domains
  • 4LDs - Four-level domains
All TLDs are saved to tlds.txt during processing.

Example

# Input file: domains.txt
example.com
sub.example.com
invalid-domain.xyz123
test.org

# Run filter
python domfilter.py --input domains.txt

# Output (output.txt):
example.com
test.org

# Removed (removed.txt):
sub.example.com  # Overlapping with example.com
invalid-domain.xyz123  # Invalid TLD

Domain Check (Bash)

Verifies domain existence using DNS lookups.

Features

The Bash script checks domain validity:
  • Uses host command for DNS verification
  • Parallel processing for speed
  • Separates valid and invalid domains
  • Generates difference report

Basic Usage

# Download the script
wget -qO domcheck.sh https://raw.githubusercontent.com/maravento/vault/master/dofi/domcheck.sh
chmod +x domcheck.sh

# Run with your domain list
./domcheck.sh my_domain_list.txt

Output Files

The script generates three output files:

hit.txt

Existing domains verified via DNS

fault.txt

Non-existent domains that failed verification

outdiff.txt

Difference between input and output

Parallel Processing

Control the number of parallel checks:
# Default: 100 parallel processes
./domcheck.sh my_domain_list.txt

# Custom: 50 parallel processes
./domcheck.sh my_domain_list.txt 50
Adjust the parallel process count based on your system resources and network capacity. Higher values = faster processing but more resource usage.

Example

# Input file: check_domains.txt
google.com
example-does-not-exist-12345.com
github.com

# Run check
./domcheck.sh check_domains.txt

# Results:
# hit.txt:
google.com
github.com

# fault.txt:
example-does-not-exist-12345.com

# outdiff.txt:
# Shows detailed differences

Workflow Examples

Complete Domain Cleaning

Combine both tools for comprehensive cleaning:
# Step 1: Filter and validate TLDs
python domfilter.py --input raw_domains.txt --output filtered.txt

# Step 2: Verify domain existence
./domcheck.sh filtered.txt

# Final result in hit.txt - fully validated, existing domains

Large List Processing

# Process large lists efficiently
python domfilter.py --input huge_list.txt --output stage1.txt
./domcheck.sh stage1.txt 200  # Increase parallel processes

TLD Data Sources

DoFi pulls TLD data from authoritative sources:

IANA

Official IANA TLD list

Public Suffix

Mozilla’s public suffix list

WHOIS XML API

Supported gTLD database

Blackweb TLDs

Extended TLD appendix

Performance Tips

For large domain lists (millions of entries):
  • Run on systems with adequate RAM (4GB+ recommended)
  • Use SSD storage for faster I/O
  • TLD cache is created once and reused
For efficient domain checking:
  • Increase parallel processes on powerful systems
  • Use reliable DNS servers
  • Consider network bandwidth limits
  • Monitor system load during processing
# Monitor performance
./domcheck.sh large_list.txt 150 &
watch -n 1 'wc -l hit.txt fault.txt'
Optimize the full pipeline:
# Filter first (fast, local)
time python domfilter.py --input raw.txt --output filtered.txt

# Then check (slower, network-dependent)
time ./domcheck.sh filtered.txt 100
This approach minimizes network checks by filtering invalid entries first.

Use Cases

Blocklist Maintenance

Clean and validate domain blocklists before deployment

SEO Analysis

Verify domain lists for SEO tools and analysis

Security Research

Validate threat intelligence domain feeds

DNS Administration

Maintain clean DNS zone files and records

Troubleshooting

If TLD sources are unavailable:
  • Check internet connectivity
  • Verify firewall/proxy settings
  • Script will use cached TLDs if available
If domain checks are slow or timing out:
  • Reduce parallel processes
  • Check DNS server responsiveness
  • Consider using local DNS cache
# Test DNS performance
time host google.com

# Use fewer parallel processes
./domcheck.sh domains.txt 25
For very large lists:
  • Split input files into smaller chunks
  • Process in batches
  • Increase system swap if needed
# Split large file
split -l 100000 huge_list.txt chunk_

# Process each chunk
for file in chunk_*; do
  python domfilter.py --input $file --output filtered_$file
done

Best Practices

1

Prepare Input

Clean input files before processing:
  • Remove http://, https://, www. prefixes
  • Convert to lowercase
  • Remove duplicates
  • One domain per line
2

Run Domain Filter

Use Python script first to validate TLDs and remove overlaps
3

Run Domain Check

Verify filtered domains actually exist via DNS
4

Review Results

Check removed entries for false positives before finalizing

License

GPL-3.0 CC BY-SA 4.0

Build docs developers (and LLMs) love