Deduper Command

Overview

The deduper command removes duplicate ROM entries from a child DAT file that already exist in a parent DAT file. This is useful for creating subset DATs that only contain unique entries not found in a main collection.

Basic Usage

datoso deduper --input <input-dat> --parent <parent-dat>

The deduper:

Reads the input (child) DAT file
Reads the parent DAT file
Compares ROM entries based on hash values (CRC, MD5, SHA1)
Removes matching entries from the child DAT
Saves the deduplicated DAT file

Required Arguments

Input DAT File

-i, --input

string

required

Input DAT file to deduplicate. Can be either a database reference (seed:name) or a file path.

Examples:

# Using database reference
datoso deduper --input "redump:Sony - PlayStation 2 (Demos)" --parent "redump:Sony - PlayStation 2"

# Using file path
datoso deduper --input "/path/to/ps2_demos.dat" --parent "/path/to/ps2_main.dat"

Parent DAT File

You must specify either --parent or --auto-merge (mutually exclusive).

-p, --parent

string

Parent DAT file to compare against. Can be either a database reference (seed:name) or a file path.

Examples:

# Database reference
datoso deduper --input "redump:PS2 Demos" --parent "redump:PlayStation 2"

# File path
datoso deduper --input "./demos.dat" --parent "/roms/main.dat"

When the input is a file path (.dat or .xml), the --parent argument is required unless using --auto-merge.

Auto-Merge Mode

-au, --auto-merge

flag

Automatically detect and use the parent DAT from the database

Example:

datoso deduper --input "redump:PS2 Demos" --auto-merge

In auto-merge mode:

Datoso looks up the input DAT in the database
Reads the parent field from the DAT metadata
Uses that parent DAT for deduplication

Auto-merge requires that the input DAT has been processed and has a parent field set in the database.

Optional Arguments

Output File

-o, --output

string

Output file path. If not specified, overwrites the input file.

Examples:

# Specify output file
datoso deduper --input "redump:PS2 Demos" --parent "redump:PS2" --output "/roms/ps2_demos_deduped.dat"

# Overwrite input (default)
datoso deduper --input "redump:PS2 Demos" --parent "redump:PS2"

If no output is specified, the input DAT file will be overwritten with the deduplicated version. Make backups of important DAT files.

Dry Run

-dr, --dry-run

flag

Show what would be removed without writing the output file

Example:

datoso deduper --input "redump:PS2 Demos" --parent "redump:PS2" --dry-run

In dry-run mode:

No files are written
Duplicate entries are identified and logged
Enables debug-level logging automatically
Useful for previewing changes before committing

Input Format Options

Database Reference Format

Reference DAT files already imported into Datoso:

seed:name

Examples:

redump:Sony - PlayStation 2
nointro:Nintendo - Nintendo DS
tosec:Commodore 64

File Path Format

Reference DAT files directly from the filesystem:

/path/to/file.dat
/path/to/file.xml

Examples:

/home/user/roms/ps2.dat
./local/datfile.dat
/mnt/storage/collection.xml

Deduplication Logic

Hash Comparison

The deduper compares ROM entries using hash values in this priority order:

SHA1 (most reliable, if present)
MD5 (fallback)
CRC32 (fallback)

Matching Criteria

A ROM is considered a duplicate if:

At least one hash value matches between input and parent
The hash comparison is successful for the strongest available hash

Preservation

The deduper preserves:

DAT metadata (header, description, version)
Game/ROM structure
Non-duplicate entries

Use Cases

Creating Subset DATs

Remove main collection ROMs from a demo/beta collection:

# Remove PS2 main collection ROMs from demos
datoso deduper \
  --input "redump:Sony - PlayStation 2 (Demos)" \
  --parent "redump:Sony - PlayStation 2" \
  --output "/dats/ps2_demos_unique.dat"

Avoiding Duplicate Storage

Before building a ROM set, deduplicate child DATs:

# Preview what would be removed
datoso deduper \
  --input "nointro:Nintendo DS (Demos)" \
  --parent "nointro:Nintendo DS" \
  --dry-run

# If satisfied, run without dry-run
datoso deduper \
  --input "nointro:Nintendo DS (Demos)" \
  --parent "nointro:Nintendo DS"

Auto-Merge Workflow

For DATs with parent relationships configured:

# Set parent relationship first
datoso dat --dat-name "redump:PS2 Demos" --set "parent=redump:PlayStation 2"

# Then use auto-merge
datoso deduper --input "redump:PS2 Demos" --auto-merge

Batch Deduplication

Deduplicate multiple DATs in a script:

#!/bin/bash
# dedupe_all_demos.sh

datoso deduper --input "redump:PS1 Demos" --auto-merge
datoso deduper --input "redump:PS2 Demos" --auto-merge
datoso deduper --input "redump:PSP Demos" --auto-merge

Workflow Examples

Complete Deduplication Workflow

# 1. Import DAT files
datoso import

# 2. Check available DATs
datoso dat --all --only-names

# 3. Preview deduplication
datoso deduper \
  --input "redump:Sony - PlayStation (Demos)" \
  --parent "redump:Sony - PlayStation" \
  --dry-run

# 4. Review output, then run for real
datoso deduper \
  --input "redump:Sony - PlayStation (Demos)" \
  --parent "redump:Sony - PlayStation" \
  --output "/dats/ps1_demos_deduped.dat"

# 5. Verify the result
ls -lh /dats/ps1_demos_deduped.dat

Using with File Paths

# Deduplicate local files not in database
datoso deduper \
  --input "/roms/custom_collection.dat" \
  --parent "/roms/official_set.dat" \
  --output "/roms/custom_unique.dat"

Integration with Seed Processing

# 1. Fetch and process a seed
datoso redump --fetch --process

# 2. Configure parent relationships
datoso dat --dat-name "redump:PS2 Demos" --set "parent=redump:PlayStation 2"

# 3. Enable auto-deduplication in config
datoso config --set PROCESS.ParentMergeEnabled true

# 4. Future processing will auto-deduplicate
# Or manually dedupe existing DATs
datoso deduper --input "redump:PS2 Demos" --auto-merge

Output

Normal Mode

$ datoso deduper --input "redump:PS2 Demos" --parent "redump:PS2"
 File saved to redump:PS2 Demos

The output DAT contains only unique entries not found in the parent.

Dry Run Mode

$ datoso deduper --input "redump:PS2 Demos" --parent "redump:PS2" --dry-run
DEBUG: Comparing hashes...
DEBUG: Found 45 duplicate entries
DEBUG: Found 12 unique entries
DEBUG: Would remove 45 entries from output

Detailed debug information shows what would be removed without making changes.

Error Handling

Parent Required Error

Error:

Parent dat is required when input is a dat file

Solution: Specify a parent DAT or use --auto-merge:

datoso deduper --input "./file.dat" --parent "./parent.dat"
# or
datoso deduper --input "seed:name" --auto-merge

File Not Found

If input or parent DAT cannot be found:

Verify the database reference: datoso dat --all --only-names
Check file paths exist and are readable
Ensure proper permissions on files

Invalid DAT Format

If the DAT file cannot be parsed:

Verify it’s a valid ClrMamePro or XML format
Check for corruption
Try importing first: datoso import

Performance Considerations

Large DAT Files

For DAT files with thousands of entries:

Processing may take several minutes
Memory usage scales with DAT size
Use --dry-run first to estimate time

Hash Comparison Speed

Hash comparison is generally fast, but:

SHA1 comparison is more reliable than CRC32
Multiple hash types increase comparison accuracy
First match found is used (optimization)

Best Practices

Always Backup

Before deduplicating important DATs:

cp important.dat important.dat.backup
datoso deduper --input important.dat --parent parent.dat

Use Dry Run First

Preview changes before committing:

# Check what will be removed
datoso deduper --input "seed:child" --parent "seed:parent" --dry-run

# If satisfied, run for real
datoso deduper --input "seed:child" --parent "seed:parent"

Configure Parent Relationships

For frequently used deduplication:

# Set parent once
datoso dat --dat-name "seed:child" --set "parent=seed:parent"

# Use auto-merge thereafter
datoso deduper --input "seed:child" --auto-merge

Integrate with Processing

Enable automatic deduplication during seed processing:

datoso config --set PROCESS.ParentMergeEnabled true

Troubleshooting

No Duplicates Found

If deduplication results in no changes:

Verify the parent DAT contains expected entries
Check that hash values exist in both DATs
Ensure correct parent DAT is specified
Use --dry-run with -v for details

All Entries Removed

If all entries are removed (empty output):

Verify you specified the correct parent
Check that input and parent aren’t reversed
Review with --dry-run first

Auto-Merge Fails

If --auto-merge doesn’t work:

Verify the input DAT is in database: datoso dat --dat-name "seed:name"
Check parent field is set: datoso dat --dat-name "seed:name" --fields parent
Set parent if missing: datoso dat --dat-name "seed:name" --set "parent=seed:parent"

Next Steps

Learn about DAT commands for managing parent relationships
Use seed commands for processing with auto-deduplication
Configure auto-merge with config commands
Import DAT files with import commands

Get Started

Core Concepts

Commands

Configuration

Guides

Advanced

​Overview

​Basic Usage

​Required Arguments

​Input DAT File

​Parent DAT File

​Auto-Merge Mode

​Optional Arguments

​Output File

​Dry Run

​Input Format Options

​Database Reference Format

​File Path Format

​Deduplication Logic

​Hash Comparison

​Matching Criteria

​Preservation

​Use Cases

​Creating Subset DATs

​Avoiding Duplicate Storage

​Auto-Merge Workflow

​Batch Deduplication

​Workflow Examples

​Complete Deduplication Workflow

​Using with File Paths

​Integration with Seed Processing

​Output

​Normal Mode

​Dry Run Mode

​Error Handling

​Parent Required Error

​File Not Found

​Invalid DAT Format

​Performance Considerations

​Large DAT Files

​Hash Comparison Speed

​Best Practices

​Always Backup

​Use Dry Run First

​Configure Parent Relationships

​Integrate with Processing

​Troubleshooting

​No Duplicates Found

​All Entries Removed

​Auto-Merge Fails

​Next Steps

Build docs developers (and LLMs) love

Overview

Basic Usage

Required Arguments

Input DAT File

Parent DAT File

Auto-Merge Mode

Optional Arguments

Output File

Dry Run

Input Format Options

Database Reference Format

File Path Format

Deduplication Logic

Hash Comparison

Matching Criteria

Preservation

Use Cases

Creating Subset DATs

Avoiding Duplicate Storage

Auto-Merge Workflow

Batch Deduplication

Workflow Examples

Complete Deduplication Workflow

Using with File Paths

Integration with Seed Processing

Output

Normal Mode

Dry Run Mode

Error Handling

Parent Required Error

File Not Found

Invalid DAT Format

Performance Considerations

Large DAT Files

Hash Comparison Speed

Best Practices

Always Backup

Use Dry Run First

Configure Parent Relationships

Integrate with Processing

Troubleshooting

No Duplicates Found

All Entries Removed

Auto-Merge Fails

Next Steps