Building the dictionary

meikipop uses a pre-processed, highly optimized dictionary format for blazingly fast lookups. The dictionary is built from JMdict and KANJIDIC2 source data and stored as a pickled Python object for instant loading.

Most users don’t need to build the dictionary themselves - prebuilt dictionaries are included in all release packages. This guide is for users who want to customize the dictionary or build from source.

Prerequisites

Before building the dictionary, ensure you have:

Python 3.10 or later
lxml library for XML processing
Internet connection (to download source data)

pip install lxml

Build process overview

The dictionary build process consists of several stages:

Download JMdict

Downloads the latest JMdict dictionary from the Electronic Dictionary Research and Development Group (EDRDG).

download_url = 'http://ftp.edrdg.org/pub/Nihongo/JMdict.gz'
open('JMdict', 'wb').write(requests.get(download_url).content)

JMdict is the comprehensive Japanese-English dictionary that forms the foundation of meikipop’s lookup system.

Process JMdict to JSON

Converts the raw XML JMdict data into optimized JSON files using scripts/process.py.

python scripts/process.py

This script:

Parses the XML structure
Extracts kanji forms (kebs), readings (rebs), and sense definitions
Filters to English-only glosses
Splits the output into multiple JSON files (~18,000 entries each) for efficient processing

Process KANJIDIC2

Downloads and processes kanji information including meanings, readings, and components.

python scripts/process_kanji.py

This creates kanjidic2.json with:

Character meanings from KANJIDIC2
Frequency-ranked readings (on’yomi and kun’yomi)
Component breakdowns from CHISE IDS database
Example words for each reading

Build optimized dictionary

Combines all data sources into a single optimized pickle file for instant loading.The dictionary includes:

JMdict entries with kanji, readings, and definitions
Deconjugation rules for verb and adjective lookups
Priority rankings for common words
Kanji information with components and examples

Running the build

To build the dictionary, run the build script from the repository root:

python -m scripts.build_dictionary

The complete build process takes several minutes and shows progress messages:

downloading jmdict...
processing jmdict -> json...
processing kanjidic2...
Starting dictionary build process...
Loading dictionary data from JSON files...
All data imported and processed in 45.23 seconds.
Total entries processed: 183542
Saving processed dictionary to: jmdict_enhanced.pkl
Dictionary saved in 12.34 seconds.
Build complete.

Understanding the output

The build process creates jmdict_enhanced.pkl in your repository root. This file contains:

Dictionary data structure

The pickled dictionary contains the following components:

{
    'entries': [],              # List of all JMdict entries
    'lookup_kan': {},           # Kanji form -> entry index mapping
    'lookup_kana': {},          # Kana reading -> entry index mapping
    'kanji_entries': {},        # Character -> kanji info mapping
    'deconjugator_rules': [],   # Verb/adjective deconjugation patterns
    'priority_map': {}          # (kanji, reading) -> priority score
}

Each entry in the entries list has this structure:

{
    'id': 1234567,              # JMdict sequence number
    'kebs': ['日本語', '日本語'],  # Kanji forms
    'rebs': ['にほんご', 'にっぽんご'], # Readings
    'senses': [                 # Definitions
        {
            'glosses': ['Japanese language'],
            'pos': ['noun']
        }
    ]
}

Required data files

The build script expects these files in the data/ directory:

JMdict*.json - Processed dictionary entries (auto-generated)
deconjugator.json - Deconjugation rules (included in repo)
priority.json - Word frequency data (included in repo)
kanjidic2.json - Kanji information (auto-generated)

If any required files are missing, the build will fail with an error message indicating which files need to be present.

Optimization techniques

The dictionary is optimized for lookup performance through:

Hash-based indexing: Direct O(1) lookup using kanji and kana as keys
Pickle serialization: Binary format loads 10-100x faster than JSON
Priority scoring: Common words appear first in results
Defaultdict usage: Efficient handling of missing entries
Pre-processed deconjugation: No runtime rule compilation needed

Build script source code

The main build logic from scripts/build_dictionary.py:

scripts/build_dictionary.py

from src.dictionary.customdict import Dictionary

print("Starting dictionary build process...")
data_dir = 'data'
output_path = 'jmdict_enhanced.pkl'

jmdict_files = [os.path.join(data_dir, f) for f in os.listdir(data_dir) 
                if f.startswith('JMdict') and f.endswith('.json')]
deconjugator_path = os.path.join(data_dir, 'deconjugator.json')
priority_path = os.path.join(data_dir, 'priority.json')
kanjidic_path = os.path.join(data_dir, 'kanjidic2.json')

print("Loading dictionary data from JSON files...")
dictionary = Dictionary()

# Load and process all data
dictionary.import_jmdict_json(jmdict_files)
dictionary.import_deconjugator(deconjugator_path)
dictionary.import_priority(priority_path)
dictionary.import_kanjidic_json(kanjidic_path)

print(f"Total entries processed: {len(dictionary.entries)}")

print(f"Saving processed dictionary to: {output_path}")
dictionary.save_dictionary(output_path)
print("Build complete.")

See the full implementation at scripts/build_dictionary.py:30-64 in the source repository.

Using a custom dictionary

After building, you can use your custom dictionary by:

Replace the default dictionary: Copy jmdict_enhanced.pkl to your meikipop installation directory
Restart meikipop: The new dictionary will be loaded automatically
Verify loading: Check the console output for “Dictionary loaded in X.XX seconds”

The dictionary loads in 1-3 seconds on most systems. If loading takes significantly longer, ensure you’re using the pickled .pkl format rather than JSON.

Troubleshooting

Build fails with 'lxml not found'

Install the lxml library:

pip install lxml

lxml is required for parsing the JMdict XML file but is not included in the standard requirements.

Download errors

If downloads fail:

Check your internet connection
Verify you can access http://ftp.edrdg.org
Try downloading the files manually and placing them in the repository root
Run the individual processing scripts instead of the full build

Missing data files error

If you see “Missing required dictionary files in ‘data’ folder”:

Ensure the data/ directory exists in your repository
Check that deconjugator.json and priority.json are present
Run the processing scripts to generate JMdict*.json and kanjidic2.json

Advanced customization

You can customize the dictionary build by modifying the processing scripts:

scripts/process.py: Adjust JMdict filtering, add support for other languages, or change JSON chunking size
scripts/process_kanji.py: Modify kanji reading prioritization, change example word selection, or adjust frequency calculations
src/dictionary/customdict.py: Add custom preprocessing, implement additional lookup indexes, or extend the data model

The dictionary structure is highly optimized for the JMdict format. If you want to use a different dictionary source, you’ll need to significantly modify the lookup logic in src/dictionary/lookup.py.

Get Started

User Guide

Advanced

Building the dictionary

Prerequisites

Build process overview

Running the build

Understanding the output

Build script source code

Using a custom dictionary

Troubleshooting

Advanced customization

Build docs developers (and LLMs) love

Get Started

User Guide

Advanced

​Prerequisites

​Build process overview

​Running the build

​Understanding the output

​Build script source code

​Using a custom dictionary

​Troubleshooting

​Advanced customization

Build docs developers (and LLMs) love

Prerequisites

Build process overview

Running the build

Understanding the output

Build script source code

Using a custom dictionary

Troubleshooting

Advanced customization