Skip to main content
meikipop uses a pre-processed, highly optimized dictionary format for blazingly fast lookups. The dictionary is built from JMdict and KANJIDIC2 source data and stored as a pickled Python object for instant loading.
Most users don’t need to build the dictionary themselves - prebuilt dictionaries are included in all release packages. This guide is for users who want to customize the dictionary or build from source.

Prerequisites

Before building the dictionary, ensure you have:
  • Python 3.10 or later
  • lxml library for XML processing
  • Internet connection (to download source data)
pip install lxml

Build process overview

The dictionary build process consists of several stages:
1

Download JMdict

Downloads the latest JMdict dictionary from the Electronic Dictionary Research and Development Group (EDRDG).
download_url = 'http://ftp.edrdg.org/pub/Nihongo/JMdict.gz'
open('JMdict', 'wb').write(requests.get(download_url).content)
JMdict is the comprehensive Japanese-English dictionary that forms the foundation of meikipop’s lookup system.
2

Process JMdict to JSON

Converts the raw XML JMdict data into optimized JSON files using scripts/process.py.
python scripts/process.py
This script:
  • Parses the XML structure
  • Extracts kanji forms (kebs), readings (rebs), and sense definitions
  • Filters to English-only glosses
  • Splits the output into multiple JSON files (~18,000 entries each) for efficient processing
3

Process KANJIDIC2

Downloads and processes kanji information including meanings, readings, and components.
python scripts/process_kanji.py
This creates kanjidic2.json with:
  • Character meanings from KANJIDIC2
  • Frequency-ranked readings (on’yomi and kun’yomi)
  • Component breakdowns from CHISE IDS database
  • Example words for each reading
4

Build optimized dictionary

Combines all data sources into a single optimized pickle file for instant loading.The dictionary includes:
  • JMdict entries with kanji, readings, and definitions
  • Deconjugation rules for verb and adjective lookups
  • Priority rankings for common words
  • Kanji information with components and examples

Running the build

To build the dictionary, run the build script from the repository root:
python -m scripts.build_dictionary
The complete build process takes several minutes and shows progress messages:
downloading jmdict...
processing jmdict -> json...
processing kanjidic2...
Starting dictionary build process...
Loading dictionary data from JSON files...
All data imported and processed in 45.23 seconds.
Total entries processed: 183542
Saving processed dictionary to: jmdict_enhanced.pkl
Dictionary saved in 12.34 seconds.
Build complete.

Understanding the output

The build process creates jmdict_enhanced.pkl in your repository root. This file contains:
The pickled dictionary contains the following components:
{
    'entries': [],              # List of all JMdict entries
    'lookup_kan': {},           # Kanji form -> entry index mapping
    'lookup_kana': {},          # Kana reading -> entry index mapping
    'kanji_entries': {},        # Character -> kanji info mapping
    'deconjugator_rules': [],   # Verb/adjective deconjugation patterns
    'priority_map': {}          # (kanji, reading) -> priority score
}
Each entry in the entries list has this structure:
{
    'id': 1234567,              # JMdict sequence number
    'kebs': ['日本語', '日本語'],  # Kanji forms
    'rebs': ['にほんご', 'にっぽんご'], # Readings
    'senses': [                 # Definitions
        {
            'glosses': ['Japanese language'],
            'pos': ['noun']
        }
    ]
}
The build script expects these files in the data/ directory:
  • JMdict*.json - Processed dictionary entries (auto-generated)
  • deconjugator.json - Deconjugation rules (included in repo)
  • priority.json - Word frequency data (included in repo)
  • kanjidic2.json - Kanji information (auto-generated)
If any required files are missing, the build will fail with an error message indicating which files need to be present.
The dictionary is optimized for lookup performance through:
  1. Hash-based indexing: Direct O(1) lookup using kanji and kana as keys
  2. Pickle serialization: Binary format loads 10-100x faster than JSON
  3. Priority scoring: Common words appear first in results
  4. Defaultdict usage: Efficient handling of missing entries
  5. Pre-processed deconjugation: No runtime rule compilation needed

Build script source code

The main build logic from scripts/build_dictionary.py:
scripts/build_dictionary.py
from src.dictionary.customdict import Dictionary

print("Starting dictionary build process...")
data_dir = 'data'
output_path = 'jmdict_enhanced.pkl'

jmdict_files = [os.path.join(data_dir, f) for f in os.listdir(data_dir) 
                if f.startswith('JMdict') and f.endswith('.json')]
deconjugator_path = os.path.join(data_dir, 'deconjugator.json')
priority_path = os.path.join(data_dir, 'priority.json')
kanjidic_path = os.path.join(data_dir, 'kanjidic2.json')

print("Loading dictionary data from JSON files...")
dictionary = Dictionary()

# Load and process all data
dictionary.import_jmdict_json(jmdict_files)
dictionary.import_deconjugator(deconjugator_path)
dictionary.import_priority(priority_path)
dictionary.import_kanjidic_json(kanjidic_path)

print(f"Total entries processed: {len(dictionary.entries)}")

print(f"Saving processed dictionary to: {output_path}")
dictionary.save_dictionary(output_path)
print("Build complete.")
See the full implementation at scripts/build_dictionary.py:30-64 in the source repository.

Using a custom dictionary

After building, you can use your custom dictionary by:
  1. Replace the default dictionary: Copy jmdict_enhanced.pkl to your meikipop installation directory
  2. Restart meikipop: The new dictionary will be loaded automatically
  3. Verify loading: Check the console output for “Dictionary loaded in X.XX seconds”
The dictionary loads in 1-3 seconds on most systems. If loading takes significantly longer, ensure you’re using the pickled .pkl format rather than JSON.

Troubleshooting

Install the lxml library:
pip install lxml
lxml is required for parsing the JMdict XML file but is not included in the standard requirements.
If downloads fail:
  1. Check your internet connection
  2. Verify you can access http://ftp.edrdg.org
  3. Try downloading the files manually and placing them in the repository root
  4. Run the individual processing scripts instead of the full build
If you see “Missing required dictionary files in ‘data’ folder”:
  1. Ensure the data/ directory exists in your repository
  2. Check that deconjugator.json and priority.json are present
  3. Run the processing scripts to generate JMdict*.json and kanjidic2.json

Advanced customization

You can customize the dictionary build by modifying the processing scripts:
  • scripts/process.py: Adjust JMdict filtering, add support for other languages, or change JSON chunking size
  • scripts/process_kanji.py: Modify kanji reading prioritization, change example word selection, or adjust frequency calculations
  • src/dictionary/customdict.py: Add custom preprocessing, implement additional lookup indexes, or extend the data model
The dictionary structure is highly optimized for the JMdict format. If you want to use a different dictionary source, you’ll need to significantly modify the lookup logic in src/dictionary/lookup.py.

Build docs developers (and LLMs) love