Most users don’t need to build the dictionary themselves - prebuilt dictionaries are included in all release packages. This guide is for users who want to customize the dictionary or build from source.
Prerequisites
Before building the dictionary, ensure you have:- Python 3.10 or later
lxmllibrary for XML processing- Internet connection (to download source data)
Build process overview
The dictionary build process consists of several stages:Download JMdict
Downloads the latest JMdict dictionary from the Electronic Dictionary Research and Development Group (EDRDG).JMdict is the comprehensive Japanese-English dictionary that forms the foundation of meikipop’s lookup system.
Process JMdict to JSON
Converts the raw XML JMdict data into optimized JSON files using This script:
scripts/process.py.- Parses the XML structure
- Extracts kanji forms (kebs), readings (rebs), and sense definitions
- Filters to English-only glosses
- Splits the output into multiple JSON files (~18,000 entries each) for efficient processing
Process KANJIDIC2
Downloads and processes kanji information including meanings, readings, and components.This creates
kanjidic2.json with:- Character meanings from KANJIDIC2
- Frequency-ranked readings (on’yomi and kun’yomi)
- Component breakdowns from CHISE IDS database
- Example words for each reading
Build optimized dictionary
Combines all data sources into a single optimized pickle file for instant loading.The dictionary includes:
- JMdict entries with kanji, readings, and definitions
- Deconjugation rules for verb and adjective lookups
- Priority rankings for common words
- Kanji information with components and examples
Running the build
To build the dictionary, run the build script from the repository root:Understanding the output
The build process createsjmdict_enhanced.pkl in your repository root. This file contains:
Dictionary data structure
Dictionary data structure
The pickled dictionary contains the following components:Each entry in the
entries list has this structure:Required data files
Required data files
The build script expects these files in the
data/ directory:JMdict*.json- Processed dictionary entries (auto-generated)deconjugator.json- Deconjugation rules (included in repo)priority.json- Word frequency data (included in repo)kanjidic2.json- Kanji information (auto-generated)
Optimization techniques
Optimization techniques
The dictionary is optimized for lookup performance through:
- Hash-based indexing: Direct O(1) lookup using kanji and kana as keys
- Pickle serialization: Binary format loads 10-100x faster than JSON
- Priority scoring: Common words appear first in results
- Defaultdict usage: Efficient handling of missing entries
- Pre-processed deconjugation: No runtime rule compilation needed
Build script source code
The main build logic fromscripts/build_dictionary.py:
scripts/build_dictionary.py
scripts/build_dictionary.py:30-64 in the source repository.
Using a custom dictionary
After building, you can use your custom dictionary by:- Replace the default dictionary: Copy
jmdict_enhanced.pklto your meikipop installation directory - Restart meikipop: The new dictionary will be loaded automatically
- Verify loading: Check the console output for “Dictionary loaded in X.XX seconds”
Troubleshooting
Build fails with 'lxml not found'
Build fails with 'lxml not found'
Install the lxml library:lxml is required for parsing the JMdict XML file but is not included in the standard requirements.
Download errors
Download errors
If downloads fail:
- Check your internet connection
- Verify you can access
http://ftp.edrdg.org - Try downloading the files manually and placing them in the repository root
- Run the individual processing scripts instead of the full build
Missing data files error
Missing data files error
If you see “Missing required dictionary files in ‘data’ folder”:
- Ensure the
data/directory exists in your repository - Check that
deconjugator.jsonandpriority.jsonare present - Run the processing scripts to generate
JMdict*.jsonandkanjidic2.json
Advanced customization
You can customize the dictionary build by modifying the processing scripts:scripts/process.py: Adjust JMdict filtering, add support for other languages, or change JSON chunking sizescripts/process_kanji.py: Modify kanji reading prioritization, change example word selection, or adjust frequency calculationssrc/dictionary/customdict.py: Add custom preprocessing, implement additional lookup indexes, or extend the data model
The dictionary structure is highly optimized for the JMdict format. If you want to use a different dictionary source, you’ll need to significantly modify the lookup logic in
src/dictionary/lookup.py.