DEV Community

loading...
Cover image for Making a Japanese Dictionary Lookup Tool with Sudachi in Python

Making a Japanese Dictionary Lookup Tool with Sudachi in Python

mathewthe2 profile image Mathew Chan Updated on ・3 min read

Background

I've been integrating a dictionary lookup tool in my recent project to help others learn words from games. The following are my goals.

Goals

These goals define user input and the desired output.

1. Get longest possible entry of user input

Input: 自己紹介(じこしょうかい)
Desired Output: 自己紹介, not 自己 or 自

2. Context-unaware

Input: 牛 in 丼を食べてる
Desired output: 牛(うし), but not ぎゅう as in ぎゅうどん(牛丼)

3. Parse conjugated verbs and adjectives

Input 食べて in 牛丼を食べて
Desired Output: 食べる(たべる)

4. Dictionary used in Parser independent from dictionary used for the entry's glossary.

(More later.)

Our desired output assumes a certain degree of Japanese grammatical knowledge from the user. When the user selects 折 from 角 we assume they want to know what 折 means instead of 折角. When they select 外国人 they want to look up the entire compound noun instead of its individual characters.

Setting up

Dictionary - JMDict

We will be using JMDict, a freely available Japanese-to-English library. You can find Japanese to other language libraries on the Yomichan project.

from pathlib import Path
import zipfile
import json

SCRIPT_DIR = Path(__file__).parent 
dictionary_map = {}

def load_dictionary(dictionary):
    output_map = {}
    archive = zipfile.ZipFile(dictionary, 'r')

    result = list()
    for file in archive.namelist():
        if file.startswith('term'):
            with archive.open(file) as f:
                data = f.read()  
                d = json.loads(data.decode("utf-8"))
                result.extend(d)

    for entry in result:
        if (entry[0] in output_map):
            output_map[entry[0]].append(entry) 
        else:
            output_map[entry[0]] = [entry] # Using headword as key for finding the dictionary entry
    return output_map

def setup():
    global dictionary_map 
    load_dictionary(str(Path(SCRIPT_DIR, 'dictionaries', 'jmdict_english.zip')))
Enter fullscreen mode Exit fullscreen mode

To load our dictionary, we unzip the file and save it as a map with all its entries as the keys. We also check entries with repeat glossaries and add them to its list of glossaries.

Parser - Sudachi

pip install sudachipy
pip install sudachidict_small
Enter fullscreen mode Exit fullscreen mode

To save space, we will use Sudachi's small dictionary instead of its core (70Mb).

from sudachipy import tokenizer
from sudachipy import dictionary

tokenizer_obj = dictionary.Dictionary(dict_type='small').create()
mode = tokenizer.Tokenizer.SplitMode.A
Enter fullscreen mode Exit fullscreen mode

There are three modes in Sudachi - A, B, and C. Mode A parses words in its longest possible form where C its shortest. For our use case we will stick to mode A since we want the longest.

Putting it together

def look_up(word):
    word = word.strip()
    if word not in dictionary_map:
        m = tokenizer_obj.tokenize(word, mode)[0]
        word = m.dictionary_form()
        if word not in dictionary_map:
            return None
    result = [{
        'headword': entry[0],
        'reading': entry[1],
        'tags': entry[2],
        'glossary_list': entry[5],
        'sequence': entry[6]
    } for entry in dictionary_map[word]]
    return result
Enter fullscreen mode Exit fullscreen mode

We first remove any unnecessary white spaces around our word then we directly check if it exists in our dictionary. This way we can get nouns like 牛丼 immediately without having to parse them.

After that we parse them with Sudachi mode A and get the dictionary_form() of the word and look that up in our own dictionary instead of using the parser's dictionary.

The final result is reformatted and returned.

(env) $ python
>>> setup()
>>> print(look_up('牛丼'))
[{'headword': '牛丼', 'reading': 'ぎゅうどん', 'tags': 'n', 'glossary_list': ['rice covered with beef and vegetables'], 'sequence': 1845250}]
>>> print(look_up('食べて'))
[{'headword': '食べる', 'reading': 'たべる', 'tags': 'v1 vt', 'glossary_list': ['to eat'], 'sequence': 1358280}, {'headword': '食べる', 'reading': 'たべ
る', 'tags': 'v1 vt', 'glossary_list': ['to live on (e.g. a salary)', 'to live off', 'to subsist on'], 'sequence': 1358280}]
>>> print(look_up('自己紹介'))
[{'headword': '自己紹介', 'reading': 'じこしょうかい', 'tags': 'n vs', 'glossary_list': ['self-introduction'], 'sequence': 1317650}]
Enter fullscreen mode Exit fullscreen mode

Let me know if this was helpful.

Discussion (0)

Forem Open with the Forem app