In Natural Language Processing to identify words from sentence in English or Latin characters is not too hard, because each word is has a space. But in Unicode character is different we need to make it compare to existing words from dictionary.
Dictionary Format:
You can structure your dictionary to include related words and explanatory phrases. Here's an example format:
Example:
khmer_dictionary = {
'មាន': {'POS': 'Verb', 'Related': ['មានសៀវភៅ', 'មានទិន្នន័យ'], 'Explanation': 'to have'},
'សៀវភៅ': {'POS': 'Noun', 'Related': [], 'Explanation': 'book'},
'ច្រើន': {'POS': 'Adjective', 'Related': [], 'Explanation': 'many'},
'ណាស់': {'POS': 'Adverb', 'Related': [], 'Explanation': 'here'},
'នៅ': {'POS': 'Verb', 'Related': [], 'Explanation': 'to be at'},
'ទីនេះ': {'POS': 'Noun', 'Related': [], 'Explanation': 'this place'}
}
Improving Tokenization Method:
To handle multi-word phrases and OOV words better, you need to adjust your tokenization function. Here's a revised version.
def tokenize_with_dictionary(sentence):
tokens = []
current_word = ''
for char in sentence:
current_word += char
if current_word in khmer_dictionary:
tokens.append((current_word, khmer_dictionary[current_word]))
current_word = ''
elif current_word[:-1] in khmer_dictionary:
tokens.append((current_word[:-1], khmer_dictionary[current_word[:-1]]))
current_word = char
if current_word:
tokens.append((current_word, 'OOV'))
return tokens
Then you can save it to database.
If you have better idea or something for improvement, please comments below.
Top comments (0)