MeCab is a very old project to analyze Japanese text implemented in C++. Note that I am not that very good at reading Japanese (documentation) myself.
Kagome is a more recently updated library implemented in Golang.
However, the parsed output also depend on training data. That's why I asked about dictionary, e.g. unidic neologd...
Top comments (1)
I think it will depend on your use case.
For example if we want to extract the phonemes, with MeCab you need to use ipadic and NOT unidic (I made this mistake :) + neologd for newer words.
You can try it out with a docker image someone built:
Then there is some boilerplate to make it work in Python.
For example if you want to generate furigana for a sentence, check this out: github.com/itsupera/furigana
Now MeCab works pretty well for this, but it's not perfect.
The best tokenizer I have found so far is ichiran (ichi.moe/ for a demo) but it's made in LISP and there is not a lot of documentation available.
As for Kuromoji and Kagome I have not tried them.