So you have followed the Deep Dive into Neo4j's Full Text Search tutorial, learned even how to create custom analyzers and finally watched the Full Text Search tips and tricks talk at the Neo4j Nodes19 online conference?
Still, searching for boat
does not yield results containing yacht
or ship
, and you're wondering how to make your search engine a bit more relevant for your users?
Don't go any further, you'll learn how to do it, now!
Synonyms
A synonym is a word or phrase that means exactly or nearly the same as another word or phrase.
Why synonyms ?
It's all about recall! In other words, to help your users find the content they're interested in without them having to know specific terms.
A user searching for coffee
should probably be seeing results containing latte macchiato
, espresso
or even ristretto
.
Lists of synonyms
You can find 3rd party word lists for synonyms, such as WordNet or ConceptNet5, howeveer, appropriate word lists are domain/application/use-case dependent, and the best fit is generally a self-curated synonyms word list.
How to use them ?
The first thing to do, is to create a word list with the following format :
coffee,latte macchiato,espresso,ristretto
boat,yacht,sailing vessel,ship
fts,full text search, fulltext search
The next step is to create a custom analyzer using the synonym filter. Since we're using an analyzer the first question that might come to mind is :
Do I have to reindex all the documents when my synonyms list change ?
The answer is yes, using a query time synonym filter is very bad(TM), for the following reasons :
The QueryParser tokenizes before giving the text to the analyzer, so if a user searches for
sailing vessel
, the analyzer will be given the wordssailing
andvessel
separately, and will not know they match a synonymMulti-Word synonyms will also not work in phrase queries
The IDF of rare synonyms will be boosted
More information can be found in the Solr documentation.
Let's create our custom analyzer for synonyms then :
@Service.Implementation(AnalyzerProvider.class)
public class SynonymAnalyzer extends AnalyzerProvider {
public static final String ANALYZER_NAME = "synonym-custom";
public SynonymAnalyzer() {
super(ANALYZER_NAME);
}
@Override
public Analyzer createAnalyzer() {
try {
String synFile = "synonyms.txt";
Analyzer analyzer = CustomAnalyzer.builder()
.withTokenizer(StandardTokenizerFactory.class)
.addTokenFilter(StandardFilterFactory.class)
.addTokenFilter(SynonymFilterFactory.class, "synonyms", synFile)
.addTokenFilter(LowerCaseFilterFactory.class)
.build();
return analyzer;
} catch (Exception e) {
throw new RuntimeException("Unable to create analyzer", e);
}
}
@Override
public String description() {
return "The default, standard analyzer with a synonyms file. This is an example analyzer for educational purposes.";
}
}
A very important note is that the LowerCaseFilter
comes after the SynonymFilter
, in some use cases it causes synonyms to not be recognized, for example with the following list :
GB,gibabyte
If the lowercase filter is applied before synonyms, then the tokens will not match.
Create a synonyms.txt
file with your synonyms list in the conf/
directory of your Neo4j instance :
conf/synonyms.txt
coffee,latte macchiato,espresso,ristretto
boat,yacht,sailing vessel,ship
fts,full text search, fulltext search
Build your analyzer jar and put it in the plugins
directory of Neo4j and restart the database if needed.
Create the Index
CALL db.index.fulltext.createNodeIndex(
'syndemo',
['Article'],
['text'],
{analyzer:'synonym-custom'}
)
Create an Article node with some text :
CREATE (n:Article {text: "This is an article about Full Text Search and Neo4j, let's go !"})
Query the index :
CALL db.index.fulltext.queryNodes('syndemo', 'fts')
╒══════════════════════════════════════════════════════════════════════╤══════════════════╕
│"node" │"score" │
╞══════════════════════════════════════════════════════════════════════╪══════════════════╡
│{"text":"This is an article about Full Text Search and Neo4j, let's go│1.2616268396377563│
│ !"} │ │
└──────────────────────────────────────────────────────────────────────┴──────────────────┘
Similarly, a search for fulltext
will return the result as well. But let's get fancy, heuu fuzzy !
:
CALL db.index.fulltext.queryNodes('syndemo', 'fullt*')
No results, no records
Prefix and synonyms ?
There is one limitation : prefix,fuzzy,.. queries do not use the analyzer, they produce term or multiterm queries instead.
But there is a trick you can use, add an NgramFilter
to your analyzer and use a phrase query, so fts and its synonyms will have their ngrams tokenized and stored/retrieved in the index :
Analyzer analyzer = CustomAnalyzer.builder()
//...
.addTokenFilter(NGramFilterFactory.class, "minGramSize", "3", "maxGramSize", "5")
.build();
return analyzer;
The NgramTokenFilter
will tokenize the inputs into n-grams of the given sizes, here min 3 and max 5. So for the following input :
fulltext search
The index will contain the n-grams ful, full, fullt, ull, ullt, ullte, lte, ltex, ltext
.
You can also use the EdgeNgramFilter
will will produce n-grams only from the beginnig of the token, for the same example as above the n-grams will be ful, full, fullt
.
Re-deploy your plugin, restart the database, drop and recreate the index and now :
CALL db.index.fulltext.queryNodes('syndemo', '"fullt*"')
╒══════════════════════════════════════════════════════════════════════╤═══════════════════╕
│"node" │"score" │
╞══════════════════════════════════════════════════════════════════════╪═══════════════════╡
│{"text":"This is an article about Full Text Search and Neo4j, let's go│0.04872262850403786│
│ !"} │ │
└──────────────────────────────────────────────────────────────────────┴───────────────────┘
To finalize, let's try some other phrase queries :
CALL db.index.fulltext.queryNodes('syndemo', '"article fullte*"~2')
╒══════════════════════════════════════════════════════════════════════╤══════════════════╕
│"node" │"score" │
╞══════════════════════════════════════════════════════════════════════╪══════════════════╡
│{"text":"This is an article about Full Text Search and Neo4j, let's go│2.3429081439971924│
│ !"} │ │
└──────────────────────────────────────────────────────────────────────┴──────────────────┘
Conclusion
Synonyms are a valuable asset when building search engines, offering a better recall and thus a better user experience.
You can find all the code from this blog post on this example Github repository
Top comments (1)
Do you know of a way currently to create the same type of text analyzer in Python with the neo4j driver?