With the exponential growth of information, finding the right information is like looking for a needle in a haystack. Bubbling the right information to the top of the search results is essential for efficiently working in the knowledge economy. Putting the best relevant results in the limited place of the first page is what distinguishes an excellent search engine from a good search engine. This is the challenge we are solving at Traindex.
One of the first challenges we are trying to solve is in patent space. Patent analysts might want to know what other patents exist in the same domain as a new patent being filed. They may want to find prior art to challenge a claim in an existing patent. There are numerous use cases that better patent search helps solve.
We have experimented with numerous approaches to retrieve the relevant information quickly and effectively. These approaches are centered around two fundamental techniques, the keyword search, and semantic search.
In this article, we will take a deep dive in what is the difference between semantic search and keyword search and which approach is better.
A keyword search is a simple keyword lookup in a corpus of documents against a query. The system will retrieve all the documents from the database which have any keyword present in the query. We can set constraints of whether all words in the query should be present in the retrieved results or any single word in the document would be sufficient to bring it up.
One drawback of this approach is that the retrieval system will not care what the meaning of the keyword is in the context of a document and query it will simply bring back all the documents which contain the keyword specified by the user. This type of search might return irrelevant results (false positives). To view how keyword search works take a look at the following diagram.
In the above diagram, each small box shows the documents that contain a term specified e.g “A”. The diagram shows a user-entered query “raining cats and dogs” and how the system has retrieved the relevant documents to the terms that they used. In this case, the system retrieved all the documents which contained “raining”, “cats”, “and” and “dogs” and showed them to the user. But “raining cats and dogs” is a phrase in English used to describe heavy rain. This system might also get some relevant results but those results would be very small in number and also could have been ranked randomly (depending upon the database structure). Moreover, each word in the query is contributing independently regardless of its meaning being governed by its neighbors. Scaling the keyword search is also a problem and can slow down the response time of the search engine if you have millions of documents. Keyword searches may also fail to retrieve related documents that don’t specifically use the search term (false negatives). Under these conditions, researchers can miss pertinent information. There is also the danger of making business decisions based on less than a comprehensive set of search results.
We use semantic search in Traindex. Unlike keyword search, the semantic search takes into account the meaning of the words according to their context. In semantic search, a latent vector representation of documents is inferred during the training process to project into the latent space. At the inference time, the incoming query is converted into the same latent space representation and projected into space in which the documents are already projected. The nearest points in the space to the query are retrieved as the similar documents to the query.
Take a look at the following illustration using a 3D latent space, although these latent spaces can be from hundreds to a few thousands. At Traindex, we use the latent space representations ranging from 200 to 600 dimensions for capturing better representations of the documents.
The red dot represents the query a user might have entered, whereas all the blue dots are the documents projected in 3D latent space. From here there are a number of ways to locate similar documents, for example, one can use Euclidean distance of these dots to calculate the most similar documents or a Cosine angle from the origin which is also known as cosine similarity, and the list of similarity matrices goes on.
Now we have a solid understanding of how both searches work, let's compare them side by side.
|Keyword Search||Semantic Search|
|Synonyms could be neglected during the search||Incorporate the meaning of words hence comprehends the synonyms as well|
|Need to carefully pick the keywords for search||The query is automatically enriched by the latent encoding|
|The information which is retrieved is dependent on keywords and page ranking algorithms that can produce spam results||The information retrieved is independent of keywords and page rank algorithms that produce exact results rather than any irrelevant results|
Semantic search seeks to improve search accuracy by understanding a searcher’s intent through the contextual meaning of the words and brings back the results the user intended to see.
In patents, technical and domain-specific terms are heavily used which might not be present in English dictionaries and hence can be missed by a patent analyst while they is searching for prior art. Semantic search in Traindex processes the whole patent as input for (prior art and similarity) search and uses vectorized representation of the text to find synonyms and same words in other patents during the search, which is impossible with a keyword search.
Moreover, keyword search is restricted to use up to a fixed number of words which is at most 50 words generally and reduces the response time as the number of keywords grow, whereas in Traindex you can search with any number of words including the whole patents which are hundreds of thousands of words and can get results in real time. Semantic search in Traindex converts the whole patent into a vector representation and matches the most similar documents in the database, hence overcomes the problem of limited keywords search.