DEV Community

marouane moutih
marouane moutih

Posted on

Semantic search with OpenAi, SpringBoot, Vaadin, ElasticSearch

Semantic search

Semantic search is an information search approach that seeks to understand the meaning and intent behind a query rather than relying solely on the specific keywords used. Unlike traditional search based on exact textual matches, semantic search considers the context, meaning and relationship between words to provide more relevant and accurate results.

Semantic research uses language analysis, natural language processing (NLP) and artificial intelligence techniques to understand the meaning of words in a given context. It can take into account elements such as synonym, antonymy, co-occurrence, semantic relationships and grammatical structures to improve the understanding of queries and indexed documents.

One of the key aspects of semantic research is the ability to interpret the intent of the user behind a query. For example, if you enter the query "Best sailing catamarans in Corsica", a semantic search engine will be able to understand that you are looking for sailing catamarans in Corsica, rather than just looking for the words "best", "catamaran", "sailing" and "Corsica" separately.

In short, semantic research goes beyond keyword matching to understand the overall meaning of a query and provide more relevant results based on the semantic understanding of the language.

Give a sense to a request (embedding)

Word embedding is a technique used in the field of automatic natural language processing (TALN) to represent words in the form of numerical vectors in a multidimensional space. These word dips are usually learned from large amounts of text through machine learning models, such as neural networks, which are able to capture the semantic and syntactic relationships between words.

These vector representations of words are used in many TALN tasks, such as machine translation, information retrieval, sentiment analysis and text classification. Word embedding allows machine learning algorithms to better understand and work with textual data, transforming words into a numerical form more suitable for analysis and modeling.

Cosine similarity

In data analysis, cosine similarity is a measure of similarity between two non-zero vectors defined in an inner product space. Cosine similarity is the cosine of the angle between the vectors; that is, it is the dot product of the vectors divided by the product of their lengths. It follows that the cosine similarity does not depend on the magnitudes of the vectors, but only on their angle.

Elasticsearch

Elasticsearch is a distributed, open-source and highly scalable data search and analysis engine. It is designed to process and analyze large amounts of data in real time. Elasticsearch uses a data structure called "inverted index" to enable quick search and retrieval of information.

Elasticsearch’s main goal is to provide fast, flexible and easy to implement data search and analysis. It is particularly suitable for indexing and searching text documents, but can also be used for log analysis, real-time monitoring, content recommendation, report generation, and many other use cases.

In addition to full-text search, Elasticsearch offers advanced features such as data aggregation, geospatial search, similarity search, term suggestion, automatic correction, etc. It also provides RESTful APIs to interact with the system and manage indexes and data.

how does it work

Our example combines the three concepts (Semantic search, embedding, Cosine similarity) to create a more powerful search system based on an elasticsearch engine, using java as a programming language, spring-boot and vaadin as stuck and OpenAi for the generation of embedding.

The Model

Our model will be a simple post with a content title and our embedding vector.
Using spring-boot-starter-data-elasticsearch we must specify to elasticsearch that our embedding field is a dense vector with a dimension of 1536 bytes to be OpenAi compliant on this subject.

@Document(
    indexName="posts"
)
@ToString
public class Post {

    @Id
    private String id;

    @Field(type = FieldType.Text)
    private String title;

    @Field(type = FieldType.Text)
    private String content;

    @Field(type = FieldType.Dense_Vector, dims = 1536, index = true)
    private Vector<Double> embedding;

    public Post(String title, String content) {
        this.title = title;
        this.content = content;
    }
}
Enter fullscreen mode Exit fullscreen mode

The Repository

Our repository will simply extend the ElasticsearchRepository. We will just define a function, which will make a match_all request with a scripted score based on cosine similarity between the embedding vector and the search text vector.

@Repository
public interface PostRepository extends ElasticsearchRepository<Post, String>{

    @Query("{" +
            "    \"script_score\": {" +
            "       \"query\": {\"match_all\": {}}," +
            "       \"script\": {"+
            "           \"source\": \"cosineSimilarity(params.queryVector, 'embedding') +0.1\","+
            "           \"params\": {\"queryVector\": ?0}"+
            "       }"+
            "   }"+
            "}")
    List<Post> findBySimilar( String content);
}
Enter fullscreen mode Exit fullscreen mode

The service

Our service must ensure the generation of embedding for each post during creation and for the text of each search request.

public void save(Post post) {
        Vector<Double> vector = new Vector<Double>(getEmbedding(post.title() + "" + post.content()));
        post.embedding(vector);
        repository.save(post);
}

public List<Post> find(String text) {
        Vector<Double> vector = new Vector<Double>(getEmbedding(text));
        return repository.findBySimilar(JsonData.of(vector).toJson(new JacksonJsonpMapper()).toString());
}

Enter fullscreen mode Exit fullscreen mode

The full exemple

https://github.com/mmoutih/sementic-search

flow me on twitter

Top comments (0)