Nhan Nguyen

Posted on May 10, 2023

Lucene.NET for search applications

#dotnetcore #searching #csharp

Introduction

Lucene.NET is an open-source search engine library that provides a scalable and efficient solution for developers to implement search functionality in their applications. With its powerful indexing and search algorithms, Lucene.NET allows developers to search through large datasets with ease, making it a popular choice for building search applications.

Pros and Cons

Pros of Lucene.NET:

Efficient indexing and searching: Lucene.NET's indexing and searching algorithms are highly optimized for performance, allowing for fast and efficient searches even with large datasets.
Flexible search capabilities: Lucene.NET offers a variety of advanced search techniques, such as Boolean searches, wildcard searches, and phrase searches, allowing developers to fine-tune their search results for optimal accuracy.
Open-source and free: Lucene.NET is an open-source library that can be used for free, making it an attractive option for developers who want to implement search functionality in their applications without incurring additional costs.
Language support: Lucene.NET supports a variety of languages, including English, Chinese, Japanese, and Korean, making it a versatile option for building search applications in different languages.

Cons of Lucene.NET:

Steep learning curve: Lucene.NET can be challenging to learn and use, especially for developers who are new to search engine development.
Complex API: Lucene.NET's API can be complex and difficult to navigate, especially when working with advanced search techniques.
Resource-intensive: Lucene.NET can be resource-intensive, especially when working with large datasets, requiring significant CPU and memory resources.
Limited analytics capabilities: Lucene.NET's focus is primarily on search functionality, and it may not provide sufficient analytics capabilities for some applications.
Maintenance and updates: Lucene.NET requires regular maintenance and updates to ensure that it remains compatible with other libraries and frameworks.

Workflow

The Lucene.Net workflow typically involves the following steps:

Indexing: The first step in using Lucene.Net is to create an index of the data you want to search. This involves creating a document object for each piece of data, and adding fields to the document that correspond to the data attributes you want to index.
Analysis: Before indexing the document, the data is analyzed to extract meaningful information from it. Lucene.Net provides a range of analysis tools, such as tokenizers and filters, to extract and transform data into searchable terms.
Index Writer: Once the documents have been analyzed, they are added to the index using an IndexWriter object. The IndexWriter is responsible for creating and managing the index, and it provides methods for adding, updating, and deleting documents from the index.
Querying: Once the data has been indexed, users can search the index using a Query object. A Query is a set of instructions that defines the search parameters, such as the search terms and filters.
Searcher: The Query is executed using a Searcher object, which searches the index for documents that match the Query. The Searcher returns a set of hits, which are the documents that match the Query.
Displaying results: Finally, the search results can be displayed to the user. This can involve rendering the hits in a search results page, highlighting the search terms in the document, or displaying relevant metadata about the hit.

Overall, Lucene.Net provides a powerful and flexible search engine framework that allows developers to index, search, and display data quickly and efficiently. By following the workflow outlined above, developers can create search applications that are accurate, scalable, and user-friendly.

Installation

To install Lucene.NET for your project, you will need to add the following package references to your project's .csproj file:



<PackageReference Include="Lucene.Net" Version="4.8.0-beta00016" />
<PackageReference Include="Lucene.Net.Analysis.Common" Version="4.8.0-beta00016" />
<PackageReference Include="Lucene.Net.QueryParser" Version="4.8.0-beta00016" />

Once you have added these package references, you can use Lucene.NET in your project.

Usage

Suppose you want to create a search application for finding employee names at a computer training center. The application needs to support fuzzy search, which allows users to find names that are similar to their search terms. In this context, the employees are all Vietnamese

Create analyzer for Vietnamese

Here is an example of a Vietnamese analyzer implemented in Lucene.Net:



using Lucene.Net.Analysis;
using Lucene.Net.Analysis.Core;
using Lucene.Net.Analysis.Miscellaneous;
using Lucene.Net.Analysis.Standard;
using Lucene.Net.Util;
namespace Portal.Shared.Infrastructure.Search.Lucene.Internals;
public class VietnameseAnalyzer : global::Lucene.Net.Analysis.Analyzer
{
    private readonly LuceneVersion _version;
    public VietnameseAnalyzer(LuceneVersion version) => _version = version;
    protected override TokenStreamComponents CreateComponents(string fieldName, TextReader reader)
    {
        var tokenizer = new StandardTokenizer(_version, reader);
        TokenStream filter = new StandardFilter(_version, tokenizer);
        filter = new LowerCaseFilter(_version, filter);
        filter = new StopFilter(_version, filter, StandardAnalyzer.STOP_WORDS_SET);
        filter = new ASCIIFoldingFilter(filter);
        return new TokenStreamComponents(tokenizer, filter);
    }
}

In this example, the VietnameseAnalyzer extends the Lucene.Net Analyzer class and overrides the CreateComponents method to create a customized token stream. The token stream starts with a StandardTokenizer to break up the text into tokens, and then applies a series of filters to process the tokens. These filters include:

StandardFilter: removes punctuation and other non-alphanumeric characters.
LowerCaseFilter: converts all tokens to lowercase for case-insensitive search.
StopFilter: removes common stop words in Vietnamese, such as "và", "là", and "của".
ASCIIFoldingFilter: converts Vietnamese diacritics to their closest ASCII equivalents, such as "đ" to "d" and "ô" to "o".

By creating a customized analyzer for Vietnamese text, you can improve the accuracy of fuzzy search in your application.

Define the interface

Here is an interface of a Lucene service implemented in Lucene.Net:



using Lucene.Net.Documents;
namespace Portal.Shared.Infrastructure.Search.Lucene;
public interface ILuceneService<T> where T : class
{
    public bool IsExistIndex(T item);
    public Dictionary<string, List<Document>> GetData(List<T> data);
    public IEnumerable<Document> Search(string query, int maxResults);
    public void ClearAll();
    public void Index(List<T> data, string options);
}

In the above example, we have a Lucene service that is implemented in Lucene.Net. The service has a generic type T and has a method Index that takes a list of T and a string as parameters. The string parameter is the options that we want to pass to the service. The service will index the list of T based on the options.

Define the action

Here is an example of the Lucene service action, the action is a smart enum. You can define the action as a class or an enum.



using Ardalis.SmartEnum;
namespace Portal.Shared.Infrastructure.Search.Lucene;
public sealed class LuceneAction : SmartEnum<LuceneAction>
{
    public static readonly LuceneAction Create = new(nameof(Create), 1);
    public static readonly LuceneAction Update = new(nameof(Update), 2);
    public static readonly LuceneAction Delete = new(nameof(Delete), 3);
    private LuceneAction(string name, int value) : base(name, value)
    {
    }
}

In the above example, we have three actions: Create, Update, and Delete. Each action has a name and a value. The value is an integer that is used to identify the action. The name is a string that is used to identify the action. The name is used in the path of the file that contains the action. The value is used to identify the action in the code.

Implement methods in interface

Here is an example of the implementation of the ILuceneService interface:



using Lucene.Net.Analysis;
using Lucene.Net.Documents;
using Lucene.Net.Index;
using Lucene.Net.QueryParsers.Classic;
using Lucene.Net.Search;
using Lucene.Net.Store;
using Lucene.Net.Util;
namespace Portal.Shared.Infrastructure.Search.Lucene.Internals;
public class LuceneService<T> : ILuceneService<T> where T : class
{
    private readonly IndexWriter _indexWriter;
    private readonly IndexSearcher _indexSearcher;
    private readonly Analyzer _analyzer;
    private readonly QueryParser _queryParser;
    public LuceneService()
    {
        var directory = FSDirectory.Open("Index");
        _analyzer = new VietnameseAnalyzer(LuceneVersion.LUCENE_48);
        var indexConfig = new IndexWriterConfig(LuceneVersion.LUCENE_48, _analyzer);
        _indexWriter = new IndexWriter(directory, indexConfig);
        _indexSearcher = new IndexSearcher(_indexWriter.GetReader(applyAllDeletes: true));
        _queryParser = new QueryParser(LuceneVersion.LUCENE_48, "content", _analyzer);
    }
    public Dictionary<string, List<Document>> GetData(List<T> data)
    {
        var propertyIndex = new Dictionary<string, List<Document>>();
        foreach (var dummy in data.GetType().GetProperties())
        {
            foreach (var property in data.GetType().GetProperties())
            {
                if (!propertyIndex.ContainsKey(property.Name))
                    propertyIndex.Add(property.Name, new List<Document>());
                var value = property.GetValue(data, null);
                if (value is null) continue;
                var document = new Document
                    { new StringField(property.Name, value.ToString(), Field.Store.YES) };
                propertyIndex[property.Name].Add(document);
            }
        }
        return propertyIndex;
    }
    public void Index(List<T> data, string options)
    {
        var document = GetData(data);
        var docs = document.SelectMany(item
                => item.Value.Select(doc
                    => new TextField(item.Key, doc.ToString(), Field.Store.YES)))
            .Select(field => new Document { field })
            .ToList();
        switch (options)
        {
            case nameof(LuceneAction.Create):
                _indexWriter.AddDocuments(docs);
                break;
            case nameof(LuceneAction.Update):
                _indexWriter.UpdateDocuments(new Term("id", "1"), docs);
                break;
            case nameof(LuceneAction.Delete):
                _indexWriter.DeleteDocuments(new Term("id", "1"));
                break;
            default:
                throw new ArgumentOutOfRangeException(nameof(options), options, null);
        }
        _indexWriter.Flush(triggerMerge: false, applyAllDeletes: false);
    }
    public bool IsExistIndex(T item)
    {
        var parser = new QueryParser(LuceneVersion.LUCENE_48, "id", _analyzer);
        var query = parser.Parse(item.ToString());
        var hits = _indexSearcher.Search(query, 1).ScoreDocs;
        return hits.Length > 0;
    }
    public IEnumerable<Document> Search(string query, int maxResults)
    {
        var fuzzyQuery = new FuzzyQuery(new Term("content", query), 2);
        var queryParser = _queryParser.Parse(query);
        var booleanQuery = new BooleanQuery
        {
            { queryParser, Occur.SHOULD },
            { fuzzyQuery, Occur.SHOULD }
        };
        var hits = _indexSearcher.Search(booleanQuery, maxResults).ScoreDocs;
        foreach (var hit in hits)
            yield return _indexSearcher.Doc(hit.Doc);
    }
    public void ClearAll() => _indexWriter.DeleteAll();
}

In this example, we have a Lucene service that is implemented in Lucene.Net. By implementing the interface, we have to implement all the methods in the interface. These methods are:

GetData: Get data from the list of T and convert it to a dictionary of string and List<Document>. The key of the dictionary is the name of the property of T. The value of the dictionary is a list of Document. Each Document contains a Field that has the name of the property and the value of the property.
Index: Index the list of T based on the options. The options are the actions that we defined in the action class. The options are Create, Update, and Delete. The Create action will add the list of T to the index. The Update action will update the list of T in the index. The Delete action will delete the list of T from the index.
IsExistIndex: Check if the list of T is in the index.
Search: Search the index based on the query and the maximum number of results. The query is a string that contains the search query. The maximum number of results is an integer that contains the maximum number of results that we want to get from the search.
ClearAll: Clear all the index.

Finally, we have to register the service in the Program.cs file:



builder.Services.AddSingleton(typeof(ILuceneService<>), typeof(LuceneService<>));

Result

For example, i want to search for my name "Nguyễn Xuân Nhân". But i want to type the Vietnamese characters. So i type "Nhan" instead. The result is:

Conclusion

Overall, Lucene.NET is a powerful and flexible search engine library that offers a range of advanced search techniques and community support. While it may have a steep learning curve and be resource-intensive, Lucene.NET is a strong choice for developers who need to implement search functionality in their applications.

DEV Community

Lucene.NET for search applications

Introduction

Pros and Cons

Workflow

Installation

Usage

Create analyzer for Vietnamese

Define the interface

Define the action

Implement methods in interface

Result

Conclusion

Top comments (0)

Read next

How to create a background email sender with outbox pattern integration

Refactoring Complex Conditions: Clean Code Solutions for Nested If Statements

Adding Revision Support to Entities in Entity Framework Core

Maximize Your Web API Performance with ASP.NET Core 9.0: Proven Strategies and Best Practices