Saurabh Rai

for SWIRL

Posted on Oct 11, 2023

Your full guide to contributing to SWIRL 🌌

#python #opensource #ai #beginners

Hello Devs,
The team at Swirl has created this amazing guide which contains all the relevant information for anyone who wants to extend Swirl by adding SearchProviders, Connectors, and Processors.

This makes it easy for you to contribute to Swirl. Get started with open source with Swirl. And we're participating in Hacktoberfest, giving out Swags to the contributors. Swags are up to $100, please check the blog here for more information.

Learn more about Swirl by checking out this article below.

Creating an 👩‍💻 Open Source Search Platform: Search Engines with AI - Swirl 🌌

Saurabh Rai for SWIRL ・ Sep 11 '23

#opensource #ai #python #github

Give Swirl a 🌟 on GitHub.

Full guide to contributing to Swirl

Prerequisites

Latest python, 3.11 or later, installed locally

% python
Python 3.11.1 (v3.11.1:a7a450f84a, Dec  6 2022, 15:24:06) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

Redis installed and running
Swirl installed locally (not in Docker) and running

% python swirl.py status
__S_W_I_R_L__2_._6__________________________________________________________

Service: redis...RUNNING, pid:31012
Service: django...RUNNING, pid:31014
Service: celery-worker...RUNNING, pid:31018

  PID TTY           TIME CMD
31012 ttys000    0:20.11 redis-server *:6379 
31014 ttys000    0:12.04 /Library/Frameworks/Python.framework/Versions/3.11/Resources/Python.app/Contents/MacOS/Python /Library/Frameworks/Python.framework/Versions/3.11/bin/daphne -b 0.0.0.0 -p 8000 swirl_server.asgi:application

Background: Understanding the Swirl Search Workflow

In a nutshell:

User creates a query - example: http://localhost:8000/swirl/search/?q=ai
Pre-query processing - example: SpellcheckQueryProcessor

🕐 Each Search Provider executes in parallel

Query processing - example: AdaptiveQueryProcessor
Connector - example: RequestsGet
Result processing - example: MappingResultsProcessor

🕐 End parallel processing

Post-result processing - example: CosineRelevancyPostResultProcessor
Ranked results available via mixer - example: http://localhost:8000/swirl/results/?search_id=1

For more information, consult the Developer Guide Workflow Overview.

Creating a SearchProvider

A SearchProvider is a configuration of a Connector. So, to connect to a given source, first, verify that it supports a Connector you already have. (See the next tutorial for information on creating new Connectors.)

For example, if trying to query a website using a URL like https://host.com/?q=my+query+here that returns JSON or XML, create a new SearchProvider configuring the RequestsGet connector as follows:

Copy any of the Google PSE SearchProviders

Modify the url and query_template to construct the query URL. Using the above example:

{
        "url": "https://host.com/",
        "query_template": "{url}?q={query_string}",
}

To learn more about query and URL parameters, refer to the Developer Guide.

If the website offers the ability to page through results, or sort results by date (as well as relevancy), use the PAGE= and DATE_SORT query mappings to add support for these features through Swirl.

For more information refer to the User Guide, Query Mappings section:

Open the query URL in a browser and look through the JSON response.

If using Visual Studio Code, right-click on the pasted JSON and select Format Document to make it easier to read.

Identify the results list and the number of results found and retrieved. Put these JSON paths in the response_mappings. Then, identify the JSON paths to use to extract the Swirl default fields title, body, url, date_published and author from each item in the result lists in the result_mappings, with the Swirl field name on the left, and the source JSON path on the right.

For example:


        "response_mappings": "FOUND=searchInformation.totalResults,RETRIEVED=queries.request[0].count,RESULTS=items",
        "result_mappings": "url=link,body=snippet,author=displayLink,cacheId,pagemap.metatags[*].['og:type'],pagemap.metatags[*].['og:site_name'],pagemap.metatags[*].['og:description'],NO_PAYLOAD",

Add credentials as required for the service.

The format to use depends on the type of credential. Details are here: User Guide Credentials Section

Add a suitable tag that can be used to describe the source or what it knows about.

Spaces are not permitted; good tags are clear and obvious when used in a query, like company:tesla or news:openai.

For more about tags, see: Organizing SearchProviders

Review the finished SearchProvider:


{
        "name": "My New SearchProvider",
        "connector": "RequestsGet",
        "url": "https://host.com/",
        "query_template": "{url}?q={query_string}",
        "query_processors": [
            "AdaptiveQueryProcessor"
        ],
        "query_mappings": "",
        "result_processors": [
            "MappingResultProcessor",
            "CosineRelevancyResultProcessor"
        ],
        "response_mappings": "FOUND=jsonpath.to.number.found,RETRIEVED=jsonpath.to.number.retrieved,RESULTS=jsonpath.to.result.list",
        "result_mappings": "url=link,body=snippet,author=displayLink,NO_PAYLOAD",
        "credentials": "bearer=your-bearer-token-here",
        "tags": [
            "MyTag"
        ]
    }

Go to Swirl localhost:8000/swirl/searchproviders/, logging in if necessary. Put the form at the bottom of the page into RAW mode, and paste the SearchProvider in. Then hit POST. The SearchProvider will reload.
Go to Galaxy localhost:8000/galaxy/ and run a search using the tag you created earlier. Results should again appear in roughly the same period of time.

Creating a Connector

In Swirl, Connectors are responsible for loading a SearchProvider, then constructing and transmitting queries to a particular type of service, then saving the response - typically a result list.

:info: Consider using your favorite coding AI to generate a Connector by passing it the Connector base classes, and information about the API you are trying to query.

:info: If you are trying to send an HTTP/S request to an endpoint that returns JSON or XML, you don't need to create a Connector. Instead, Create a SearchProvider that configures the RequestsGet connector included with Swirl.

To create a new Connector:

Create a new file, e.g. swirl/connectors/my_connector.py
Copy the style of the ChatGPT connector as a starting point, or BigQuery it targeting a database.


class MyConnector(Connector):

    def __init__(self, provider_id, search_id, update, request_id=''):
        self.system_guide = MODEL_DEFAULT_SYSTEM_GUIDE
        super().__init__(provider_id, search_id, update, request_id)

In the init class, load and persist anything that will be needed when connecting and querying the service. Use the ChatGPT Connector as a guide.

Import the python package(s) to connect to the service. The ChatGPT connector uses the openai package, for example:

import openai

Modify the execute_search method to connect to the service.

As you can see from the ChatGPT Connector, it first loads the OpenAI credentials, then constructs a prompt, sends the prompt via openai.ChatCompletion.create(), then stores the response.


    def execute_search(self, session=None):

        logger.debug(f"{self}: execute_search()")

        if self.provider.credentials:
            openai.api_key = self.provider.credentials
        else:
            if getattr(settings, 'OPENAI_API_KEY', None):
                openai.api_key = settings.OPENAI_API_KEY
            else:
                self.status = "ERR_NO_CREDENTIALS"
                return

        prompted_query = ""
        if self.query_to_provider.endswith('?'):
            prompted_query = self.query_to_provider
        else:
            if 'PROMPT' in self.query_mappings:
                prompted_query = self.query_mappings['PROMPT'].format(query_to_provider=self.query_to_provider)
            else:
                prompted_query = self.query_to_provider
                self.warning(f'PROMPT not found in query_mappings!')

        if 'CHAT_QUERY_REWRITE_GUIDE' in self.query_mappings:
            self.system_guide = self.query_mappings['CHAT_QUERY_REWRITE_GUIDE'].format(query_to_provider=self.query_to_provider)

        if not prompted_query:
            self.found = 0
            self.retrieved = 0
            self.response = []
            self.status = "ERR_PROMPT_FAILED"
            return
        logger.info(f'CGPT completion system guide:{self.system_guide} query to provider : {self.query_to_provider}')
        self.query_to_provider = prompted_query
        completions = openai.ChatCompletion.create(
            model=MODEL,
            messages=[
                {"role": "system", "content": self.system_guide},
                {"role": "user", "content": self.query_to_provider},
            ],
            temperature=0,
        )
        message = completions['choices'][0]['message']['content'] # FROM API Doc

        self.found = 1
        self.retrieved = 1
        self.response = message.replace("\n\n", "")

        return

ChatGPT depends on the OpenAI API key, which is provided to Swirl via the .env file. To follow this pattern, create new values in .env then modify swirl_server/settings.py to load them as Django settings, and set a reasonable default.

Modify the normalize_response() method to store the raw response. This is literally no more (or less) than writing the result objects out as a Python list and storing that in self.results:


    def normalize_response(self):

        logger.debug(f"{self}: normalize_response()")

        self.results = [
                {
                'title': self.query_string_to_provider,
                'body': f'{self.response}',
                'author': 'CHATGPT',
                'date_published': str(datetime.now())
            }
        ]

        return

There's no need to do this if self.response is already a python list.

Add the new Connector to swirl/connectors/__init__.py

from swirl.connectors.my_connector import MyConnector

Restart Swirl

% python swirl.py restart core

Create a SearchProvider to configure the new Connector, then add it to the Swirl installation as noted in the Create a SearchProvider tutorial.

Don't forget a useful tag so you can easily target the new connector when ready to test.

To learn more about developing Connectors, refer to the Developer Guide.

Creating a QueryProcessor

A QueryProcessor is a stage executed either during Pre-Query or Query Processing. The difference between these is that Pre-Query processing is applied to all SearchProviders, and Query Processing is executed by each individual SearchProviders. In both cases, the goal is to modify the query sent to some or all SearchProviders.

Note: if you just want to rewrite the query using lookup tables or regular expressions, consider using QueryTransformations instead

To create a new QueryProcessor:

Create a new file, e.g. swirl/processors/my_query_processor.py
Copy the GenericQueryProcessor class as a starting point, and rename it:


class MyQueryProcessor(QueryProcessor):

    type = 'MyQueryProcessor'

    def process(self):
        # TO DO: modify self.query_string, and return it 
        return self.query_string + ' modified'

Save the module.

Add the new module to swirl/processors/__init__.py


from swirl.processors.my_processor import MyQueryProcessor

Add the new module to the Search.pre_query_processing pipeline or at least one SearchProvider.query_processing pipeline:

SearchProvider:


        "query_processors": [
            "AdaptiveQueryProcessor",
            "MyQueryProcessor"
        ],

Search:


  {
        "query_string": "news:ai",
        "pre_query_processors": [
          "MyQueryProcessor"
        ],
  }

Restart Swirl

% python swirl.py restart core

Go to Galaxy http://localhost:8000/swirl/search/?q=some+query

Run a search; if using a query processor be sure to target that SearchProvider. For example if you added a QueryProcessor to a SearchProvider query_processing pipeline with tag "news", the query would be http://localhost:8000/swirl/search/?q=news:some+query instead.

Results should appear in a just a few seconds. In the messages block a message indicating that the new QueryProcessor rewrote the query should appear:

MyQueryProcessor rewrote Strategy Consulting - Google PSE's query to: <modified-query>

To learn more about writing Processors, refer to the Developer Guide.

Creating a ResultProcessor

A ResultProcessor is a stage executed by each SearchProvider, after the Connector has retrieved results. ResultProcessors operate on results and transform them as needed for downstream consumption or presentation.

The GenericResultProcessor and MappingResultProcessor stages are intended to normalize JSON results. GenericResultProcessor searches for exact matches to the Swirl schema (as noted in the SearchProvider example) and copies them over. MappingResultProcessor applies result_mappings to normalize the results, again as shown in the SearchProvider example above. In general adding stages after these is a good idea, unless the SearchProvider is expected to respond in a Swirl schema format.

To create a new ResultProcessor:

Create a new file, e.g. swirl/processors/my_result_processor.py
Copy the GenericResultProcessor class as a starting point, and rename it. Don't forget the init.

class MyResultProcessor(ResultProcessor):

    def __init__(self, results, provider, query_string, request_id='', **kwargs):
        super().__init__(results, provider, query_string, request_id=request_id, **kwargs)

Implement the process() method. This is the only one required.

Process() operates on self.results, which will contain all the results from a given SearchProvider, in python list format. Modify items in the result list, and report the number updated.

    def process(self):

        if not self.results:
            return

        updated = 0
        for item in self.results:
            # TO DO: operate on each item and count number updated
            item['my_field1'] = 'test'
            updated = updated + 1

        # note: there is no need to save in this type of Processor

        # save modified self.results
        self.processed_results = self.results
        # save number of updated
        self.modified = updated

        return self.modified

Save the module.

Add the new module to swirl/processors/__init__.py

from swirl.processors.my_processor import MyResultProcessor

Add the new module to the at least one SearchProvider.result_processing pipeline:

        "result_processors": [
            "MappingResultProcessor",
            "MyResultProcessor",
            "CosineRelevancyResultProcessor"
        ],
         ...etc...

Restart Swirl

% python swirl.py restart core

Go to Galaxy http://localhost:8000/swirl/search/?q=some+query

Run a search; be sure to target at least one SearchProvider that has the new ResultProcessor.

For example if you added a ResultProcessor to a SearchProvider result_processing pipeline with tag "news", the query would need to be http://localhost:8000/swirl/search/?q=news:some+query instead of the above.

Results should appear in a just a few seconds. In the messages block a message indicating that the new ResultProcessor updated a number of results should appear, and the content should be modified as expected.

MyResultProcessor updated 5 results from: MyConnector",

To learn more about writing Processors, refer to the Developer Guide.

Creating a PostResultProcessor

A PostResultProcessor is a stage executed after all SearchProviders have returned results. They operate on all the results for a given query.

To create a new ResultProcessor:

Create a new file, e.g. swirl/processors/my_post_result_processor.py
Copy the template below as a starting point, and rename it:

class MyPostResultProcessor(PostResultProcessor):

    type = 'MyPostResultProcessor'

    ############################################

    def __init__(self, search_id, request_id = ''):
        return super().__init__(search_id, request_id=request_id)

    ############################################

    def process(self):

        updated = 0

        for results in self.results:
            if not results.json_results:
                continue
            for item in results.json_results:
                # TO DO: operate on each result item
                item['my_field2'] = "test"
                updated = updated + 1
            # end for
            # call results.save() if any results were modified
            if updated > 0:
                results.save()

        # end for
        ############################################

        self.results_updated = updated
        return self.results_updated

Modify the process() method, operating on the items and saving each result set as shown.

Add the new module to swirl/processors/__init__.py

from swirl.processors.my_post_result_processor import MyPostResultProcessor

Add the new module to the Search.post_result_processing pipeline:

  {
        "query_string": "news:ai",
        "post_result_processors": [
            "DedupeByFieldPostResultProcessor",
            "CosineRelevancyPostResultProcessor",
            "MyPostResultProcessor"
        ],
        ...etc...
    }

Restart Swirl

% python swirl.py restart core

Go to Galaxy http://localhost:8000/swirl/search/?q=some+query

Run a search; be sure to target at least one SearchProvider that has the new PostResultProcessor.

For example if you added a PostResultProcessor to a Search post_result_processing pipeline with tag "news", the query would need to be http://localhost:8000/swirl/search/?q=news:some+query instead of the above.

Results should appear in a just a few seconds. In the messages block a message indicating that the new PostResultProcessor updated a number of results should appear, and the content should be modified as expected.

MyPostResultProcessor updated 10 results from: MySearchProvider

To learn more about writing Processors, refer to the Developer Guide.

Join the Community

Join the Swirl Metasearch Community on Slack!
Email: support@swirl.today with issues, requests, questions, etc - we'd love to hear from you!

Give Swirl a 🌟 on GitHub.

swirlai / swirl-search

SWIRL AI Connect: AI infrastructure software that powers your Search & Retrieval Augmented Generation (RAG) applications. Simplify and enhance your AI pipelines with seamless integration of large language models (LLMs) and data sources.

SWIRL AI Connect

Bring AI to the Data, not the Data to the AI

SWIRL AI Connect is advanced AI infrastructure software. It supports enhanced Retrieval Augmented Generation (RAG) capabilities, powerful analytics, and SWIRL Co-Pilot. SWIRL harnesses AI for business, enabling organizations to make better decisions and take more effective and timely actions.

Start Searching · Slack · Key Features · Contribute · Documentation · Connectors

Get your AI up and running in minutes, not months. SWIRL AI Connect is an open-source AI Connect platform that streamlines the integration of advanced AI technologies into business operations. It supports powerful features like Retrieval-Augmented Generation (RAG), Analytics, and Co-Pilot, enabling enhanced decision-making with AI and boosting enterprise AI transformation.

SWIRL operated without needing to move data into a vector database or undergo ETL processes. This approach not only enhances security but also speeds up the deployment. As a private cloud AI provider…

View on GitHub