Hello Devs,
The team at Swirl has created this amazing guide which contains all the relevant information for anyone who wants to extend Swirl by adding SearchProviders, Connectors, and Processors.
This makes it easy for you to contribute to Swirl. Get started with open source with Swirl. And we're participating in Hacktoberfest, giving out Swags to the contributors. Swags are up to $100, please check the blog here for more information.
Learn more about Swirl by checking out this article below.
Creating an π©βπ» Open Source Search Platform: Search Engines with AI - Swirl π
Saurabh Rai for SWIRL γ» Sep 11 '23
Give Swirl a π on GitHub.
Table of contents
Prerequisites
- Latest python, 3.11 or later, installed locally
% python
Python 3.11.1 (v3.11.1:a7a450f84a, Dec 6 2022, 15:24:06) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
- Redis installed and running
- Swirl installed locally (not in Docker) and running
% python swirl.py status
__S_W_I_R_L__2_._6__________________________________________________________
Service: redis...RUNNING, pid:31012
Service: django...RUNNING, pid:31014
Service: celery-worker...RUNNING, pid:31018
PID TTY TIME CMD
31012 ttys000 0:20.11 redis-server *:6379
31014 ttys000 0:12.04 /Library/Frameworks/Python.framework/Versions/3.11/Resources/Python.app/Contents/MacOS/Python /Library/Frameworks/Python.framework/Versions/3.11/bin/daphne -b 0.0.0.0 -p 8000 swirl_server.asgi:application
Background: Understanding the Swirl Search Workflow
In a nutshell:
- User creates a query - example: http://localhost:8000/swirl/search/?q=ai
- Pre-query processing - example: SpellcheckQueryProcessor
π Each Search Provider executes in parallel
- Query processing - example: AdaptiveQueryProcessor
- Connector - example: RequestsGet
- Result processing - example: MappingResultsProcessor
π End parallel processing
- Post-result processing - example: CosineRelevancyPostResultProcessor
- Ranked results available via mixer - example: http://localhost:8000/swirl/results/?search_id=1
For more information, consult the Developer Guide Workflow Overview.
Creating a SearchProvider
A SearchProvider is a configuration of a Connector. So, to connect to a given source, first, verify that it supports a Connector you already have. (See the next tutorial for information on creating new Connectors.)
For example, if trying to query a website using a URL like https://host.com/?q=my+query+here
that returns JSON or XML, create a new SearchProvider configuring the RequestsGet connector as follows:
- Copy any of the Google PSE SearchProviders
Modify the url
and query_template
to construct the query URL. Using the above example:
{
"url": "https://host.com/",
"query_template": "{url}?q={query_string}",
}
To learn more about query and URL parameters, refer to the Developer Guide.
- If the website offers the ability to page through results, or sort results by date (as well as relevancy), use the
PAGE=
andDATE_SORT
query mappings to add support for these features through Swirl.
For more information refer to the User Guide, Query Mappings section:
- Open the query URL in a browser and look through the JSON response.
If using Visual Studio Code, right-click on the pasted JSON and select Format Document
to make it easier to read.
- Identify the results list and the number of results found and retrieved. Put these JSON paths in the response_mappings. Then, identify the JSON paths to use to extract the Swirl default fields
title
,body
,url
,date_published
andauthor
from each item in the result lists in the result_mappings, with the Swirl field name on the left, and the source JSON path on the right.
For example:
"response_mappings": "FOUND=searchInformation.totalResults,RETRIEVED=queries.request[0].count,RESULTS=items",
"result_mappings": "url=link,body=snippet,author=displayLink,cacheId,pagemap.metatags[*].['og:type'],pagemap.metatags[*].['og:site_name'],pagemap.metatags[*].['og:description'],NO_PAYLOAD",
- Add credentials as required for the service.
The format to use depends on the type of credential. Details are here: User Guide Credentials Section
- Add a suitable tag that can be used to describe the source or what it knows about.
Spaces are not permitted; good tags are clear and obvious when used in a query, like company:tesla
or news:openai
.
For more about tags, see: Organizing SearchProviders
- Review the finished SearchProvider:
{
"name": "My New SearchProvider",
"connector": "RequestsGet",
"url": "https://host.com/",
"query_template": "{url}?q={query_string}",
"query_processors": [
"AdaptiveQueryProcessor"
],
"query_mappings": "",
"result_processors": [
"MappingResultProcessor",
"CosineRelevancyResultProcessor"
],
"response_mappings": "FOUND=jsonpath.to.number.found,RETRIEVED=jsonpath.to.number.retrieved,RESULTS=jsonpath.to.result.list",
"result_mappings": "url=link,body=snippet,author=displayLink,NO_PAYLOAD",
"credentials": "bearer=your-bearer-token-here",
"tags": [
"MyTag"
]
}
Go to Swirl
localhost:8000/swirl/searchproviders/
, logging in if necessary. Put the form at the bottom of the page into RAW mode, and paste the SearchProvider in. Then hit POST. The SearchProvider will reload.Go to Galaxy
localhost:8000/galaxy/
and run a search using the tag you created earlier. Results should again appear in roughly the same period of time.
Creating a Connector
In Swirl, Connectors are responsible for loading a SearchProvider, then constructing and transmitting queries to a particular type of service, then saving the response - typically a result list.
:info: Consider using your favorite coding AI to generate a Connector by passing it the Connector base classes, and information about the API you are trying to query.
:info: If you are trying to send an HTTP/S request to an endpoint that returns JSON or XML, you don't need to create a Connector. Instead, Create a SearchProvider that configures the RequestsGet connector included with Swirl.
To create a new Connector:
Create a new file, e.g.
swirl/connectors/my_connector.py
Copy the style of the
ChatGPT
connector as a starting point, orBigQuery
it targeting a database.
class MyConnector(Connector):
def __init__(self, provider_id, search_id, update, request_id=''):
self.system_guide = MODEL_DEFAULT_SYSTEM_GUIDE
super().__init__(provider_id, search_id, update, request_id)
In the init class, load and persist anything that will be needed when connecting and querying the service. Use the ChatGPT Connector as a guide.
- Import the python package(s) to connect to the service. The ChatGPT connector uses the openai package, for example:
import openai
- Modify the execute_search method to connect to the service.
As you can see from the ChatGPT Connector, it first loads the OpenAI credentials, then constructs a prompt, sends the prompt via openai.ChatCompletion.create()
, then stores the response.
def execute_search(self, session=None):
logger.debug(f"{self}: execute_search()")
if self.provider.credentials:
openai.api_key = self.provider.credentials
else:
if getattr(settings, 'OPENAI_API_KEY', None):
openai.api_key = settings.OPENAI_API_KEY
else:
self.status = "ERR_NO_CREDENTIALS"
return
prompted_query = ""
if self.query_to_provider.endswith('?'):
prompted_query = self.query_to_provider
else:
if 'PROMPT' in self.query_mappings:
prompted_query = self.query_mappings['PROMPT'].format(query_to_provider=self.query_to_provider)
else:
prompted_query = self.query_to_provider
self.warning(f'PROMPT not found in query_mappings!')
if 'CHAT_QUERY_REWRITE_GUIDE' in self.query_mappings:
self.system_guide = self.query_mappings['CHAT_QUERY_REWRITE_GUIDE'].format(query_to_provider=self.query_to_provider)
if not prompted_query:
self.found = 0
self.retrieved = 0
self.response = []
self.status = "ERR_PROMPT_FAILED"
return
logger.info(f'CGPT completion system guide:{self.system_guide} query to provider : {self.query_to_provider}')
self.query_to_provider = prompted_query
completions = openai.ChatCompletion.create(
model=MODEL,
messages=[
{"role": "system", "content": self.system_guide},
{"role": "user", "content": self.query_to_provider},
],
temperature=0,
)
message = completions['choices'][0]['message']['content'] # FROM API Doc
self.found = 1
self.retrieved = 1
self.response = message.replace("\n\n", "")
return
ChatGPT depends on the OpenAI API key, which is provided to Swirl via the .env file. To follow this pattern, create new values in .env then modify swirl_server/settings.py
to load them as Django settings, and set a reasonable default.
- Modify the
normalize_response()
method to store the raw response. This is literally no more (or less) than writing the result objects out as a Python list and storing that inself.results
:
def normalize_response(self):
logger.debug(f"{self}: normalize_response()")
self.results = [
{
'title': self.query_string_to_provider,
'body': f'{self.response}',
'author': 'CHATGPT',
'date_published': str(datetime.now())
}
]
return
There's no need to do this if self.response is already a python list.
- Add the new Connector to
swirl/connectors/__init__.py
from swirl.connectors.my_connector import MyConnector
- Restart Swirl
% python swirl.py restart core
- Create a SearchProvider to configure the new Connector, then add it to the Swirl installation as noted in the Create a SearchProvider tutorial.
Don't forget a useful tag so you can easily target the new connector when ready to test.
To learn more about developing Connectors, refer to the Developer Guide.
Creating a QueryProcessor
A QueryProcessor is a stage executed either during Pre-Query or Query Processing. The difference between these is that Pre-Query processing is applied to all SearchProviders, and Query Processing is executed by each individual SearchProviders. In both cases, the goal is to modify the query sent to some or all SearchProviders.
Note: if you just want to rewrite the query using lookup tables or regular expressions, consider using QueryTransformations
instead
To create a new QueryProcessor:
Create a new file, e.g.
swirl/processors/my_query_processor.py
Copy the
GenericQueryProcessor
class as a starting point, and rename it:
class MyQueryProcessor(QueryProcessor):
type = 'MyQueryProcessor'
def process(self):
# TO DO: modify self.query_string, and return it
return self.query_string + ' modified'
Save the module.
- Add the new module to
swirl/processors/__init__.py
from swirl.processors.my_processor import MyQueryProcessor
- Add the new module to the Search.pre_query_processing pipeline or at least one SearchProvider.query_processing pipeline:
SearchProvider
:
"query_processors": [
"AdaptiveQueryProcessor",
"MyQueryProcessor"
],
Search
:
{
"query_string": "news:ai",
"pre_query_processors": [
"MyQueryProcessor"
],
}
- Restart Swirl
% python swirl.py restart core
- Go to Galaxy
http://localhost:8000/swirl/search/?q=some+query
Run a search; if using a query processor be sure to target that SearchProvider. For example if you added a QueryProcessor to a SearchProvider query_processing pipeline with tag "news", the query would be http://localhost:8000/swirl/search/?q=news:some+query
instead.
Results should appear in a just a few seconds. In the messages
block a message indicating that the new QueryProcessor rewrote the query should appear:
MyQueryProcessor rewrote Strategy Consulting - Google PSE's query to: <modified-query>
To learn more about writing Processors, refer to the Developer Guide.
Creating a ResultProcessor
A ResultProcessor is a stage executed by each SearchProvider, after the Connector has retrieved results. ResultProcessors operate on results and transform them as needed for downstream consumption or presentation.
The GenericResultProcessor and MappingResultProcessor stages are intended to normalize JSON results. GenericResultProcessor searches for exact matches to the Swirl schema (as noted in the SearchProvider example) and copies them over. MappingResultProcessor applies result_mappings to normalize the results, again as shown in the SearchProvider example above. In general adding stages after these is a good idea, unless the SearchProvider is expected to respond in a Swirl schema format.
To create a new ResultProcessor:
Create a new file, e.g.
swirl/processors/my_result_processor.py
Copy the
GenericResultProcessor
class as a starting point, and rename it. Don't forget the init.
class MyResultProcessor(ResultProcessor):
def __init__(self, results, provider, query_string, request_id='', **kwargs):
super().__init__(results, provider, query_string, request_id=request_id, **kwargs)
- Implement the
process()
method. This is the only one required.
Process() operates on self.results
, which will contain all the results from a given SearchProvider, in python list format. Modify items in the result list, and report the number updated.
def process(self):
if not self.results:
return
updated = 0
for item in self.results:
# TO DO: operate on each item and count number updated
item['my_field1'] = 'test'
updated = updated + 1
# note: there is no need to save in this type of Processor
# save modified self.results
self.processed_results = self.results
# save number of updated
self.modified = updated
return self.modified
Save the module.
- Add the new module to
swirl/processors/__init__.py
from swirl.processors.my_processor import MyResultProcessor
- Add the new module to the at least one SearchProvider.result_processing pipeline:
"result_processors": [
"MappingResultProcessor",
"MyResultProcessor",
"CosineRelevancyResultProcessor"
],
...etc...
- Restart Swirl
% python swirl.py restart core
- Go to Galaxy
http://localhost:8000/swirl/search/?q=some+query
Run a search; be sure to target at least one SearchProvider that has the new ResultProcessor.
For example if you added a ResultProcessor to a SearchProvider result_processing pipeline with tag "news", the query would need to be http://localhost:8000/swirl/search/?q=news:some+query
instead of the above.
Results should appear in a just a few seconds. In the messages
block a message indicating that the new ResultProcessor updated a number of results should appear, and the content should be modified as expected.
MyResultProcessor updated 5 results from: MyConnector",
To learn more about writing Processors, refer to the Developer Guide.
Creating a PostResultProcessor
A PostResultProcessor is a stage executed after all SearchProviders have returned results. They operate on all the results for a given query.
To create a new ResultProcessor:
Create a new file, e.g.
swirl/processors/my_post_result_processor.py
Copy the template below as a starting point, and rename it:
class MyPostResultProcessor(PostResultProcessor):
type = 'MyPostResultProcessor'
############################################
def __init__(self, search_id, request_id = ''):
return super().__init__(search_id, request_id=request_id)
############################################
def process(self):
updated = 0
for results in self.results:
if not results.json_results:
continue
for item in results.json_results:
# TO DO: operate on each result item
item['my_field2'] = "test"
updated = updated + 1
# end for
# call results.save() if any results were modified
if updated > 0:
results.save()
# end for
############################################
self.results_updated = updated
return self.results_updated
Modify the process()
method, operating on the items and saving each result set as shown.
- Add the new module to
swirl/processors/__init__.py
from swirl.processors.my_post_result_processor import MyPostResultProcessor
- Add the new module to the Search.post_result_processing pipeline:
{
"query_string": "news:ai",
"post_result_processors": [
"DedupeByFieldPostResultProcessor",
"CosineRelevancyPostResultProcessor",
"MyPostResultProcessor"
],
...etc...
}
- Restart Swirl
% python swirl.py restart core
- Go to Galaxy
http://localhost:8000/swirl/search/?q=some+query
Run a search; be sure to target at least one SearchProvider that has the new PostResultProcessor.
For example if you added a PostResultProcessor to a Search post_result_processing
pipeline with tag "news", the query would need to be http://localhost:8000/swirl/search/?q=news:some+query
instead of the above.
Results should appear in a just a few seconds. In the messages
block a message indicating that the new PostResultProcessor updated a number of results should appear, and the content should be modified as expected.
MyPostResultProcessor updated 10 results from: MySearchProvider
To learn more about writing Processors, refer to the Developer Guide.
Join the Community
Email: support@swirl.today with issues, requests, questions, etc - we'd love to hear from you!
Give Swirl a π on GitHub.
swirlai / swirl-search
SWIRL AI Connect: AI infrastructure software that powers your Search & Retrieval Augmented Generation (RAG) applications. Simplify and enhance your AI pipelines with seamless integration of large language models (LLMs) and data sources.
SWIRL AI Connect
Bring AI to the Data, not the Data to the AI
SWIRL AI Connect is advanced AI infrastructure software. It supports enhanced Retrieval Augmented Generation (RAG) capabilities, powerful analytics, and SWIRL Co-Pilot. SWIRL harnesses AI for business, enabling organizations to make better decisions and take more effective and timely actions.
Start Searching Β· Slack Β· Key Features Β· Contribute Β· Documentation Β· Connectors
Get your AI up and running in minutes, not months. SWIRL AI Connect is an open-source AI Connect platform that streamlines the integration of advanced AI technologies into business operations. It supports powerful features like Retrieval-Augmented Generation (RAG), Analytics, and Co-Pilot, enabling enhanced decision-making with AI and boosting enterprise AI transformation.
SWIRL operated without needing to move data into a vector database or undergo ETL processes. This approach not only enhances security but also speeds up the deployment. As a private cloud AI providerβ¦
Top comments (2)
Probably should start contributing to Swirl!
Swirl will help me learn Python which has been on my list. Thanks for the thorough tutorial, I would love to start contributing .