Abstract
It is now straightforward to replicate JSON data from MongoDB to SingleStore Kai using the new Change Data Capture (CDC) solution. This article describes two examples of how to do this. In the first example, we'll replicate a small retail dataset. In the second example, we'll replicate a small book dataset and create OpenAI embeddings for books. Finally, we'll build a simple Streamlit application to provide a GUI for a book recommender system.
The datasets, notebook files, SQL and Streamlit code are available on GitHub.
Introduction
The new CDC solution from SingleStore makes it very easy to replicate data from MongoDB to SingleStore Kai. We'll work through two different examples in this article to see how.
MongoDB Atlas
We'll use MongoDB Atlas in an M0 Sandbox. We'll configure an admin user with atlasAdmin privileges under Database Access. We'll temporarily allow access from anywhere (IP Address 0.0.0.0/0) under Network Access. We'll note down the username, password and host.
SingleStore Kai
A previous article showed the steps to create a free SingleStoreDB Cloud account. We'll use the following settings:
- Workspace Group Name: CDC Demo Group
- Cloud Provider: AWS
- Region: US East 1 (N. Virginia)
- Workspace Name: cdc-demo
- Size: S-00
-
Settings:
- SingleStore Kai selected
Once the workspace is available, we'll make a note of our password and host. The host will be available from cdc-demo > Connect > MongoDB Client. We'll need this information later for Streamlit. We'll also temporarily allow access from anywhere by configuring the firewall under CDC Demo Group > Firewall.
1. Retail Sales Example
MongoDB Atlas
We'll create a new database called new_transactions
with three collections called custs
, items
and txs
. We'll load data into these collections. The data load can be accomplished in several ways, such as using MongoDB Compass or mongoimport
. The datasets can be found on GitHub.
SingleStore Kai
From the left navigation pane in SingleStoreDB Cloud, we'll select DEVELOP > Data Studio > Open SQL Editor to create a new_transactions
database and link
, as follows:
CREATE DATABASE IF NOT EXISTS new_transactions;
USE new_transactions;
DROP LINK new_transactions.link;
CREATE LINK new_transactions.link AS MONGODB
CONFIG '{"mongodb.hosts": "<primary>:27017, <secondary>:27017, <secondary>:27017",
"collection.include.list": "new_transactions.*",
"mongodb.ssl.enabled": "true",
"mongodb.authsource": "admin",
"mongodb.members.auto.discover": "false"}'
CREDENTIALS '{"mongodb.user": "<username>",
"mongodb.password": "<password>"}';
CREATE TABLES AS INFER PIPELINE AS LOAD DATA LINK new_transactions.link '*' FORMAT AVRO;
We'll replace <username>
and <password>
with the values that we saved earlier from MongoDB Atlas. We'll also need to replace the values for <primary>
, <secondary>
and <secondary>
with the full address for each from MongoDB Atlas.
Next, we'll start the pipelines to replicate the data, as follows:
START ALL PIPELINES;
Shortly afterwards, we'll check the tables and confirm that the data have been copied across, as follows:
SELECT COUNT(*) FROM custs;
SELECT COUNT(*) FROM items;
SELECT COUNT(*) FROM txs;
Load the Notebook
We'll use the notebook file available on GitHub. From the left navigation pane in SingleStoreDB Cloud, we'll select DEVELOP > Data Studio. In the top right of the web page will be New Notebook with a pulldown that has two options:
- New Notebook
- Import From File
We'll select the second option, locate the notebook file we downloaded from GitHub and load it into SingleStoreDB Cloud.
Run the Notebook
Figure 1 shows the overall flow of this application.
In Figure 1,
- Step 1: Copy data from Atlas to SingleStoreDB was previously completed using the SingleStore CDC solution.
- Step 2: Run Queries against Atlas and SingleStore allows us to run various analytical queries against both systems.
We'll enter the MongoDB Atlas details in the notebook, as follows:
myclientmongodb = pymongo.MongoClient("mongodb+srv://<username>:<password>@<host>/?retryWrites=true&w=majority")
We'll replace <username>
, <password>
and <host>
with the values that we saved earlier from MongoDB Atlas.
We'll now run through the Retail Sales notebook to generate and view the analytics and visualisations.
2. Bookstore Example
MongoDB Atlas
We'll create a new database called bookstore
with a collection called books
. We'll load data into this collection. The data load can be accomplished in several ways, such as using MongoDB Compass or mongoimport
. The dataset can be found on GitHub.
SingleStore Kai
From the left navigation pane in SingleStoreDB Cloud, we'll select DEVELOP > Data Studio > Open SQL Editor to create a bookstore
database and link
, as follows:
CREATE DATABASE IF NOT EXISTS bookstore;
USE bookstore;
DROP LINK bookstore.link;
CREATE LINK bookstore.link AS MONGODB
CONFIG '{"mongodb.hosts": "<primary>:27017, <secondary>:27017, <secondary>:27017",
"collection.include.list": "bookstore.*",
"mongodb.ssl.enabled": "true",
"mongodb.authsource": "admin",
"mongodb.members.auto.discover": "false"}'
CREDENTIALS '{"mongodb.user": "<username>",
"mongodb.password": "<password>"}';
CREATE TABLES AS INFER PIPELINE AS LOAD DATA LINK bookstore.link '*' FORMAT AVRO;
We'll replace <username>
and <password>
with the values that we saved earlier from MongoDB Atlas. We'll also need to replace the values for <primary>
, <secondary>
and <secondary>
with the full address for each from MongoDB Atlas.
Next, we'll start the pipeline to replicate the data, as follows:
START ALL PIPELINES;
Shortly afterwards, we'll check the table and confirm that the data have been copied across, as follows:
SELECT COUNT(*) FROM books;
Load the Notebook
We'll use the notebook file available on GitHub. From the left navigation pane in SingleStoreDB Cloud, we'll select DEVELOP > Data Studio. In the top right of the web page will be New Notebook with a pulldown that has two options:
- New Notebook
- Import From File
We'll select the second option, locate the notebook file we downloaded from GitHub and load it into SingleStoreDB Cloud.
Run the Notebook
Figure 2 shows the overall flow of this application.
In Figure 2,
- The Book Data were previously loaded using the SingleStore CDC solution.
- The notebook generates data embeddings for books using OpenAI. The embeddings are stored in a new SingleStore Kai collection.
- The notebook obtains query embeddings using OpenAI and performs
$dotProduct
when we query the books. SingleStore Kai also supports$euclideanDistance
as discussed in a previous article.
We'll now run through the Bookstore notebook to generate book recommendations. The query results will be output as JSON data.
Bonus: Book Recommender
For the Bookstore example, we'll build a small Streamlit application that provides a more interactive experience and outputs the results in a better format, rather than as JSON data. The Streamlit code can be found on GitHub.
We'll enter the SingleStore Kai MongoDB Client details in the Streamlit application, as follows:
def init_connection():
return pymongo.MongoClient("mongodb://<username>:<password>@<host>:27017/?authMechanism=PLAIN&tls=true&loadBalanced=true")
We'll replace <username>
, <password>
and <host>
with the values that we saved earlier from SingleStoreDB Cloud.
We'll need to provide an OpenAI API Key
in our environment. For example:
export OPENAI_API_KEY="<OpenAI API Key>"
Replace <OpenAI API Key>
with your key.
Next, we'll run the Streamlit application, as follows:
streamlit run streamlit_app.py
The application presents an input box on the left side where the query can be typed, and the results are shown on the right side, formatted as a table. An example query and its output are shown in Figure 3.
Summary
This article presented two examples of how easily we can use the SingleStore CDC solution to connect to MongoDB Atlas and replicate the data to SingleStore Kai. We also created embeddings from JSON book data and stored these embeddings in SingleStore Kai. Finally, we built a simple application to query the database system to make book recommendations.
Top comments (0)