If you are not already familiar with vector databases, they're simply specialized databases designed to efficiently store and query vector data. In vector databases, data is represented as high-dimensional vectors, where each vector represents a feature or attribute of the data.
For this article, I will be using JSON data made up of different data about individuals, you can assume that it's employee data of some company. Although the data you might be working with might be different, similar processes might apply especially if you are using Pinecone.
Concept.
When working with data files such as texts and PDFs that have flowy information, for example, an article talking about baking cookies, the go to strategy is split this file into smaller chunks and then embed them before storing them in a database.
With our data or similar data(employee data), the data is discrete in the sense that, employee A has their own attributes, and employee B also has their own attributes and so on.
This is where vector databases are different from traditional databases, with these databases, how or who is going to use the data matters, we can use chunks to embed our data but it isn't really necessary, Pinecone allows for the use of metadata which we can add while inserting the data which makes querying even more easier.
Upserting data to Pinecone.
For an easier understanding, here is the link to the documentation. pinecone
The data I'm using can be found in this GitHub repo with .json extension. GitHub link
And all the code can be found there.
if you are using a notebook, you can easily install all the dependencies
!pip install langchain
!pip install openai
!pip install pinecone-client
!pip install jq
!pip install tiktoken
Importing them
from langchain.vectorstores import Pinecone
from langchain.embeddings import OpenAIEmbeddings
import pinecone
from langchain.document_loaders import JSONLoader
Your API keys and pinecone Env go between the strings
openai_api_key = ''
PINECONE_API_KEY = ''
PINECONE_API_ENV = ''
Loading the Json data
import json
from pathlib import Path
from pprint import pprint
file_path='/kaggle/input/json-dataset-of-people/Customer data.json'
data = json.loads(Path(file_path).read_text())
To perform the operation below, you need to have a Pinecone account which allows you to create an index. They do have a waitlist for a free plan but it takes only a day mostly. For this project, you'll need to set the metric to "cosine similarity" which is just a vector metric that you learn more about here Cosine Similarity, the other item is the number of dimensions, and since we are using openAi embeddings, it is set to 1536.
Initiailizing Pinecone
pinecone.init(
api_key=PINECONE_API_KEY,
environment=PINECONE_API_ENV
)
index_name = "metadata-insert" # You use the index name you created in the Pinecone console.
Once you confirm that the data has been loaded, Pinecone has a python-client that allows you to enter the data into the index you created. And the format goes like this a list of (Id,vector,metadata) tuples. The data structures are (string,list,dictionary) as shown below.
index.upsert([
("A", [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1], {"genre": "comedy", "year": 2020}),
("B", [0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2], {"genre": "documentary", "year": 2019})
])
There are many ways to structure your data so as to meet the format above. As for this project, the name of the employees would be Id (even though our data has an Id field type), the list would be a list of vectors of the names, and the dictionary will be the other field types or key-value pairs("Occupation": "Engineer")
The entire restructuring and packaging has been done and explained in the same GitHub repo,
Instantiating the index.
index = pinecone.Index("metadata-insert")
Querying the data using metadata
the text is our prompt
text = "Return anyone with id given by the metadata"
query_vector = embeddings.embed_query(text)
Checking for the metadata we can use in our queries
print(data_dict[0].keys())
output
dict_keys(['id', 'email', 'gender', 'ip_address', 'Location', 'Occupation', 'Ethnicity'])
Running the Query
The function index.query() takes in:
- A vector of your prompt, in our case, the variable query_vector
- A metadata filter, you can use any of the metadata above but to be very specific, we can use "id" since we have only one employee of "id", thus easier to confirm.
- A top k value refers to the number of results you want returned, in our case it should return only one result, but if it was set to 2,3...., it would return the specified number of results that have a closer "cosine similarity" score to your query vector.
- Setting the include_metadata parameter to True returns all the metadata that was stored with the entry. As below:
result= index.query(
vector=query_vector,
filter={
"id": 5
},
top_k=1,
include_metadata=True
)
Output
{'matches': [{'id': 'Beverie Frandsen',
'metadata': {'Ethnicity': 'Yakama',
'Location': 'Longwei',
'Occupation': 'Developer III',
'email': 'bfrandsen4@cargocollective.com',
'gender': 'Female',
'id': 5.0,
'ip_address': '235.124.253.241'},
'score': 0.680275083,
'values': []}],
'namespace': ''}
You can confirm from your original data if this is accurate.
There are many different techniques to query using metadata depending on your use case that I will include in the repo later.
Enjoy Learning!
Top comments (2)
Hey @peterabel
Thanks for the article. It certainly helps to get started storing JSON into the vector database. My question is, how good the data analysis can be using vector embeddings? For example, if I query stuff like "How many Americans in are there in Croatia", now that is a statistical data analysis where pandas would produce an accurate result and not the vector database which is designed to return a list of records limited by k value.
That's right.