Vectors are used to represent quantities that have both magnitude and direction. They can be visualized as arrows in space. Now, let's dive into sparse and dense vectors.
A sparse vector is one that has mostly zero or empty values. In other words, it has very few non-zero elements compared to its total size. Imagine a long list of numbers where most of the entries are zero. For example, consider a vector representing the presence or absence of words in a document. In a large document with a vast vocabulary, only a few words will be present, and the rest will be zeros.
A dense vector is one that contains significant values in a high proportion of its elements. In a dense vector, most of the entries have non-zero values. Dense vectors can be thought of as vectors where every element carries meaningful information. For instance, consider a vector representing the intensity of different colors in an image. Each element of the vector corresponds to a specific color channel, and all the channels have non-zero values.
To summarize:
Sparse vectors have very few non-zero elements compared to their total size, with most of the entries being zero.
Dense vectors have a high proportion of non-zero values, with meaningful information in most of their elements.
Both sparse and dense vectors have their uses in different contexts. Sparse vectors are often utilized in situations where the data being represented has a lot of empty or zero values, such as text data or high-dimensional data where most elements are expected to be zero. On the other hand, dense vectors are commonly employed when there is meaningful information in every element, such as image data or numerical data.
Example with real data.
We will be using Pinecone as our vector database since it allows for both dense and sparse vectors.
Pinecone uses dictionaries to insert data with a python-client. The keys required are Id, dense values, metadata, sparse values(that has indices and values)
About the Dataset.
The "Top 250 IMDb TV Shows" dataset comprises information on the highest-rated television shows according to IMDb ratings. This dataset contains 250 unique TV shows that have garnered critical acclaim and popularity among viewers. Each TV show is associated with essential details, including its name, release year, number of episodes, show type, IMDb rating, image source link, and a brief description. (source: IMDB Top 250 TV Shows | Kaggle.
Process.
- Simple EDA to identify the field types to use.
- Process the IDs.
- Process the metadata.
- Get the dense vectors.
- Get the sparse vectors.
- Combine them into a single list.
- Discussion and Conclusion.
1. Simple EDA to identify the field types to use.
Dependencies
!pip install openai
!pip install tiktoken
!pip install langchain
!pip install pinecone-client
import numpy as np
import pandas as pd
from langchain.embeddings.openai import OpenAIEmbeddings
import pinecone
My embedding engine.
openai_api_key = ''
embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key )
Reading the data
data = pd.read_csv("/kaggle/input/imdb-top-250-tv-shows/IMDB.csv")
data.head(10)
Output
Name Year Episodes Type Rating Image-src Description Name-href
0 1. Breaking Bad 2008–2013 62 eps TV-MA 9.5 https://m.media-amazon.com/images/M/MV5BYmQ4YW... A chemistry teacher diagnosed with inoperable ... https://www.imdb.com/title/tt0903747/?ref_=cht...
1 2. Planet Earth II 2016 6 eps TV-G 9.5 https://m.media-amazon.com/images/M/MV5BMGZmYm... David Attenborough returns with a new wildlife... https://www.imdb.com/title/tt5491994/?ref_=cht...
2 3. Planet Earth 2006 11 eps TV-PG 9.4 https://m.media-amazon.com/images/M/MV5BMzMyYj... A documentary series on the wildlife found on ... https://www.imdb.com/title/tt0795176/?ref_=cht...
3 4. Band of Brothers 2001 10 eps TV-MA 9.4 https://m.media-amazon.com/images/M/MV5BMTI3OD... The story of Easy Company of the U.S. Army 101... https://www.imdb.com/title/tt0185906/?ref_=cht...
4 5. Chernobyl 2019 5 eps TV-MA 9.4 https://m.media-amazon.com/images/M/MV5BNTdkN2... In April 1986, an explosion at the Chernobyl n... https://www.imdb.com/title/tt7366338/?ref_=cht...
5 6. The Wire 2002–2008 60 eps TV-MA 9.3 https://m.media-amazon.com/images/M/MV5BNTllYz... The Baltimore drug scene, as seen through the ... https://www.imdb.com/title/tt0306414/?ref_=cht...
6 7. Avatar: The Last Airbender 2005–2008 62 eps TV-Y7-FV 9.3 https://m.media-amazon.com/images/M/MV5BODc5YT... In a war-torn world of elemental magic, a youn... https://www.imdb.com/title/tt0417299/?ref_=cht...
7 8. Blue Planet II 2017 7 eps TV-G 9.3 https://m.media-amazon.com/images/M/MV5BNDZiND... David Attenborough returns to the world's ocea... https://www.imdb.com/title/tt6769208/?ref_=cht...
8 9. The Sopranos 1999–2007 86 eps TV-MA 9.2 https://m.media-amazon.com/images/M/MV5BZGJjYz... New Jersey mob boss Tony Soprano deals with pe... https://www.imdb.com/title/tt0141842/?ref_=cht...
9 10. Cosmos: A Spacetime Odyssey 2014 13 eps TV-PG 9.3 https://m.media-amazon.com/images/M/MV5BZTk5OT... An exploration of our discovery of the laws of... https://www.imdb.com/title/tt2395695/?ref_=cht...
data.columns
output
Index(['Name', 'Year', 'Episodes', 'Type', 'Rating', 'Image-src',
'Description', 'Name-href'],
dtype='object')
checking length.
len(data)
output
250
dropping empty records.
data = data.dropna(subset=['Name', 'Year', 'Episodes', 'Type', 'Rating', 'Image-src',
'Description', 'Name-href'])
checking length.
len(data)
output
245
data["Description"][0]
Output
"A chemistry teacher diagnosed with inoperable lung cancer turns to manufacturing and selling methamphetamine with a former student in order to secure his family's future."
We are going to use this Description column for our dense vectors.
2. Process the IDs.
Accessing the indices of the DataFrame
indices = data.index
# Convert the RangeIndex to a Python list
indices_list = indices.tolist()
The indices_list will represent the IDs of our records!
3. Process the metadata.
The metadata represents all other non-categorical field types in our dataset, these include: "Name", "Year", "Episodes", "Rating" for this particular exercise.
The data is organized in a list of dictionaries, where the keys are field types and the values are the actual records.
# List to store dictionaries for each row
metadata_list = []
# Iterate over the DataFrame rows
for index, row in data.iterrows():
# Extract the desired columns for the current row
name = row['Name']
year = row['Year']
episodes = row['Episodes']
rating = row['Rating']
# Create a dictionary for the current row and append it to the dict_list
metadata_list.append({"Name": name, "Year": year, "Episodes": episodes, "Rating": rating})
4. Get the dense vectors.
# Extract the descriptions from the DataFrame as a list
descriptions = data['Description'].tolist()
# Embed the list of descriptions
dense_vectors = embeddings.embed_documents(descriptions)
5. Get the sparse vectors.
The Sparse values will be obtained from the Type column. Why?
data['Type'].unique()
Output
array(['TV-MA', 'TV-G', 'TV-PG', 'TV-Y7-FV', 'TV-14', 'TV-Y', 'PG-13',
'TV-Y7', 'Not Rated', nan], dtype=object)
This column has categorical values as shown above, these are usually recorded in a traditional database but in a vector database, we can save on space computation by recording them as sparse values.
# Step 1: One-hot encode the 'Type' column
one_hot_encoded_df = pd.get_dummies(data['Type'])
# Step 2: Convert the one-hot encoded DataFrame to a list of lists (encodings for all records)
one_hot_encodings_list = one_hot_encoded_df.values.tolist()
# Step 3: Generate a list of the index positions of the non-zero values in the encodings
non_zero_indices_list = [one_hot_encoded_df.columns[encoding.nonzero()[0]].tolist() for encoding in one_hot_encoded_df.to_numpy()]
# Print the results
print("One-hot encodings for all records:")
print(one_hot_encodings_list[:5])
Output
One-hot encodings for all records:
[[0, 0, 0, 0, 1, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 1, 0, 0, 0, 0], [0, 0, 0, 0, 1, 0, 0, 0, 0]]
Collecting the sparse value in a list.
sparse_values = []
for encoding in one_hot_encodings_list:
index = [i + 1 for i, value in enumerate(encoding) if value == 1] # Find the indices of the non-zero values (1) and add 1 to each index
float_encoding = [float(value) for value in encoding] # Convert the values to floats
sparse_values.append({
'indices': index,
'values': float_encoding
})
print(sparse_values[:5])
Output
[{'indices': [5], 'values': [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0]}, {'indices': [4], 'values': [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]}, {'indices': [6], 'values': [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0]}, {'indices': [5], 'values': [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0]}, {'indices': [5], 'values': [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0]}]
6. Combine them into a single list.
This is to create a dictionary that can be inserted into a Pinecone DB with the following structure.
# Creating a list of dictionaries
vector_list = []
for i in range(len(indices_list)):
data_dict = {
'id': indices_list[i],
'values': dense_vectors[i],
'metadata': metadata_list[i],
'sparse_values': sparse_values[i]
}
vector_list.append(data_dict)
vector_list[0]
output
{'id': '0',
'values': [0.013431914869978481,
0.010376786649314377,
-0.018131088179204113,
-0.030511347095271996,
-0.010323538281951407,
0.028434657974148448,
-0.01690637479853322,
-0.006163504808691176,
-0.04060192342076444,
0.0028437985589655013,
0.005318186161896771,
0.03128344888769636,
0.00782751838425978,
0.018131088179204113,
0.012939367239040359,
-0.016786565739135895,
0.03775313064457141,
-0.02923338441591557,
-0.012187233002300513,
-0.024920262002902118,
-0.0171859280286969
...],
'metadata': {'Name': '1. Breaking Bad',
'Year': '2008–2013',
'Episodes': '62 eps',
'Rating': 9.5},
'sparse_values': {'indices': [5], 'values': [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0]}}
7. Discussion and Conclusion.
As shown above, sparse values are obtained from categorical information in the field types that is redundant in tabular data and thus we can use indices and the vector values to save on space in vector databases. The indices indicate the position of the non-zero vectors in the sparse vectors list e.g
'sparse_values': {'indices': [5], 'values': [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0]}}
By using a sparse representation, you can save memory and computational resources, especially when dealing with large datasets that have a significant number of zero elements. Sparse representations are commonly used in various fields such as natural language processing (NLP), machine learning, and data compression, where data sparsity is prevalent. They allow for more efficient storage and manipulation of sparse data structures.
Good coding!
Top comments (0)