DEV Community

Cover image for vector-2-trend easily show trending topics from collections of semantic text vectors
Aaron Decker
Aaron Decker

Posted on

vector-2-trend easily show trending topics from collections of semantic text vectors

I recently have been working on a lot of semantic search related problems at Bounty and kept coming to the same problem.

Tl;dr: the open source library is published here and on npm as "vector-2-trend".

The problem

When you collect user feedback about a product, your boss will come to you and say:

"What feature are people requesting the most?"

But, semantic search isn't truly suited for these kind of quantitative questions.

The new generative Q&A process people are doing with Pinecone + GPT will not give you this information.

So how do you take a bunch of unstructured text data and do some quantitative analysis on it in terms of commonalities and strength of those common topics?

The answer is you run clustering algorithms on the vectors (also known as unsupervised machine learning).

The process for generative Q&A

Ok hold up maybe I lost you, what am I talking about? I am taking advantage of the new openAI llms ability to make amazingly good semantic vectors so you can generate your own semantic search engines and do other things with the vectors (we are doing clustering on them).

This is the new hot thing so you have to understand why you would already be generating vectors.

The process looks like this for making semantic search:

  1. collect text based data (in my case it's tiktok video transcripts)
  2. convert the transcripts to vectors (I used openAI text-embedding-ada-002 API)
  3. feed these into pinecone to create a vector db
  4. when somebody asks a question convert it into a vector embedding using the same text-embedding-ada-002 API
  5. search this question vector against the vector database
  6. get back the closest results semantically (this is the key - if somebody has written "sucks" it understand that this is a complaint when you ask about complaints).
  7. feed into GPT to summarize or analyze, or show the direct results to user

OK got it?

So the point is, if you are doing this kind of stuff you already have semantic vectors!

The solution to generate trending topics

Given that people will start asking "what are people complaining about most" you are going to need to generate an list of things.

But feeding your closest matches into GPT is not going to do this for you. How do you analyze the entire data set, or the most recent 100?

The way I dealt with this is to use clustering to create groups of similarly structured pieces of data.

It ended up working pretty well, on the dataset I used I had a group of people complaining about delivery times and I ended up with a grouping of all of those pieces of user feedback related to delivery times!

If you do this for the entire dataset and you rank the strength of the trend you get something like a trending list like this:

Image description

What are the steps

Ok I will next outline the steps I perform in the library.

  1. Input semantic vectors w/ original text to the clustering library
  2. run PCA on the vectors to reduce dimensionality (ada-002 outputs 1536 dimensions)
  3. run kmeans algorithm on this (choses a sane value for number of clusters)
  4. calculate a silhouette score for each cluster
  5. create a "custom density" score for each cluster - which is a combo of silhouette score + number of results.
  6. return all this data, rank by density descending
  7. pass this to the classifier
  8. classifier calls GPT-3.5-turbo and asks it classify each cluster with a descriptive name
  9. output is a simple trending list like above!

Feedback & PRs

If you want to make a PR I would be happy to accept, this is something that I will continue to work on as it is powering some of our features.

If you want to just use it and play around feel free to reach out to me on twitter for questions @ardninja

You can find the project here: https://github.com/a-r-d/vector-2-trend

Top comments (0)