I’m a founder at Potis AI, where we’re building an AI recruiter.
SEO has always been on my back burner, but I finally decided to tackle it. Since I had zero experience beyond surface-level knowledge, I hired an agency.
First thing they said? “We need to build a semantic core.” Seemed logical. They analyzed competitors and gathered a huge list of keywords.
Then they told me, “Now we’ll cluster these keywords.” Still sounded straightforward.
But here’s where the magic started.
The agency claimed it would take at least two weeks to cluster everything, even though there were only 500K records.
When I asked what algorithms or principles they used, they simply said: “We have a very big Windows server running a magical desktop program.”
Fair enough, I didn’t interfere.
But curiosity got the better of me. I decided to build my own pipeline to figure out how this works.
Filtering out noise:
I labeled 5K rows using GPT-4o-mini, marking examples of relevant and irrelevant words. Then I trained a mini-classifier. Surprise! 90% of the data was garbage.Building embeddings:
I used stella_en_1.5B_v5 to generate embeddings (1024 dimensions), then reduced it to 30 dimensions with LSA (Truncated SVD).Clustering:
I ran multiple iterations of clustering with HDBSCAN. By the third iteration, ~70% of the data was neatly clustered. I calculated centroids and assigned the remaining elements to the nearest cluster.Naming clusters:
Too lazy to do it manually, I asked GPT-4o-mini to name ~2K clusters. (Sarcasm incoming: of course) most clusters revolved around one theme.
The whole process took me half a day and $3 on Google Colab.
Fast forward two weeks.
The SEO agency came back with their results: a dirty dataset grouped only by root words. When I showed them my results, they were genuinely surprised.
Moral of the story
Not all “magic” is science. Sometimes, it’s just complete lack of understanding.
Top comments (0)