This article was written to design a solution for terravigil.com articles ordering. It will start as a problem related to articles, and will end with the same problem as well. But in between...
Let's do some math:
- open your favourite site with some articles on the main page -
bbc.com
,reddit
ordev.to
, it does not matter. - check what you see
- probably that's latest and the best news or articles
latest
ANDthe best
.
These are two different properties to order. Simultaneously.
These are two different dimensions to discover.
π‘ There is an easy way to "experience" this problem - open any table
within some table processor (Excel) and try to sort by a field. And then another field. And then by two fields simultaneously -
π
but there is no such operation.
Speaking in a more common language (SQL):
- you can
ORDER BY time, rating LIMIT 10
- but it will first sort by
time
and only then byrating
.
Want to get by time AND by rating simultaneously - that is not possible, the ordering is happening sequentially, or it's better to say in the 1D world
.
However, it's easy to start the story from another example. From the actual 2D world
. Something we all experienced, and able to understand.
From maps π
Maps
πΊ Maps are 2D. Obviously.
Let's do some math:
- open your favourite site, powered by some maps, and displaying some stuff on the map (that's important!).
- check what you see.
- probably there are some "dots"(Points Of Interest) on that map.
For the sake of simplicity - let's use domain.com.au as an example.
Let's check what we actually looking at:
- β it's a City Of Sydney
- β majority of POIs are located in the Downtown
- β there are almost 9000 places to rent
- π© just a few dozen are displayed
So, this is "our situation" - we have more data than we could, or should display, so we have to apply some limits
- we have to pick "latest
AND the best
" items to display.
We have to pick
the best
items with restriction byX
andY
. Already 1-dimensional ordering and 2-dimensional filtering.
However, I would like to point in that strange fact, that there is almost no information outside of Downtown.
Probably nobody is selling or leasing their houses so far from the centre. People are not living in those wild areas! Sounds reasonable? π€·ββοΈ why not?
However, let's double-check it - zoo-oo-ooming in!
π I would assure you, there is life outside of Sydney's CBD, we just are not able to see it due to limitation of 2D dimensions.
I've got a video for you
π In short: there is A LOT of "dots" outside the Downtown, but we don't see them.
The root cause of this effect is - information tend to be displayed in the areas of higher density. And yes - it comes, and become worse and worse, with a scale.
information tend to be displayed in the areas of higher density. Every entity has an equal probability to be picked, and that's why higher "density" produces more "picks"
Actually, the root cause here can be observed as the overall "speed" of updating information on the map as well.
Every time I move the map, domain
fetches the new information from the backend, and then displays it. Very often, especially within "areas of high density", the updated data does not represent the previous dataset (it might be very different), even if I am searching roughly the same area.
This causes heavy load on the backend, very slow and non-responsive frontend, as well as annoying POI-glittering. Really VERY annoying. Especially when some point you had a plans to inspect just disappears...
Living in 2-dimensions
We have two problems here:
Problem 1
Every request is unique. You can move a map as you want, and every time it would be absolutely unique. You will have to reach the server, spend time, return a unique result.
Two customers, looking at one area, are looking at ABSOLUTELY different ones.
There is one thing you have to understand - maps "displaying" some stuff are here for some time. For "some time" - since IE6 and pre-iPhone era. Since "before" AWS. Since the times, when you had a single database for all 1000000 customers per day.
In other words - 10 years ago you were not able to "scale up", and had to provide performant and efficient backend, or your Frontend will DDOS your backend.
- And cartographical services of the past have to solve the problem. And had solved it.
- But cartographical services of the present are not giving a crap, resulting in the issue mention above, the fix for which is known for years. > The solution will be explained later
Problem 2
Problem number 2 consist of the other two problems:
-
SELECT * WHERE X and Y
gives us 10.000 records. So you can select much more information than you can display. So you have toLIMIT
it. But limiting needs sorting to keep "the best" records. -
ORDER BY rating
is keeping records in the areas of higher density. And you want to display all possible opportunitues. If all records have the same probability to be displayed - the result would represent just information density.
And "information density" is the problem you have to fix first. Not to fix - but neutralize.
Clustering
For example,β you can do N smaller requests to "cover" the whole requested area.
SELECT WHERE X and Y ORDER BY rating LIMIT 100
β¬οΈβ¬οΈβ¬οΈ
100 times x(SELECT WHERE x and y ORDER BY rating LIMIT 1)
This time instead of one big select we are making a dozen smaller, every of which still tends to display the most "dense" part of it, but covering the whole block more or less evenly.
However - it is not efficient. There is a better way - group by
SELECT WHERE X and Y ORDER BY rating LIMIT 100
β¬οΈβ¬οΈβ¬οΈ
SELECT WHERE X and Y GROUP BY (ROUND(x/10), ROUND(y/10)) ORDER BY rating
In this case we have grouped all points in some "smaller area" together, resulting potentially up to 100 rows(0.1x0.1 grouping).
This is, again, not efficient, but could it help us with reducing dev city of the data and making coverage a bit more βflatβ.
We just make a complex task a bit simpler.
However, this is not clusterization we probably should use to properly display high density information. All we did is aggregation, and that's enough! But we did it not in the best way.
There is a better one.
Problems with "random queries"
-
group by ROUND(x/10)
is not working./10
should be different for every zoom level. With every zoom, Map becomes twice bigger, so multiplier should also adjust (x2 every zoom) > Plus grouping is actually slow. No matter how you do it - it would be slow. -
WHERE x BETWEEN left and right
would be unique for every user looking at your map. And that is a problem bot only for the network cache, but for the database cache as well > π Two customers, looking at one area, are looking at ABSOLUTELY different ones.
Z for Z-code
The goal of aggregation is to collapse some records, which could be grouped somehow, into one record.
For 2d data the most common solution is known as Z or Morton code, as well as geohash, Gray code or Hilbert curve - they all are some sort of spatial filling curves, literally a line which stitches N dimension together.
it is called "Z" because every iteration it "Z" itself (and Hilbert curve "U" itself)
It might take a while to understand how such curves work, but the essence of the idea is simple:
values which are "close" on the curve, are "close" in the N-dimension as well
In our case that would sounds like
SELECT WHERE X and Y
GROUP BY zcode & 0xFFFF00 <---
ORDER BY rating LIMIT 100
Where 0xFFFF00
is playing a role of a "mask" and groups πvalues which are "close" on the curveπ effectively clustering elements which are close to each other in 2D.
The explanation how "clustering" work is best seen on Bings Tile "coordinate" System - quadkeys (the same Z code) which uses 2 bits (0-1-2-3) per one "division". So masking the "following" bits can "group" all points with the same beginning of the code. And all points in one "rectangle" will have the same beginning.
Word-to-vec
A bit different approach is used in search engines, which also need to find and rank something among a billion documents.
"Similarity" search is one of key algorithms there, and word2vec is a key technique here.
As the name implies, word2vec represents each distinct word with a particular list of numbers called a vector. The vectors are chosen carefully such that a simple mathematical function (the cosine similarity between the vectors) indicates the level of semantic similarity between the words represented by those vectors.
Mathematically speaking it's the same case - "close" words or "documents" would end "close" to each other in the (very) multi-dimension space (for the reference it's not 2D or 3D, but closer to 1000D).
Terravigil
However let's back to our articles case, about displaying "the best and the most fresh" data.
What we actually want:
- we want to see "the best of the best" news longer - more people should read them
- we want to display hot news as soon as possible, so freshness is important
- but there two points violates each other
There are two possible solutions here: buckets
and score
Buckets
Selects news in predefined time internals, and order inside each group separately.
Literally "display top news of today" and then "display top news of tomorrow".
- the size of a bucket can be adjusted, like more than a day or less than a day
- buckets might overlap, like you can merge the top news today, and the top news from the past week.
- buckets can be fixed (in time), thus cacheable, or floating, changing intervals they represent every minute or hour.
Buckets are first
sort by timebucket by time, then sort by rate.
Score
Selects news ordered by a single "score", which is the main problem of this article.
While spatial curves can help with 2D and other geometrical cases - they are more or less useless for this particular case.
Why? Because some elements "close" on the 2D plot would be VERY far on the curve. Just think for "start" is different from the "end", while curve starts and ends in the same point.
Solution here is to use or special "vector" data formats or databases known as KNN - k-nearest neighbours search (or Nearest centroid.
-
Javascript
implementation https://github.com/mourner/rbush-knn -
ElasticSearch
https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/knn.html -
Postgres
support it as a part ofPostGIS
- remember that our problem can be seen as a geometrical problem - MySQL does not have any efficient way to handle this, and you will probably kill your database if you try.
A sample data set of points considered as "neighbours"
Conclusion
Long story short - Terravigil
uses buckets as long as it's easy to design them and understand how they work, so you can create a correct result.
As well it uses score to derive a single metric from many different other signals to at least understand which article is "the best".
But all you have to do - just look at the problem from another angle. If you were told to do a 2-dimensional sorting, thinks about limits of the 2-nd dimension.
And it's not just about how to do it, it's also about how to make it FAST.
Top comments (2)
I like this explanation of Hilbert curve 3blue1brown.com/videos-blog/2017/5...
A really different point of view. Thanks!
This moment is a brilliant explanation how "clustering" works - youtu.be/3s7h2MHQtxc?t=414