DEV Community

Apiumhub
Apiumhub

Posted on • Originally published at apiumhub.com on

Dimensionality reduction – TSNE

t-SNE | T-distributed stochastic neighbor embedding

The goal is to take a set of points in a high-dimensional space and find a faithful representation of those points in a lower-dimensional space, typically the 2D plane. The algorithm is non-linear and adapts to the underlying data, performing different transformations on different regions. Those differences can be a major source of confusion.

Previous Concepts

Softmax function

This function is related to Gaussian distribution which is named in the previous paper of t-SNE –> SNE.

WZDZNxdixavGjw4YufcHBgbZAmYevCV2ont2HpYwCJNJuZVwi mHzyR9Hb5p LcB53 xDUJvhTWQAluLLRVOF4 2V25Ad4Ou8rcy6gtDebznHtQut GhYX6bASXqC6v6dIUcdg

Kullback Leiber divergence

Kullback–Leibler divergence, also called relative entropy, is a measure of how one probability distribution is different from a second, reference probability distribution.

Disimilarity

eSyiFrbAsFbqKnhNOJWlIQbvoDN EVd7ucx56HWpoeDh94WHbUcpLhaLeTgZ1lhJHGfiebQXpD7jPbKH8e13qYLGlt7SffPGvciEektN PcHHQWytDWG2zbyRCsCq1lds85cQjA

Neighbor Probability

1EYVmLK1ykBrr1ng2vqfeaPV8lOL5r3Ll7vOmt zmr3tlN6xNs6GNhU0FpA CaXS9eXpSUsV p7jGikcSIiSml1AVQpmsZhIz1m31ogYFO 5JR zKC15cZ0gsgDczGIlZIX9f

A second feature of t-SNE is a tuneable parameter, “perplexity,” which says (loosely) how to balance attention between local and global aspects of your data. The parameter is, in a sense, a guess about the number of close neighbors each point has. The perplexity value has a complex effect on the resulting pictures. The original paper says, “The performance of SNE is fairly robust to changes in the perplexity, and typical values are between 5 and 50.” But the story is more nuanced than that. Getting the most from t-SNE may mean analyzing multiple plots with different perplexities.

That’s not the end of the complications. The t-SNE algorithm doesn’t always produce similar output on successive runs, for example, and there are additional hyperparameters related to the optimization process.

Embedding

An embedding is essentially a low-dimensional space into which a high dimensional vector can be translated. During translation, an embedding preserves the semantic relationship of the inputs by placing similar inputs close together in the embedding space. Let’s try and wrap our head around this concept with examples. Here is a grab from the creators of the Embedding projector, a tool which enables us to visualise high dimensional data easily.

https://projector.tensorflow.org/

Perplexity

s6bzajyl Sdseu H9lq1OCfAzffb4FyhJpFgUnrsBjapO1KRC262r tLB1Jxmp7IBbGgZWCuSBtZa ToZ7OHkJZ4xPJJpJvRBRkAraS7aV8DFxHDxZrmLjEubFrtPThIxa 3SjX9

Perplexity definition by Van der Maaten & Hinton can be interpreted as a smooth measure of the effective number of neighbors. The performance of t-SNE is fairly robust to changes in the perplexity, and typical values are between 5 and 50.

With perplexity values in the range (5 – 50) suggested by van der Maaten & Hinton, the diagrams do show these clusters, although with very different shapes. Outside that range, things get a little weird. With perplexity 2, local variations dominate. The image for perplexity 100, with merged clusters, illustrates a pitfall: for the algorithm to operate properly, the perplexity really should be smaller than the number of points. Implementations can give unexpected behavior otherwise.

Iteration

XHVpS2irqWOuXsYrt4vPrgIa7pt2TpLX2NPhOQ9OXOkiUynkSFKC s

The images above show five different runs at perplexity 30. The first four were stopped before stability. After 10, 20, 60, and 120 steps you can see layouts with seeming 1-dimensional and even pointlike images of the clusters.

If you see a t-SNE plot with strange “pinched” shapes, chances are the process was stopped too early. Unfortunately, there’s no fixed number of steps that yields a stable result. Different data sets can require different numbers of iterations to converge.

Another natural question is whether different runs with the same hyperparameters produce the same results. In this simple two-cluster example, and most of the others we discuss, multiple runs give the same global shape. Certain data sets, however, yield markedly different diagrams on different runs; we’ll give an example of one of these later.

Clusters with different interdistance

Surprisingly, the two clusters look about same size in the t-SNE plots. What’s going on? The t-SNE algorithm adapts its notion of “distance” to regional density variations in the data set. As a result, it naturally expands dense clusters, and contracts sparse ones, evening out cluster sizes. To be clear, this is a different effect than the run-of-the-mill fact that any dimensionality reduction technique will distort distances. (After all, in this example all data was two-dimensional to begin with.) Rather, density equalization happens by design and is a predictable feature of t-SNE.

The bottom line, however, is that you cannot see relative sizes of clusters in a t-SNE plot.

Top comments (0)