Data visualization, by definition, involves making a two- or three-dimensional picture of data, so when the data being visualized inherently has many more dimensions than two or three, a big component of data visualization is dimensionality reduction. Dimensionality reduction is also often the first step in a big-data machine-learning pipeline, because most machine-learning algorithms suffer from the Curse of Dimensionality: more dimensions in the input means you need exponentially more training data to create a good model. Datacratic’s products operate on billions of data points (big data) in tens of thousands of dimensions (big problem), and in this post, we show off a proof of concept for interactively visualizing this kind of data in a browser, in 3D (of course, the images on the screen are two-dimensional but we use interactivity, motion and perspective to evoke a third dimension).

For the TL;DR crowd, here’s a demo of what we came up with, the source code on Github, and a video:

The behavioural datasets on which Datacratic’s platform operates are basically very large, very sparse binary matrices: you can think of a grid with millions of users running down the side and tens of thousands of behaviours running across the top, with a 1 in each cell where user U engaged in behaviour B and 0s everywhere else. Each user record thus can be thought of as a point in a high-dimensional space. If we had only three behaviours, this space would be three-dimensional, and there would only be 8 possible points in this space, like the corners of a cube. Because we operate on tens of thousands of behaviours, each user sits in one corner of a ten-plus-thousand-dimensional space.

This is hard to describe, hard to think about, very hard to picture, and very hard to efficiently run algorithms on, so one of the first steps in our machine-learning pipeline is to perform a Singular Value Decomposition, or SVD, on the data. The SVD helps us turn our ten-thousand-dimensional hypercube of corners into something a bit more manageable. After the SVD dimensionality reduction step, each user now occupies a point in a two-hundred-dimensional continuous space (i.e. they’re not all in a corner), and the coordinates of users that behave similarly to each other are close to each other in this new space. That sounds slightly easier to think about, and it’s certainly easier to run algorithms on, but 200 dimension is still at least 197 dimensions too many to actually make a picture.

To reduce the dimensionality even further, down to something we can actually look at, we use another algorithm called t-Stochastic Neighbour Embedding, or t-SNE, which was designed to do exactly this: take high-dimensional data and make low-dimensional pictures such that points close to each other in the high-dimensional space are also close to each other in the picture (check out our high-performance open-source implementation!). t-SNE can reduce the number of dimensions to two, so we can just make a scatter-plot of a sample of our users in any old tool, but we chose to reduce the number of dimensions to three instead, and used some exotic browser technology to make some fancy visuals. Relaxing the constraint of the algorithm to three dimensions from two should also help preserve more of the high-dimensional structure in the final output as well, so this wasn’t solely an aesthetic exercise.

The proof of concept which we ended up calling the Data Projector was built to see if we could interactively visualize a few thousand sampled points of the output of a server-side SVD/t-SnE pipeline in the browser using WebGL via Three.js instead of something like SVG via D3.js, which doesn’t make use of hardware acceleration in the browser and hence would struggle to display so many points. As the code on Github, the demo, and the video above show, the answer is most definitely yes. The interactivity in this case is that you can drag the cube around to get a different perspective on the data, and you can shift-drag in the right-hand orthographic view to select a prism-shaped volume.

In the video above and in the demo, each point represents a user. Points close to each other represent users that are similar to each other, in the sense that they behaved similarly. The colour of the points represents the output of yet another machine-learning algorithm called k-means clustering, which is used to group similar data points into clusters. Here we ran k-means in the high-dimensional space before running t-SNE with k=10, so we have grouped the users into 10 buckets based on similarity. You’ll notice that broadly, similarly-coloured users end up close together creating coloured clouds. This means that users that were close to each other in the high-dimensional space ended up close to each other in the three-dimensional visualization as well.

This proof of concept shows that it’s possible to interactively visualize the output of some heavy-duty server-side high-dimensional machine-learning algos in 3D in the browser. In this demo we’re manipulating a graph with thousands of points in real-time, leveraging hardware acceleration via WebGL. The performance is good enough that we can envision doing much more complicated operations which would permit us to interactively dig into the actual semantics of why one user is “similar” to another, all in the browser. This technique can also be applied to simpler two-dimensional scatterplots, while maintaining very good performance, far beyond what SVG- or canvas-based interactive visualization libraries can manage today.

Catégorie:

## Commentaires

Very interesting and informative post!

A question: would it make sense to run the k-means after the dimensionality reduction of t-sne? i.e. is t-sne useful only for the visualisation of the high dimensional data?

From what you say it seems that the results of the t-sne are similar to the k-means ones in the original space, so if someone is interesting in the clustering results only, k-means is enough (although a visualisation of the clustering is nice to have as well)

thanks,

Stelios

Thanks for your comment Stelios, and my apologies for the long delay in responding!

I don't think that it would be better to do clustering on the 3-d post-t-SNE representation, as by that point we have thrown away a lot of information, and also because t-SNE is such a non-linear transformation, you would get some strange results.

As to the point of t-SNE here, the idea was to get a visual picture of what was happening within the data. The reason you see these uniformly-coloured starbursts is because that's how t-SNE represents large numbers of very similar points. In a more linear dimensionality-reduction, you would see these points all one on top of another, which would hide how many there are. When there is a greater variety of points, you get more interesting, "galactic" looking visualizations.

Cheers,

Nicolas

First, I found this post extremely insightful and interesting. I used data project in a project not that long ago and it was awesome !

Here is it running with a totally different, dimensionality reduced set of data. In this case, it was PCA, then SVD I think. http://metasyn.pw/ted-talk-topics/data-projector/index.html

The data in that case is some strange clustering (or lack there of) of ted-talks in 2013...

I took your idea and tried to do something similar here:

http://metasyn.pw/dreda

http://github.com/metasyn/dreda

Best,

Xander

Very cool, thanks for sharing!

I am curious if the demo in the video has more features / is more code complete than code that is currently on Github. Two things I've noticed: 1) there seems to be a three.js positioning bug with the projections as you can see in the live demo page, and 2.) the [Shift+P]rint feature in the toolbar. I am doing some of my own t-SNE viz on doc2vec vectors from a corpus and would love to use the latest code if at all possible!

## Ajouter un commentaire