DEV Community

benank
benank

Posted on

Realtime Pose Comparison With TensorFlow.js

The Challenge

I'm creating a dance game in the browser that uses TensorFlow.js (also referred to as MoveNet, which is the model used) to analyze a person's movements and will compare those movements to those of the song that they're dancing to.

In the previous blog posts, I outlined a general plan and talked about how to use YouTube videos with TensorFlow.js. Now that we've got the video, we'll need to compare each frame of it with the webcam stream from the user, all in realtime. This way, the user can see how well they're doing at any given time as they play the song.

How do we compare the poses and dance moves between one person and another? How do we account for different body shapes and sizes?

The Plan

When you analyze an image (or frame of a video in my case), TensorFlow.js returns some data that looks a little like this:

    "keypoints": [
        {
            "y": 95.41931572589485,
            "x": 289.713457280619,
            "score": 0.8507946133613586,
            "name": "nose"
        },
        {
            "y": 87.39720528471378,
            "x": 299.0246599912063,
            "score": 0.8859434723854065,
            "name": "left_eye"
        },
        {
            "y": 89.00106838638418,
            "x": 279.21988732828237,
            "score": 0.7947761416435242,
            "name": "right_eye"
        },
        ... (and more, 17 keypoints total)
Enter fullscreen mode Exit fullscreen mode

Each keypoint has an x and y position (where the keypoint is on the screen), score (how confident TFJS is that this keypoint is correct), and name (label for the keypoint).

Here is a diagram of all the keypoints on a human model (indices are simply the order of the keypoints returned):

Keypoint diagram
(More detailed info here about the keypoint diagram)

This is all the information we get from TensorFlow.js, and we need to somehow use this data to fit our needs. We are going to get two sets of this type of data: one for the dance video that we need to match, and one for our live webcam feed.

We need to give the player a score to tell them how we'll they're doing using this data. How can we take raw 2D positional data and turn it into something useful? And after we turn it into something useful, how can we determine how well a person is performing the correct dance move?

Initial Thoughts

These were my initial, unsorted thoughts:

Base the keypoint data positions on a center, average position in the middle of the chest. This way, when the person moves, the keypoints will move with them, and thus, the keypoints will stay still. By applying this to the live keypoint data, both will be in a somewhat normalized space.

Next up is the problem of seeing how well the keypoint data sets match.

A person might be taller or shorter or have different body size or limb proportions than the dancer in the video, so how do we scale/transform them to match? It needs to be a connection/limb based scaling/transformation, because simply scaling someone on the y axis down won't always work. Someone might have a long torso and short arms, or a short torso and long arms. These need to be taken into account, so we need to transform the distances between each of the keypoints.

We will need to get measurements of a person before they begin. We'll have them do a T-pose and record the measurements of each limb.

But how can we get the measurements of the dancer that they are following in the video? That dancer isn't going to T-pose for you.

During the analysis of the dance with TFJS, we could also record the maximum length of each limb/connection. We use the maximum instead of an average because a person can't stretch past their maximum limb length - that's just their limb length.

Now that we have corresponding limb lengths of both dancers, how do we transform one to "fit" the other?

We need to scale each limb along its axis, taking all other connected points with it.

For example, if one dancer's shoulders are farther apart than the dancer we are comparing to, we need to shift those shoulders closer together. Shifting these shoulders closer together will also cause the arms to shift in closer, because otherwise we would have really long arms. And shifting the arms is shifting multiple, connected keypoints.

The General Plan

First, record the dance video keypoint data:

  1. Run the video through MoveNet and record all keypoint data at each frame in the video.
  2. Run this data through a filter to make each keypoint position based on the average chest position at that point.
  3. Convert keypoint positions and limb lengths from pixel values to another unit that's not based on how many pixels they take up. We can take the body length (torso length + leg length) and divide everything by it to get all measurements relative to the body length. For example, shoulder-to-elbow length might be 0.2 BLU, or body-length-units. The torso itself might be closer to 0.4 BLU.

Now we can take the live video and transform its keypoint data to the expected dance video keypoint data:

  1. Get the player's measurements by having them make a T-pose and running it through MoveNet. Get the measurements in BLU.
  2. Run the video through MoveNet and get the keypoint data for the current frame.
  3. Run this data through a filter to make each keypoint position based on the average chest position at that point.
  4. Convert keypoint positions and limb lengths from pixels to BLU.
  5. Transform player BLU keypoints and limb lengths to dancer BLU keypoints and limb lengths.
  6. Compare the distances of player vs dancer BLU keypoint positions to see how well the player is performing the dance.

Transforming the data in step 5 will be a difficult step. In BLU, every body part is relative to the body length, so we need to match up the body length, then match up each limb length.

Another issue that might come up though is if the dancer in the video moves closer/father to/from the camera. This might mess up BLU measurements if BLU only uses the absolute maximum limb lengths, rather than limb lengths at a current point in time. This can probably be solved by detecting if the dancer is moving closer/farther to/from the camera and then scaling the limb lengths based on that, which will affect the BLU measurements.

How do we detect the approximate distance of a person from the camera, though? We can potentially use the side lengths of the abdomen since those won't change much, even when spinning or rotating. Those would only change if the person was laying on the ground and wasn't facing the camera. Or we could take the BLU reference unit (total body length in pixels) and divide that by the height of the video. It would still be skewed if the person rotated in a way that made them appear as having a shorter abdomen or legs, but it could work.

Also, some dance videos zoom in/out. This must be taken into account somehow as well.

Scoring After Transforming

After applying the above transformation methods to make the keypoints as similar as possible, we need to figure out a scoring method to determine how similar the two data sets are.

We could use some sort of 2D distance formula combined with a threshold. Say, a distance of 5 units (I say units here because the measurements are currently arbitrary) is the maximum distance someone can be from the expected keypoint. That would be a score of 0, and a distance of 0 would be a score of 1. Anything in between would be on a sliding scale, but what kind of sliding scale? Linear, quadratic, cubic, or something different? It could be good to have a quadratic scale so it's easier to match to start, but gets more difficult as you get closer to matching it. Or, on the flip side, it could get easier as you get closer. This would help to account for errors within TensorFlow.js as well as stuttering or other issues.

After Some Research

The above solution might have worked, but it's non-trivial to implement and might not work. I want guaranteed results, or at least guaranteed something. The proposed solution above doesn't guarantee that you'll get anywhere close to "good" results.

I did some more research and found this blog post from someone who had the exact same problem as me. They had keypoints from two different images of people that they wanted to compare to see how similar they were.

Perfect, I can just copy what this person did and I'll be done, right?

Nope. Not quite.

That's initially what I tried to do, at least. I read the blog post multiple times and learned a lot just from that, especially about body segmentation and checking for rotational outliers (which we'll get into in a bit). But the code snippets were written in python, and I'm using JavaScript. It's non-trivial to convert from mostly numpy code to plain JavaScript, so I had to find a math library that would work similarly to numpy.

I first tried using the math.js library. It seemed to work alright, but I wasn't able to figure out how to solve for the affine matrix, and there weren't many examples online. In some of those examples though, I found a different math library: ml-matrix.

As you can probably tell by the name, this is a math library built specifically for operations that deal with matrices. That's exactly what I needed for this new affine matrix method. (Also, if you're confused about this affine matrix stuff, please read the blog post I linked! It gives a pretty good explanation of some of the terms I'll be using)

After much trial and error, I was able to get it "working" with the matrix library. The results were not good: the transformation didn't work at all. Something was very wrong with the math or the method, and I couldn't figure out why it wasn't working.

I continued to search and reread the aforementioned blog post, and decided that I would do more research on the transformation method used: Procrustes analysis.

It's a pretty interesting and fitting name if you read the first couple sentences in the Wikipedia article. I figured that I could learn the general algorithm for the method and write it in JavaScript myself.

That turned out to be really difficult! So I hit the drawing board again with another method that came to mind: searching the NPM site. There are tons and tons of packages available, so I figured that someone must have come before me and made something that uses the Procrustes analysis technique. I have two sets of 2D points and I just want to know how similar they are.

I searched for "procrustes" on the site, and there were three packages, to my surprise (I wasn't expecting any). One of them, curve-matcher, sounded exactly like what I wanted.

From the description, it states:

The core of curve-matcher is a function called shapeSimilarity which estimates how similar the shapes of 2 curves are to each other, returning a value between 0 and 1.

two images of matching and non-matching curves

This was exactly what I wanted. Simple, easy to use, and gives me all the information I need. Not to mention that it also has some nice customization options for fine-tuning later, such as setting a maximum rotation angle (which solves one of the issues from the pose comparison article earlier).

I tested it using my webcam versus a video, and it worked quite well. I used the 3-part body segmentation technique discussed in the pose comparison article, which splits all keypoints into three sets for the head, torso, and legs. Each segment is compared separately, so I got three different similarity scores.

If I was doing the movement just right, the score would be about 95%. If I was doing it wrong, it would be 80% or lower. Because the head is a separate segment, it even took head rotation into account! Simply rotating my head from the expected position dropped the similarity score greatly.

It wasn't perfect, but for a first test, the results were quite promising. And with this step done, the initial prototyping and tests are complete! The project is 100% feasible and all the pieces are in place. Now all we have to do is create a cool looking website and put everything in place.

Finalized Plan

The new and improved, final plan looks something like this:

  1. Run MoveNet on each frame of the video. Store that for later.
  2. Run MoveNet on each frame of the webcam stream.
  3. Compare the stored data from the video with the live data from the webcam stream, using the curve matcher package.

And that's about it! There's a bunch of nuance to this and extra steps, but this is the general gist. This is the core of the entire game, and it works!

The next step is to actually create the game! This includes all of the UI and backend logic to help things flow smoothly. Stay tuned for updates on that!

Discussion (0)