Part 2 — Recognizing Sign Language
In Part 1 we learned about Hidden Markov Models, a powerful tool to represent a partially observable world. Now we’ll see another HMM representation, one that will help us recognize American Sign Language.
In Part 1 our variables were binary e.g. rain or no rain, umbrella or no umbrella. What if the data we need to model has a continuous nature? Imagine we have a signal. We’re looking at its value which is a function of time.
A signal through time
At $t=0$ its value is $-2$ . At $t=10$ it’s $-1$ . At $t=15$ it’s $0$ . At $t=35$ it’s $1$ . Finally, at $t=38$ it’s $2$ . It looks like there are four parts of this function so we’ll use four states in our HMM. The goal is to design a model that could have generated this signal. We’ll use a representation similar to the one we learned about in Part 1, except that states will include self-transitions, meaning that at each time frame you could either transition to another state or stay in the same state.
Remember when we had those binary evidence variables in Part 1? Another way to call those are emissions or outputs. In this case they are not binary anymore and they are continuous. These output distributions are in fact probability densities but for our purposes we can think of them as probabilities. So we need to figure out these output (evidence) probabilities.
First we need to figure out which values are allowable while we’re in a given state. We will be in the first state while the value of the signal is between $-2$ and $-1$ . In the second state the value is between $-1$ and $0$ . Third state is between $0$ and $1$ and the fourth state is when the signal is between $1$ and $2$ . In each of those intervals all the values are equally represented so a boxcar distribution works pretty well to describe how the value will fall in that range.
Now we need the transition probabilities. Let’s start with the transition where we escape state 1. We can see that on state 1, when the signal is between $-2$ and $-1$ we spend $10$ time frames ( $0$ to $10$ on the horizontal axis). If we should expect to spend $10$ time frames in state 1, that means that about once every $10$ frames we’ll escape so the probability of leaving the state is $1/10 = 0.1$ . Since we only have one other transition (the self-transition) and all transitions out of a state must sum to $1$ then the self-transition has a $0.9$ probability. We continue this with state 2 which has $5$ time frames, state 3 with $20$ frames, and state 4 with $3$ frames.
What we’re doing is creating a model by inspection to represent a given signal. In reality we’ll want to have many examples of a signal and create a model that can accommodate all the examples. It will need to strike a balance between two forces:
- Being specific enough to recognize the signal.
- But not so specific that it would fit only the data it already knows and not any new, slightly different examples. This is called overfitting.
American Sign Language
To show how HMMs work we’ll use only one feature to represent the movements that constitute a sign in ASL. We’ll use the distance that the hand has traveled on the $y$ axis with respect to the previous frame. It’s a very limited model since we’re not tracking the movements on the $x$ axis or more than one hand or how the fingers are positioned. Still, HMMs are so powerful it will be enough to show how we can tell two symbols apart with just that information.
For the word “I” the symbol has three motions, giving us three states:
- Hand goes up to the chest.
- Hand stays there momentarily.
- Hand goes down.
This is how the change in the vertical axis behaves for these 3 movements.
- The hand goes up, accelerating up to the point where the hand position in a frame was 10 units higher than in the preceding frame, then slows down until it stops (the frame shows the hand in the same position as the last frame) showing a change of zero.
- The hand stays in the same position for a little while with zero movement.
- The hand goes down, accelerating until it peaks at 10 units lower than the previous frame, then slows down until it stops.
In our earlier example we used a boxcar distribution for our signal where all the values in the interval were equally represented. For hand movements in sign language we’ll make a different assumption: That the changes in hand position will follow a normal distribution, also known as a Gaussian, where in most cases the values will fall around the average and fewer falling further away at both ends.
Furthermore, when the hand goes up or down we expect to see quite a bit of variability in the amount of movement between different people and different examples, while the part where the hand stays at chest level will have very little variation in the change of position across examples.
This means our Gaussian will be wider for the up and down movements and quite narrow for the part where the hand stays in place. Assuming a normal distribution of values ranging from $0$ to $10$ the average will be $5$ . Similarly, for values between $0$ and $-10$ the average will be $-5$ and our output probability density functions (PDFs) will look like this:
PDFs for “I”
We now have our outputs! As a first pass at getting the transition probabilities we’ll pick something reasonable. For the three distinct motions of the symbol for “I”, which gives us $3$ states, it looks like we spend a lot of time in states 1 and 3 as the hand goes up and down and only a moment in state 2 when the hand stays at chest level. So let’s give states 1 and 3 a low exit probability meaning we’re likely to spend more frames there, and for state 2 a little higher to allow a higher probability of exiting that state so we don’t stay there too long.
Transition probabilities for “I”
Now let’s do the same thing for the symbol for “We”. It will have three motions too:
- Hand goes up to chest.
- Hand moves horizontally across chest.
- Hand goes down.
We could have had more states to depict pausing between these movements but to keep it simple we’ll stick to three states. Keeping a similar structure for “I” and “We” will also help us see how we can detect even subtle differences.
A few things to notice: The main difference in our models for “I” and “We” is the second movement. “We” has more variability in the feature we’re measuring (change in the $y$ axis) because as we’re moving the hand across the chest it might move up or down a little bit, while “I” keeps the hand in one place having very little variability around zero. This means our Gaussian for the second state in “We” will be wider and shorter. Another thing to notice about this second state is that “We” spends more time on it than “I” does so the transition probabilities will reflect that by giving a lower probability to the exit transition and therefore a higher one to the self-transition.
Feature: Change in y axis for “We”
States for “We”
Gaussians for “We”
We’re ready for some recognition! Suppose we have a set of observations that represent the samples we want to recognize. A bunch of videos of people doing signs and we want to recognize what signs they are. We’ll use a tool called a Viterbi trellis to see how likely it is that the samples were generated by the models we built. In other words, how likely it is that those signs are an “I” or a “We” in sign language according to our models. The model that gives us the highest probability will be considered the match.
So we have an observation $O$ with the values for the feature we’re tracking ( $\Delta y$ ) at each time frame. We want to find the probability of that observation given the model for “I” denoted with the greek letter $\lambda$ . In other words:
Probability of the observation given our model for “I”
We’ll start by laying out the trellis. For each of the seven time frames in our observation we’ll have the observed values and a node for each of the three states. Putting it all together:
Transition probabilities for “I”
Gaussians for “I”
Viterbi trellis
We’re calculating the probability that the observation would be produced by our model for “I”. We’ll trace the states that we could be in if the observation were to match the model. Follow along by reading each step and seeing how it plays out in the next figure.
- We have to start in state $S1$ at $t=1$ .
- We have to end in $S3$ at $t=7$ .
- At $t=1$ we can go from $S1$ to $S1$ (self-transition) or to $S2$ .
- At $t=2$ we can go from $S1$ to $S1$ or to $S2$ and from $S2$ to $S2$ or to $S3$ .
At this point we’ve “touched” all three states from the beginning on the left side of the trellis (blue arrows) so let’s now walk back from the end (green arrows).
- At $t=7$ the only way to get to $S3$ (remember we must end at $S3$ ) is from either $S2$ of a self-transition from $S3$ .
- At $t=6$ the only way to get to $S2$ is from $S1$ or $S2$ and the only way to get to $S3$ is from $S2$ or $S3$ .
- At $t=5$ you can get from $S1$ to $S1$ , to $S2$ from $S1$ or $S2$ , and to $S3$ from $S2$ or $S3$ .
- At $t=4$ you can get to $S1$ from $S1$ , to $S2$ from $S1$ or $S2$ , and to $S3$ from $S2$ .
Transition probabilities for “I”
Viterbi trellis for “I”
Now we have to add the transition probabilities to our trellis. We get them from our model for “I”.
Viterbi trellis for “I” with transition probabilities
That takes care of the transitions but how do we get the overall probabilities? Since we’re building this by inspection just to show the general mechanism, we won’t be using very exact numbers. In reality we would have true probability density functions but in this high-level example we’re going to use approximate figures based on the assumption that the feature we’re tracking behaves a certain way, namely following a normal distribution.
We’ll start by looking at $t=1$ . Our observation was a $\Delta y$ of $3$ and at that time frame we can be only in State 1. State 1 has a mean of $5$ so it seems more or less reasonable to find a value of $3$ . It’s not too far from the mean. Again, in real life we would have a real distribution for these outputs and we’d be able to calculate the exact probability of getting a $3$ but for this example we’ll use our best guess. As long as we’re consistent, it will work. Let’s say our estimate of the probability that the Gaussian for State 1 generates a $3$ is $0.5$ . We’ll update the corresponding node with a $0.5$ .
Gaussians for “I”
Output probabilities for “I”
Now we move on to the nodes at $t=2$ . We have two nodes there, one for State 1 ( $\overline{S1}=5$ ) and one for State 2 ( $\overline{S2}=0$ ). What’s the probability of getting the observed value of $7$ in State 1? $7$ is $2$ units away from the mean of $5$ just like $3$ was $2$ units away from it so it seems reasonable to give it the same probability of $0.5$ . How about State 2? It has $\overline{S2}=0$ and a very small standard deviation, which we can tell by its narrow shape. The probability of getting a $7$ there is very small, almost zero. So let’s pick a very small number. Say, $10^{-7}$ .
Output probabilities for “I”
Follow the same process to fill out the rest of the nodes in the trellis and you’ll have something similar to this.
Output probabilities for “I”
Now we need to find the most likely path. Along this path we’ll multiply the transition probability times the output probability. So what’s the path with the highest probability? Note that the highest probability path will not necessarily be the greedy one. In other words, the highest expected value at each transition may not necessarily lead to the highest value in the overall path.
So let’s consider the transition from State 1 to State 2. Our options are staying on State 1 or moving to State 2. The expected value of staying in State 1 is $.8(.5) = .4$ while the expected value of moving to State 2 is $.2 (10^{-7}) = 2\mathrm{e}{-8}$ . The greedy algorithm would choose to stay in State 1 since that value is bigger. But we need to keep track of all the possible sequences in the trellis so we can choose the path with the highest overall value, not the one where we end up by simply following what looks best at that moment without reconsidering our choices.
We’ll keep going through the time frames looking at all the possible paths, multiplying the transition probabilities from start to end of the path. At each time frame we’re going to keep the path with the maximum value for each state.
For example, at $t=3$ we have four possible paths. The best (highest probability) paths so far for each state are in bold.
- $1 \times .5 \times .8 \times .5 \times .8 \times .6$ path to state 1
- $1 \times .5 \times .8 \times .5 \times .2 \times 10^{-5} = 4\mathrm{e}{-7}$ path to state 2
- $1 \times .5 \times .2 \times 10^{-7} \times .5 \times 10{^-5} = 5\mathrm{e}{-14}$ path to state 2
- $1 \times .5 \times .2 \times 10^{-7} \times .5 \times 10^{-4}$ path to state 3
At the end at $t=7$ we choose the most likely overall path. For our example it’s the one shown in the next diagram and it has a probability of $0.00035$ .
Viterbi path for “I”
A few ways to express our interpretation of this result. These all mean the same thing:
- This is the probability that this observation sequence was generated by our model for “I”.
- This is the probability that this observation corresponds to the symbol for “I”.
- This is the probability that the person making the sign depicted by these observed data meant to say “I” in sign language.
Now we can do the same process for “We” and compare that result to this one. For example, if doing this for “We” results in a most likely path with probability $0.00016$ that would mean that it’s a lot more probable that the model for “I” generated these data.
This shows us how powerful HMMs can be for distinguishing which sign is the correct one even with relatively bad features. Remember that we chose $\Delta y$ even though it’s not the best way to tell “I” and “We” apart.
One last note about these probabilities. You might be wondering what happens as these observation sequences get longer, making the probability smaller and smaller until we run out of precision for our computers to correctly represent these numbers. This is called an underflow. In reality, we wouldn’t use the raw probabilities as we did on this example but instead the log probability to avoid this problem.
We’re done! You now have a good understanding of how Artificial Intelligence would go about interpreting sign language. This technique is used in a wide variety of classification tasks. Keep learning and come back soon for more AI content.
Top comments (0)