DEV Community

Cover image for Expressando (Part 2): A real-time sign language detection system
Amool-kk for GNU/Linux Users' Group, NIT Durgapur

Posted on • Updated on

Expressando (Part 2): A real-time sign language detection system

This is a follow-up article from our previous post.

In the previous post, we discussed the features like What is a 'Virtual Environment'?, Configuring Input through Webcam using OpenCV, Checking for Convexity Defects in the Camera Input, What is a 'Convex Hull'? How to display the Convex Hull ?, What are 'Convexity Defects'?. Today, you will be learning about Collecting data through OpenCV, Demonstration of Data-Collection and TensorFlow, Convolutional Neural Networks (CNN), and many more things. You will also learn about how to live prediction works and all the things.

Collecting data through OpenCV and labeling them

Create a file named collect-data.py inside the directory of TDoC-2021. As the name suggests, collecting data collects the images taken by webcam using the OpenCV library. creating one or more dataset(s) depending on the input, storing them in the directories, and pursuing them according to classes.

First, activate your virtual environment (This is very important as OpenCV is installed inside the virtual environment and not globally). Then import OpenCV and OS (included in python library) as cv2 and os.


import cv2

import os

Enter fullscreen mode Exit fullscreen mode

We will be checking for the presence of the data directory, which will be used to store the train and test data. If the data folder is already created then, it will not create the folders, and proceed to store the data in the form of datasets, each representing a class.


if not os.path.exists("data"): #True

  os.makedirs("data")

  os.makedirs("data/train") 

  os.makedirs("data/test")

  os.makedirs("data/train/0") 

  os.makedirs("data/train/1")

  os.makedirs("data/train/2")

  os.makedirs("data/train/3")

  os.makedirs("data/train/4")

  os.makedirs("data/train/5")

  os.makedirs("data/test/0")

  os.makedirs("data/test/1")

  os.makedirs("data/test/2")

  os.makedirs("data/test/3")

  os.makedirs("data/test/4")

  os.makedirs("data/test/5")

Enter fullscreen mode Exit fullscreen mode

The if not statement checks if the directory data exists or not, whereas the os.path.exists method returns Boolean value & depending on which next lines are executed. Next if os.path.exists returns False then os.makedirs method will create the necessary directories.

Now we will create some variables for the path of the folders where we will save our dataset of images. After that, we need to collect the required images via webcam. For that, we will use the already learned how to use the cv2.VideoCapture() method to initialize.

To make the process of collecting the data more interactive and visually appealing, we will display some statistics in the live view screen of the webcam. So to know how many images we have collected inside the folder, we can simply use these key bindings and record them as count objects and keep the counts of images as a dictionary for every number. we can display it in the live view of the webcam. It can be done very easily, made possible by OpenCV. This can be achieved by the following lines of code:


cv2.putText(frame, "MODE : "+mode, (30, 50), cv2.FONT_HERSHEY_COMPLEX_SMALL, 1, (225,255,255), 1)

cv2.putText(frame, "IMAGE COUNT", (10, 100), cv2.FONT_HERSHEY_COMPLEX_SMALL, 1, (225,255,255), 1)

cv2.putText(frame, "ZERO : "+str(count['zero']), (10, 120), cv2.FONT_HERSHEY_COMPLEX_SMALL, 1, (255,255,255), 1)

cv2.putText(frame, "ONE : "+str(count['one']), (10, 140), cv2.FONT_HERSHEY_COMPLEX_SMALL, 1, (255,255,255), 1)

cv2.putText(frame, "TWO : "+str(count['two']), (10, 160), cv2.FONT_HERSHEY_COMPLEX_SMALL, 1, (255,255,255), 1)

cv2.putText(frame, "THREE : "+str(count['three']), (10, 180), cv2.FONT_HERSHEY_COMPLEX_SMALL, 1, (255,255,255), 1)

cv2.putText(frame, "FOUR : "+str(count['four']), (10, 200), cv2.FONT_HERSHEY_COMPLEX_SMALL, 1, (255,255,255), 1)

cv2.putText(frame, "FIVE : "+str(count['five']), (10, 220), cv2.FONT_HERSHEY_COMPLEX_SMALL, 1, (255,255,255), 1

Enter fullscreen mode Exit fullscreen mode

Next, we will learning about the Region of Interest (R.O.I).A region of interest (ROI) is an area of an image defined for further analysis or processing. Here, the region of interest will basically contain the region of the hand, used for portraying the gesture. We will be defining the region of interest using the following code:


x1 = int(0.5*frame.shape[1])

y1 = 10

x2 = frame.shape[1]-10

x1 = int(0.5*frame.shape[1])

y1 = 10

x2 = frame.shape[1]-10

y2 = int(0.5*frame.shape[1])

cv2.rectangle(frame, (x1-1, y1-1), (x2+1, y2+1), (255,0,0) ,3)

roi = frame[y1:y2, x1:x2] 

roi = cv2.resize(roi, (200, 200)) 

cv2.putText(frame, "R.O.I", (440, 350), cv2.FONT_HERSHEY_COMPLEX_SMALL, 1, (0,225,0), 3)

cv2.imshow("Frame", frame)



roi = cv2.cvtColor(roi, cv2.COLOR_BGR2GRAY)

_, roi = cv2.threshold(roi, 120, 255, cv2.THRESH_BINARY)

cv2.imshow("ROI", roi)  y2 = int(0.5*frame.shape[1])

cv2.rectangle(frame, (x1-1, y1-1), (x2+1, y2+1), (255,0,0) ,3)

roi = frame[y1:y2, x1:x2] 

roi = cv2.resize(roi, (200, 200)) 

cv2.putText(frame, "R.O.I", (440, 350), cv2.FONT_HERSHEY_COMPLEX_SMALL, 1, (0,225,0), 3)

cv2.imshow("Frame", frame)



roi = cv2.cvtColor(roi, cv2.COLOR_BGR2GRAY)

_, roi = cv2.threshold(roi, 120, 255, cv2.THRESH_BINARY)

cv2.imshow("ROI", roi)

Enter fullscreen mode Exit fullscreen mode

Here, frame.shape[1] returns the shape of the frame or the camera input. Next, we will be defining the variables x1, y1, x2, and y2, which will serve as the diagonals of the rectangle (x1,y1) and (x2,y2). Next, we draw a rectangle surrounding the ROI using the rectangle() function, so that we can record our gesture inside the rectangle. After extraction, we resize and enlarge the region of interest using the resize() function.

we convert the ROI into its grayscale form, using the module cv2.COLOR_BGR2GRAY. After converting the ROI into grayscale, we will apply a simple threshold using the function cv2.threshold(). We will be using the cv2.THRESH_BINARY module to apply the threshold.

Next, we will be recording the images in the dataset according to the key bindings. We first initialize the interrupt variable, which returns 10 frames per second here. Then we define all the key bindings. Whenever, we press 0, data for the 'zero' key will get recorded. Similarly, it applies to the other datasets as well.

Now your collect-data.py should look like the following.


import cv2

import os 



if not os.path.exists("data"): #True

  os.makedirs("data")

  os.makedirs("data/train") 

  os.makedirs("data/test")

  os.makedirs("data/train/0") 

  os.makedirs("data/train/1")

  os.makedirs("data/train/2")

  os.makedirs("data/train/3")

  os.makedirs("data/train/4")

  os.makedirs("data/train/5")

  os.makedirs("data/test/0")

  os.makedirs("data/test/1")

  os.makedirs("data/test/2")

  os.makedirs("data/test/3")

  os.makedirs("data/test/4")

  os.makedirs("data/test/5")





mode = 'train' 

directory = 'data/'+mode+'/' #data/train/



cap=cv2.VideoCapture(0)



while True:

  _, frame = cap.read()

  frame = cv2.flip(frame, 1)



  cv2.putText(frame, "Expressando - TDOC 2021", (175, 450), cv2.FONT_HERSHEY_COMPLEX_SMALL, 1, (225,255,255), 3)



  count = {'zero': len(os.listdir(directory+"/0")), 

       'one': len(os.listdir(directory+"/1")),

       'two': len(os.listdir(directory+"/2")),

       'three': len(os.listdir(directory+"/3")),

       'four': len(os.listdir(directory+"/4")),

       'five': len(os.listdir(directory+"/5"))} 



  cv2.putText(frame, "MODE : "+mode, (30, 50), cv2.FONT_HERSHEY_COMPLEX_SMALL, 1, (225,255,255), 1)

  cv2.putText(frame, "IMAGE COUNT", (10, 100), cv2.FONT_HERSHEY_COMPLEX_SMALL, 1, (225,255,255), 1)

  cv2.putText(frame, "ZERO : "+str(count['zero']), (10, 120), cv2.FONT_HERSHEY_COMPLEX_SMALL, 1, (255,255,255), 1)

  cv2.putText(frame, "ONE : "+str(count['one']), (10, 140), cv2.FONT_HERSHEY_COMPLEX_SMALL, 1, (255,255,255), 1)

  cv2.putText(frame, "TWO : "+str(count['two']), (10, 160), cv2.FONT_HERSHEY_COMPLEX_SMALL, 1, (255,255,255), 1)

  cv2.putText(frame, "THREE : "+str(count['three']), (10, 180), cv2.FONT_HERSHEY_COMPLEX_SMALL, 1, (255,255,255), 1)

  cv2.putText(frame, "FOUR : "+str(count['four']), (10, 200), cv2.FONT_HERSHEY_COMPLEX_SMALL, 1, (255,255,255), 1)

  cv2.putText(frame, "FIVE : "+str(count['five']), (10, 220), cv2.FONT_HERSHEY_COMPLEX_SMALL, 1, (255,255,255), 1)





  x1 = int(0.5*frame.shape[1])

  y1 = 10

  x2 = frame.shape[1]-10

  y2 = int(0.5*frame.shape[1])

  cv2.rectangle(frame, (x1-1, y1-1), (x2+1, y2+1), (255,0,0) ,3)

  roi = frame[y1:y2, x1:x2] 

  roi = cv2.resize(roi, (200, 200)) 

  cv2.putText(frame, "R.O.I", (440, 350), cv2.FONT_HERSHEY_COMPLEX_SMALL, 1, (0,225,0), 3)

  cv2.imshow("Frame", frame)



  roi = cv2.cvtColor(roi, cv2.COLOR_BGR2GRAY)

  _, roi = cv2.threshold(roi, 120, 255, cv2.THRESH_BINARY)

  cv2.imshow("ROI", roi)



  interrupt = cv2.waitKey(10) 

  if interrupt & 0xFF == 27:

    break

  if interrupt & 0xFF == ord('0'):

    cv2.imwrite(directory+'0/'+str(count['zero'])+'.jpg', roi)

  if interrupt & 0xFF == ord('1'):

    cv2.imwrite(directory+'1/'+str(count['one'])+'.jpg', roi)

  if interrupt & 0xFF == ord('2'):

    cv2.imwrite(directory+'2/'+str(count['two'])+'.jpg', roi)

  if interrupt & 0xFF == ord('3'):

    cv2.imwrite(directory+'3/'+str(count['three'])+'.jpg', roi)

  if interrupt & 0xFF == ord('4'):

    cv2.imwrite(directory+'4/'+str(count['four'])+'.jpg', roi)

  if interrupt & 0xFF == ord('5'):

    cv2.imwrite(directory+'5/'+str(count['five'])+'.jpg', roi)



cap.release()

cv2.destroyAllWindows()

Enter fullscreen mode Exit fullscreen mode

Demonstration of Data-Collection and Introduction to TensorFlow

Now, we are going to have a demonstration on how to collect data with increased precision. This is a vital step, as it governs the future processes which are required in making the model.

What is a 'Dataset'?

A Dataset is a collection of data that is treated as a single unit by a computer. This means that a dataset contains a lot of separate pieces of data/images but can be used to train an algorithm with the goal of finding predictable patterns inside the whole dataset.

Datasets are of three types:

  • Training Dataset - This dataset which contains the necessary data with which the model is to be trained. This dataset contains the maximum amount of data, and it is primarily used by the machine for comparison.

  • Validation Dataset - This dataset checks for the validation of the input in accordance with the training dataset. It is a subset of the training dataset. It eliminates unnecessary inputs and increases the speed of operation.

  • Test Dataset - This dataset is used for testing the models and determining the accuracy and losses which are incurred while training the machine. This is very much important, as it serves as a test set for the training data, and the user can verify the results.

alt Dataset

Now let us go through a quick example of how to collect data:

Run the collect-data.py file using the command:


python collect-data.py

Enter fullscreen mode Exit fullscreen mode

The window will open, taking a reference input from the webcam. It will also display a thresholded image of the area under the rectangle (Region of Interest) in another window. This image will be recorded under the respective directories of the dataset.

  • Let us first begin with the "zero" gesture, which basically involves a closed fist. Bring the gesture close inside the Region of Interest, as much that it is completely enclosed within the rectangle, and covers most of the area of the region.

  • Make sure, you have a clear background and do not let unexpected disturbances appear inside the rectangle/ROI. Try to collect clear data as much as possible. Make sure all your fingers are visible.

  • Then, start collecting the data by pressing '0' of the Numpad. This will capture the images, which will be the images stored in the dataset under the train directory in the 0 folder.

  • Collect approximately 100-120 images to train the model. Make sure to collect data, while changing your orientation over the rectangle, so that the machine can recognize the gesture whenever you show the gesture as input inside the rectangle.

alt zero

Similarly for other datasets.

Now, as you have collected the data, now let us see what TensorFlow has in store for us.

TensorFlow

Implementing machine learning is a complex stream mainly the concerns related to the creation of real-time models. Hence the end-to-end open-source framework - TensorFlow is used for collecting datasets, training systems through models and providing the results based on it.

TensorFlow is an open-source framework created by Google Creative Labs. It executes machine learning and helps to build neural networks easily (along with Keras) in Python as well as in Javascript. It builds an environment of networks to experiment with the algorithms of machine learning and visualizes it using flow graphs. The graphs represent the progression of all nodes where nodes are the operations in the model.

Why TensorFlow

TensorFlow provides pre-built functions and advanced operations API to ease the task of building different neural network models. It provides the required infrastructure and hardware which makes them one of the leading libraries used extensively by researchers and students in the deep learning domain.

alt TensorFlow

To begin with Tensorflow, you can go through the following tutorials:

Convolutional Neural Networks (CNN)

Artificial Intelligence has been witnessing a monumental growth in bridging the gap between the capabilities of humans and machines. Researchers and enthusiasts alike, work on numerous aspects of the field to make amazing things happen. One of many such areas is the domain of Computer Vision.

The agenda for this field is to enable machines to view the world as humans do, perceive it in a similar manner, and even use the knowledge for a multitude of tasks such as Image & Video recognition, Image Analysis & Classification, Media Recreation, Recommendation Systems, Natural Language Processing, etc. The advancements in Computer Vision with Deep Learning have been constructed and perfected with time, primarily over one particular algorithm โ€” Convolutional Neural Network.

Image description

To know more about CNN go through the following link: https://deepai.org/machine-learning-glossary-and-terms/convolutional-neural-network

Now let us go through the code:

Create a file named as 'train_model.py' inside the directory of 'TDoC-2021'. As the name suggests, it will be training the models using TensorFlow and Keras with the Convolutional Neural Network being trained on it.

At first, we will import the model and layers, which we will be using from Keras. Here, Keras uses TensorFlow as it's backend, i.e, the functions in Keras also makes uses of the functions in TensorFlow for all of the results.

Here, we will be making use of the Sequential model. This model is used when you have one input, and one output at a time. This model basically stacks the layers and serves as a container for the layers in connection.

To know more about the Sequential model, go here:

The process of building a Convolutional Neural Network always involves four major steps.

  • Convolution

  • Pooling

  • Flattening

  • Full Connection

Now, we will create an object of the sequential class below:


from keras.models import Sequential

from keras.layers import Convolution2D, MaxPooling2D, Flatten, Dense



classifier = Sequential()

Enter fullscreen mode Exit fullscreen mode

Next, we will define the first convolutional layer. This will modulate the image input tensor and result in convoluted matrices. The matrices produced will be merged and amalgamated into a single matrix. This basically consists of mathematical operands, each of which serves as a node. The operands are decided on the basis of image size, color, and characteristics. After the convolution, we will be Max-Pooling the matrix, so that the size of the matrix is reduced, and it will help in the efficient determination. It will be achieved by the following code:


classifier.add(Convolution2D(32, (3, 3), input_shape=(64, 64, 1), activation='relu'))

classifier.add(MaxPooling2D(pool_size=(2, 2)))

Enter fullscreen mode Exit fullscreen mode

Here, the add() function helps to add successive layers in the Convolution Neural Network object.

We will be using two types of activation function:

  • relu: It stands for Rectified Linear Unit. It helps in the independent activation of neurons/nodes in the layers.

  • softmax: It helps to activate the neurons if there are multiple branching present, and it requires detecting simultaneous figures.

To know more about activation functions, go here: https://www.v7labs.com/blog/neural-networks-activation-functions

Similarly, we will be adding a second Convolutional Layer to it, in order to increase the efficiency of detection. However, we cannot add the infinite number of layers, as it would decrease the speed and also, can result in wrong detections.

Next, we will be flattening the layers, and connecting them to produce our entire Neural Network. Flattening is a very important step to understand. What we are basically doing here is taking the 2-D array, i.e pooled image pixels, and converting them to a one-dimensional single vector. Weโ€™ve used flatten function to perform flattening, we no need to add any special parameters, Keras will understand that the โ€œclassifierโ€ object is already holding pooled image pixels and they need to be flattened.


classifier.add(Flatten())

Enter fullscreen mode Exit fullscreen mode

Now, we need to create a fully connected layer, and to this layer, we are going to connect the set of nodes we got after the flattening step, these nodes will act as an input layer to these fully-connected layers. As this layer will be present between the input layer and output layer, we can refer to it as a hidden layer.

Dense() is the function to add a fully connected layer. Units is where we define the number of nodes that should be present in this hidden layer, these units value will be always between the number of input nodes and the output nodes but the art of choosing the most optimal number of nodes can be achieved only through experimental tries. Though itโ€™s a common practice to use a power of 2 and the activation function will be a rectifier function.


classifier.add(Dense(units=128, activation='relu'))

classifier.add(Dense(units=6, activation='softmax'))

Enter fullscreen mode Exit fullscreen mode

As, the model has been prepared, you need to compile and compose it.


classifier.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

Enter fullscreen mode Exit fullscreen mode

To know more about Adam Optimisation, go here: https://www.geeksforgeeks.org/intuition-of-adam-optimizer/

Next, we will import our datasets: train and test, and then will pass them through the model for compilation, and ultimately compose the model on the basis of the images fed into them.


from keras.preprocessing.image import ImageDataGenerator



train_datagen = ImageDataGenerator(

    rescale=1./255,

    shear_range=0.2,

    zoom_range=0.2,

    horizontal_flip=True)



test_datagen = ImageDataGenerator(rescale=1./255)



training_set = train_datagen.flow_from_directory('data/train',

                         target_size=(64, 64),

                         batch_size=5,

                         color_mode='grayscale',

                         class_mode='categorical')



test_set = test_datagen.flow_from_directory('data/test',

                      target_size=(64, 64),

                      batch_size=5,

                      color_mode='grayscale',

                      class_mode='categorical') 

Enter fullscreen mode Exit fullscreen mode

Here, we will be using the ImageDataGenerator. It is used to manipulate the image input, and convert them into a machine-readable format using the CNN building. We will be first defining the Data Generators for the test and train data individually, and naming them as test_datagen and train_datagen. We will be using the rescale parameter to rescale the images, zoom_range and shear_range to adjust the image identification, so that when the image can be still be recognized under zoomed conditions up to the scale of 0.2. At last, horizontal_flip helps to check for a correct orientation for the image.

After the data generators are created, we need to generate the training_set and test_set for making our model, and testing it. We define the directory, color-space (grayscale), and the class mode(which can be binary or categorical.) We also define the input size of the image as 64 x 64. The batch size refers to the size of the batches, which is used here as 5. Batch is a continuous length of data, that is passed as once to the model for training.

After the datasets are prepared to feed into the model, we fit them into the model layers, for analysis and metrics calculation.


classifier.fit_generator(

    training_set,

    epochs=10,

    validation_data=test_set)

Enter fullscreen mode Exit fullscreen mode

Here, we define the sets, that have been generated in the previous steps. We will be testing the training_set against the test_set, and determine the accuracy and losses. We will pass them in 10 epochs. Epoch is the number of passes of the entire training dataset the machine learning algorithm has completed. It is similar to test cases, which the training set needs to pass. These check for valid input and test the accuracy and losses.

After the model has been prepared and tested, it is necessary to save the model in the form of hierarchial data-model. We will also save the model in the form of JSON format (JavaScript Object Notation). We are saving it in the JSON format so that the user can know about the model composition and characteristics.


model_json = classifier.to_json()

with open("model-bw.json", "w") as json_file:

  json_file.write(model_json)

classifier.save_weights('model-bw.h5')

Enter fullscreen mode Exit fullscreen mode

Now your train_model.py should look like the following.


from keras.models import Sequential

from keras.layers import Convolution2D, MaxPooling2D, Flatten, Dense



classifier = Sequential()



classifier.add(Convolution2D(32, (3, 3), input_shape=(64, 64, 1), activation='relu'))

classifier.add(MaxPooling2D(pool_size=(2, 2)))



classifier.add(Convolution2D(32, (3, 3), activation='relu'))

classifier.add(MaxPooling2D(pool_size=(2, 2)))



classifier.add(Flatten())



classifier.add(Dense(units=128, activation='relu'))

classifier.add(Dense(units=6, activation='softmax'))



classifier.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])





from keras.preprocessing.image import ImageDataGenerator



train_datagen = ImageDataGenerator(

    rescale=1./255,

    shear_range=0.2,

    zoom_range=0.2,

    horizontal_flip=True)



test_datagen = ImageDataGenerator(rescale=1./255)



training_set = train_datagen.flow_from_directory('data/train',

                         target_size=(64, 64),

                         batch_size=5,

                         color_mode='grayscale',

                         class_mode='categorical')



test_set = test_datagen.flow_from_directory('data/test',

                      target_size=(64, 64),

                      batch_size=5,

                      color_mode='grayscale',

                      class_mode='categorical') 



classifier.fit_generator(

    training_set,

    epochs=10,

    validation_data=test_set)



model_json = classifier.to_json()

with open("model-bw.json", "w") as json_file:

  json_file.write(model_json)

classifier.save_weights('model-bw.h5')

Enter fullscreen mode Exit fullscreen mode

Run the code in your Powershell/terminal using:


python train_model.py

Enter fullscreen mode Exit fullscreen mode

Live Prediction of Customised Sign Language

Create a file named as 'prediction.py' inside the directory of 'TDoC-2021'. As the name suggests, it will be predicting the customized sign with which we have trained the model, when we give a similar gesture to what we have trained it. First activate your virtual environment. Also, make sure you have the model-bw.h5 and model-bw.json Prepared before running the code.

Next, we will be importing cv2, operator(an inherent module present in the standard Python library), and model_from_json, which is to be imported to read the model content in the JSON file.


from keras.models import model_from_json

import operator

import cv2

Enter fullscreen mode Exit fullscreen mode

Here, we will be declaring the object json_file, which will parse the model-bw.json present in the same directory as the prediction.py file. Next, we will be parsing the contents from the json_file, which has the model-bw.json already passed in it.

Then, we use the read() and close() function to store the contents in the file in the object model_json.

Next we will load the data model and JSON model simultaneously using the load_weights() and model_from_json() function respectively. When the models are successfully loaded, it will print "Loaded model from disk" in the terminal/Powershell.

Next, we have initialized a Video-Capture object called "cap". Then, we have declared a dictionary called categories, having integer numbers as their key index. This will be helpful in rendering the string value of their respective numbers, which will be shown in the prediction.


json_file = open("model-bw.json", "r")

model_json = json_file.read()

json_file.close()

loaded_model = model_from_json(model_json)

loaded_model.load_weights("model-bw.h5")

print("Loaded model from disk")

cap = cv2.VideoCapture(0) 



categories = {0: 'ZERO', 1: 'ONE', 2: 'TWO', 3: 'THREE', 4: 'FOUR', 5: 'FIVE'}

Enter fullscreen mode Exit fullscreen mode

Then, we will initialize a while-loop which will run infinitely, as long as the video is being captured, or a break statement has been encountered. Then we read the frames of the video using cap.read() function, and invert the reading frame using the flip() function.


while True:

  _, frame = cap.read()

  frame = cv2.flip(frame, 1)



  x1 = int(0.5*frame.shape[1])

  y1 = 10

  x2 = frame.shape[1]-10

  y2 = int(0.5*frame.shape[1])



  cv2.putText(frame, "Expressando - TDOC 2021", (175, 450), cv2.FONT_HERSHEY_COMPLEX_SMALL, 1, (225,255,0), 3)

  cv2.rectangle(frame, (x1-1, y1-1), (x2+1, y2+1), (255,255,255) ,3)

  roi = frame[y1:y2, x1:x2]



  roi = cv2.resize(roi, (64, 64)) 

  roi = cv2.cvtColor(roi, cv2.COLOR_BGR2GRAY)

  cv2.putText(frame, "R.O.I", (440, 350), cv2.FONT_HERSHEY_COMPLEX_SMALL, 1, (0,225,0), 3)



  _, test_image = cv2.threshold(roi, 120, 255, cv2.THRESH_BINARY)

  cv2.imshow("ROI", test_image)

Enter fullscreen mode Exit fullscreen mode

Next, we will be defining the Region of Interest, where we will be showing our hand gestures. We have previously discussed how to define the region of interest in our previous documentation. Then, we will be using the rectangle() function to enclose the region of interest, within the rectangle for the user to know where the sign will be detected.


  result = loaded_model.predict(test_image.reshape(1, 64, 64, 1))

  prediction = {'ZERO': result[0][0], 

         'ONE': result[0][1], 

         'TWO': result[0][2],

         'THREE': result[0][3],

         'FOUR': result[0][4],

         'FIVE': result[0][5]} 

  prediction = sorted(prediction.items(), key=operator.itemgetter(1), reverse=True)

  cv2.putText(frame, "PREDICTION:", (30, 90), cv2.FONT_HERSHEY_SIMPLEX, 1, (255,255,255), 2)

  cv2.putText(frame, prediction[0][0], (80, 130), cv2.FONT_HERSHEY_SIMPLEX, 1, (255,255,255), 2)   

  cv2.imshow("Frame", frame)

Enter fullscreen mode Exit fullscreen mode

We will use the predict() function, which enables us to predict the labels of the data values on the basis of the trained model. It accepts only a single argument which is usually the data to be tested. It returns the labels of the data passed as an argument based upon the learned or trained data obtained from the model. Thus, the predict() function works on top of the trained model and makes use of the learned label to map and predict the labels for the data to be tested. We will pass the test_image, reshaped to the input size which we will be passing into the Convolutional Neural Network, which is 64 x 64 x 1.

Then, we will declare a dictionary called prediction, which has the integer name in the string as the key, which in turn returns the result[0][key]. Here result[0] refers to the first element in the list, which is the categories dictionary. Then, we parse the results depending on the key value in result[0][key].

Next, we are sorting the prediction.items(), which are the individual probabilities that the input gesture you are showing in the Region of Interest (R.O.I) resembles the images/dataset with which you have trained the models. The operator.itemgetter(1) helps to sort with more than one column. Also, reverse=True refers to sorting in the descending order such that prediction[0][0] contains the string integer with the maximum probability.

We will then show the string using the putText() function on the frame, and then show the frame under the name Frame using the imshow() function.

Next, we initialize the interrupt variable, which returns 10 frames per second here. 0xFF collects and recognizes the keyboard response and matches it to the Escape Key. Then, after the detection is over, we release our webcam and deallocate all the memory rendered to the image array vectors. This has been previously explained in the past documentation.

Now your prediction.py should look like the following.


from keras.models import model_from_json

import operator

import cv2



json_file = open("model-bw.json", "r")

model_json = json_file.read()

json_file.close()

loaded_model = model_from_json(model_json)

loaded_model.load_weights("model-bw.h5")

print("Loaded model from disk")



cap = cv2.VideoCapture(0) 



categories = {0: 'ZERO', 1: 'ONE', 2: 'TWO', 3: 'THREE', 4: 'FOUR', 5: 'FIVE'}



while True:

  _, frame = cap.read()

  frame = cv2.flip(frame, 1)



  x1 = int(0.5*frame.shape[1])

  y1 = 10

  x2 = frame.shape[1]-10

  y2 = int(0.5*frame.shape[1])



  cv2.putText(frame, "Expressando - TDOC 2021", (175, 450), cv2.FONT_HERSHEY_COMPLEX_SMALL, 1, (225,255,0), 3)

  cv2.rectangle(frame, (x1-1, y1-1), (x2+1, y2+1), (255,255,255) ,3)

  roi = frame[y1:y2, x1:x2]



  roi = cv2.resize(roi, (64, 64)) 

  roi = cv2.cvtColor(roi, cv2.COLOR_BGR2GRAY)

  cv2.putText(frame, "R.O.I", (440, 350), cv2.FONT_HERSHEY_COMPLEX_SMALL, 1, (0,225,0), 3)



  _, test_image = cv2.threshold(roi, 120, 255, cv2.THRESH_BINARY)

  cv2.imshow("ROI", test_image)



  result = loaded_model.predict(test_image.reshape(1, 64, 64, 1))

  prediction = {'ZERO': result[0][0], 

         'ONE': result[0][1], 

         'TWO': result[0][2],

         'THREE': result[0][3],

         'FOUR': result[0][4],

         'FIVE': result[0][5]} 

  prediction = sorted(prediction.items(), key=operator.itemgetter(1), reverse=True) 

  cv2.putText(frame, "PREDICTION:", (30, 90), cv2.FONT_HERSHEY_SIMPLEX, 1, (255,255,255), 2)

  cv2.putText(frame, prediction[0][0], (80, 130), cv2.FONT_HERSHEY_SIMPLEX, 1, (255,255,255), 2)   

  cv2.imshow("Frame", frame)



  interrupt = cv2.waitKey(10)

  if interrupt & 0xFF == 27:

    break





cap.release()

cv2.destroyAllWindows()

Enter fullscreen mode Exit fullscreen mode

Run the code in your Powershell/terminal using:


python prediction.py

Enter fullscreen mode Exit fullscreen mode

Now, see your real-time sign-language detection in action!!!

alt Predection

Thank you for reading this article. Hope you had a great time building this project and have learned a lot more about Machine learning.

May the source be with you! ๐Ÿงโค๏ธ

Top comments (1)

Collapse
 
sandygun profile image
SandyGun

How can I add more data to the prediction dictionary to predict more signs?
I'm getting this error "index 6 is out of bounds for axis 0 with size 6".