The mission is to develop an open source machine learning solution which will use computer vision to analyse (home made) sports videos.
For starters I want to focus on Basketball games but the solution should also be applicable to any sport which has players and a court.
Further documentation, code examples and eventually a working open-source solution will get published on GitHub.
Feel free to contact me if you want to help out, have suggestions or know about existing open source projects that we can (re)use.
Project Goals
Short (1 and 2) and long term (3 and 4) project goals.
- Player tracking per team.
- Video mapping onto 2D basketball court.
- Game play action detection (with tagging) and analytics
- More advanced game analytics like lay-up, dunk, pick & roll, running distance, etc.
Basically similar to the football analytics video shown below but then for basketball and open sourced.
Machine Learning Models
Based on the Player Tracking and Analysis of Basketball Plays paper, the following machine learning models need to be created.
1) Court Detection - find lines of the court
2) Person Detection - detect individuals ✅
3) Player Detection and Color Classification - players detection standing on the court and separate these individuals into two teams
4) Player Tracking - Keep positions information frame by frame
5) Mapping via Homography - translate onto a court
Court Detection
Explained in Court Reconstruction for Camera Calibration in Broadcast Basketball Videos [2]
The video frames that we obtained from Youtube were initially converted from the BGR to the HSV (hue, saturation
and value) color model. We then focused on the H-plane in order to create a binary model of the system. Then, we
proceeded to perform erosion and dilation of the image in order to get rid of artifacts that were not related to the court. Subsequently, we made use of the Canny edge detector to detect the lines in our system. Finally, we performed the Hough transform in order to detect the straight lines in the system.
Court detection strategies
As you can see in the images above, these are not NBA courts. Many criss-cross lines define a basketball court which will make it very hard to auto detect.
Let's have a closer look at a few strategies.
Naive Court Detection
- Converting the image to HSV
- Isolating pixels within a given hue range
- Developing a bitwise-AND mask
- Using Canny edge detection
- Using Hough Transformation
The Python code "court_detection1.py" is included in this project.
Basketball court detection (with too many lines) will be very difficult to identify using above strategies.
Binary segmentation using auto encoders
An auto encoder for sports field segmentation will be required as explained in the Classification of Actions by Simone Francia (see section 3.2.2 : Auto encoder model of the basketball court)
Field Segmentation Datasets
In order for training to work, a 100,000-frame dataset of basketball courts is required.
To do this, about 1000 frames needs to be extracted from each game which is then used for the creation of the data set.
The size of the dataset can be increased through simple data augmentation techniques.
Through the OpenCV function cv2.polylines it is possible to create n points {p1, p2, .., pn } on the image plane. These points are then used to draw a polygon.
def draw_poly_box(frame, pts, color=[0, 255, 0]):
"""Draw polylines bounding box.
Parameters
----------
frame : OpenCV Mat
A given frame with an object
pts : numpy array
consists of bounding box information with size (n points, 2)
color : list
color of the bounding box, the default is green
Returns
-------
new_frame : OpenCV Mat
A frame with given bounding box.
"""
new_frame = frame.copy()
temp_pts = np.array(pts, np.int32)
temp_pts = temp_pts.reshape((-1, 1, 2))
cv2.polylines(new_frame, [temp_pts], True, color, thickness=2)
return new_frame
This polygon, annotated by manually, is interpreted as a field (basketball court) and colored white inside, while the outside will be black.
Data Augmentation for the Dataset Field
The annotation of the field has been carried out for one frame every second, and being the videos of 25 fps, it is equivalent to annotating a frame every 25. The annotation of 1000 frames is not sufficient to create a robust auto-coder model; for this reason, some Data Augmentation solutions have been adopted in order to provide the autoencoder model with a sufficient number of examples for training.
Every court image can also be rotated with an angle ranging from -15 and 15. From each original court image two other combinations are created, choosing a random angle between the interval.
Persons Detection
Object detection locates the presence of an object in an image and draws a bounding box around that object, in our case this would be a person.
Common Object Detection model architectures are :
- R-Convolutional Neural Networks or R-CNN
- Fast R-CNN
- Faster R-CNN
- Mask R-CNN
- SSD (Single Shot MultiBox Defender)
- YOLO (You Only Look Once)
- Objects as Points
- Data Augmentation Strategies for Object Detection
You can find a working example by Arun Ponnusamy using Yolo and OpenCV. The result image is shown below.
Another approach is using Convolutional Neural Networks like TensorFlow. More details on the different model architectures can be found in A 2019 Guide to Object Detection.
Mask R-CCN allows us to segment the foreground object from the background as shown in this Mask R-CNN example and the image below. This will help in the next model where we'll detect a player and based on color classification link them to a team.
Player Detection and Color Classification
Excerpt from Learning to Track and Identify Players from Broadcast Sports Videos [4].
In order to reduce the number of false positive detections, we use the fact that players of the same team wear
jerseys whose colors are different from the spectators, referees, and the other team. Specifically, we train a
logistic regression classifier [32] that maps image patches to team labels (Team A, Team B, and other), where image patches are represented by RGB color histograms. We can then filter out false positive detections (spectators and referees) and, at the same time, group detections into their respective teams. Notice that it is possible to add color features to the DPM detector and train a player detector for a specific team [33]. However, [33] requires a larger labeled training data, while the proposed method only needs a handful examples.
After performing this step, we significantly boost precision to 97% while retaining a recall level of 74%.
The Mask R-CNN application allows us to extract the segmented image of each identified person. Extracting the dominate colors per segmented image should allow us to classify the players by team. However for some unknown reason the used python code to accomplish this doesn't identify the yellow jersey color (yet).
See also Player Tracking and Analysis of Basketball Plays.
Players Tracking
Excerpt from Learning to Track and Identify Players from Broadcast Sports Videos [4].
Face recognition is infeasible in this domain, because image resolution is too low even for human to identify
players. Recognising jersey numbers is possible, but still very challenging. We tried to use image thresholding to
detect candidate regions of numbers, and run an OCR to recognise them. However, we got very poor results because image thresholding cannot reliably detect numbers, and the off-the-shelf OCR is unable to recognise numbers on deformed jerseys. Frequent pose and orientation changes of players further complicate the
problem, because frontal views of faces or numbers are very rare from a single camera view. We adopt a different approach, ignoring face and number recognition, and instead focusing on identification of players as entities. We extract several visual features from the entire body of players. These features can be faces, numbers on the jersey, skin or hair colors. By combining all these weak features together into a novel Conditional Random Field (CRF), the system is able to automatically identify sports players, even in video frames taken from a single pan-tilt-zoom camera.
The open source Alpha Pose project can detect a human body within an image and provide a full description of a human pose.
Alpha Pose is the “first real-time multi-person system to jointly detect human body, hand, and facial key points on single images using 130 key points.
Once we can identify a body pose the direction can be calculated and these can be mapped onto a 2D playing plane/field/court as shown above.
Court Mapping via Homography
How can we map a player in a video onto a 2D court?
A homography is a perspective transformation of a plane (in our case a basketball court) from one camera view into a different. Basically with a perspective transformation you can map 3D points onto 2D image using a transformation matrix.
By having the dimensions of the court, we are able to find a 3x3 homography matrix that is computed using an affine transform. Each player’s position is then multiplied by the homography matrix that projects them into the model court.
See also this scientific paper on A Two-point Method for PTZ Camera Calibration in Sports [22]
A Python example on how to use the OpenCV Homography algorithm can be seen below. This is based an article by Satya Mallick.
#!/usr/bin/env python
import cv2
import numpy as np
if __name__ == '__main__' :
# Read source image.
im_src = cv2.imread('book2.jpg')
# Four corners of the book in source image
pts_src = np.array([[141, 131], [480, 159], [493, 630],[64, 601]])
# Read destination image.
im_dst = cv2.imread('book1.jpg')
# Four corners of the book in destination image.
pts_dst = np.array([[318, 256],[534, 372],[316, 670],[73, 473]])
# Calculate Homography
h, status = cv2.findHomography(pts_src, pts_dst)
# Warp source image to destination based on homography
im_out = cv2.warpPerspective(im_src, h, (im_dst.shape[1],im_dst.shape[0]))
# Display images
cv2.imshow("Source Image", im_src)
cv2.imshow("Destination Image", im_dst)
cv2.imshow("Warped Source Image", im_out)
cv2.waitKey(0)
More Advanced Machine Learning Models
In a future version of the project we could also consider adding Start of Game (SoC), Track Ball and Goal (score) machine learning models.
Start of Game (SoG)
If we want to solution to automatically analyse a video we could also considering adding a Start of Game (SoC) model. This identifies the players at a certain position on the court which will flag the beginning of a game or quarter when doing basketball analysis.
Track Ball
Tracking the ball will be a requirement when we want to achieve scoring analytics. Some very interesting research studies have been published on this subject : A deep learning ball tracking system in soccer videos [5]
An example of tracking a large ball using OpenCV can be found here.
Pose estimator
Alpha Pose is the “first real-time multi-person system to jointly detect human body, hand, and facial key points (in total 130 key points) on single images,”. The solution is capable of taking in an image and detecting key points (eyes, nose, various joints, etc.) on all human figures in the image. This allows the full description of a human pose in an image.
Alpha Pose can potentially be a building block to detect shots, layups, dunks etc.
A must read related document is Sports Analytics With Computer Vision [7].
And another great article giving an overview on the available human pose estimation solutions.
Shot Detection
Another very interesting machine learning model that we can use is the open source project on basketball shot detection and analysis shared by Rembert Daems (Thanks for the info).
This program is able to detect when a shot occurs and fill in the balls flight from captured data. It calculates the balls initial velocity and launch angle. It is able to estimate the balls flight perpendicular to the camera plane (The z axis) using a single camera. The program is also able to detect when the balls flight is interrupted by another object and will drop those data points.
Actions recognition
Simone Francia developed a basketball action recognition dataset as shown in the video below. I've contacted Simone to get more details how his dataset can be used.
Goal (Score)
Eventually we also want to have a model which can identify when a player makes a goal (in basketball this can be a free throw which is one point, two or three points). As confirmed by the ML6 presentation, different Goal modals will need to be combined. For example audio peaks could be an indication that a goal was made, of course ball tracking towards to hoop and "entering" the "ring" are all events which could identify a goal.
Another idea (when possible) is to do OCR of the score board and see when the score increases as a confirmation of a goal. Of course the majority of the home made videos do not include the score board.
Further research is required and suggestions are always welcome.
Ensemble Learning
Once we have the different models working we'll most likely need to "stack" them. Basically using the output from one model as the input for another one. Or combine similar models to obtain a better predictive performance.
See also A Comprehensive Guide to Ensemble Learning (with Python codes)
One technique of ensemble learning is stacking.
Stacking is a way to ensemble multiple classifications. The point of stacking is to explore a space of different models for the same problem. The idea is that you can attack a learning problem with different types of models which are capable to learn some part of the problem, but not the whole space of the problem. So, you can build multiple different learners and you use them to build an intermediate prediction, one prediction for each learned model. Then you add a new model which learns from the intermediate predictions the same target.
This final model is said to be stacked on the top of the others, hence the name. Thus, you might improve your overall performance, and often you end up with a model which is better than any individual intermediate model.
https://www.geeksforgeeks.org/stacking-in-machine-learning/
Training Data
Over the past 7 years I've recorded my son's basketball games and these are available on YouTube. We have thousands of hours of training data which we can use to test and train our machine learning models!
Machine Learning Hardware (on a budget)
This NVIDIA Jetson Nano board could be an interesting and in-expensive solution to train and deploy the Sports Analytics machine learning software.
According to NVIDIA this is a small, powerful computer that let us run multiple neural networks in parallel for applications like "image classification, object detection, segmentation, and speech processing."
Architecture
Once it comes to the architecture we're back on solid ground and have enough experience to make something beautiful.
The machine learning models can get integrated using an architecture explained by ML6 at Devoxx Belgium 2019 [3].
As you can see the architecture is based on Google Cloud but I'm convinced this architecture can most likely also be accomplished using Amazon, IBM or even Microsoft cloud services.
Data crunching
As explained in [3] some models will only need one video frame others will need multiple frames sorted by a timestamp (time-series) to analyse for example player movement.
A possible approach could be accomplished as follows:
Commercial solutions
- Second Spectrum. See also this non technical TED talk by Rajiv Mageswaran.
- PlaySight has an AI solution to analyse sport games.
- Any others?
ML Sporting references
- Player Tracking and Analysis of Basketball Plays [1]
- Court Reconstruction for Camera Calibration in Broadcast Basketball Videos [2]
- Video Analytics for Football games by Sven Degroote at Devoxx Belgium 2019 [3]
- Learning to Track and Identify Players from Broadcast Sports Videos [4]
- A deep learning ball tracking system in soccer videos [5]
- Shot Detection project on GitHub [6]
- Sports Analytics With Computer Vision [7]
- An accurate multi-person pose estimator [8]
Top comments (6)
Hello Stephan! Thank you for this fantastic article, very insightful. :-)
We're trying to solve the same problems, and this helped us clarify some of our ideas for the new project.
This article is a 🦄!
Hello Stephan, i have to say i'm utterly amazed by the quality of your work!
I'm a youth basketball coach from Greece and your article describes 100% what would be my ultimate goal for my basektball team. However I'm a long way from actually implementing all those things. Would you mind sharing what equipment do you use for the recordings of your sons basketball games?
Once again, amazing work keep it up !
Hey, apologies for the very late reply. You can find more info about my recording equipment @ mlbasketball.medium.com/basketball...
Hi Stephan!
Beautifully written article. Can see that a thorough consideration has been done for every aspect of the project. I was working on a similar project, but for a different sport but can see that many of the ideas you've arrived at are aligned with what I had concluded as necessary for my use case as well. I wanted to check with you whether you ended up using the NVIDIA Jetson for the project, at least in the inference stage if not training, and did it fit the bill? I am also looking for a solution that could work on the "edge"
Once again, thanks for sharing the very insightful article!
Thank you, very helpful article.
Hi,
Thank you for this article, it's very helpful for me. Is the code for mapping in football is in open source ?
Thanks you in advance for your answer