DEV Community

Stephen-Kamau
Stephen-Kamau

Posted on • Edited on

Real-Time Object Detection with YOLO: A Step-by-Step Guide with Realtime Fire Detection Example.

PART I: INTRODUCTION TO YOLO AND DATA FORMAT.

Detailed tutorial explaining the ABCs of YOLO model, dataset preparation and how to efficiently train the object detection algorithm YOLOv5 on using custom dataset.

This blog post tries to explain a presentation that was done at devfest22 Nairobi.

Introduction.

In recent years, advances in computer vision and machine learning have led to the development of more advanced object detection systems that can detect objects in real-time from video feeds of surveillance cameras or any recording. One popular approach for this task is the YOLO (You Only Look Once) object detection algorithm.
The YOLO (You Only Look Once) object detection algorithm is one of the popularaly used model for real-time object detection work since it is fast and accurate performance initially proposed by Redmon et al [https://arxiv.org/abs/1506.02640]. YOLO has proven to be a valuable tool in a wide range of applications.

In this tutorial we'll explore the working of the YOLO model and how it can be used for real-time fire detection using implimentation from Ultralytics [https://github.com/ultralytics/yolov5]. We will use transfer-learning techniques from P5 models (P5 models are model supported by ultralytics and differs in architecture and parameter size) to train our own model, evaluate its performances and use it for inference.

This tutorial is designed for people with theoretical background knowledge of object detection and computer vision who might need to seek practical implimentation. An Easy to use Notebook is provided with full code implementation for easier follow through.


The ABCs Of YOLO Model.

The general working of YOLO model is that it applies a single neural network to the full image. This network divides the image into regions and predicts bounding boxes and probabilities for each region. These bounding boxes are weighted by the predicted probabilities. It looks at the whole image at test time so its predictions are informed by global context in the image. It also makes predictions with a single network evaluation unlike other systems where you need to evaluate several times. It uses 3 techniques for detection which includes the following.

  1. Residual blocks (Creating an S * S grid from the image)
  2. Bounding box regression (predict the height, width, center, and class of objects)
  3. Intersection Over Union (IOU) - To check how bbox overlaps and get the best fit.

1. Residual Blocks

One of the key features of YOLO is its use of residual blocks to create an S * S grid from the input image. This grid is used to divide the image into a set of cells, each of which is responsible for predicting a fixed number of bounding boxes and class probabilities. The use of residual blocks allows YOLO to process the entire image in a single pass, making it well-suited for real-time object detection tasks.

Residual Blocks


2.Bounding Box Regression

In order to predict the locations of objects in the input image, YOLO uses bounding box regression. This technique involves predicting the height, width, center, and class of objects in the image. During training, the YOLO model learns to adjust the bounding boxes to better fit the objects in the training data. This allows the model to be more accurate at predicting the locations of objects in new images.

Bounding Box Regression

Every bounding box in the image consists of the following attributes; Width (bw), Height (bh), Class Label (c) and Bounding box center (bx,by). A single Bbox regression is used to predict the height, width, center, and class of objects.

The model then uses these attributes scores to predict bounding boxes for each cell. The use of anchor boxes in the YOLO model allows it to predict the locations of objects in the input image. An anchor box is a predefined set of bounding box dimensions that serve as a reference for predicting the bounding boxes of the objects in the image.


3. Intersection Over Union (IOU)

In addition to predicting bounding boxes, YOLO also uses intersection over union (IOU) to check how well the bounding boxes overlap with the ground truth boxes and select the best fit. IOU is calculated as the area of overlap between the predicted bounding box and the ground truth box, divided by the area of union between the two boxes. A high IOU score indicates a good overlap between the predicted and ground truth boxes, while a low IOU score indicates a poor overlap.

IOU Analysis description

Each grid cell is responsible for predicting the bounding boxes and their confidence scores. The IOU is equal to 1 if the predicted bounding box is the same as the real box. This mechanism eliminates bounding boxes that are not equal to the real box.

By using these three techniques, YOLO is able to accurately detect objects in images and make predictions in real-time. This makes it a powerful tool for a wide range of object detection tasks, including real-time fire detection, pedestrian tracking, and more.


Real-Time Fire Detection With YOLOv5

Now that we've covered the basic working techniques of the YOLO model, let's look at how it can be used for real-time fire detection.
One use case for YOLO in fire detection is in the monitoring of surveillance cameras. By training a YOLOv5 model on a large dataset of images and videos of fires, it is possible to build a model that can detect fires in real-time video streams. Training a YoloV5 Model is very Easy, the Bigger part comes when the dataset is not in the format required as our case. YoloV5 expects the labels (Bounding Box Information) to be in a txt format with the same name as the Image.

For this Demo, We make a walk through the the end-to-end object detection project on a custom Fire dataset, using YOLOv5 implementation developed by Ultralytics. Check on the same Implimentation using latest version (YoloV7).


The Walkthrough.

1. Dataset Handling.

The structure of the dataset for YOLOv5 should follow the format of the Open Images dataset, which is organized into a hierarchy of folders with the following structure:

├── data.yaml
base_dir:
├── images
   ├── train
   └── validation
└── labels
    ├── train
    └── validation
Enter fullscreen mode Exit fullscreen mode

Each image file should be accompanied by a corresponding text file with the same name that contains the annotation information for the objects in the image. The annotation file should contain one line for each object in the image, with each line having the following format:

class_id x_center y_center width height
Enter fullscreen mode Exit fullscreen mode

where class_id is the integer id of the class of the object, x_center and y_center are the coordinates of the center of the bounding box for the object, and width and height are the dimensions of the bounding box. In this case, these values except the class_id must be NORMALIZAED BETWEEN 0 and 1.

In our case, We have variables that can help us draw an Image BBOX. The (XMIN, YMIN) and (XMAX, YMAX) are the two corners for a bbox. The width and Height used is the one in which these annotation were extracted with. A sample snippet of how our data looks is as follows;

file_id img_name xmax ymax xmin ymin width height
100 WEBFire977 WEBFire977.jpg 428 389 376 335 1280
101 WEBFire977 WEBFire977.jpg 764 474 462 368 1280
102 WEBFire977 WEBFire977.jpg 1173 495 791 387 1280
103 WEBFire977 WEBFire977.jpg 1293 522 1211 460 1280

Below is an example of the image with its annotation drawn.

Sample Image with its Bounding Box


2.Labelling Format.

At first, from the above dataframe, we need to extract the x and y centers, and also height and width for each object. Each text file should contains one bounding-box (BBox) annotation for each of the objects in the image. The annotations are normalized to the image size, and lie within the range of 0 to 1. They are represented in the following format:

< object-class-ID> <X center> <Y center> <Box width> <Box height>
Enter fullscreen mode Exit fullscreen mode
  • If there are more than one objects in the image, the content of the YOLO annotations text file might look like this:
  0 0.563462 0.686216 0.462500 0.195205
  7 0.880769 0.796447 0.041346 0.112586
  2 0.880769 0.796447 0.041346 0.112586
  0 0.564663 0.679366 0.463942 0.181507
  0 0.566106 0.658390 0.469712 0.192637
  1 0.565144 0.359803 0.118750 0.107449
Enter fullscreen mode Exit fullscreen mode

Each Value is separated by a space and For Information for each object is on Its new Line. Since the annotations needs to be normalized, lets Normalize them and Extract the Center and Dimension for Each Fire object. In normalization, the bbox are divided by height if its y else width. To get the center, we get sum of either x or y and then divided by 2. To get Height and Width, we subtract xmax/ymax by xmin/ymin. Below is a code snippet for the same.

# first, normalization of the bbox information to be between a range of 1 and 0.
#We divide by height or width

df['x_min'] = df.apply(lambda record: (record.xmin)/record.width, axis =1)
df['y_min'] = df.apply(lambda record: (record.ymin)/record.height, axis =1)
df['x_max'] = df.apply(lambda record: (record.xmax)/record.width, axis =1)
df['y_max'] = df.apply(lambda record: (record.ymax)/record.height, axis =1)

# extract the Mid point location
df['x_mid'] = df.apply(lambda record: (record.x_max+record.x_min)/2, axis =1)
df['y_mid'] = df.apply(lambda record: (record.y_max+record.y_min)/2, axis =1)

# Extract the height and width of the object
df['w'] = df.apply(lambda record: (record.x_max-record.x_min), axis =1)
df['h'] = df.apply(lambda record: (record.y_max-record.y_min), axis =1)
Enter fullscreen mode Exit fullscreen mode

After applying the above functionality, Since we have a dataframe and a single image can have more than one object, for easier labels creation, all unique files with their objects will be on a single row i.e I will create a dictionary that has a list of annotations for each image. Information regarding the object are going to be inside a list i,e A list of Dictionary Example for a single File:

  [
    {'x_min': 0.4677716390423573, 'y_min': 0.3788, "x_max":0.12435,"y_max":0.234352, "x_mid":0.8829343, "y_mid":0.23435, "w":0.23, "h":0.1234},
    {.....},
    {.....},
    ..........
  ]

Enter fullscreen mode Exit fullscreen mode

Below is a code snippet that creates a dictionary for each Image.

# a list to hold all unique files information. It will help in easier conversion to dataframe
TRAIN =[]
for img_id in tqdm(df['file_id'].unique()):
    #get all rows that has the current id
    curr_df = df[df['file_id'] ==img_id].reset_index(drop=True)
    #get unique information
    base_details = dict(curr_df.loc[0][['file_id','img_name', 'width', 'height']])

    # a list to hold bbox annotation information
    information =[]

    #iterate through the whole records of the current id while extracting their annotation informations
    for indx in range(curr_df.shape[0]):
        #get their information as dic and add to the informatiuon list above
        other_details = dict(curr_df.loc[indx][["x_min", "y_min","x_max","y_max", "x_mid", "y_mid", "w", "h", "area" ]])
        information.append(other_details)
    # append information for the current file
    TRAIN.append([base_details['file_id'], base_details['img_name'],base_details['width'],base_details['height'],information])
# create a datafrmae from the above created list.
processed_df = pd.DataFrame(TRAIN, columns =['image_id', "img_name", "width", "height", "information"])
Enter fullscreen mode Exit fullscreen mode

The next step after this will be splitting the dataset for both training and validation. This will follow in the next post.

Link to the Notebook GITHUB

3. Training and Validation (INCOMPLETE).

You can Check the Other part from this link Link to Part II of the blog

Top comments (0)