DEV Community

Cover image for one of the Codia AI technologies: 2023-24 Mainstream Object Detection Models

Posted on

one of the Codia AI technologies: 2023-24 Mainstream Object Detection Models

1. Introduction

In the articles Codia AI: Shaping the Design and Code Revolution of 2024 and Codia AI: Shaping the Design and Code Revolution of 2024 - Part 2, Codia AI was introduced. Codia AI has undergone in-depth training, implementation, and optimization of object detection models. This article will focus on discussing the object detection models.

Object detection is a fundamental task in computer vision, involving the identification and localization of objects within an image. Deep learning has revolutionized object detection, enabling more accurate and efficient detection of objects in images and videos. In 2023-24, several deep learning models are making significant strides in object detection. Here are the mainstream object detection deep learning models for 2023-24:

2. R-CNN Series

2.1. R-CNN (Regions with CNN features)

R-CNN is one of the pioneering works in the field of object detection, proposed by Ross Girshick et al. in 2014. It first uses a selective search algorithm to extract candidate regions in an image, then uses a convolutional neural network (CNN) to extract features for each region, and finally classifies them using support vector machines (SVM). R-CNN is a groundbreaking work that applies deep learning to object detection. Its workflow is as follows:

  1. Use a selective search algorithm to generate about 2000 region proposals in the input image.
  2. Scale each candidate region to fit the input size of the convolutional neural network (CNN).
  3. Extract features through a pre-trained CNN (such as AlexNet).
  4. Train multiple support vector machines (SVMs), each responsible for recognizing a category.
  5. Refine the bounding boxes of the candidate regions using a regression model.

Although R-CNN made significant progress in accuracy, it was slow because it needed to extract features for each candidate region separately.

2.2. Fast R-CNN

To address the slow speed and multi-stage training issues of R-CNN, Girshick proposed Fast R-CNN in 2015. Fast R-CNN integrates CNN feature extraction and classifier training into one network and introduces an RoI (Region of Interest) Pooling layer to extract fixed-size feature vectors from shared feature maps, thereby improving efficiency. Its main improvements include:

  1. The entire image is processed through the CNN to extract features at once, rather than for each candidate region separately.
  2. The introduction of the RoI Pooling layer, which extracts fixed-size feature vectors from shared feature maps corresponding to each candidate region.
  3. The use of a multi-task loss function to train the network for classification and bounding box regression simultaneously.

These improvements significantly increased the speed of training and detection while simplifying the training process.

2.3. Faster R-CNN

Faster R-CNN, proposed by Ren et al. in 2015, further improved the speed and accuracy of object detection. It introduced a Region Proposal Network (RPN), a fully convolutional network used to generate high-quality region proposals, on top of Fast R-CNN. This improvement allowed the region proposal generation process to be optimized through deep learning, further increasing detection speed and accuracy.

2.4. Mask R-CNN

Mask R-CNN, proposed by Kaiming He et al. from Facebook AI Research in 2017, is an extension of Faster R-CNN that can perform not only object detection but also instance segmentation, i.e., pixel-level segmentation of each detected object.


  • RoIAlign: Mask R-CNN introduced the RoIAlign layer to replace the RoIPooling in Fast R-CNN. RoIAlign uses bilinear interpolation to avoid quantization errors, thereby improving the accuracy of segmentation and detection.
  • Parallel Prediction: In each RoI, Mask R-CNN simultaneously predicts bounding boxes, categories, and object masks generated by a fully convolutional network (FCN), enabling the model to perform detection and segmentation at the same time.
  • Multi-task Loss: Mask R-CNN uses a multi-task loss function to optimize classification, localization, and segmentation tasks simultaneously.

Mask R-CNN achieved the best performance at the time for both object detection and instance segmentation tasks.

2.5. Cascade R-CNN

Cascade R-CNN, proposed by Zhaowei Cai and Nuno Vasconcelos in 2018, is a multi-stage object detection framework that progressively refines detection accuracy by cascading several detection heads.


  • Cascade Detection Heads: Cascade R-CNN contains multiple detection heads, each trained at different IoU thresholds. As the cascade progresses, the IoU threshold gradually increases, which can progressively improve the quality of detection.
  • Adaptive Training: Each cascade stage adaptively adjusts its training targets based on the output of the previous stage, allowing for more precise fitting of high-quality bounding boxes.
  • Multi-task Learning: Similar to Mask R-CNN, Cascade R-CNN can also be extended to multi-task learning, such as simultaneously performing bounding box regression and object segmentation.

Cascade R-CNN demonstrated superior performance in multiple benchmarks, especially in high-quality detection.

3. YOLO Series (You Only Look Once)

3.1. YOLOv1 (2015)

YOLOv1 is the first model in the YOLO series, which frames object detection as a single regression problem. The model uses a convolutional neural network (CNN) to extract features from images and then uses fully connected layers to predict bounding boxes and class probabilities. YOLOv1 could process 45 images per second, which was a breakthrough speed at the time. YOLOv1's features include:

  1. Dividing the input image into an SxS grid, with each grid cell responsible for predicting objects whose center falls within that cell.
  2. Each grid cell predicts multiple bounding boxes and their confidence scores, as well as conditional class probabilities.
  3. The model makes predictions directly on the entire image, greatly increasing detection speed.

3.2. YOLOv2 (2016)

YOLOv2 made several improvements to YOLOv1, including:

  • Batch Normalization: This helped stabilize the training process and improve the model's generalization ability.
  • Anchor Box Mechanism: This helped the model predict more accurate bounding boxes.
  • Deeper CNN Architecture: This allowed the model to extract richer features from images.

These improvements increased YOLOv2's speed to 90 images per second while also improving accuracy.

3.3. YOLOv3 (2018)

YOLOv3 further improved upon YOLOv2, including:

  • Deeper CNN Architecture (Darknet-53): This further enhanced the model's ability to extract features from images.
  • Multi-scale Prediction: This allowed the model to make predictions on images of different scales, improving the detection of small objects.
  • Loss Function Improvements: This helped the model better handle overlapping and occluded objects.

These improvements significantly increased YOLOv3's accuracy while still maintaining a high inference speed.

3.4. YOLOv4 (2020)

YOLOv4 made significant improvements to YOLOv3, including:

  • New CNN Architecture (CSPDarknet53): This is a more lightweight and efficient architecture that can improve the model's speed and accuracy.
  • Self-attention Mechanism: This helped the model focus on the most important areas of the image.
  • Path Aggregation Network (PAN): This helped the model integrate features from different levels, improving the detection of small objects.

These improvements made YOLOv4 one of the most advanced object detection models at the time.

3.5. YOLOv5 (2020)

YOLOv5 is an improved version of YOLOv4, focusing on usability and customizability. The model provides a unified API for training, evaluation, and deployment, making it easier for developers to use. Additionally, YOLOv5 offers various pre-trained models that can be fine-tuned for specific tasks.

3.6. YOLOv6 (2022)

YOLOv6 further improved upon YOLOv5, including:

  • New CNN Architecture (EfficientNet): This is a more lightweight and efficient architecture that can further increase the model's speed.
  • Data Augmentation Techniques: This helped the model better handle various image conditions.
  • Knowledge Distillation: This helped the model learn from larger pre-trained models, thereby improving accuracy.

These improvements made YOLOv6 one of the fastest object detection models at the time.

3.7. YOLOv7 (2022)

YOLOv7 is an improved version of YOLOv6, focusing on lightweight and efficiency. The model uses a new CNN architecture (Cross-Stage Partial Connections) that can reduce the model's size and computational cost while maintaining high accuracy. YOLOv7 can process over 160 images per second, making it an ideal choice for real-time object detection.

3.8. YOLOv8 (2023)

YOLOv8 is the latest version in the YOLO series, featuring faster inference speed, higher accuracy, and smaller model size. The model uses a new CNN architecture (RepVGG) that can further improve the model's speed and accuracy. Additionally, YOLOv8 introduces self-supervised learning and knowledge distillation, which help the model learn from unlabeled data and larger pre-trained models.

4. SSD (Single Shot MultiBox Detector)

SSD, proposed by Liu et al. in 2016, is also a single-stage detector that provides speed comparable to Faster R-CNN without sacrificing accuracy. SSD predicts bounding boxes and class probabilities on feature maps of different scales, which allows it to effectively detect objects of varying sizes. Key features of SSD include:

  1. Using multi-scale feature maps to detect targets of different sizes.
  2. Predicting bounding boxes at each location on each feature map using multiple predefined anchor boxes.
  3. Using a single loss function to perform classification and bounding box regression simultaneously.

SSD achieved a good balance between speed and accuracy, making it suitable for real-time detection tasks.

5. RetinaNet

RetinaNet is a single-stage object detection model proposed by Lin et al. in 2017, which introduced Focal Loss to address the problem of imbalance between positive and negative samples. Focal Loss is designed to reduce the loss contribution of easy-to-classify samples, allowing the model to focus more on difficult-to-classify samples. RetinaNet also uses a Feature Pyramid Network (FPN) to improve the detection capability for multi-scale targets.

6. CenterNet

CenterNet is an object detection method based on keypoint estimation, proposed by Zhou et al. in 2019. It avoids the use of anchor boxes and instead directly predicts the center points of objects, regressing from the center points to the size of the bounding boxes. This approach simplifies the object detection process while maintaining high accuracy and speed.

7. EfficientDet

EfficientDet is an efficient object detection model proposed by Mingxing Tan and Quoc V. Le from Google Research in 2020. It is based on EfficientNet, which is an efficiently designed network architecture. The core feature of EfficientDet is its scalability, which uses a compound coefficient to uniformly adjust the network's width, depth, and resolution, achieving optimal performance under different resource constraints.


  • BiFPN (Bidirectional Feature Pyramid Network): EfficientDet introduced BiFPN, a new type of feature pyramid network that allows information to flow bidirectionally between feature layers of different scales, enhancing the expressiveness of features.
  • Compound Scaling: EfficientDet uses a compound scaling method to optimize the model's efficiency by balancing the network's depth, width, and resolution.
  • Automated Model and Anchor Box Optimization: EfficientDet optimizes its model architecture through Neural Architecture Search (NAS) and automatically selects the best anchor box configuration.

EfficientDet achieves performance comparable or even better than other more complex models while maintaining lower computational costs.

8. FCOS (Fully Convolutional One-Stage Object Detection)

FCOS, proposed by Tian et al. in 2019, is a single-stage, fully convolutional object detection model. Unlike anchor-based detectors, FCOS directly predicts the category and location of targets on the feature map without the need for predefined anchor boxes.


  • Anchor-free Design: FCOS dispenses with anchor boxes, simplifying the model design and training process.
  • Centerness Prediction: To reduce mismatches between detected bounding boxes and actual objects, FCOS introduced centerness prediction, which helps suppress low-quality detection results.
  • Fully Convolutional Structure: FCOS uses a fully convolutional network to directly predict the bounding boxes and categories of targets at each pixel location, making the model more flexible in handling targets of different sizes.

FCOS maintains performance comparable to other single-stage detectors while simplifying the model design.

9. Transformer-based Models (e.g., DETR)

DETR (Detection Transformer) is a pioneer in applying the Transformer architecture to object detection. DETR uses the encoder-decoder structure of the Transformer to process image features and directly outputs a set of predictions, including bounding boxes and category labels. A key feature of DETR is that it does not rely on predefined anchor boxes or complex post-processing steps, but instead uses set prediction and the Hungarian matching algorithm to address the problem of overlaps in object detection.

10. Conclusion

As a core task of computer vision, the R-CNN series has initiated a series of innovations by introducing deep learning into object detection, including Fast R-CNN, Faster R-CNN, Mask R-CNN, and Cascade R-CNN, progressively improving detection speed and accuracy. The YOLO series, with its unique single-shot detection mechanism, has achieved fast and efficient detection, up to the latest developments of YOLOv8. SSD and RetinaNet have addressed key issues in detection through multi-scale feature maps and Focal Loss, respectively. CenterNet, EfficientDet, and FCOS further optimized the detection process, improving efficiency and performance. Finally, Transformer-based Models like DETR introduced a new architecture, bringing a fresh perspective to object detection. The development of these models not only drives technological progress but also lays a solid foundation for future research and applications. As technology continues to evolve, we can expect object detection to reach new heights in accuracy, speed, and applicability.

Top comments (0)