Object Detection

What is Object Detection?

Object detection is a subfield of computer vision that focuses on identifying and locating objects within an image or video. It not only classifies the objects present in the image or video but also determines their precise positions by drawing bounding boxes around them. This dual capability of classification and localization makes object detection a powerful tool for various practical applications. Within the umbrella of object detection lies multi-class and multi-object detection, both of which are also extremely useful in terms of applications.

Classification: Determines what objects are present in an image, and labels them as such. For instance, if an image contains a cat and a dog, classification labels it as "cat" and "dog."

Localization: Identifies where the objects are located within the image. This involves drawing a bounding box around each object, specifying the coordinates of the box. Bounding boxes help in pinpointing the exact location of objects, which is crucial for applications that require spatial information.

Multi-class Detection: The ability to detect and classify multiple types of objects within a single image. For example, identifying cars, pedestrians, and bicycles simultaneously in an image or video.

Multi-object Detection: The ability to detect multiple instances of the same or different classes of objects within an image. For instance, detecting several different cars or several different pedestrians in a crowded street.

How it Works:

Annotation: Images need to be annotated with bounding boxes around objects of interest. This involves drawing rectangles around objects and labeling them. The annotated data is then split into training, testing, and validation sets.

Training the Model: The chosen model is trained on the training dataset. The process involves feeding the model images and their corresponding annotations, allowing it to learn to predict bounding boxes and labels. During training, a loss function calculates the error between the predicted and actual bounding boxes and labels. The model parameters are then adjusted to minimize the loss. A subset of the training set known as the validation set is used to fine-tune these parameters and adjust the model.

Applying the Model: Once trained, the model can be used to detect objects in new images or videos. The model outputs predicted bounding boxes, confidence scores, and class labels for each detected object.

Specific Steps Within The Model:

Preprocessing: The input image is converted to the format required by the neural network, resized, and color adjusted

Input: The detection process starts with the preprocessed input image which is fed into the neural network.

Feature Extraction: The input image passes through multiple convolutional layers that apply filters to detect various features like edges, textures, and patterns. Pooling layers reduce the spatial dimensions of the feature maps while retaining the most important information.

Classification and Localization: Each proposed region is further processed to refine the bounding box coordinates and classify the object within the box.

Post-Processing: Non-Maximum Suppression is applied to remove redundant bounding boxes, retaining only the most confident detections.

Industry Use Cases:

Autonomous Vehicles

  • Identifying and tracking pedestrians to ensure their safety.
  • Recognizing other vehicles to avoid collisions and maintain proper distance.
  • Detecting traffic lights and signs.

Surveillance and Security

  • Detecting and identifying faces for security verification and monitoring.
  • Identifying unauthorized or dangerous objects.

Retail and Inventory Management

  • Detecting when products on shelves are running low or are out of stock.
  • Monitoring customer movements and interactions with products to optimize store layouts and marketing strategies.

Medical Imaging

  • Detecting tumors, fractures, or other anomalies in X-rays, MRIs, and CT scans.

Sports

  • Tracking players during games or practices to gather performance statistics.
  • Detecting suspicious activity in the crowds of games

State-of-the-art Models:

Here, we take a look into a few of the state-of-the-art models used for object detection. The following table consists of five models along with statistics collected based on the COCO test-dev dataset.

  • box mAP: measures the accuracy of object detection models by averaging precision across various recall levels and union thresholds, giving a single score that reflects both detection and localization performance
  • AP50: measures the accuracy of an object detection model by calculating the average precision at a 50% Intersection over Union (IoU) threshold, or cases where approximate localization is sufficient
  • AP50: measures the accuracy of an object detection model by calculating the average precision at a 75% Intersection over Union (IoU) threshold, or cases where more precise localization is necessary
  • APS: measures the average precision for small objects in an object detection model
  • APM: measures the average precision for medium objects in an object detection model
  • APL: measures the average precision for large objects in an object detection model

Technical Application:

In this example, we explore the application of the YOLOv8 model from Ultralytics for object detection. YOLOv8, a high-performing publicly available model, comes in five sizes – nano, small, medium, large, and extra large, each offering a different balance between precision and computational time. For our task, we chose the medium model, as it provides a good trade-off between these factors. Our goal is to test this model on a dataset from Roboflow that contains images of shelves stocked with clothes. The model aims to identify sections of the shelves as either in stock, partially out of stock, or out of stock.

First, we trained the model using 292 images from our dataset over 20 epochs (training cycles). The images consist of shelves divided by bounding boxes. Each box was labeled as either in stock, partially out of stock, or out of stock. Running the model yielded the following results:

Let us focus on the mAP50 and mAP50-95 results, as they provide a measure of how precise our model is. The values obtained were 0.862 and 0.531 respectively, which are both solid numbers for those metrics. We can also see that both values trend in the upward direction as the number of epochs increases, showing that the model indeed became more precise as it analyzed more images. We can also look at the loss metrics and see that those values decrease as the number of epochs increase, showing that the model improved at identifying and classifying images. 

Next, we validated the model using a set of 29 images. The results remained consistent with the training set, with a mAP50 of 0.863 and a mAP50-95 of 0.531. This consistency reinforces the reliability of our model across different data sets.

Finally, we tested the model on a set of 14 images. The results showed that the model effectively identified sections of the shelves, performing well in real-world scenarios, as seen below:

Beyond static images, we also tested the YOLOv8 medium model on a short video clip featuring several people walking on a solid platform. We integrated the model with OpenCV, a Python library that is valuable for drawing bounding boxes and displaying results, such as counting the number of specific objects in a video.

Looking at the frame above, we can see that the model accurately identified most objects and drew near-perfect bounding boxes around them. However, it did make a minor error by misidentifying a circle on the floor as a tennis racket. Despite this, the model proved to be quite effective for video object detection overall.

Conclusion:

Object detection is a critical aspect of computer vision, enabling the identification and localization of objects within images and videos. This technology involves steps like annotation, training, and application, where models are trained on labeled images to predict bounding boxes and classify objects. Evaluation metrics such as box mAP and AP50 provide comprehensive assessments of model performance.

The effectiveness of modern object detection models, like YOLOv8, can be demonstrated through practical examples. These models achieve high mAP scores and deliver consistent results across training, validation, and test sets. As object detection technology advances with improved neural network architectures and training techniques, its applications will expand, offering robust solutions to real-world challenges and enhancing various technological fields.

Blog Posts