Computer Vision, Deep Learning and Object Detection

Comprehensive Analysis of Object Detection Algorithms

24 min readJun 6, 2022

Image: https://unsplash.com/photos/dRNT_zPMZ6k

Overview

The human visionary mechanism is fascinating. The visual sensors perceive an image and convert it into electrical signals, which they pass to neural systems. The brain then processes the signals, eventually allowing humans to see, as well as understand the context of an image, including which objects are in the image and where and how many of them there are. All of these complex processes happen instantly. If one is given a pen and asked to draw a box around all of the visible objects, this can be easily performed.

However, it is questionable whether a machine can perform this process as efficiently as humans. Convolutional Neural Networks (ConvNets or CNNs) are good at extracting features from a given image and finally classifying it as a cat or a dog. This process is known as image classification. This is an easy task if the objects are centred and only a few objects are in the image. If the number of objects is increased and the objects belong to different classes, they must be distinguished and localised within an image. This is known as object detection and localisation. Zhao posits that object detection is a process of building a complete understanding, including classification and estimating the concepts and location of objects in each image. (Zhao, et al., 2019). Object detection also involves sub-tasks such as face detection, pedestrian detection, and keypoint detection. These subtasks power numerous applications, including human behaviour analysis, facial recognition, and autonomous driving (Zhao, et al., 2019).

In this article, I focus on object detection algorithms which are R-CNN family algorithms; R-CNN, Fast R-CNN, Faster R-CNN, Mask R-CNN, Single Shot Multibox Detector(SSD), RetinaNet and YOLO family algorithms; YOLO, YOLO-9000, YOLOv3, YOLOv4 and YOLOv5.

According to Bochkovskiy et al., object detectors have 2 primary parts: a backbone trained on ImageNet and a head used for bounding box predictors. Primarily, VGG, ResNet, ResNeXt, and Darknet are used for GPU platforms, while SqueezeNet, MobileNet, or ShuffleNet are used for CPU platforms as ‘backbone’ architecture. Most object detectors insert connection layers between the backbone and head to collect feature maps from different stages. This is known as ‘neck’. Different necks can be used, such as Feature Pyramid Networks, PANet, or Bi-FPN. Depending on the object detector, various ‘heads’ can be used, including YOLO, SSD, or RetinaNet as one-stage detectors or the R-CNN family as two-stage detectors (Bochkovskiy, et al., 2020).

Figure 1 Architecture of Object Detection Algorithms, (Bochkovskiy, et al., 2020)

R-CNNs — Region-based Convolutional Neural Networks

R-CNN and Fast R-CNN

In the first version, Region-based Convolutional Neural Networks(R-CNNs) have three stages. The first stage is region proposal generation, which defines the set of candidate detections. The second stage is feature extraction for each region, while the final stage is classification (Girshick, et al., 2013). Girshick et al. use a selective search algorithm to generate region proposals and use the AlexNet CNN architecture as a feature extractor. In the final stage, extracted features are fed into linear Support Vector Machines (SVMs), which are optimised per class, to classify the presence of objects within the candidate region proposal. In addition to predicting the class of the region proposal, the algorithm also predicts four values, which are offset values to increase the bounding box precision. Several drawbacks accompany R-CNN algorithms. First, they are slow to train, and classifying each region proposal (~2000) per image is costly. Second, R-CNN cannot be used in real-time scenarios since it takes around 47 seconds to test an image. Third, R-CNN’s selective search algorithm is fixed. Thus, there is no learning at this stage, which may lead to a bad candidate proposal.

In 2015, Girshick proposed an improved version of R-CNN, also known as Fast R-CNN (Girshick, 2015). This approach is similar to the original, but instead of feeding region proposals to the convolutional layer, the original image is used to generate a convolutional feature map. Region proposals are identified from the feature map and then processed by an RoI (region of interest) pooling layer to reshape them into a fixed size; they are then fed into a fully connected layer. There are two twin output layers. The first layer outputs the discrete probability distribution per RoI over categories. The second layer outputs the bounding box regression offsets. This is achieved via a so-called multi-task loss.

Figure 2 Fast R-CNN Architecture, (Girshick, 2015)

As a result, from the RoI feature vector seen in Figure 2, the class and offset values are predicted for each proposed region. An important component in Fast R-CNN is RoIPooling, which allows for the reuse of the feature map from a previous convolution network; this technique improves training and testing time significantly and allows an object detection system to be trained end-to-end (Grel, 2017).

Figure 3 RoI Pooling Example (Grel, 2017)

The problem with Fast R-CNN is that it still uses a selective search algorithm to generate region proposals. Because this process is costly and time-consuming, the region proposal generation has become a bottleneck for the algorithm.

Faster R-CNN

In 2015, Ren et al. proposed Faster R-CNN, which eliminates the selective search algorithm for region proposals and uses Region Proposal Networks (RPN) to learn the region proposals (Ren, et al., 2015).

In their original paper, Ren et al. use the Zeiler and Fergus model — which has 5 sharable convolutional layers, as well as the VGG16, which has 13 sharable convolutional layers — as feature extractors, also known as backbones. In PyTorch implementation, Faster R-CNN uses ResNet with FPN or MobileNetV3 with FPN as feature extractors; the architecture is illustrated in Figure 4.

The Region Proposal Network (RPN) is a small network that slides over the convolutional feature map output by the last shared convolutional layer. It generates rectangular object proposals with objectness scores by taking a 3 x 3 spatial window of the input convolutional feature maps generated by the feature extractor. At each sliding window location, RPN generates k different possible proposals. These k different proposals lead to 2k objectness scores and 4k coordinates. In addition, these k different proposals are relative to a k different reference box, referred to as anchors. These anchors come in different sizes and shapes (Ren, et al., 2015). In their paper, Ren et al. specify that they used 3 scales and 3 aspect ratios, totalling 9 anchors. RPNs are trained by assigning binary class labels to each anchor to check if there is any object or not. The positive labels are for the anchors with the highest IoU and for anchors with IoU overlap higher than 0.7 with a ground-truth box. The negative labels are assigned to a non-positive anchor with IoU 0.3 for all ground-truth boxes. Under these definitions, the loss function for RPN is defined as shown in Equation 1:

where i is the index of an anchor in a mini-batch and pi is the probability that that anchor has an object in it. The ground-truth label p*i is 1 if the anchor is positive and is 0 if the anchor is negative; ti is the vector representing the 4 parameterised coordinates of the bounding box; and t*i is that of the ground truth box associated with a positive anchor. The classification loss is binary cross-entropy, and regression loss is smooth L1. The regression loss is activated only for positive anchors where p*i is not zero. For the 4 bounding box coordinates, the following parameters are used, as defined in Equation 2:

where x and y denote a box’s centre coordinates and h and w denote the height and width; x, xa, and x* denote the predicted bounding box, anchor, and ground truth.

After RPN generates region proposals, Faster R-CNN also uses ROI Pooling — as in Fast R-CNN — to combine regional proposal and feature maps for detection tasks.

Mask R-CNN

He et al. developed Mask R-CNN in 2017 by extending Faster R-CNN and adding a branch for predicting object mask in parallel to bounding box prediction (He, et al., 2017). Mask R-CNN runs at 5 fps. The primary goal is instance segmentation. The Mask R-CNN model’s new branch predicts pixels by pixel segmentation masks on each region of interest (RoI). RoIAlign is used instead of RoIPool because it is quantisation-free and because it fixes the misalignment and preserves spatial locations.

In PyTorch implementation, Mask R-CNN uses ResNet with FPN or MobileNetV3 with FPN as feature extractors; the architecture is illustrated in Figure 5.

As shown in Table 1, Mask-RCNN also outperforms Faster RCNN in object detection tasks in mAP. He et al. suggest that this improvement is due to RoIAlign (+1.1AP), multitask training (+0.9AP), and ResNeXt101(+1.6AP) (He, et al., 2017).

Table 1: Comparison of Mask R-CNN and Faster R-CNN

He et al. propose RoIAlign for solving the quantisation problem caused by RoIPooling (RoIPool is introduced in Section 1.1.1.1. RoIAlign simply avoids any quantisation of RoI boundaries; it uses bilinear interpolation to compute the exact values of the features in each RoI bin and aggregates the results. As shown in Figure 6, the dashed grid is the feature map, while the solid lines are for regions of interest (RoI) and the four dots are the sampling points. RoI computes the bilinear interpolation of the point from the nearby grid on the feature map. Therefore, no quantisation occurs during these operations.

Mask R-CNN uses the same loss function as Faster R-CNN. Additionally, it has a mask loss as defined by Equation 3:

Lcls and Lbox are defined as in Faster R-CNN. The mask branch has a K binary mask for each ROI, and each mask is an m x m resolution that is the result of Km2 dimensional output for each ROI. Therefore, it is possible to create masks for each class by the network, which prevents competition between classes. He et al. apply per-pixel sigmoid and used Lmask as binary cross-entropy loss (He, et al., 2017).

Single Shot MultiBox Detector (SSD)

Region-based object detectors such as the R-CNN family require at least 2-stage object detectors, where the first stage is proposal generation and the second stage consists of object detection for each proposal. The Single Shot MultiBox Detector, also known as SSD, is a single-stage detector, meaning that both object localisation and classification are completed in a single feed-forward pass of the network that produces a fixed-size collection of bounding boxes and scores for the presence of object class instances, followed by a non-max suppression to remove same detections for an object (Liu, et al., 2016). SSD runs at 59 frames-per-second (FPS), with an mAP of 74.3% on the VOC2007 test dataset. By comparison, Faster R-CNN ran at 7 FPS with an mAP of 73.2%, while YOLO ran at 45 FPS with an mAP of 63.4% (Liu, et al., 2016).

According to Liu et al. the greatest improvement comes from eliminating bounding box proposals and feature resampling. The contributions of SSD are threefold. First, it is faster and significantly more accurate than the state-of-the-art single-shot detector (YOLO). Second, it predicts category scores and bounding box offsets for a fixed set of default bbox, using small convolutional filters applied to feature maps. Third, it generates predictions for different scales from feature maps of different scales and separates predictions by aspect ratio.

SSD’s architecture is based on VGG-16, although it eliminates fully connected layers. Additional convolutional layers are added to extract features from different scales and progressively decrease the size of input at each layer. This is stated as multi-scale feature maps for detection. SSD computes location and class scores using small convolutional filters, which are 3x3 and produce either a score for a category or a shape offset relative to the default box coordinates for each cell (Liu, et al., 2016). These filters are known as convolutional predictors for detection. SSD uses default bounding boxes like anchors in Faster R-CNN.

Figure 7: SSD Architecture (Liu, et al., 2016)

Liu et al. also describe a technique called hard negative mining, which uses some negative and positive examples during training. Because most of the bounding boxes have low intersect-over-union (IoU) and are interpreted as negative examples, Liu et al. use a 3:1 ratio between negative examples to positive examples to balance the training examples. This also helps the network learn incorrect detections (Liu, et al., 2016).

Data augmentation techniques are also applied, such as flipping and patching, as in many other neural network applications. Liu et al. use horizontal flipping with a probability of 0.5 to ensure that potential objects appear on both the left and right with similar likelihood.

The loss function for the SSD model is a weighted sum of localisation loss (loc) and the confidence loss (conf), as described in Equation 4:

where x is the indicator for matching the default box to the ground truth; l is the predicted box, and N is the number of matched default boxes to the ground truth. The confidence loss is the SoftMax output over multiple class confidences ©, while localisation loss is the Smooth L1 loss between the predicted box and ground truth (Liu, et al., 2016).

RetinaNet

Focal Loss for Dense Object Detection, also known as RetinaNet, was proposed in 2018 by Lin et al. According to Lin et al., the reason behind the lower accuracy of one-stage detectors is extreme foreground-background class imbalance. Class imbalance is addressed in the R-CNN family and the other two-stage detectors. The proposal state, such as selective search proposals or RPN, is responsible for filtering out most background samples by narrowing down the number of candidate object locations to a small number (1–2k), whereas one-stage detectors must process ~100k. In the second stage, sampling heuristics such as a fixed foreground-to-background ratio (1:3) or online hard example mining help balance the foreground and background. Although one-stage detectors apply bootstrapping or hard example mining, Liu et al. argue that these techniques are insufficient to deal with it. Therefore, Lin et al. propose the reshaped cross-entropy loss such that it down-weights the loss to achieve a well-classified loss (Lin, et al., 2018).

RetinaNet uses the ResNet + Feature Pyramid Network (FPN) as the backbone, which is responsible for extracting rich and multi-scale feature maps from the entire image. RetinaNet then relies on one subnetwork for class prediction and another for bounding box prediction. As seen in Table 2, RetinaNet is the first one-stage detector that beats two-stage detectors not only in inference time (FPS) but also in accuracy at its time.

Table 2: Comparison of One-Stage vs Two-Stage Detectors.

Focal loss has been proposed to solve the imbalance between foreground and background classes. As stated by Lin et al., the weighing factor is a common method for modifying the cross-entropy (Lin, et al., 2018). Therefore, they add a modulating factor (1-pt)y, where is the focusing parameter:

Figure 9: Focal Loss vs Cross-Entropy(Lin, et al., 2018)

You Only Look Once (YOLO) Family

YOLO, V1

The YOLO algorithm was first proposed in 2015 by Redmon et al. It adopts a different approach from other object detection algorithms at that time. It frames object detection as a regression problem, whereas others use a classification approach. Redmon et al. argue that, because a single network predicts both bounding box and class probabilities in one pass, it can be optimised end-to-end (Redmon, et al., 2015).

YOLO (v1) can process images at 45 FPS, and a smaller version of YOLO can process images at 155 FPS while still achieving double the mAP of other real-time object detectors (Redmon, et al., 2015).

The idea behind the YOLO algorithm is to take an image as input and split it into cells that can be imagined overlaying a grid (S x S); if the centre of an object falls into a grid cell, that grid cell is responsible for predictions.

Figure 10: YOLO Model(Redmon, et al., 2015)

Each grid cell generates B bounding boxes and a confidence score for those boxes, reflecting the confidence of the model about the box regarding whether it contains an object or not. Redmon et al. formulate the confidence as in Formula 6 (Redmon, et al., 2016):

According to this formula, if no object is present in the cell, then the confidence score should be zero. If an object is present in the cell, then the probability is 1, and confidence is equal to the IoU between the predicted bounding box and the ground truth.

As mentioned above, each cell predicts the B bounding box, and there are 5 values for each bounding box. These are x, y, w, and h of the bounding box, as well as the confidence score. The (x, y) is the centre of the bounding box, while w and h are the width and height of the bounding box relative to the whole image. Along with the bounding box, each grid cell also predicts the class probabilities if there is an object — in other words, the C-conditional class probabilities Pr(Classi | Object) — so the formula is rendered as follows (Redmon, et al., 2015):

The total predictions for an image are equal to S x S x (B * 5 + C). In their paper, Redmon et al. state that they used S=7 and B=2 on the PASCAL VOC Dataset, which has 20 classes; thus, the final prediction is a 7 x 7 x 30 tensor. At the final stage, YOLO applies non-max suppression to eliminate duplicates (Redmon, et al., 2016).

Figure 11: YOLO Architecture(Redmon, et al., 2016)

YOLO v1 has 24 convolutional layers and 2 fully connected layers, as inspired by GoogLeNet. However, instead of inception modules, YOLO uses a 1x1 reduction layer followed by 3x3 convolutional layers. The convolutional layers are pretrained on ImageNet at half-resolution (224 x 224) and then double the resolution for detection.

YOLO v1 carries known limitations at its time. Since each grid cell predicts only two bounding boxes and one class, it is difficult to detect close objects. It is also difficult for YOLO v1 to detect small objects. The loss function creates a risk of duplicate errors for both large and small objects. Small errors on small objects have a greater effect.

YOLO utilises multi-part loss function, which is the sum of localisation loss, confidence loss, and classification loss. Localisation loss measures errors in the predicted bounding boxes’ locations and sizes;

places more emphasis on bounding box accuracy. Furthermore, confidence loss measures if there is an object in the celli; this is referred to as objectness. If an object is detected, the classification loss predicts the class of the object in each cell by calculating the square error of class-conditional probabilities for each class.

At the time that YOLO was proposed, it outperformed two-stage detectors both in mAP and FPS, as seen in Table 3.

YOLO v2 (YOLO 9000)

SSD was a strong competitor when it was proposed. YOLO had higher localisation errors, while its recall — which measures how well it locates all objects — was lower. Thus, YOLO v2 is aimed at improving recall and localisation while maintaining classification accuracy (Redmon & Farhadi, 2016).

Redmon and Farhadi use multiple techniques to improve YOLO v2, describing in their paper the path from YOLO to YOLO v2. The first of these techniques is batch normalisation, which brings significant improvement in convergence and eliminates other normalisation methods. By applying batch normalisation, Redmon and Farhadi gained 2% improvement in mAP and removed drop-out; batch normalisation also improved regularisation and prevented overfitting. The second technique is using a higher-resolution classifier. All state-of-the-art object detectors use classifiers pre-trained on ImageNet. In YOLO v1, this is 224 x 224; but in YOLO v2, the classification network is trained on ImageNet with input images at 448 x 448 resolution for 10 epochs. Higher resolution yields an additional 4% increase in mAP. The third technique is convolution with anchor boxes, as in Faster R-CNN: the network only predicts offsets for hand-given anchor boxes (priors). Therefore, Redmon and Farhadi remove fully connected layers and use anchor boxes to predict the bounding boxes. One pooling layer is removed to increase resolution, and the image is shrunk to 416 x 416 instead of 448 x 448. The motivation for this shrink process is to use an odd number of locations in the grid to guarantee a single centre cell. YOLO down-samples the image by a factor of 32; therefore, 448 ends up with 14 whereas 416 ends up with 13. Finally, Redmon and Farhadi claim that they could achieve better results by starting with better anchor boxes. Therefore, they use k-means clustering on the dataset to find the anchor boxes.

Figure 12: Clustering Box Dimensions on VOC and COCO(Redmon & Farhadi, 2016).

Instead of predicting offset according to anchor boxes, Redmon and Farhadi use direct location prediction, which is offset to the grid. This helps solve model instability during early iterations. For each bounding box, 5 coordinates (tx, ty, tw, th, and to) are predicted by the network. If the cell is offset by (cx, cy) from the top left corner, and if the anchor box width (pw) and height (ph), then the predictions of bounding box and objectness are as follows:

Figure 13: Bounding Boxes with Anchor Boxes and Direct Location Prediction(Redmon & Farhadi, 2016).

Redmon and Farhadi also use fine-grained features, which are the features from earlier layers. Other object detectors use different scales that contribute predictions. Therefore, Redmon and Farhadi also apply a similar approach by using features from earlier layers at 26x26; they stack the features from high resolution and low resolution like identify mappings, as in ResNet. They also use multi-scale training, which entails changing the input size during training. They report that they use a set of multiples of 32 ({320, 352, …, 608}) and that this regime forces the network to learn to predict more effectively across different input dimensions (Redmon & Farhadi, 2016).

Redmon and Farhadi also use a custom base feature extractor other than VGG-16. While VGG-16 is powerful and accurate, it is also complex — it has 30.69 billion FLOPs for a single pass, whereas the custom network designed for YOLO v2 has 8.52 billion FLOPs. The custom network’s accuracy is slightly worse than that of VGG-16: whereas VGG-16 has 90% accuracy on ImageNet, the custom network has 88% accuracy (Redmon & Farhadi, 2016). The final model is called Darknet-19, with 19 convolutional layers and 5 max-pooling layers.

Redmon and Farhadi also use hierarchical classification, which enables the merging of classification datasets and detection datasets by making image labels usable. Hierarchical classification and combined datasets enable real-time object detection across more than 9000 object categories.

YOLO v3

In 2018, Redmon and Farhadi proposed several updates to the YOLO algorithm. YOLO v3 has a new feature extractor network architecture called Darknet-53, which is a variant of Darknet with 53 layers trained on ImageNet. For task detection, 53 additional layers are stacked onto it. It has 106 fully convolutional layers. Due to this heavy architecture, it is not faster than YOLO-v2, although it is more accurate (Redmon & Farhadi, 2018). Redmon and Farhadi claim that Darknet-53 is better than RestNet-101 and 1.5x faster, with similar performance to ResNet-152, albeit 2x faster.

Table 5: Architecture of DarkNet-53(Redmon & Farhadi, 2018)

With its new architecture, YOLO v3 with Darknet53 is better than SSD and close to the state-of-the-art RetinaNet at AP50, albeit 3x faster.

Table 6: YOLOv3 Comparison to Other Object Detectors(Redmon & Farhadi, 2018)

For performance reasons, Redmon and Farhadi update the class prediction used in independent logistic classifiers, instead of using SoftMax. With this approach, they can use multilabel classification and solve the overlapping labels problem (for example, ‘woman’ and ‘person’). They also propose predictions across scales in their paper. YOLO v3 generates bounding box predictions at 3 different scales. Thus, the tensor is S x S x [ 3 * (4+1 + 80)] for the COCO dataset for each scale (Redmon & Farhadi, 2018). As seen in Figure 14, the multi-scale prediction helps detect objects from different scales. They use k-means clustering on the COCO dataset to find the anchor boxes for each scale, which are the following 9 anchor boxes ( 3 for each scale): (10×13), (16×30), (33×23), (30×61), (62×45), (59× 119), (116 × 90), (156 × 198), and (373 × 326) (Redmon & Farhadi, 2018).

Figure 14: YOLO v3 Architecture (Ahmad, 2020)

Incidentally, after developing YOLO v3, Redmon decided in 2020to stop research on computer vision due to its military applications and related privacy concerns.

Figure 15: Joseph Redmons’s Tweet about Quitting CV and OD Research (Twitter, 2020)

YOLO v4

Bochkovskiy et al. continued researching the YOLO algorithm and proposed YOLO v4 in 2020. Their contribution of YOLO v4 lies primarily in developing an efficient and powerful object detection model, verifying bag-of-freebies and bag-of-specials methods, and modifying state-of-the-art methods to run on a single GPU for everyone to access. They achieved stellar results, with 43.5% AP (65.7% AP50) for the MS COCO dataset at a real-time speed of 65 FPS (Bochkovskiy, et al., 2020), by combining some of the following techniques: Weighted Residual Connections (WRC), Cross-Stage Partial Connections (CSP), Cross-Mini-Batch Normalisation (CmBN), Self-Adversarial Training (SAT), Mish Activation, Mosaic Data Augmentation, and Drop Block regularisation.

Their convolutional neural networks were trained offline, with the researchers developing models and using techniques to help improve model accuracy at inference time without affecting inference cost. Therefore, this approach is called a bag of freebies.

Figure 16: Bag of Freebies(Bochkovskiy, et al., 2020)

Data augmentation helps train a model on the variability of input images to increase robustness. This approach relies mostly on photometric distortions (such as changing the brightness, contrast, hue, saturation, and noise of an image) and geometric distortions (such as rotation, flipping, cropping, and random scaling); this is one of the techniques within the bag-of-freebies category (Bochkovskiy, et al., 2020). In addition to pixel-wise augmentation, some researchers propose processing multiple images at once by applying MixUp — which uses two images to multiply and superimpose at different ratios — or CutMix, which covers some parts of images with other images and Mosaic which mixes 4 different training images. As mentioned earlier regarding RetinaNet, unbalanced/biased datasets lead to low-performing models; thus, they have low accuracy. On the other hand, labelled data may be wrong. If a dataset is small, then manual checking may be an option; but for larger datasets, Label Smoothing is a mathematical way to improve learning from wrongly labelled samples within a dataset (Szegedy, et al., 2015). Although the mean-squared error is primarily used as a loss function for regression problems, Bochkovskiy et al. also state that threading bounding box coordinates as independent variables miss the integrity of the object (Bochkovskiy, et al., 2020).

Bochkovskiy et al. also propose a bag of specials, which marginally increases inference costs but significantly improves the accuracy of object detection.

Figure 17: Bag of Specials(Bochkovskiy, et al., 2020)

The enlarging receptive fields are SSP (Spatial Pyramid Pooling), which was proposed by He et al. to eliminate the fixed-size network input limitation by generating fixed-length representations of images (He, et al., 2015), as well as ASPP (Atrous Spatial Pyramid Pooling) — which was proposed by Chen et al. (Chen, et al., 2017) to help effectively enlarge the field of view of filters to incorporate larger context without significantly increasing the number of parameters or the amount of computation — and RFB (Receptive Field Block), which was proposed by Liu et al. and was inspired from human visual systems. RFB considers the relationship between the size and eccentricity of receptive fields to enhance feature discriminability and robustness (Liu, et al., 2018).

Attention modules — primarily the channel-wise attention module and the pixel-wise attention module — are also used in object detection. Squeeze-and-Excitation, the representative of channel-wise attention, was proposed by Hu et al. to enable models/networks to build informative features by bringing spatial and channel-wise information within local receptive fields at each layer (Hu, et al., 2019). As Bochkovskiy et al. report, the SE module is costly for GPU (+10% cost), although it can be used for CPU/mobile devices (+2% cost) (Bochkovskiy, et al., 2020). SAM (Spatial Attention Module), which is the representative of pixel-wise attention, was proposed by Woo et al. as a building block for the Convolutional Block Attention Module (Woo, et al., 2018). SAM generates a mask that enhances the important features that define the object and refines the feature maps.

Feature integration, such as skip connections and FPN, helps integrate low-level features to high-level features. Activation functions provide non-linearity, and the aim of selecting an activation is to cause the gradient to be backpropagated efficiently. Another post-processing process is the NMS (Non-Max Suppression), which is the elimination process of bounding boxes with low scores.

Thus, the YOLO v4 architecture is as follows: the backbone is CSPDarknet53, the neck is SPP and PAN, and the head is the same as in YOLOv3. The bag-of-freebies approaches for backbone are CutMix, Mosaic Data Augmentation, DropBlock regularisation, and Label Smoothing. The bag-of-specials techniques for backbone are Mish activation, Cross-stage partial connections (CSP), and Multi-input weighted residual connections (MiWRC). The bag-of-freebies techniques for the detector are CIoU-loss, CmBN, DropBlock regularisation, Mosaic data augmentation, Self-Adversarial Training, eliminating grid sensitivity, using multiple anchors for a single ground truth, cosine-annealing scheduler optimal hyperparameters, and random training shapes. The bag-of-specials techniques for the detector are Mish activation, SPP-block, SAM-block, PAN path-aggregation block, and DIoU-NMS (Bochkovskiy, et al., 2020).

In a discussion on Github.com, Bochkovskiy specified the mAP contribution of bag-of-freebies and bag-of-specials techniques such as SPP (+3%), CSP+PAN (+2%), SAM (+0.3%), CIoU+S (+1.5%), Mosaic and Hyperparameter tuning (+2%), Scaled Anchors (+1%), totalling approximately +10% altogether (Bochkovskiy, 2020). Thus, 5% of the total improvement comes from architecture, with another 5% from bag-of-freebies (Bochkovskiy, 2020).

YOLO v5

YOLO v4 constituted a major leap forward over YOLO v3. Just a few months later, on 9 June 2020, Glenn Jocher — who was mentioned in the YOLO v4 paper for Mosaic data augmentation by Bochkovskiy et al., and who had contributed significantly to the YOLOv3 architecture (over 2000 commits and bringing mAP from 33 to 45.6.) — released YOLO v5 without an official paper. He simply open-sourced YOLO v5 on Github.com (Jocher, 2020).

Figure 18: Acknowledgement by Bochkosvkiy et al. of Glen Jocher

YOLOv5 is not Darknet-based, but is implemented entirely in PyTorch. According to the mAP results shown for YOLO v4 on the MS COCO dataset, the YOLO v5 mAP values are nearly as high. The biggest model, YOLO v5x, has a slightly higher mAP value (Kin-Yiu, 2020).

Figure 19: YOLOv5 Model Comparison with EfficientDet(Kin-Yiu, 2020)

Jocher also discussed the training performance on Github.com Repo/Issues, stating that ‘our smallest YOLOv5 trains on COCO in only 3 days on one 2080Ti and runs inference faster and more accurately than EfficientDet D0, which was trained on 32 TPUv3 cores by the Google Brain team. By extension, we aim to comparably exceed D1, D2 etc. with the rest of the YOLOv5 family’ (Jocher, 2020)

In the next article, I will review training and inference performances on different hardware platforms.

Stay tuned!

References

Ahmad, R., 2020. All about YOLOs — Part4 — YOLOv3, an Incremental Improvement. [Online]
Available at: https://medium.com/analytics-vidhya/all-about-yolos-part4-yolov3-an-incremental-improvement-36b1eee463a2

Bochkovskiy, A., 2020. Github.com , YOLOv5 About reproduced results Discussion. [Online]
Available at: https://github.com/ultralytics/yolov5/issues/6#issuecomment-643644347

Bochkovskiy, A., Wang, C.-Y. & Liao, H.-Y. M., 2020. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv, Volume arXiv:2004.10934v1.

Chen, L.-C., Papandreou, G., Murphy, K. & Yuille, A. L., 2017. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. arXiv, Volume arXiv:1606.00915v2.

Girshick, R., 2015. Fast R-CNN. arXiv, Issue 1504.08083v2.

Girshick, R., Donahue, J., Darrell, T. & Malik, J., 2013. Rich feature hierarchies for accurate object detection and semantic segmentation. [Online]
Available at: https://arxiv.org/pdf/1311.2524.pdf

Grel, T., 2017. Region of interest pooling explained. [Online]
Available at: https://deepsense.ai/region-of-interest-pooling-explained/

He, K., Gkioxari, G., Dollar, P. & Girshick, R., 2017. Mask R-CNN. arXiv, Volume arXiv:1703.06870v3.

He, K., Zhang, X., Ren, S. & Sun, J., 2015. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition.

Hu, J. et al., 2019. Squeeze-and-Excitation Networks. arXiv, Volume arXiv:1709.01507v4.

Jocher, G., 2020. Github.com, Issues. [Online]
Available at: https://github.com/ultralytics/yolov5/issues/2#issuecomment-642425558

Jocher, G., 2020. YOLOv5. [Online]
Available at: https://github.com/ultralytics/yolov5

Kin-Yiu, W., 2020. Github.com. [Online]
Available at: https://github.com/ultralytics/yolov5/issues/6#issuecomment-647069454

Lin, T.-Y.et al., 2018. Focal Loss for Dense Object Detection. arXiv, Volume arXiv:1708.02002v2.

Liu, S., Huang, D. & Wang, Y., 2018. Receptive Field Block Net for Accurate and Fast Object Detection. arXiv, Volume arXiv:1711.07767v3 .

Liu, W. et al., 2016. SSD: Single Shot MultiBox Detector.

Redmon, J., Divvala, S., Girshick, R. & Farhadi, A., 2015. You Only Look Once: Unified, Real-Time Object Detection. arXiv, Issue arXiv:1506.02640v5.

Redmon, J., Divvala, S., Girshick, R. & Farhadi, A., 2016. You Only Look Once: Unified, Real-Time Object Detection.

Redmon, J. & Farhadi, A., 2016. YOLO9000: Better, Faster, Stronger. arXiv, Issue arXiv:1612.08242v1.

Redmon, J. & Farhadi, A., 2018. YOLOv3: An Incremental Improvement. arXiv, Volume arXiv:1804.02767v1.

Ren, S., He, K., Girshick, R. & Sun, J., 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv, Volume 1506.01497v3.

Szegedy, C., Vanhoucke, V., Ioffe, S. & Shlens, J., 2015. Rethinking the Inception Architecture for Computer Vision. arXiv, Volume arXiv:1512.00567v3 .

Twitter, 2020. Twitter. [Online]
Available at: https://twitter.com/pjreddie/status/1230524770350817280?s=20

Woo, S., Park, J., Lee, J.-Y. & Kweon, I. S., 2018. CBAM: Convolutional Block Attention Module. arXiv, Volume arXiv:1807.06521v2.

Zhao, Z.-Q., Zheng, P., Xu, S.-t. & Wu, X., 2019. Object Detection with Deep Learning: A Review. arXiv, Volume arXiv:1807.05511v2.