YOLO-X Exceeding YOLO Series in 2021

https://arxiv.org/abs/2107.08430

How they improve YOLO-Series

Decoupled Head

As we can see in Fig1 (Model Architecture) from the original paper, YOLOX has a decoupled Head in order to obtain bounding box information like class, bounding box sizes, and location. But YOLO-v3 has one coupled head and the stream via head is going to export this information at the last layer simultaneously. By separating the inference layer on each HEAD, we can regard the architecture as a way of avoiding the mixing of various types of information.

multi-positive

In the ordinal type of YOLO-Series, one label is assigned per object. But in YOLO-X, the multiple labels which are surrounding the target is also assigned as target label. Since using only one target per object causes label imbalance between foreground and background, the stability in training must get increased by assigning multiple labels to the target.

anchor-free

In YOLO-v3, 9 anchors are prepared to detect the different sizes of objects. But there is no anchor in YOLO-X. So, there are 3 data streams to get anchors in YOLO-v3, there are not required in YOLO-X. The anchor-free makes the model simple, and deciding anchor in preprocessing is no more required.

SimOTA

strong augmentation

Mixed-up and Mosaic augmentation functions are introduced to YOLO-X procedure. These functions are also used in YOLOv4 and YOLOv5, it is getting common functions compared with the others.

The mixed-up is a function to mix 2 images with the target literally using alpha value. The Mosaic is a way of combining 3 or 4 images for making an image.

Differences and Commons with the former version

Number of parameters

Since the YOLOv3 has 65M of parameters, this is not so different from YOLO-X. Although YOLO-X eliminates anchors-processing flow, it has a decoupled head. This process get number of parameters increased.