YOLOv9: A Leap in Real-Time Object Detection

Object detection has seen rapid advancement in recent years thanks to deep learning algorithms like YOLO (You Only Look Once). The latest iteration, YOLOv9, brings major improvements in accuracy, efficiency and applicability over previous versions. In this post, we’ll dive into the innovations that make YOLOv9 a new state-of-the-art for real-time object detection.

A Quick Primer on Object Detection

Before getting into what’s new with YOLOv9, let’s briefly review how object detection works. The goal of object detection is to identify and locate objects within an image, like cars, people or animals. It’s a key capability for applications like self-driving cars, surveillance systems, and image search.

The detector takes an image as input and outputs bounding boxes around detected objects, each with an associated class label. Popular datasets like MS COCO provide thousands of labeled images to train and evaluate these models.

There are two main approaches to object detection:

  • Two-stage detectors like Faster R-CNN first generate region proposals, then classify and refine the boundaries of each region. They tend to be more accurate but slower.
  • Single-stage detectors like YOLO apply a model directly over the image in a single pass. They trade off some accuracy for very fast inference times.

YOLO pioneered the single-stage approach. Let’s look at how it has evolved over multiple versions to improve accuracy and efficiency.

Review of Previous YOLO Versions

The YOLO (You Only Look Once) family of models has been at the forefront of fast object detection since the original version was published in 2016. Here’s a quick overview of how YOLO has progressed over multiple iterations:

  • YOLOv1 proposed a unified model to predict bounding boxes and class probabilities directly from full images in a single pass. This made it extremely fast compared to previous two-stage models.
  • YOLOv2 improved upon the original by using batch normalization for better stability, anchoring boxes at various scales and aspect ratios to detect multiple sizes, and a variety of other optimizations.
  • YOLOv3 added a new feature extractor called Darknet-53 with more layers and shortcuts between them, further improving accuracy.
  • YOLOv4 combined ideas from other object detectors and segmentation models to push accuracy even higher while still maintaining fast inference.
  • YOLOv5 fully rewrote YOLOv4 in PyTorch and added a new feature extraction backbone called CSPDarknet along with several other enhancements.
  • YOLOv6 continued to optimize the architecture and training process, with models pre-trained on large external datasets to boost performance further.

So in summary, previous YOLO versions achieved higher accuracy through improvements to model architecture, training techniques, and pre-training. But as models get bigger and more complex, speed and efficiency start to suffer.

The Need for Better Efficiency

Many applications require object detection to run in real-time on devices with limited compute resources. As models become larger and more computationally intensive, they become impractical to deploy.

For example, a self-driving car needs to detect objects at high frame rates using processors inside the vehicle. A security camera needs to run object detection on its video feed within its own embedded hardware. Phones and other consumer devices have very tight power and thermal constraints.

Recent YOLO versions obtain high accuracy with large numbers of parameters and multiply-add operations (FLOPs). But this comes at the cost of speed, size and power efficiency.

For example, YOLOv5-L requires over 100 billion FLOPs to process a single 1280×1280 image. This is too slow for many real-time use cases. The trend of ever-larger models also increases risk of overfitting and makes it harder to generalize.

So in order to expand the applicability of object detection, we need ways to improve efficiency – getting better accuracy with less parameters and computations. Let’s look at the techniques used in YOLOv9 to tackle this challenge.

YOLOv9 – Better Accuracy with Less Resources

The researchers behind YOLOv9 focused on improving efficiency in order to achieve real-time performance across a wider range of devices. They introduced two key innovations:

  1. A new model architecture called General Efficient Layer Aggregation Network (GELAN) that maximizes accuracy while minimizing parameters and FLOPs.
  2. A training technique called Programmable Gradient Information (PGI) that provides more reliable learning gradients, especially for smaller models.

Let’s look at how each of these advancements helps improve efficiency.

More Efficient Architecture with GELAN

The model architecture itself is critical for balancing accuracy against speed and resource usage during inference. The neural network needs enough depth and width to capture relevant features from the input images. But too many layers or filters lead to slow and bloated models.

The authors designed GELAN specifically to squeeze the maximum accuracy out of the smallest possible architecture.

GELAN uses two main building blocks stacked together:

  • Efficient Layer Aggregation Blocks – These aggregate transformations across multiple network branches to capture multi-scale features efficiently.
  • Computational Blocks – CSPNet blocks help propagate information across layers. Any block can be substituted based on compute constraints.

By carefully balancing and combining these blocks, GELAN hits a sweet spot between performance, parameters, and speed. The same modular architecture can scale up or down across different sizes of models and hardware.

Experiments showed GELAN fits more performance into smaller models compared to prior YOLO architectures. For example, GELAN-Small with 7M parameters outperformed the 11M parameter YOLOv7-Nano. And GELAN-Medium with 20M parameters performed on par with YOLOv7 medium models requiring 35-40M parameters.

So by designing a parameterized architecture specifically optimized for efficiency, GELAN allows models to run faster and on more resource constrained devices. Next we’ll see how PGI helps them train better too.

Better Training with Programmable Gradient Information (PGI)

Model training is just as important to maximize accuracy with limited resources. The YOLOv9 authors identified issues training smaller models caused by unreliable gradient information.

Gradients determine how much a model’s weights are updated during training. Noisy or misleading gradients lead to poor convergence. This issue becomes more pronounced for smaller networks.

The technique of deep supervision addresses this by introducing additional side branches with losses to propagate better gradient signal through the network. But it tends to break down and cause divergence for smaller lightweight models.

YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information https://arxiv.org/abs/2402.13616

To overcome this limitation, YOLOv9 introduces Programmable Gradient Information (PGI). PGI has two main components:

  • Auxiliary reversible branches – These provide cleaner gradients by maintaining reversible connections to the input using blocks like RevCols.
  • Multi-level gradient integration – This avoids divergence from different side branches interfering. It combines gradients from all branches before feeding back to the main model.

By generating more reliable gradients, PGI helps smaller models train just as effectively as bigger ones:

Experiments showed PGI improved accuracy across all model sizes, especially smaller configurations. For example, it boosted AP scores of YOLOv9-Small by 0.1-0.4% over baseline GELAN-Small. The gains were even more significant for deeper models like YOLOv9-E at 55.6% mAP.

So PGI enables smaller, efficient models to train to higher accuracy levels previously only achievable by over-parameterized models.

YOLOv9 Sets New State-of-the-Art for Efficiency

By combining the architectural advances of GELAN with the training improvements from PGI, YOLOv9 achieves unprecedented efficiency and performance:

  • Compared to prior YOLO versions, YOLOv9 obtains better accuracy with 10-15% fewer parameters and 25% fewer computations. This brings major improvements in speed and capability across model sizes.
  • YOLOv9 surpasses other real-time detectors like YOLO-MS and RT-DETR in terms of parameter efficiency and FLOPs. It requires far fewer resources to reach a given performance level.
  • Smaller YOLOv9 models even beat larger pre-trained models like RT-DETR-X. Despite using 36% fewer parameters, YOLOv9-E achieves better 55.6% AP through more efficient architectures.

So by addressing efficiency at the architecture and training levels, YOLOv9 sets a new state-of-the-art for maximizing performance within constrained resources.

GELAN – Optimized Architecture for Efficiency

YOLOv9 introduces a new architecture called General Efficient Layer Aggregation Network (GELAN) that maximizes accuracy within a minimum parameter budget. It builds on top of prior YOLO models but optimizes the various components specifically for efficiency.

https://arxiv.org/abs/2402.13616

YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information
https://arxiv.org/abs/2402.13616

Background on CSPNet and ELAN

Recent YOLO versions since v5 have utilized backbones based on Cross-Stage Partial Network (CSPNet) for improved efficiency. CSPNet allows feature maps to be aggregated across parallel network branches while adding minimal overhead:

This is more efficient than just stacking layers serially, which often leads to redundant computation and over-parameterization.

YOLOv7 upgraded CSPNet to Efficient Layer Aggregation Network (ELAN), which simplified the block structure:

ELAN removed shortcut connections between layers in favor of an aggregation node at the output. This further improved parameter and FLOPs efficiency.

Generalizing ELAN for Flexible Efficiency

The authors generalized ELAN even further to create GELAN, the backbone used in YOLOv9. GELAN made key modifications to improve flexibility and efficiency:

  • Interchangeable computational blocks – Previous ELAN had fixed convolutional layers. GELAN allows substituting any computational block like ResNets or CSPNet, providing more architectural options.
  • Depth-wise parametrization – Separate block depths for main branch vs aggregator branch simplifies fine-tuning resource usage.
  • Stable performance across configurations – GELAN maintains accuracy with different block types and depths, allowing flexible scaling.

These changes make GELAN a strong but configurable backbone for maximizing efficiency:

In experiments, GELAN models consistently outperformed prior YOLO architectures in accuracy per parameter:

  • GELAN-Small with 7M parameters beat YOLOv7-Nano’s 11M parameters
  • GELAN-Medium matched heavier YOLOv7 medium models

So GELAN provides an optimized backbone to scale YOLO across different efficiency targets. Next we’ll see how PGI helps them train better.

PGI – Improved Training for All Model Sizes

While architecture choices impact efficiency at inference time, training process also affects model resource usage. YOLOv9 uses a new technique called Programmable Gradient Information (PGI) to improve training across different model sizes and complexities.

The Problem of Unreliable Gradients

During training, a loss function compares model outputs to ground truth labels and computes an error gradient to update parameters. Noisy or misleading gradients lead to poor convergence and efficiency.

Very deep networks exacerbates this through the information bottleneck – gradients from deep layers are corrupted by lost or compressed signals.

Deep supervision helps by introducing auxiliary side branches with losses to provide cleaner gradients. But it often breaks down for smaller models, causing interference and divergence between different branches.

So we need a way to provide reliable gradients that works across all model sizes, especially smaller ones.

Introducing Programmable Gradient Information (PGI)

To address unreliable gradients, YOLOv9 proposes Programmable Gradient Information (PGI). PGI has two main components designed to improve gradient quality:

1. Auxiliary reversible branches

Additional branches provide reversible connections back to the input using blocks like RevCols. This maintains clean gradients avoiding the information bottleneck.

2. Multi-level gradient integration

A fusion block aggregates gradients from all branches before feeding back to the main model. This prevents divergence across branches.

By generating more reliable gradients, PGI improves training convergence and efficiency across all model sizes:

  • Lightweight models benefit from deep supervision they couldn’t use before
  • Larger models get cleaner gradients enabling better generalization

Experiments showed PGI boosted accuracy for small and large YOLOv9 configurations over baseline GELAN:

  • +0.1-0.4% AP for YOLOv9-Small
  • +0.5-0.6% AP for larger YOLOv9 models

So PGI’s programmable gradients enable models big and small to train more efficiently.

YOLOv9 Sets New State-of-the-Art Accuracy

By combining architectural improvements from GELAN and training enhancements from PGI, YOLOv9 achieves new state-of-the-art results for real-time object detection.

Experiments on the COCO dataset show YOLOv9 surpassing prior YOLO versions, as well as other real-time detectors like YOLO-MS, in accuracy and efficiency:

Some key highlights:

  • YOLOv9-Small exceeds YOLO-MS-Small with 10% fewer parameters and computations
  • YOLOv9-Medium matches heavier YOLOv7 models using less than half the resources
  • YOLOv9-Large outperforms YOLOv8-X with 15% fewer parameters and 25% fewer FLOPs

Remarkably, smaller YOLOv9 models even surpass heavier models from other detectors that use pre-training like RT-DETR-X. Despite 4x fewer parameters, YOLOv9-E outperforms RT-DETR-X in accuracy.

These results demonstrate YOLOv9’s superior efficiency. The improvements enable high-accuracy object detection in more real-world use cases.

Key Takeaways on YOLOv9 Upgrades

Let’s quickly recap some of the key upgrades and innovations that enable YOLOv9’s new state-of-the-art performance:

  • GELAN optimized architecture – Improves parameter efficiency through flexible aggregation blocks. Allows scaling models for different targets.
  • Programmable gradient information – Provides reliable gradients through reversible connections and fusion. Improves training across model sizes.
  • Greater accuracy with fewer resources – Reduces parameters and computations by 10-15% over YOLOv8 with better accuracy. Enables more efficient inference.
  • Superior results across model sizes – Sets new state-of-the-art for lightweight, medium, and large model configurations. Outperforms heavily pre-trained models.
  • Expanded applicability – Higher efficiency broadens viable use cases, like real-time detection on edge devices.

By directly addressing accuracy, efficiency, and applicability, YOLOv9 moves object detection forward to meet diverse real-world needs. The upgrades provide a strong foundation for future innovation in this critical computer vision capability.