Minority class achieves 99.2% mAP while majority class only gets 65.3% - Expected behavior? #22978

bee1409 · 2025-12-17T09:05:36Z

bee1409
Dec 17, 2025

I'm seeing a counter-intuitive pattern where the rarest class in my dataset significantly outperforms the most common class. Wondering if this is expected YOLOv11 behavior or if I should adjust my approach.

Dataset Characteristics

Task: Multi-class object detection (rail defects)
Classes: 5
Extreme imbalance: 108:1 ratio

Class Instances
breaks 235 ← Rarest
cracks 1,915
lightband 1,915
rails 2,063
scars 24,546 ← Most common

model: yolov11s.pt
data: data.yaml
epochs: 100
imgsz: 1280
batch: 16
device: 0

All other params: default

Class Instances Precision Recall mAP50 mAP50-95
breaks 33 0.939 1.000 0.992 0.705
cracks 127 0.943 0.976 0.986 0.710
lightband 195 0.983 0.954 0.981 0.919
rails 200 0.961 0.975 0.984 0.947
scars 2,385 0.779 0.479 0.653 0.302

all 0.921 0.877 0.919 0.717

The Pattern
Unexpected observation:

Rarest class (breaks): Nearly perfect (0.992 mAP, 1.0 recall)
Most common class (scars): Worst performance (0.653 mAP, 0.479 recall)

This contradicts typical imbalance behavior where minority classes struggle.
Visual Characteristics:

Breaks (performing well):
Sharp edges, clear discontinuities
Visually very distinctive

-Scars (performing poorly):
High intra-class variance (size, shape, appearance)

Is this expected YOLOv11 behavior?
Does YOLOv11's architecture/loss function handle well-separated minority classes better than previous versions?
Should I apply balancing strategies?
Given that 4/5 classes are already near-perfect, would typical balancing (oversample/undersample) help or hurt?
Any YOLOv11-specific recommendations?
Are there config tweaks that might specifically help the scars class without affecting others?
Adjust cls loss weight?
Use focal loss (if available)?
Different augmentation strategy?
Have others seen this pattern?
Is this a known behavior with highly imbalanced but separable data?

Context
This is for research/thesis work. Overall 91.9% mAP is quite good, but I want to understand:

Whether this is fundamentally sound
If I should focus on improving scars specifically vs. global rebalancing
Whether this tells us something about YOLOv11's behavior with imbalanced data

Additional Info

Using imgsz=1280 (not 640) due to small/subtle defects
Test set held out, these are validation results

Any insights appreciated! Especially from folks who've trained YOLOv11 on similarly imbalanced datasets.
Thanks!

UltralyticsAssistant · 2025-12-17T09:06:00Z

UltralyticsAssistant
Dec 17, 2025
Maintainer

👋 Hello @bee1409, thank you for your interest in Ultralytics 🚀! We recommend a visit to the Docs for new users where you can find many Python and CLI usage examples and where many of the most common questions may already be answered.

Thanks for sharing the detailed class stats and metrics 🙌—this kind of “minority class wins / majority class struggles” pattern can be tricky to interpret, and to help the team give you the most accurate guidance we’ll need a bit more context about your setup and data split 🔎

If this is a 🐛 Bug Report, please provide a minimum reproducible example to help us debug it.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset image examples and training logs, and verify you are following our Tips for Best Training Results.

Please also include (as text) the exact command you used (CLI or Python), your data.yaml (with paths anonymized if needed), and the relevant parts of results.csv/training logs for the run you reported. If possible, add a few representative validation images showing “scars” false negatives/false positives and “breaks” true positives—this often clarifies whether it’s a labeling/definition/variance issue vs. a training configuration issue 📌

Join the Ultralytics community where it suits you best. For real-time chat, head to Discord 🎧. Prefer in-depth discussions? Check out Discourse. Or dive into threads on our Subreddit to share knowledge with the community.

Upgrade

Upgrade to the latest ultralytics package including all requirements in a Python>=3.8 environment with PyTorch>=1.8 to verify your issue is not already resolved in the latest version:

pip install -U ultralytics

Environments

YOLO may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all Ultralytics CI tests are currently passing. CI tests verify correct operation of all YOLO Modes and Tasks on macOS, Windows, and Ubuntu every 24 hours and on every commit.

This is an automated response 🤖—an Ultralytics engineer will also assist soon.

4 replies

glenn-jocher Dec 17, 2025
Maintainer

This pattern can be totally normal: per-class mAP is driven more by class separability, label consistency, and how “tight” the class definition is than by raw instance count, and with only 33 breaks instances in val your AP estimate for that class can also be high-variance/optimistic compared to scars. I’d start by opening the run’s confusion_matrix*.png, PR_curve.png, and the val_batch*_pred.jpg images to see whether scars errors are mainly false negatives (often caused by high intra-class variance or incomplete/ambiguous labeling) vs. false positives (often caused by background lookalikes); the visuals/curves are described in the YOLO performance metrics guide. If scars is genuinely heterogeneous, you’ll usually get more lift from tightening/splitting the label definition (or adding more representative “scar” edge-cases) than from global over/under-sampling, which can easily hurt your already-strong classes. YOLO11 doesn’t have a special “minority class booster” here—so I’d treat this as a data/definition problem first, not a loss-weighting problem, and only rebalance after you’ve confirmed the scars mistakes aren’t label-noise/definition-related.

bee1409 Dec 17, 2025
Author

Thanks Glenn, really appreciate the detailed guidance.

I followed your suggestion and inspected the PR curve, raw/normalized confusion matrices, and validation batch predictions. For scars, the dominant failure mode is clearly false negatives to background, not confusion with other defect classes. Cross-class confusion is minimal, while ~35–40% of true scars are missed entirely, which aligns with the class’s high intra-class variability and low visual contrast.

Based on this, I agree this is primarily a data/definition and visibility issue, rather than something loss weighting would fix. I’m planning to evaluate image tiling (e.g., 512–640 px tiles with overlap) since scars are thin and subtle and may benefit from increased effective resolution and reduced background context.

If you have any rules-of-thumb on when tiling tends to help vs. hurt for YOLO-style detectors, I’d be very interested to hear your perspective.

glenn-jocher Dec 17, 2025
Maintainer

Tiling usually helps YOLO-style detectors when the target signal is getting “washed out” by resizing (i.e., the defect becomes only a few pixels wide at your chosen imgsz) and the scene has lots of background that competes for attention; it can hurt when the object’s identity depends on wider context or when you frequently split instances across tile borders and introduce inconsistent/partial labels, so I’d use overlap (so the same scar appears fully in at least one tile), enforce a consistent label-inclusion rule (e.g., keep labels whose center falls inside the tile), and then compare scar recall on the same held-out val set before/after tiling to verify it’s a net win as described in the docs’ tiling note under model evaluation and fine-tuning insights.

bee1409 Dec 18, 2025
Author

Thanks again for the clarification, this has been very helpful.

I’m trying to anchor it to prior work. My understanding is that this pattern (majority class underperforming due to high intra-class variability and low separability, while rare but visually distinctive classes perform well) is fairly common in defect detection and other real-world detection tasks.

Are there any papers or survey-style references you’d recommend that discuss this phenomenon (even broadly, e.g., in defect detection or imbalanced detection contexts)? I’m not looking for an exact match, just something that frames this behavior in the literature.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ultralytics

Minority class achieves 99.2% mAP while majority class only gets 65.3% - Expected behavior? #22978

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Ultralytics

Minority class achieves 99.2% mAP while majority class only gets 65.3% - Expected behavior? #22978

Uh oh!

bee1409 Dec 17, 2025

All other params: default

Class Instances Precision Recall mAP50 mAP50-95 breaks 33 0.939 1.000 0.992 0.705 cracks 127 0.943 0.976 0.986 0.710 lightband 195 0.983 0.954 0.981 0.919 rails 200 0.961 0.975 0.984 0.947 scars 2,385 0.779 0.479 0.653 0.302

Replies: 1 comment · 4 replies

Uh oh!

UltralyticsAssistant Dec 17, 2025 Maintainer

Upgrade

Environments

Status

Uh oh!

glenn-jocher Dec 17, 2025 Maintainer

Uh oh!

bee1409 Dec 17, 2025 Author

Uh oh!

glenn-jocher Dec 17, 2025 Maintainer

Uh oh!

bee1409 Dec 18, 2025 Author

bee1409
Dec 17, 2025

Class Instances Precision Recall mAP50 mAP50-95
breaks 33 0.939 1.000 0.992 0.705
cracks 127 0.943 0.976 0.986 0.710
lightband 195 0.983 0.954 0.981 0.919
rails 200 0.961 0.975 0.984 0.947
scars 2,385 0.779 0.479 0.653 0.302

Replies: 1 comment 4 replies

UltralyticsAssistant
Dec 17, 2025
Maintainer

glenn-jocher Dec 17, 2025
Maintainer

bee1409 Dec 17, 2025
Author

glenn-jocher Dec 17, 2025
Maintainer

bee1409 Dec 18, 2025
Author