Why your AI training methods might be mathematically guaranteeing fragile models.
For years, the machine learning community has treated adversarial vulnerability, texture bias, and spurious correlations as annoying engineering bugs. The prevailing belief? If we just throw more data at the problem, scale our parameters, or use aggressive min-max adversarial training, we can patch these issues.
But the truth is, standard ERM—the bedrock of how we train AI—actually guarantees a geometric blind spot.
It isn’t a failure of the architecture; it’s a mathematical necessity of the objective itself. My team and I recently published a paper, which you can read in full on ArXiv, proving that our standard training methods are the root cause of AI fragility.
Why Standard ERM Creates a Geometric Blind Spot
When you train a model via Empirical Risk Minimization (ERM), your goal is strictly to minimize expected loss on the training distribution. If your data includes a “nuisance feature”—like a specific background in an image or a particular sentence structure in a document—that happens to correlate with your target label, the model will latch onto it.
Mathematically, the model has no incentive to ignore these shortcuts. To achieve the lowest loss, it must encode those features.
This is where the geometric blind spot comes from. Because the encoder learns these spurious features, its internal representation is structurally forced to maintain a high sensitivity in those directions. If the model uses the background grass to identify a cow, the internal “mental model” of the AI must shift violently if the grass changes. The representation manifold simply cannot be smooth.
The “Squeezed Balloon” Illusion: Why PGD Fails
If we know the manifold is rough, why not just use adversarial training like Projected Gradient Descent (PGD)? It seems like the logical fix.
The reality is that PGD is mathematically flawed. Think of the model’s sensitivity like a balloon. PGD squeezes the balloon tightly in one specific direction to resist a known attack. But the sensitivity doesn’t vanish; it just rotates and piles up in other, orthogonal directions.
We introduced the Trajectory Deviation Index (TDI) to track this. TDI measures how much a model’s internal representation distorts when hit with random, isotropic noise. Our research shows that while PGD reduces the “adversarial” loss, it actually results in a worse clean-input TDI than doing nothing at all. PGD doesn’t smooth the manifold; it makes it more anisotropic and fragile in every other direction.
The Fix: Penalized Manifold Hardening (PMH)
We didn’t want to rely on heuristics, so we derived a new approach called Penalized Manifold Hardening (PMH).
Our derivation proved that simple Gaussian noise is the unique distribution that suppresses the encoder’s Jacobian uniformly. Unlike PGD, which squeezes the balloon, PMH shrinks it uniformly. By penalizing the displacement of the representation under Gaussian noise during training, we anchor the model’s geometry.
You can find the open-source codebase for PMH on GitHub if you want to test the TDI of your own models.
The Scaling Paradox and Fine-Tuning Trap
Perhaps the most alarming finding in our research is that these blind spots scale with capacity. Larger models have more “room” to encode every single spurious correlation, making them mathematically more fragile than their smaller counterparts.
Even worse, standard standard ERM fine-tuning actively breaks the geometry of pre-trained backbones. When you fine-tune, you inject new task labels with new spurious correlations, tearing up the smooth geometry established during pre-training.
Key Takeaways
- Robustness isn’t a patchable bug: It is an inherent mathematical outcome of how we use ERM.
- Avoid the PGD trap: Adversarial training often hides fragility by pushing it into unmeasured directions.
- Prioritize Manifold Smoothness: Methods like PMH prove that uniform shrinkage is the key to creating stable, less fragile representations.
- Re-think Alignment: RLHF and other alignment techniques rely on labels that likely inject new, hidden geometric blind spots into our best LLMs.
We need to stop playing “whack-a-mole” with adversarial attacks and start fixing the underlying geometry of our models. If we continue to rely on training objectives that force models to prioritize shortcuts, we will always be stuck with fragile, biased AI. The next thing you should do is audit your own model’s TDI—you might be surprised by how much “blind” sensitivity is hiding in your current architecture.