Benchmark Scores Are Statistically Fragile // AIDRAN

The Messing paper's most consequential finding is not that evaluation scores are noisy — that has been suspected — but that the noise is exploitable in a specific direction . When variance goes unreported, the entity best positioned to probe the measurement surface is the model developer, not the evaluator. Labs running repeated evals against the same benchmark can identify which prompt phrasings, judge configurations, and temperature settings favor their model's outputs, then ship the version that scores highest on those conditions. The benchmark community has no mechanism to detect…

The Benchmark Scores Deciding Model Deployments Are Statistically Fragile

Free reading limit reached

Frontier AI in Your Pocket: The Community That Stopped Being Impressed

A 397-Billion-Parameter Model on an iPhone, and Nobody Blinked

Running Local AI Has Become an Act of Political Dissent

The Pentagon's AI Supply Chain Quietly Rewrote Its Terms

The Weapons Transfer Nobody Noticed and Nobody Protested

PyTorch Foundation Bets on Infrastructure While Creators Distrust Open Licenses

PyTorch Foundation Fortifies Open Stack While Licensing Doubts Linger

Project Maven Is Selecting Targets in Iran, and the Ethics Conversation Has Caught Up

A Four-Layer Architecture for Neural Deception Detection Meets Hard Scrutiny

Two CVEs, One Bluesky Bot, and the Decentralized Security Beat