Why most tech benchmarks are meaningless, Comparing Apples to Oranges

Everyday, some tech outlet publishes a breathless review claiming the new shiny device (for example, a phone, a processor, a graphics card) that is “30% faster” or “crushes the competition”. Each of these reviews comes with a set of shiny comparison graphs. The comment threads go up on Youtube and reddit threads explode. And almost nobody asks the most important question of “Is this comparison meaningful”? I would also argue the same set of principles applies to shiny new programming frameworks. I don’t really care if your Rust based library is 8% faster than my existing library if it comes with errors, usability problems, lack of mature code bases and no documentation.

What most tech reviews look at is statistical noise. Not because the writers are incompetent, but because they fundamentally don’t understand what makes a valid comparison. They’re comparing apples to oranges and calling it unbiased.

Checklist of solutions
#

I thought I would give the solutions first and then talk about the problem afterwards.

Take those benchmark charts with a very large grain of salt. Just use them to make a first opinion.
Going back to old school forums.
When you are looking at a benchmark, a control isn’t just something you vaguely think about. It’s a systematic elimination of variables that could skew your results. Every factor that isn’t what you’re specifically testing needs to be either held constant or accounted for statistically.
Maybe ignore most of these tech outlets. Most tech channels now double as entertainment (click bait and rage bait) channels with a dose of marketing and sales funnels.
Building a check lists of things of what matters for your purchase?
I use search engines combined with LLMs to retrieve things, structure them and use them to make the buying decision. Eventually this will disappear when these LLM search engines get paid by ads. But this works still in 2025.
A thing about reviews, 1 and 5 stars are meaningless. Trying to find a balanced review.
While half a decade ago, comments on video content used to contain meaningful corrections and balanced opinions. Currently most comments on the content are now useless. I found from a NLP analysis of 5000+ YouTube comments, most comments fall into three types:
- They agree
- They disagree
- Spam

Why Tech Outlets Get It Wrong ?
#

It’s not malicious. It’s structural.

Just way too many things to review. The number of SKUs and the sheer volume of devices that these companies are pumping out is insane. They are flooding the markets and also the reviewers.
Considering most manufactures provide reviewers golden samples, it is often that these differ significantly from the real life examples.
Not enough time to review. Just because of the sheer volume.
Most outlets get one review unit. Maybe two if they’re lucky. Population level conclusions from n=1 samples are statistically meaningless.
Misaligned incentives. Paying a reviewer for a favorable review seems to be a common practice. Also a lot of outlets seem to be gaming the algorithm for engagement and views. They clearly know what is clickable from their analytics and they often go out of their way to promote wrong and bad products.
Lack of statistical emphasis. Most tech journalists are journalists and neither them or most audience care about controls or statistics.

An example of bad test and what controls mean
#

Most tech channels contain gorgeous charts with precise looking numbers. Phone 1: 2,847 points. Phone 2: 2,756 points. Conclusion: Phone 1 wins. Except that comparison is probably meaningless.

Did they test both phones at the same ambient temperature? Same battery level? Same background processes running? Same thermal conditions after the same warm up period? Same storage capacity and available space? Same network conditions for connectivity tests? They pulled both phones out of different boxes, ran some tests, and called it a day. That’s not benchmarking, that’s dice rolling.

It is also impossible to because of the manufacturers. Samsung for example goes into a special boost mode when they are benchmarked Samsung TV Benchmark Cheating - Android Authority

Testing CPU performance? Lock the clock speeds, disable boost algorithms, set identical thermal conditions, and run tests multiple times with statistical analysis of variance. Measure ambient temperature. Control for background processes. Use identical memory configurations. Often this is impossible to do with most of the hardware on phones these days.

Real benchmarking requires:
#

Multiple test runs: One data point isn’t data, it’s an anecdote. Run the same test 10, 20, 50 times. Calculate means, standard deviations, confidence intervals.
Margin of error calculations: That 3% performance difference? Meaningless if your standard deviation is 8%.
Statistical significance testing: Is the difference you measured actually meaningful, or could it have happened by random chance? There are actual mathematical tests for this. Use them.
Sample size considerations: Testing one unit of each device tells you almost nothing about the broader population. Manufacturing variance alone can exceed the performance differences being measured.

Examples of good outlets
#

AnandTech was always the gold standard for methodological rigor, now that it is gone and what ever is left over in that space is not that good. Some YouTube channels like GamersNexus GamersNexus YouTube Channel understand statistical significance and proper controls and try to implement standardized testing protocols and being transparent about their methodology. I especially enjoy their testing deep dives.

Checklist of solutions#

Why Tech Outlets Get It Wrong ?#

An example of bad test and what controls mean#

Real benchmarking requires:#

Examples of good outlets#

Checklist of solutions
#

Why Tech Outlets Get It Wrong ?
#

An example of bad test and what controls mean
#

Real benchmarking requires:
#

Examples of good outlets
#