When the Noise Is Not Where the Problem Is: Diagnostic Bias in the Age of AI

A roadside view of a small hill with green rushes under an expansive sky.
Looking beneath the surface.

Case of the WeekMin Wu, PhD  ·  ai-public-health.com

There were on-and-off sounds in our kitchen for weeks. My wife heard them first, and after a while she asked me to fix it. I treated it as detective work.

The most obvious suspect was the refrigerator. So I cut the power to the kitchen. The noise continued. I checked every appliance on that floor. Nothing. The kitchen had nothing left in it that could be making the sound.

So I went upstairs. Then I went down to the basement.

Underneath the kitchen, where a ventilation pipe runs through the basement ceiling, the noise was clearer. It was stronger in one direction. I followed it to the basement bathroom. Every few minutes, the toilet flushed by itself. The valve was old and leaking, and the refill cycle was sending water sounds up through the pipe, into the kitchen, where they had no obvious source.

I replaced the valve. The kitchen went quiet. Case closed.

I tell that story because the symptom and the source were in different rooms. The noise appeared in the kitchen. The cause lived in the basement. If I had kept investigating the refrigerator, I could have replaced every appliance in the kitchen and never solved anything.

There is a particular discipline in that kind of detective work. It is not finding the answer. It is refusing to keep looking in the room where the answer is not. That refusal is the hard part — because the room where the symptom appears is the room where attention is most naturally drawn.

I have been thinking about this discipline in the context of diagnostic bias in AI. The shape of the problem is the same. The symptom shows up in the model output. The source lives somewhere else.

AI did not invent diagnostic bias. It inherited it.

There is a tempting story about AI bias that goes like this: a technology company built a flawed model, the model produced biased outputs, and we need better engineers to fix it. That story is not wrong. But it is shallow. It locates the cause too close to the symptom. It investigates the kitchen.

The deeper story is about what the model was trained on, and what the training data was built from.

Twentieth-century medical research operated for decades on an implicit default: that a particular kind of patient — predominantly male, predominantly of European ancestry, predominantly from high-income countries — could serve as a universal reference for human physiology. This was not a conspiracy. It was the result of regulatory history, institutional convenience, and the simple fact that stratified analysis was computationally expensive in an era before cheap compute. Women were excluded from early-phase clinical trials for liability reasons after thalidomide. Imaging libraries were built where the equipment and the funding were. Trial enrollment followed the geography of the institutions running the trials.

That default built the data pipelines. Clinical trial datasets, hospital records, dermatology image libraries, cardiovascular case files — all of them carry the demographic shape of who was studied, and the demographic absence of who was not. AI systems were then trained on those pipelines. The models did exactly what we asked them to: they learned the patterns in the data.

The bias was already in the basement. AI just amplified it through the kitchen pipe.

The bias was not introduced by AI. AI inherited it — and then automated, scaled, and made it invisible.

This is the configuration I described in New Bottle, Old Wine: new technology running old logic, scaling unexamined assumptions faster than anyone can audit them. Diagnostic bias is that pattern made concrete. The data pipelines are the old wine; the model is the new bottle. Naming the source is the first half of the work. The harder half — whether the inherited logic can actually be unlearned once it is in the system — is where I want to go next, in a companion post on selective forgetting (Article in next week).

This pattern has now been measured in peer-reviewed work across multiple clinical domains, with consistent findings: where the training data underrepresents a population, the model underperforms for that population. The detail of those measurements matters for clinicians and for regulators. For the diagnostic posture I am describing here, what matters is that the failure mode is the same in each case. The model is doing its job correctly. The job is the problem.

Why this is harder than the human kind

Human bias in medicine is not new. Practitioners have been wrestling with it, imperfectly, for as long as medicine has existed. What is new about AI bias is the velocity, the scale, and the form of its operation.

"The deep-dive portion of this weekly case is reserved for members. Subscribe for free to unlock the full analysis instantly."