The bitter-ish lesson for clinical AI

General models may win the capability race. Clinical infrastructure still decides whether they help.

AI in Practice
Epistemology of Clinical AI
A short note on why a controversial clinical AI benchmark modestly supports the bitter lesson, and why that should focus attention on infrastructure rather than model loyalty.
Author

Mark Zobeck

Published

June 17, 2026

Introduction

Clinician and AI system fist bumping as collaborators.

A new Nature Medicine paper compared two specialized clinical AI tools, OpenEvidence and UpToDate Expert AI, with several frontier general-purpose models (Vishwanath et al. 2026). The headline result was hard to miss: the frontier models performed better across medical benchmarks and blinded clinician ratings of real clinical queries.

AI Benchmark Performance by Model. A) MedQA, B)HealthBench, C) Real Clinical Queries

The results have been entertaining to watch. Eric Topol’s X post helped ignite the debate. OpenEvidence pushed back. Others pointed out real limitations in both directions: benchmark design is hard, proprietary systems are hard to evaluate, and clinical usefulness is not the same thing as a leaderboard score.

Now I do not think the paper settles the future of clinical AI. Scoring the quality of model output is hard, and this paper evaluates an extremely narrow slice of the many ways people might use a clinical AI system. So there are no definitive conclusions here.

However, I do think the findings support a useful framing for those who are building clinical AI systems: the bitter lesson is true-ish for medicine.

The bitter lesson is true-ish for medicine

True

Richard Sutton’s bitter lesson is that general methods powered by computation tend to beat expert-crafted systems as scale increases (Sutton 2019). In clinical AI, that seems increasingly plausible. Frontier models did not become good at medical questions because they were engineered around a particular specialty workflow. They became good because broad training, scale, tool use, and general reasoning improved.

That is the “true” part.

Ish

The “ish” is more important.

The value of clinical AI is only realized when the model interacts with the real, messy world of healthcare. This interaction requires well-designed infrastructure to ensure it is safe, effective, and beneficial.

A powerful general model is just one component of this system. Medicine needs expert-crafted machinery around the models: governance, audit trails, workflow integration, retrieval, citation standards, monitoring, escalation rules, input controls, output controls, and a sharply limited action surface. The bitter lesson says general models may keep getting better. It does not say we can skip the work required to make them safe, useful, and accountable.

That should be encouraging for people building clinical AI infrastructure. The durable asset is not the model itself. The model will change. It may be OpenEvidence in one setting, Gemini in another, Claude in another, and an institutionally hosted model somewhere else. The valuable system is the harness: the clinical workflow that defines what the model can see, what it can do, how its claims are checked, when a human must intervene, and who is responsible for the result.

So the practical lesson is not “frontier models win, specialized tools lose.” That is too simple. The lesson from the bitter lesson is to build for model churn. Assume increasingly capable general models will keep arriving. Then build the clinical infrastructure that can use those models as safe references, swap them when better evidence appears, and keep the responsibility where it belongs.

The bitter lesson may be true-ish. The boring infrastructure lesson is definitely true.

References

Sutton, Richard S. 2019. The Bitter Lesson. http://www.incompleteideas.net/IncIdeas/BitterLesson.html.
Vishwanath, Krithik, Anton Alyakin, Mrigayu Ghosh, et al. 2026. “General-Purpose Large Language Models Outperform Specialized Clinical AI Tools on Medical Benchmarks.” Nature Medicine, ahead of print, June 12. https://doi.org/10.1038/s41591-026-04431-5.