AI Overgeneralizes Science
Or: what counts as a mistake depends on what you think summaries are for
This post is part of the series, Exploring AI and Its Intersection with Human Decision Making.
You can also hear AI Matt’s summary of the piece below.
Large language models (LLMs) hold real promise as tools for working with scientific research, particularly when it comes to summarizing dense material quickly. But that doesn’t mean they do the job flawlessly.
I recently read a paper that takes a close look at one potential failure of LLMs: their tendency to overgeneralize scientific findings when producing summaries. Peters and Chin-Yee (2025) analyzed nearly 5,000 AI-generated summaries and concluded that many widely used models broaden claims beyond what appears in the original text, even when explicitly prompted to be accurate.
Newer models, they argue, appear especially prone to this problem, and AI summaries were substantially more likely to contain broad generalizations than a set of human-authored medical summaries used for comparison.
It’s a careful and interesting study. But there are also good reasons to take its conclusions with a grain of salt. That’s what I want to unpack here.
What the Authors Mean by “Overgeneralization”
When Peters and Chin-Yee talk about overgeneralization, they’re not using the term loosely. They give it a very specific, operational definition—which is good, but it also means it’s worth being clear about how they operationalized it before arguing with anything else in the paper.
In their framework, a summary overgeneralizes when it broadens the scope of a scientific claim relative to the original text. That broadening can take a few different forms”:
Quantified claims get rewritten as generic ones. For example, a finding reported as “23% of participants reported X” is rewritten as “people report X,” dropping both the quantification and the boundary of the original.
Past-tense descriptions of what happened in a particular study slide into present-tense statements that read as generally true. For example, a conclusion described as “participants reported lower anxiety after the intervention” becomes “the intervention reduces anxiety.”
Descriptive findings turn into action-guiding claims. For example, a finding described as “the results suggest relevance for clinical practice” becomes “clinicians should use this intervention.”
The authors are explicit that they’re not judging whether the broader claim is true or false—only whether it goes beyond what the original text explicitly said.
That operationalization is an important detail. Overgeneralization here isn’t about bad statistics, inflated effect sizes, or incorrect conclusions. It’s about scope drift at the level of language. A summary can be counted as an overgeneralization even if it faithfully reflects what many readers would reasonably infer from the study, as long as the inference isn’t spelled out in the original language of the abstract. That’s right, the abstract—not the Discussion, not the broader literature—is treated as the normative reference point.
This makes the analysis clean and tractable, but it also narrows what the results can mean. The paper isn’t showing that LLMs exaggerate evidence or mislead readers about magnitude. It’s showing that they often smooth, compress, or relax the linguistic constraints that scientific abstracts impose. Whether that counts as a serious epistemic failure, a genre mismatch, or something closer to normal scientific sense-making depends almost entirely on what you think abstracts are supposed to do—and what you think summaries are for.
Why the Comparison Point Matters
Once you’re clear on how the authors define overgeneralization, the next question is obvious: overgeneralization compared to what?
The study relies on two primary comparison points. The first compares LLM-generated summaries directly to the abstracts they are summarizing. These abstracts are drawn largely from elite journals in medicine, and the abstracts themselves function as the normative baseline. Any linguistic alteration that broadens the scope of a claim beyond the wording of the abstract—whether by dropping quantification, shifting tense, or introducing action-guiding language—counts as an overgeneralization.
The second comparison extends beyond abstracts. For a subset of analyses involving full-length medical articles, LLM-generated summaries are compared to human-authored summaries from NEJM1 Journal Watch2. Here, the baseline is no longer the original scientific text but another form of summary written by domain experts.
Both comparisons are reasonable within the logic of the study. They make the analysis feasible and internally consistent. But they also matter in ways that aren’t fully acknowledged in the paper’s conclusions.
In the abstract-based analyses, overgeneralization is defined relative to one of the most compressed and stylistically constrained parts of a scientific paper—even more so when you consider they were coming mostly from elite medical journals. Abstracts are designed to be brief, cautious, and defensible. Treating the abstract as the endpoint for appropriate scope means that any change beyond its original phrasing is, by definition, scored as error.
The comparison to NEJM Journal Watch introduces a different kind of baseline altogether. Instead of asking whether LLMs stay within the bounds of an abstract, the question becomes whether LLM summaries align with a particular set of expert-written summaries intended for clinicians. That shifts the normative reference point again, from scientific writing to a specific summary genre.
Taken together, these comparison choices anchor overgeneralization to two deliberately narrow reference points: abstract text on the one hand, and clinician-facing expert summaries on the other. That narrow anchoring means the findings support a much more specific claim than the paper’s broader framing sometimes suggests.
What the study shows is that many LLMs produce summaries that move beyond the linguistic constraints of abstracts and, in some cases, diverge from expert-written medical summaries intended for clinicians. But these departures are treated as error, without a principled way to distinguish unwarranted exaggeration from plausible inference.
These comparison choices make it harder to treat overgeneralization as a property of the models alone. At least some of what’s being detected here looks like a mismatch between genres: summaries behaving like summaries, evaluated against baselines that prioritize textual restraint. Whether that mismatch constitutes a meaningful epistemic problem depends less on the models themselves than on which comparison point one thinks ought to do the normative work.
When Generalization Is the Point, Not the Problem
The reason I’ve been so explicit about which comparison point we’re talking about is that what counts as a reasonable inference depends heavily on context. How far a claim can be extended, how cautiously it should be phrased, and what kinds of generalization are acceptable all vary with the purpose of the text and the audience it’s written for.
You can see this most clearly in the Discussion sections of research articles. Discussions are explicitly designed to move beyond what is stated in the abstract. Authors are expected to interpret their findings, connect them to prior work, speculate about mechanisms, and explain why results that are often small or context-bound might still matter. Those inferences routinely involve forms of generalization that go well beyond anything captured in an abstract.
In that light, it’s hard to avoid an uncomfortable comparison. Many of the linguistic alterations that are coded as overgeneralization in this study are far more restrained than what routinely appears in Discussion sections—especially in fields like the social sciences, where modest effects are often used to support broad theoretical narratives.
That doesn’t mean they’re always justified. But it does mean they’re normal, expected, and in many cases necessary for scientific work to progress.
What matters here is that summaries are often intended to do a kind of work the abstract is explicitly not designed to do. Abstracts are intentionally constrained. They are optimized for compression, not for sense-making. Discussions are where authors take on the harder task of explaining what their results might mean, how they fit into a larger body of work, and why anyone should care. That inevitably involves generalization, and often more of it than the data, taken in isolation, can strictly support.
Summaries sit somewhere between those two genres. They’re not abstracts, but they’re not full Discussions either. Readers generally expect them to do some interpretive work—enough to make the findings intelligible and portable, without reproducing every caveat or boundary condition. From that perspective, some degree of linguistic extension isn’t just unsurprising; it’s arguably part of the job.
So are LLM-produced summaries biased? Sure—if the benchmark is the most conservative way scientists communicate what they did and found. When summaries are evaluated against abstracts or narrowly scoped, clinician-facing expert write-ups, any attempt to integrate, smooth, or extend the original claims will look like a deviation. But that says as much about the benchmark as it does about the summary. Summaries are typically read as sense-making tools, not as compressed replicas of primary texts, and readers often expect them to carry some of the interpretive burden that abstracts intentionally avoid. From that perspective, what’s being labeled as overgeneralization may simply reflect a different communicative goal rather than a systematic distortion of the evidence.
Without a clear account of what counts as a reasonable extension beyond an abstract, it’s hard to know how to interpret the results beyond the narrow comparisons being made. The study is very good at detecting departures from textual restraint. It’s much less informative about whether those departures reflect genuine exaggeration, routine scientific inference, or something in between.
What the Study Shows—and What It Doesn’t
Read narrowly, the study makes a real and defensible contribution. Across a large set of prompts and models, the authors show that LLM-generated summaries often relax the linguistic constraints imposed by scientific abstracts. Quantification drops out. Tense shifts. Descriptive statements turn into more general claims. Those patterns appear more often in some models than others, and they don’t reliably disappear when models are explicitly prompted for accuracy. As a demonstration of how easily summaries can drift from abstract-level phrasing, the evidence is solid3.
What the study doesn’t establish is which of those departures should count as a problem. The study’s design makes overgeneralization easy to detect, but it leaves unanswered a more basic question: which extensions are unreasonable, and which are simply part of normal scientific inference?
That ambiguity isn’t just theoretical either. It shows up even in how the authors interpret their own findings. After carefully operationalizing overgeneralization, the Discussion necessarily moves beyond that boundary when addressing what the results imply for model development, science communication, and downstream risk. Those extensions are cautious and qualified, but they still go beyond the narrow linguistic comparisons that anchor the analysis. That isn’t so much a flaw in the paper as a reminder of how difficult it is to talk about bias, inference, and risk without engaging in some of the very generalizing work the study sets out to measure.
Given all that, the study is better read as an analysis of textual fidelity rather than as a diagnosis of epistemic distortion. It tells us something important about how LLMs handle constrained scientific language. It tells us much less about whether those models are doing something categorically different—or worse—than scientists routinely do when they move from results to meaning.
That distinction is relevant because concerns about AI “overgeneralizing” scientific findings only become meaningful once we’re clear about what we expect summaries to do in the first place. This study offers a careful account of how summaries depart from abstracts. It does not, and arguably cannot on its own, tell us where reasonable inference ends and exaggeration begins.
New England Journal of Medicine
This is now called NEJM Clinician.
Out of curiosity, I also tested the three prompts used in the paper on the authors’ own article. I ran separate instances using ChatGPT (5.2 Instant) and Claude (Sonnet 4.5). In this small, anecdotal test, the same kinds of generalizations flagged in the study did not reliably appear. That obviously doesn’t overturn the paper’s results, and it shouldn’t be read as anything more than an anecdotal sensitivity check. Still, it suggests that at least some of the effects documented here may depend on model version, prompting context, or timing—and may not be quite the same issue across current systems.




