The Judgment Problem Behind AI
Why the challenge isn’t using AI, but deciding what to do with what it gives you
You can also hear AI Matt’s summary of the piece below.
Over at Psychology Today, I recently wrote about a study my colleagues and I ran on AI and decision making — and the full study is open access if you want to dig into it. The basic takeaway was simple enough: the problem isn’t using AI, it’s figuring out when its recommendations are actually worth following.
It’s a discussion of our research that written as a concise post for an outlet that requires posts to be not much more than 1,000 words1. But the issue becomes a lot more complicated once you start asking how people are deciding whether to follow an AI’s recommendations.
Because it turns out that this isn’t a single judgment people make and move on from. It’s a series of small decisions — whether to look at the recommendation, whether to take it seriously, how much weight to give it, and what to ignore. And those don’t always line up in a way that produces better outcomes.
Two Ways This Can Go Wrong
One of the things that stood out in the data is that people aren’t all making the same mistake when it comes to using AI.
It’s easy to treat this as a single problem — people either rely on it too much or not enough — but the pattern doesn’t really behave that way. It splits.
Some people lean into the AI. They look at the recommendations, take them seriously, and are more willing to incorporate them into their decisions. When those recommendations are good, that helps. When they’re not, it doesn’t just fail to help — it actively makes things worse.
Others go the opposite direction. They’re less inclined to use the AI at all. They stick with their initial judgment, even when the system is offering something that would improve their decision. That protects them from bad input, but it also means passing up useful information when it’s there.
At a glance, those might look like two sides of the same issue — just different points along a continuum of AI use. But they don’t quite cancel each other out. They create different kinds of errors.
And what pushes people in one direction or the other isn’t just the quality of the recommendations they’re seeing.
Part of it comes down to how trustworthy they view the system.
Part of it comes down to how trustworthy they view themselves.
Trustworthiness as a Decision Factor
Part of the decision to use an AI comes down to how trustworthy people think the system is.
When we talk about trustworthiness, we’re talking about beliefs about a system’s competence, reliability, and usefulness. Those beliefs don’t operate in isolation — they shape whether someone is willing to rely on the system at all.
In our study, that pattern showed up clearly. Participants who saw the AI as more trustworthy were more likely to use its recommendations. Those who viewed it less favorably were more likely to ignore it.
That part is straightforward.
The complication is that this tells us nothing about whether those judgments were well-placed. Trustworthiness, in this sense, is a belief — a more general orientation toward the quality of the system, not an evaluation of any specific recommendation. And in our study, that belief tended to create a bias toward consulting and adopting the AI’s recommendations, regardless of their quality.
Stronger beliefs about trustworthiness amplify that bias.
If that bias tracks the quality of the AI’s recommendations, it helps. If it doesn’t, it doesn’t just fail — it pushes decisions in the wrong direction.
The wrinkle is that participants didn’t really have a reliable way to judge AI quality in the first place. This wasn’t a task where they could easily verify whether a recommendation was good or bad. That gave the bias more room to operate.
Which means the impact of AI isn’t just about what it produces. It’s about how those outputs are filtered before they ever get used.
Sometimes that filter helps, and sometimes it doesn’t. And the difference isn’t coming from the AI. It’s coming from the judgment applied to it.
When (Over-)Confidence Closes the Door
But beliefs about the system’s trustworthiness aren’t the only thing shaping these decisions. Another pattern of results had nothing to do with the AI itself.
Participants who reported higher confidence in their own judgment — via perceived expertise — were less likely to use the AI’s recommendations.
That would make sense — if they actually knew what they were doing.
But they didn’t.
This wasn’t a task where anyone had meaningful expertise — it was a lunar survival task, after all2. The same thing that made it hard to tell whether recommendations were high or low quality also meant there wasn’t much basis for trusting your own answers.
But that didn’t stop people from acting as if they knew enough to rely on them. Participants who said they stuck with their original answers because they felt confident also rated themselves almost a full point higher on the expertise scale. The decision to reject the AI wasn’t strongly tied to judging it as unhelpful — it was tied to believing they didn’t need the help3.
That’s where the problem starts because now the decision isn’t really about the AI at all. It’s about calibration — whether people can tell when their own judgment is worth trusting.
Sometimes they can. Even in unfamiliar domains, some people will reason better, notice more, or get closer to the right answer.
But the study doesn’t give us any reason to believe that the people who felt more knowledgeable were actually making better decisions. In fact, self-reported expertise was unrelated to performance. What it did shape was whether the AI recommendations were used or ignored.
And that’s where things start to get messy. If people defer to AI when they feel uncertain and ignore it when they feel confident, then the usefulness of the tool depends less on its quality and more on how well those feelings track reality4.
Sometimes they will, and sometimes they won’t. Which means the same system can be helpful in one moment and dismissed in the next — merely because of the user’s confidence in their own judgment.
A Different Kind of Decision Problem
One way to think about this is that AI — at least in the form of large language models (LLMs) — changes the kind of decisions people have to make.
With most technologies, the interaction is relatively constrained. A television shows you what’s on. A spreadsheet does what you tell it to do. Even search engines, for all their complexity, return a set of sources. They don’t try to pull those together into a single answer.
LLMs work differently.
They generate a single, coherent response that sounds like an answer. It’s tailored to the prompt, shaped by how the question is asked, and often delivered with a level of fluency that makes it feel more definitive than it actually is. And that changes how the decision has to be made.
Instead of working through a set of options, you’re starting from something that already looks resolved.
So the question is how to treat the answer you’ve been given — whether to accept it, reject it, or refine it.
And that’s a harder challenge than it looks.
Because there isn’t always a clear way to verify whether the recommendation is good or bad. There isn’t always a second source to check against. A lot of the time, you’re making a judgment without a reliable reference point.
Which brings me back to the pattern in the data.
When people didn’t have a clear way to evaluate the output, they fell back on something else. Their general beliefs about how trustworthy the system is. Their confidence in their own judgment.
And those don’t always line up with what actually leads to better decisions.
That’s the calibration problem LLMs create.
Which is reasonable, given the PT’s broader audience. But with Substack, I have the capacity to expand beyond a simple and easy to digest narrative.
And I doubt any NASA astronauts were among our participants.
I tested this directly for this post — so it’s not in the original study. Among participants who chose not to complete the task a second time, those who said they stuck with their original answers because they felt confident rated themselves about a full point higher on perceived expertise than those who did not.
This is similar to a pattern I’ve written about before, where people dismiss or reshape AI outputs when they conflict with what they expect rather than revisiting their own judgment.





MANY Thanks Matt & Ruv - so glad I follow you BOTH! 👍
I’m looking at these concepts for students / children - who (from memory?) have similar misconceptions about their own knowledge…
So - you’ve introduced 2 levels - rather than one for me - as I thought that having prior knowledge of a topic might help younger students - that was my first level (and starting point) for designing some instruction? WAS?
What you introduced was whether learners understand their own personal “confidence level” - and I think many younger (& maybe older) people /students have greater confidence than their actual knowledge?
Hope this makes sense? Thanks again! 👍❤️❤️
I came across this piece right after finishing my book The Judgment Behind AI, and it immediately caught my attention.
What I find especially valuable here is the focus on calibration: the problem is not simply whether people use AI or ignore it, but how they decide whether an AI recommendation is worth trusting.
That is very close to the problem I was trying to name in the book.
The difference, I think, is one of framing.
This piece looks at the judgment problem through trust, confidence, recommendation use, and calibration.
My book looks at it through a broader “judgment structure” lens: AI does not remove human judgment. It exposes and amplifies the judgment structure already behind the user.
So when someone has clear judgment, AI can become leverage.
When someone has confused judgment, AI may only help them produce more confusion faster.
In that sense, I see this article as touching the same underlying problem from a research and decision-making angle, while my book approaches it from a practical, structural, and operational angle.
Really glad to have found this.