Why Schools Don’t Report Uncertainty in Test Scores
Why “probably” is (probably) precise enough
A decade ago I got a single-line message from a primary headteacher. No explanation, just the following:
“113 ± 5. I guess I’m supposed to know what that means.”
I stared at it for a while, trying to work out what I was supposed to make of it. Eventually we got on the phone. We talked about averages. About standard deviations. About what sampling error actually means. We poked around the assessment provider website to see exactly what sort of uncertainty was, and wasn’t, wrapped into that little “±5”.
Did the call help him understand uncertainty in assessment? A little. Did it help him do his job as a headteacher? Not really - beyond the broad message that test scores are a bit noisy.
And that’s the heart of the problem. Psychometricians love and need confidence intervals. Teachers and leaders are much less sure what to do with them.
Why psychometricians love confidence intervals
Psychometricians, of course, see things differently. When they look at a score like “113”, their first instinct is to ask, how sure are we? No assessment is a perfect measure. Every test result is a blend of signal and noise: a glimpse of true attainment, clouded by all the unpredictable things that can distort performance on the day.
This is where the Standard Error of Measurement (SEM) and other statistical calculations come in. If you know how reliable a test is, you can estimate how much a score might wobble if the student sat it or a slightly different version of the same test again tomorrow. From that, you can build a confidence interval around the score. A mark of 69% might really mean “somewhere between 64 and 74%.” (Note: This isn’t trivial because uncertainty can come from content/task sampling, marker variation, test-form/equating error, item-parameter estimation or model misfit, and administration or transient student factors like illness.)
To a psychometrician, a score is not a fact but rather an estimate with uncertainty. To say that a student scored 113 asserts false precision. As we wrote in an earlier post, think of it like listening to a favourite song on the radio through patchy reception. The tune is there, but static makes it harder to hear clearly. The confidence interval is a way of saying, here’s how loud the static is.
So far, so sensible. But if confidence intervals are such a good idea in principle, why don’t schools use them in practice?
Why schools don’t (and can’t) report them
The simple answer to why schools don’t report confidence intervals is that they can’t. Most internal school assessments simply aren’t set up to support them.
Commercial test companies can publish SEMs because they run large-scale, standardised assessments, with item-level data, trialled questions, and statistical models behind the scenes. A Year 9 end-of-year history paper, written by a department one spring afternoon, is a very different beast.
In secondary schools at least, school tests are often:
One-off: written for a single year group, then put away or edited for next year.
Unmoderated: consistency between markers is assumed or only lightly checked, rather than measured and incorporated into the confidence interval.
Narrow: in humanities and English, questions are often sampled from just a couple of topics, yielding inter-dependency between scores within a multi-part question or writing rubric.
Open-ended: many subjects include essays, extended answers, and practical tasks that make confidence interval calculation complex.
Even in maths, where test formats look more standardised, the assessments used in most schools aren’t replicated, analysed, or scaled across schools in ways that would make a standard error defensible.
This is the first reason schools don’t report them: the nature of the assessments and the sample of students taking them don’t easily yield robust confidence intervals.
But there’s another reason. Even if schools could report confidence intervals, it’s not obvious it would make much difference…
Why even if they could, it wouldn’t help much
Suppose, for a moment, that every school test did come with a neat confidence interval. Amira’s 69% might be reported as “69 ±5,” Jack’s 65% as “65 ±5.” Would this really change anything?
I doubt it. Pupils and parents would still anchor themselves to the headline score. “I got 69” has more meaning than “I probably got somewhere between 64 and 74”, no matter how carefully we try to explain margins of error.
Teachers might make more cautious noises, but in practice they would still use the point estimate. (That said, at department or leadership level, confidence intervals can be useful for sanity-checking whether an apparent dip or gap is larger than normal wobble, e.g., before moving sets, changing a scheme of work, or declaring a class “off-track.”)
Confidence intervals would likely not change students’ own inferences about who did better or worse. Amira’s “69 ±5” is indeed better than Jack’s “65 ±5” and so the effect of the attainment feedback on their self-concept and thus future motivation to study may be unchanged.
In the playground, the rankings would continue: winners and losers, top sets and bottom sets. A fuzzy interval does little to soften the competitive edge.
This is the second reason schools don’t report them: confidence intervals wouldn’t change behaviour.
Which brings us to the better question: if confidence intervals aren’t likely to be calculated or used, how should schools acknowledge uncertainty?
Better ways to acknowledge uncertainty
If confidence intervals aren’t the answer, what can schools do instead?
One option is to bake uncertainty into the reporting format and remove the point estimate, so that Amira is told they have a mark of 64-74% and Jack is told they have a mark of 60-70%. Personally, I still think Amira will believe did better than Jack, but some ambiguity is at least revealed.
Another option is to show the range of grades that a student achieved in different sections of an exam or in class work over the weeks. So, as well as reporting that a student has achieved a Grade 6 in English at the end of the year, parents and student will also see a range of Grades from, say 4 to 8, for individual pieces of work or test papers. Seeing the spread of grades should puncture ‘I’m on track’ complacency and, because they’ve already hit higher marks on individual papers, make further improvement feel attainable.
Another option is to acknowledge uncertainty narratively, so that teacher’s give their personal view about the range of attainment over which the student is currently working. They might write, for example that most work sit between Grades 5 and 7 over the term, with occasional 4 or 8 outliers. They can point out the student’s strongest and weakest areas, prompting a realistic sense of how future results could move either way according to effort.
These are the judgements that shape behaviour, motivation, and learning. And they can be communicated honestly, without the false comfort of numbers that look more rigorous than they really are.
Why “probably” is enough
So where does this leave us?
If one student scores 69% on an end-of-year test and another scores 65%, should we treat that as a meaningful difference? As we explored in The Discrimination Dilemma, the honest answer is often: probably. That word is unsatisfying to a psychometrician. It feels woolly, evasive, not quite scientific enough. But in the messy reality of school assessment, it’s often the right level of certainty.
Our goal at 100% Assessment is to give teachers tools for talking about attainment in a manner that is efficient and that promotes learning. They need ways to recognise the noise and uncertainty in what we know about a student’s true attainment without drowning in statistics. Confidence intervals are great where they are calculated by third parties in large-scale assessments, but the complexity of calculating them for ad-hoc school assessments usually outweighs any benefit they might have. A simple narrative — these students performed similarly; we shouldn’t overinterpret small gaps — usually serves us better.
When it comes to reporting attainment, “probably” is probably precise enough.



Adjacent to your second reason, is that we typically use confidence intervals to be more honest about what we're reporting. But in the case of a test result, the most honest thing to say is "on this test, on this day, with this person marking, you got this score". Reporting a confidence interval implies that you have some clear model for how performance would vary across similar tasks, exactly how the assessment proxies for the measured construct etc etc, which you clearly don't have.