How Reliable Does an Assessment Need to Be?
Making Sense of Noise, Judgement, and Purpose in Assessment
Every assessment handbook tells us that a good test must be both valid and reliable. Validity ensures we’re measuring what we think we are; reliability ensures we could measure it again and get a similar result. But perfect reliability is impossible and chasing it can be costly. In some subjects, it's harder to achieve than others. So the real question for teachers isn’t is this assessment reliable? but reliable enough for what? In this post, we explore how to think pragmatically about reliability, and why sometimes, a bit of noise is a price worth paying.
Assessment is inherently noisy
No assessment can perfectly capture what a student knows and can do. Every test result is a mix of signal and noise: a glimpse of true attainment, clouded by all the unpredictable things that distort performance on the day. A useful analogy is listening to music on a radio with patchy reception. The tune is there, but interference makes it harder to hear clearly.
So, what causes this interference?
Sampling limitations. In a short assessment, we can’t test everything. The particular questions chosen will inevitably favour some students more than others.
Marking variability. Open-ended responses often depend on subjective judgement, and even well-trained assessors can differ in how they interpret the same script.
Assessment conditions. Temperature, background noise, seating arrangements, distractions, student tiredness, stress or hunger can all influence how well a student performs.
Assessment design flaws. Ambiguous questions or confusing instructions can lead students to underperform through misunderstanding rather than lack of knowledge.
Format familiarity. A well-prepared student may still struggle if the test looks different to anything they’ve seen before.
Random guessing. Particularly in multiple-choice formats, an element of chance can lift or lower a score regardless of underlying understanding.
All of this adds up to uncertainty. However well an assessment is designed, it will never give a perfectly clean signal. The important question is how much noise we can live with, and whether we’re clear on the trade-offs.
Reliability has a cost
Improving reliability isn’t free. It usually means making assessments longer, more standardised, more controlled, or more heavily marked. Each of these carries a cost, as we discussed in our last post.
The most obvious cost is time. Reliable assessment often requires a broad sampling of knowledge and skills, which means more questions, longer papers, or multiple tasks. In some subjects, like history or English, you may need hours of writing to draw out consistent patterns in student performance. In others, like maths, a relatively short paper can already sample a wide range of small, distinct skills (which is why it is odd that they have equal examination time at GCSE).
Reliability also demands standardisation: tight control over marking and assessment conditions. But this can distort the appropriateness of the examination in relation to the curriculum.
A desire for reliable assessments is likely to drive us towards particular types of assessment tasks. Multiple-choice questions are more reliable to mark than essays, but they necessarily restrict the nature of the knowledge and skills we assess.
We sometimes talk as though reliability is a neutral virtue. But every attempt to boost it introduces a trade-off. And those trade-offs don’t fall evenly across subjects, or even across students. Reliability always comes at a cost: of time, of resources, and sometimes of what really matters.
How reliability is quantified (and what the numbers hide)
Measuring the reliability of an assessment is complex because the noise arises from many different sources. Commercial assessment companies tend to report a few different measures of reliability. They will assess the effect of sampling variability by calculating a test statistic called Cronbach’s alpha.1 The value tends to be high (around 0.9) where individual assessment items or tasks as measuring similar constructs and where the assessments have many items. They will be lower where an assessment is made up of fewer tasks, particularly where they cover disparate topics so that student performance varies across the tasks.2
Other reliability estimates that try to account for other sources of noise in the list above tend to report much higher levels of uncertainty. For example, test-retest reliability, where a student sits two parallel forms of a test (e.g. two different SATs reading papers) might show quite low correlations.3 And correlations between two different assessments designed to test similar knowledge might be much lower still.4
We know that the current Key Stage Two tests have high Cronbach’s alpha (0.9 and above), which means student performance will not be too sensitive to the luck of which particular questions were included in the assessment.5 The historic Key Stage Two writing tests had low marker consistency, which is one reason why they were disbanded in favour of teacher moderation of a writing portfolio (though that may be equally unreliable of course!).6 The exams regulator, Ofqual, has published a large body of research showing that marker consistency is perhaps lower than teachers and students might realise, particularly in essay-based subjects such as history.7
Reliability must be judged against purpose
We call an assessment reliable when the noise is weak relative to the signal. But how clear does that signal need to be? If you’re listening to music on the radio, a bit of static might be fine if you’re just trying to recognise the tune. If you're transcribing the lyrics line by line, it's not. The same applies to assessment: reliable enough always depends on reliable enough for what?
This is why reliability isn't an absolute property. It’s a judgement about fitness for purpose. An assessment used to inform teaching decisions might need to be very reliable, especially if it's shaping what happens in the next lesson or which students receive intervention. On the other hand, a low-stakes quiz that gets students thinking, or nudges them to revise, might do its job even if the scores aren’t all that precise.
In our earlier post on why we don’t use the terms summative and formative, we argued that all assessments serve multiple purposes: before, during, and after the test. That means reliability matters for all of them. But the level of reliability required will vary.
Some examples:
If you’re placing students in maths sets or groups for next year, you need a test that's robust enough to avoid obvious misclassifications, particularly where set placement informs GCSE tier entry.
If you’re giving feedback to students and parents, the reliability of the grade will shape their trust in the grade and their motivation to respond to it.
The danger comes when we apply the same reliability threshold to all assessments, expecting the same precision whether we’re making life-altering decisions or giving a student feedback on a practice essay. Not every assessment needs to transcribe the lyrics. Sometimes it’s enough to know that the melody is playing.
Unreliability affects motivation
When assessments are unreliable, they they can really sap motivation.
I had piano lessons as a child with a teacher who was, in one particular way, unreliable. She never missed a lesson, but the feedback she gave me on my playing seemed entirely disconnected from how much I’d practised or how well I thought I’d performed. Some weeks she was full of praise; other weeks, oddly dismissive. Over time, I realised it didn’t seem to matter whether I’d worked hard or not because the response felt random. And so, I stopped practising.
Students make effort–reward calculations all the time. If their assessment experiences suggest that effort won’t lead to recognition or improvement, they will act accordingly. As we’ve written before, motivation depends on more than just the stakes of a task, it depends on the perceived credibility of the judgement. Unreliable assessment risks breaking that link. And when it does, learning is the first casualty.
What teachers can do
Most teachers won’t be calculating Cronbach’s alpha on their Year 9 assessments, and nor should they. But that doesn’t mean we should ignore reliability altogether.
Instead, take a rough, pragmatic view. Ask yourself: If this student sat the same test again next week, how different might their result be? If the plausible range is wide, say so. Report a grade as “C, likely between B and D” rather than a precise score. Avoid over-interpreting minor differences between students, especially near cut points. And be wary of aggregating across noisy data. It’s easy to mask unreliability by giving many in the class an “Expected” standard, but it creates unreliable cliff edges where none really exist (we’ll return to these in a forthcoming post).
You can also talk to students about this uncertainty. It helps them see that assessments aren’t fixed verdicts, but snapshots taken through a blurry lens. That understanding can reduce anxiety, encourage resilience, and support a healthier, more motivated approach to learning.
See, for example, GL Assessment (2024). Reliability of Exact, webpage available at: https://support.gl-assessment.co.uk/knowledge-base/assessments/exact-support/introduction/reliability-of-exact
Assessment companies usually convert this analysis into a more useful statistic that reports how large fluctuations in a student’s score might typically be expected to be. These are either reported as a band within which the company estimates there is a 68% chance the student falls within it (known as the standard error of measurement) or a 90% band.
GL Assessment (2024). Reliability of Exact, webpage available at: https://support.gl-assessment.co.uk/knowledge-base/assessments/exact-support/introduction/reliability-of-exact
Allen, R., Jerrim, J., Parameshwaran, M. and Thomson, D. (2018). Properties of commercial tests in the EEF database, EEF Research Paper 001 (February 2018).
GL Assessment (2024). Validity of Exact, webpage available at: https://support.gl-assessment.co.uk/knowledge-base/assessments/exact-support/introduction/validity-of-exact
https://ffteducationdatalab.org.uk/2019/04/how-reliable-are-key-stage-2-tests/
https://assets.publishing.service.gov.uk/media/5a81d19de5274a2e87dbfa4d/0214_Ofqual-marking-reliability-of-the-ks-2-nc-english-writing-tests-in-england.pdf
https://www.gov.uk/government/publications/reliability-of-assessment-compendium
Another great post both, thank you. I’m again reminded that these messages about the relative unreliability of assessment data should be known by governors and trustees, who tend to think that grades are absolute and unquestionable. As senior leaders, we should be reminding our boards of this instead of allowing them to continue with a misconception.
I’ve always felt comfortable with unreliability of assessment, as long as everyone around me understands it too. It’s where people don’t understand the impossibility of reliability (and validity) that things are tricky (hence comment above re governing and trust boards)
It’s been useful to me to consider what broad inferences I can take from assessment data, to make best bets on the next course of action.
For example, allocating interventions based on data (which is what a lot of trusts did with national tutoring funding post Covid) has always seemed a bit bonkers. I think it can guide the decision, but the data will mean that some kids getting the intervention won’t really need it, or won’t be interested in it - and some not getting it, would have benefitted massively from it. Better to make decisions based on inferences drawn from triangulating with soft data.. staff knowledge for example.
It’s the point you’re making above.. reliable enough ‘for what?’ Or perhaps, we should ask - is assessment data ‘unreliable’ enough to avoid or be cautious about certain assumed next steps? :)
Thought provoking and robust as ever. Thanks.
The Ofqual research should be widely publicised - an English exam answer has only a 50% chance of receiving the ‘right’ grade. If teachers knew this we might teach proper writing rather than spurious exam skills.