Does Your Test Really Measure Attainment?
When summed scores mislead and how IRT helps
You finish writing the end-of-year paper. You set the marks for each question, invigilate, collect the scripts, and spend an evening marking. Then comes the tidy bit: add up the totals, sort the class from highest to lowest, and you have your picture of who’s strongest in the subject.
But does that ranking of scores really tell you what you think it does? Are the students at the top genuinely the most secure, or did the mix of questions flatter them? Do equal marks always mean equal attainment?
In this post we’ll look at why summed scores feel straightforward yet rest on fragile assumptions, how item response theory (IRT) offers a truer read of underlying understanding, and how we can use its insights to make your own classroom tests better align with the meaning of attainment even if you never touch an IRT model.
The hidden complexity of the simple summed score
Summed scores feel like the fairest way to judge performance: add up how many marks a pupil gets and assume the total reflects how good they are. This logic, rooted in classical test theory, underpins almost all classroom and exam grading. But beneath the simplicity sit several strong assumptions. It assumes every question contributes equally to the total. In other words, that a mark on an easy item carries the same meaning as a mark on a hard one, and that each question is equally good at separating stronger from weaker pupils. It also assumes the questions all tap the same underlying construct and that errors in marking or guessing simply wash out. When these conditions hold (roughly, when all questions behave like miniature versions of each other), totals are a reasonable guide to attainment.
In practice, they rarely do. Real papers blend easy and hard items; some turn out to be poor discriminators; others drift into slightly different skills. In maths, we mix quick recall questions with complex multi-step problems on the hardest topics. In science, students can acquire marks by banking vocabulary definitions or reasoning with data. In history, we might follow a bundle of short factual items with an open-ended explanation of causation.
Two pupils can therefore land on the same total by answering very different questions correctly. The ‘attainment’ implied by the total is not equivalent. Summed scores blur the line between what was asked and what was learnt: they are sensitive to the particular mix of items as much as to the underlying understanding we hope to measure. So what would a scoring approach look like that accounts for item difficulty and how informative each question is?
Why item response theory gets closer to what we mean by attainment
Item Response Theory models each question’s difficulty and discrimination and uses those to estimate a pupil’s underlying ability from their response pattern. The estimate depends less on which particular questions they saw and more on how their responses fit a curve of performance across ability. The result is a latent ability score, not just a total: two pupils with the same raw mark can receive different estimates if one earned marks on harder or more informative items. It also lets you compare pupils fairly across different forms, because the scaling adjusts for item difficulty. In short, IRT turns a raw total, which is necessarily tied to the quirks of a paper, into a scale that better reflects how far a pupil has mastered the construct.
With IRT, equal scores are more likely to reflect equal attainment. By accounting for item difficulty and discrimination, it produces fairer, more comparable measures, so results depend less on which particular questions a pupil faced. This matters most when papers differ in difficulty or when comparing across years.1 That’s why programmes like PISA and TIMSS use IRT: it helps us measure learning rather than luck of question paper composition.2
This isn’t just theory; it changes how we talk about who is doing well and badly. A study has shown that summed scores can distort teacher value-added, but IRT markedly reduces that bias.3 Summed totals can embed group differences when items behave differently for different students.4 And pass/fail calls based on raw totals are less stable than decisions based on IRT-scaled scores.5 In short: how we aggregate marks affects fairness, accountability, and the credibility of our conclusions.
Why we don’t all use IRT (and how to do better without it)
So why don’t schools just switch to IRT? Mostly, practicalities. School datasets are often too small for stable item calibration; few classroom platforms make IRT straightforward; and, weighed against the hassle, a simple total usually orders pupils “well enough” for everyday purposes. The marginal gain from full IRT can feel slim compared with the effort.
That doesn’t mean we should ignore what IRT teaches us. You can bake its insights into ordinary test design and marking. Before you set a paper, do quick sense-checks: have you allocated marks to match what you value? Are the hinge ideas and demanding concepts rewarded properly, not buried under easy one-mark recalls? Is there a sensible spread of difficulty so higher attainers aren’t all bunched at the top and lower attainers can still show what they know?
After the test, spend ten minutes checking how each item behaved. Which questions actually discriminated (strong pupils tended to get them right)? Which were too easy, too hard, or oddly answered by everyone (perhaps poorly worded or off-scope)? Use that audit to tweak future papers and to temper how confidently you read tiny mark differences, especially near grade boundaries or cut scores.
If you must compare across question papers, avoid treating raw totals as equivalent. Anchor comparisons with a few well-behaved common items, or at least add a short note on form difficulty (“Paper B ran easier; expect slightly higher totals”).
To conclude, summed scores can approximate attainment well, but only under restrictive conditions. Even if you never run an IRT model, the IRT-informed mindset of thoughtful mark allocation, a planned mix of difficulty, quick item diagnostics, and humility about tiny gaps will make your assessments fairer, clearer, and more meaningful.
Hambleton, R. K., & Jones, R. W. (1993). Comparison of classical test theory and item response theory and their applications to test development. Educational Measurement: Issues and Practice, 12(3), 38–47.
Kolen, M. J., & Brennan, R. L. (2014). Test Equating, Scaling, and Linking (3rd ed.). Springer.
DeMars, C. (2010). Item Response Theory. Oxford University Press.
Jensen, N., Rice, A., & Soland, J. (2018). The influence of rapidly guessed item responses on teacher value-added estimates: Implications for policy and practice. Educational Evaluation and Policy Analysis, 40(2), 267–284.



So, I've just tried out IRT using R on a recent history MCQ (thanks Gemini). Very interesting results. The question difficulty will be good to discuss with the history team. The pupils' scores also shifted about... On a standardised scale up to 5 points - definitely across grade thresholds.