One student scored 69% on the end-of-topic test. Another got 65%. You know they'll both notice. You can already hear the questions: “Miss, did I do better than Jack?” or “Sir, was I close to the top?”
But how sure are you that Amira’s 69% really reflects higher attainment than Jack’s 65%? Are you confident enough to say so? Confident enough to use the distinction to decide who gets a reward, an email home, a stronger predicted grade, or a place in the top set?
This post is about how finely an assessment can—and should—discriminate between students. It follows closely from our June post on reliability, so if you haven’t read that one, we’d recommend starting there.
What Is discrimination in assessment?
In everyday language, discrimination suggests unfairness, prejudice, bias, or unequal treatment. But in assessment, it means something more technical: how well a test distinguishes between students with high, medium, or low levels of attainment.
Imagine lining up your class from strongest to weakest in their understanding of a topic. Is this easy for you to do? Now glance up and down the line. How confident are you that the order is right - roughly, or even precisely? Where are you least sure? Is it in the middle, or near one end?
An assessment discriminates well when it consistently spreads students out according to their level of understanding. But perfect discrimination across the whole range isn’t usually the goal. In this post, we’ll explore how we match the level of discrimination we desire to the purpose of the assessment.
The relationship between discrimination and reliability
Discrimination and reliability often go hand in hand, but they’re not the same thing. Discrimination refers to how well an assessment spreads out students by attainment. Reliability refers to how consistently it does so.
High-discrimination items help support reliability. In classical test theory, reliability improves when individual items correlate well with overall test performance and this is something good discriminators tend to do. That’s why subjects with lots of discrete, high-discrimination items (like maths) often find it easier to produce reliable assessments.
But discrimination alone doesn’t guarantee reliability. A task might separate students well, but if the test is too short, poorly constructed, or inconsistently marked, reliability will still suffer. In short: good discrimination helps, but it’s not enough on its own.
Discrimination within a single task versus a series of tasks
When an assessment includes just one task, such as an essay, a performance or practical, it typically discriminates through quality: we judge how well the task is performed, not just whether it was completed.
This contrasts with assessments made up of many short, right-or-wrong items. These tend to discriminate through difficulty. A single task can only split students into those who get it right and those who don’t, so its ability to discriminate depends on how many students are in each group. A task that's too easy won't tell you much about top performers; one that's too hard won’t distinguish among lower attainers.
To achieve broader discrimination, we can combine tasks of varying difficulty. This allows different parts of the mark range to be stretched: easier questions help distinguish lower-performing students, while harder ones help at the top. Teachers can manipulate difficulty not just through content, but also by changing task format or cognitive demand.
Most good assessments combine both approaches. Multiple-choice or short-answer questions can reveal how far a student can climb through levels of difficulty. Extended responses, meanwhile, show the quality of their thinking or performance, relative to other respondents. (We’ll return to the art of combining these elements in a future post.)
Precision comes at a cost
As we explored in our post on reliability, precision is rarely free. The same applies to discrimination. If you want an assessment that finely distinguishes between students, expect to pay for it, either in time, or in effort, or in both.
Precise ranking may seem ideal, but the resources required can be excessive. A PE teacher might spend 15 minutes sorting students into three broad netball groups, yet ranking them individually could take over three hours. An English teacher could categorise 30 essays into three bands in an hour, but ranking each student reliably would take much longer and might require them to review multiple pieces of work. High discrimination usually demands more assessment and marking time, so we need to ask: is the extra precision worth it, given how the results will influence learning? Often, moderate discrimination is enough.
This also helps us decide where in the attainment distribution we want precision. For instance, teachers might set short quizzes after homework, knowing most students will get full marks. These quizzes aren’t designed to separate high attainers. Instead, they’re meant to encourage homework completion, give students a sense of success, and flag those who need support. By contrast, tournaments - whether in football or maths Olympiads - require fine discrimination at the top, but not at the bottom, where early rounds eliminate weaker performers.
Even with unlimited time, high discrimination isn’t always desirable. We could improve the Year 1 phonics check by adding a reading fluency task to separate students scoring between 33 and 36. But doing so might shift teaching towards fluent readers and away from those still learning to decode, or prompt rehearsed word memorisation. Better discrimination could distort learning. Phonics is a clear case where the goal is class-wide mastery. In these cases, we design assessments that only award marks for demonstrating the skill. We hope they fail to discriminate at the top end, because all students have succeeded.
Different subjects, different needs
Discrimination isn’t just a question of how well a test works. It’s also a question of how much you’re willing to pay. And in some subjects, the price is simply higher than in others.
The cost depends, first and foremost, on the nature of the knowledge domain. In maths, we can write short, tightly defined questions that test discrete skills. This allows for high discrimination and high reliability with relatively low effort. We can automate the marking, standardise the scoring, and easily spot differences in attainment between students.
But in subjects like English literature or history, the most valid assessments are often extended written responses. We need students to construct arguments, weigh interpretations, or analyse texts in nuanced ways. These tasks offer quality-based discrimination, but marking is slower, less consistent, and more subjective. To increase reliability, we often sacrifice nuance by collapsing rich responses into broad bands. In some cases, any attempt to discriminate more finely comes at the cost of validity.
Then there are subjects where the problem isn’t marking but rather the task itself. A map annotation, a cooking practical, a science experiment: these are often great for telling you whether a student can do the thing. But once a student can do it, finer distinctions are hard to justify reliably. The assessment might discriminate well at one point in the distribution (e.g. pass/fail) but not others (e.g. distinguishing “very good” from “excellent”).
Over time, these trade-offs shape subject culture. Maths teachers may feel comfortable reporting fine-grained marks or scores out of 60. English teachers are more likely to report broad grades. Art and music teachers might prefer “expected/above/below” descriptors. These habits aren’t arbitrary. They reflect the underlying assessment economy of each subject: the tasks that are valid, the costs of discriminating, and the limits of precision that feel justifiable.
When (and where) precision matters
So when does precise discrimination actually matter?
Sometimes the answer is straightforward. If you're choosing who gets into the top set, who qualifies for an intervention, or who sits the Higher tier paper, you need an assessment that can make reliable, fine-grained distinctions, particularly around critical thresholds. Without this, your decisions risk being arbitrary or unfair.
But even here, we must be careful not to confuse necessity with tradition. Take maths, for example. It’s easy to argue that high-discrimination assessments are vital for effective setting. But the reverse is also true: the reason maths teachers are comfortable with fine-grained setting is because they have assessments that make that level of discrimination feel valid and defensible. The need arises, in part, from the possibility.
This feedback loop exists in every subject. Cultural practices around grading, grouping, and feedback reflect an interplay between what we need to do and what we can do. In English, where marking essays finely is time-consuming and inherently fuzzy, we rarely attempt to rank students 1 to 30. Instead, we report grades in broad bands, both because it’s sufficient and because pushing further feels unjustifiable. The practice sustains the reporting habit, and the reporting habit justifies the practice.
The same is true in music, PE, art, and drama. Teachers rarely report 61% or 73% on a sculpture or trampolining course, not because students don’t differ, but because those differences are hard to pinpoint consistently and reliably. A broad judgement (“secure”, “developing”, “excellent”) feels more meaningful.
This is why subject-specific reporting practices endure. They aren’t just preference or inertia. They’re practical responses to the kinds of discrimination that a subject’s valid tasks make possible. Precision isn’t always needed. But when we believe we can defend it, it quickly becomes part of what we do.
So, who did better?
Let’s return to Amira and Jack. One scored 69%, the other 65%. They both want to know who “did better”. Will you tell them?
By now, you should be able to see that it depends. It depends on what the assessment was designed to do. It depends on your security of belief that 69% is indeed between than 65%. It depends on how learning about this difference might help or hinder learning in the future.
If this was a maths assessment, with attainment compiled through dozens of questions and high security in the marking process, I suspect you will feel happy to hand back the percentage marks. But if it was an English essay, marked against broad criteria, with a bit of luck in the question choice and some fuzziness in the marking, then perhaps it’s better give them the same graded feedback.
Discrimination matters and can be useful in giving informative and motivating feedback. But only when it is meaningful. So, before you separate Amira and Jack by a few percentage points, it’s worth asking: Was this assessment built to do that job? And if not, what kind of feedback would serve them better?
This series of articles is really good; it’s the best introduction to the deeper aspects of assessment I’ve come across but very accessible. Should be a book!