The Curriculum Ghost in the Data Machine
The hidden reason shared assessments fail in most secondary subjects
Fifteen school data leads walk into a room. They’ve brought their Heads of Science, English, History and Maths, and one very reasonable ambition: to write shared end-of-year assessments so they can compare how well year groups are doing across schools.
You can see the appeal. If you only ever assess within one school, nearly every comparison is compromised:
Comparing departments inside a school is meaningless if “73%” in English doesn’t represent anything like the same achievement as “73%” in History.
Comparing this year’s Year 8 with last year’s Year 8 runs straight into the “sawtooth effect”: marks dip when a new assessment is introduced, creep up as teachers learn it, then dip again when the next new one arrives.
Comparing classes within a year group only makes sense if pupils were randomly allocated, which they rarely are.
So, shared assessments across schools look like a neat solution: one common instrument, administered in similar conditions, producing a benchmark that feels fair. Done well it could allow diagnostics at scale: spotting broad patterns, checking whether curriculum plans are landing, and noticing where cohorts may need extra attention.
The group begins.
And almost immediately, they discover that the hardest thing to standardise is not the test.
It’s the curriculum.
The real problem: shared assessments are a curriculum-alignment technology
There’s a familiar mantra I use to talk about “meaningful comparison”. All you need is…
standardised tests
of a standardised curriculum
sat in standardised conditions
with roughly standardised expectations of stakes
reported with standardised meaning
In practice, schools can usually negotiate most of those. The sticking point is the second bullet: if what’s been taught varies, what the test is sampling varies too — and then your comparisons wobble, even if everyone sits the same paper on the same day and marks it consistently.
To show why, here are four tales from those fifteen Heads of Department.
1) Science: the tyranny of timing
The science Heads arrive optimistic. Many are following the same programme of study. Surely this will be straightforward?
Then the first snag: topic sequence.
By June, one school has most recently taught Earth and Space. Another has just finished Sound and Light. A third is in the middle of something else entirely.
This matters because of the recency effect: pupils tend to do better on what they’ve encountered recently, particularly when topics don’t naturally interleave enough to keep earlier content warm.
You might argue the recency effect “balances out” across schools: every school is fresh on something, so it all comes out in the wash. In theory, perhaps. In practice, that would require an almost impossible level of precision in balancing:
the proportion of marks by topic
the difficulty of questions within each topic
the way knowledge and skills are distributed across the paper
Then a second snag arrives: pace.
One school is behind because Year 8 has fewer science hours (with more planned in Year 9). Should those pupils be assessed on content they simply haven’t been taught yet? If you remove the topic, you change the domain and create a new unfairness; worse, you risk incentivising curriculum delay if schools believe delay will be rewarded.
What the science group learns: even when everyone broadly agrees what “good science” looks like, the paper becomes a negotiation about sequencing and pace. You can write something shared, but you should not pretend it’s a clean comparison unless the curriculum timing is genuinely aligned.
2) English: the problem of multiple texts (and invisible bias)
The English Heads have strong shared instincts about what “good writing” looks like. They expect to disagree about fine points, not fundamentals.
And then they hit the wall: texts.
Across fifteen schools, pupils have studied a wide range of novels, plays and poems. There’s some overlap — perhaps two Shakespeare options — but beyond that you can quickly find yourself staring at ten different works.
A GCSE-style “optional paper” seems tempting: everyone answers the Shakespeare question, then they answer the question on the text they studied.
But that raises two difficult questions:
Equivalent difficulty: how do you ensure questions across different texts are genuinely comparable?
Comparable marking: how do you moderate scripts written on different texts so the grade boundaries mean the same thing?
Exam boards can do this because they have thousands of scripts, statistical equating, trained examiners, and external reference points. A small cluster of schools cannot recreate that machinery. And there’s an additional hazard: teacher markers are far more likely to know “which school taught which text”, which invites all the usual marker bias problems (even with good intentions).
So, the group ends up doing what is sensible: they write a shared component where overlap exists (for example, those Shakespeare texts), they write a shared assessment of transferable components (unseen reading, SPaG, certain writing tasks), and they drop the rest.
What the English group learns: where curriculum content diverges sharply, a fully shared paper is either unmanageable or misleading unless you dramatically narrow what you assess.
3) History: when the “construct” won’t travel without its context
Then the history Heads sit down. They are quickly confronted by a brutal reality: minimal overlap in period study in any given year group.
Even when there is nominal overlap — say, “the British Empire in Year 8” — the lens varies: one school foregrounds the experiences of the colonised, another focuses on trade, economics and administration. Timelines and emphases differ too.
They consider an optional paper with enough periods to catch everyone. But the deeper problem appears as soon as they attempt to write and mark questions:
marking an answer in history depends heavily on what knowledge was taught and what interpretations were emphasised
“understanding” of abstract ideas (power, war, leadership, causation) is not easily separable from the concrete context through which those ideas were encountered
A pupil who can write sharply about power in Tudor England may not automatically be able to think clearly about power in the Civil Rights Movement. This is not because they are weaker historians, but because the knowledge base is different. The construct doesn’t travel cleanly without the context.
So the history group does something honest: they abandon the attempt.
What the history group learns: if a subject’s thinking is deeply tied to specific content, and that content differs across schools, the idea of a shared end-of-year paper can collapse.
4) Maths: when everything aligns, it suddenly becomes easy
Now the contrast case.
The maths Heads find common ground almost instantly. Why? Because they are doing something the other subjects are not doing in the same way: they are following the same purchased curriculum, taught in the same order, at broadly the same pace.
Maths is also a more obviously hierarchical domain in school: later topics genuinely depend on earlier ones, and there is often clearer agreement about what “prerequisite knowledge” means.
So they can quickly decide what the paper should contain, what proportion of marks should go where, and what kinds of questions best sample each area.
And it is no coincidence that the subject where this is easiest is also the subject where commercial organisations can most readily sell shared assessments at scale.
What the maths group demonstrates: when curriculum alignment is real, shared assessments become straightforward.
So what should schools do with all this?
The “Four Tales” leave us with a hard truth: The ghost in the machine is the curriculum.
Shared assessments mostly measure curriculum alignment.
If you use them as if they measure “department quality” in isolation, you will often reward the school that happened to have taught the sampled material most recently, most thoroughly, or most closely to the paper.
That does not make shared assessment worthless. It just tells you what it really is.
If you still want shared assessments, here is the least-bad way
If your goal is sensible benchmarking for curriculum evaluation (not high-stakes judgement), you can make shared assessments far more defensible by being explicit about design and limits.
1) Narrow the purpose
Write down, in advance, what decisions you will and won’t make from the results. If you can’t write that sentence without squirming, you’re about to misuse the data.
2) Agree the domain (properly)
Don’t just agree “Year 8 science”. Agree a blueprint: topic × skill, with weightings. What is in? What is out?
3) Align timing
If schools are teaching units in different orders, accept that a single end-of-year paper will carry recency effects. Either align sequencing for the assessed content, or interpret results as “partly about timing”.
4) Build in moderation with anchors
Use shared mark schemes, standardisation meetings, and a small set of anchor scripts or exemplars. Treat grading as a collective craft, not an afterthought.
5) Make “stakes” genuinely comparable
Pupils need to treat the assessment as roughly the same kind of event across schools. Otherwise you are comparing student effort as much as attainment.



Really enjoyed this. On the following:
"One school is behind because Year 8 has fewer science hours (with more planned in Year 9). Should those pupils be assessed on content they simply haven’t been taught yet? If you remove the topic, you change the domain and create a new unfairness; worse, you risk incentivising curriculum delay if schools believe delay will be rewarded."
I don't really see this one as such a big deal. If everyone agrees on the content of a good science curriculum, and one school chooses to spend less time delivering it, then a shared assessment can still be revealing, as it will show the school the impact of that choice. Given the strong link up between KS3 and KS4, it's then valid in the sense that it tells the school what is likely to occur further down the line as well?
Does this not perhaps call into question the whole purpose of assessment? As Wiliam's says we should be thinking of responsive teaching not assessment. For too long assessment has been too focussed on ranking, comparing and hierarchy rather than the development of the individual - why does it matter that John is better than Jane? Is it not more important how both John and Jane can develop? Standardised assessment assumes standardised children!