When the test takes over
The gravitational pull of examinations on schools and teachers
A favoured phrase of headteachers, heard at open events across the country, is to defiantly pronounce; “We are not an exam factory!”
If results are high, this claim defends against criticisms of hot-housing. If results are mediocre or low, the school defiantly point to the broader education outcomes students achieve.
The truth is that schools care deeply about ensuring students do well in exams and that they become well-rounded people with bright futures. These are not mutually exclusive goals. However, scrutinising the ways schools go about achieving better exam outcomes is wise because exams weigh upon schools and teachers as much as they do students. And when something weighs upon us, it can change our behaviour, for better or worse.
The gravity of exams
In a previous post, we made the case that the concept of ‘test stakes’ is limited in that it assumes that the weight of consequence is a property of the test itself rather than how it is perceived. What is high stakes to one student may matter little to another, and vice versa.
We prefer instead to think about the gravity of a test; the psychological pull an assessment has on an individual’s behaviour. This pull is both a consequence of the real-world implications of doing well or poorly in an assessment and the extent to which this affects the student emotionally and socially.
School leaders and teachers are also subject to the gravitational pull of assessment. As with students, the pull of the assessment will depend on the interplay of various factors including the real or perceived stakes of the assessment, beliefs about self and ability, and contextual factors. Focussing merely on the high stakes of the exam system ignores the fact that it is gravity that matters more - how the presence of the assessment weighs upon different parts of the system and those within it, and whether this pulls their behaviour in positive or negative ways.
The examinations system may exert a greater pull on behaviour where the school is performing poorly in performance tables, where competition between schools is high and there aren’t enough students to go around, where the school is down-at-heal in other ways (for example, the poor state of school buildings), where there is a legacy status (such as being the ‘old secondary modern’ to the ‘old grammar school’ down the road), and where the headteacher perceives that their job is insecure. However, high-performing schools can also feel the weight of parent and community expectations and have a reputation to protect.
Teachers within these schools will feel this institutional context day to day. However, the gravity of examinations will be tempered or magnified by their department’s relative performance and status and their own circumstance and mindset. Those teachers who have been playing the game for many years, confident in their ability and towards the end of their career, will feel the pull of the examinations system very differently to less confident, early career teachers. The profile of staff in a school can have a significant affect on behaviour.
High-gravity effects can be problematic if the behaviours that result are undesirable, however this need not be the case. Perhaps more problematic is a situation where school leaders and teachers feel apathetic or complacent about the game being played.
Feeling the pull
When a test exerts a strong gravitational pull on teachers and school leaders, they change their behaviour. In what ways?
The academic Daniel Koretz and colleagues identify 7 responses: teaching more, working harder, working more effectively, reallocation, alignment, coaching and cheating.1
The first three are generally positioned as being positive effects of an exams system that weighs on teachers. However, teaching more tips into being toxic when additional instruction eats away at teachers’ and students’ holidays and leisure time. At a school level, teaching more can also mean giving one subject more curriculum time over another. This may be justified by the relative importance of subjects or amount of curriculum content to be taught. However, if the adjustment is merely in response to the weight given to qualifications in school progress measures, such as the double-weighting of maths and English in the Progress 8 measure, then the justification is weak. As a rule of thumb, we should be able to point to an educational benefit when making these trade-offs.
Reallocation refers to shifting instructional time and study time towards what is expected to be tested, sometimes referred to (negatively) as ‘narrowing instruction’. However, this mechanism can be positive if attention is drawn towards the aspects of the curriculum that are central to what it means to be good at the subject. We wrote here about the notion of whether a subject has a unified or fractured knowledge domain i.e. whether it is meaningful to say that someone is ‘good at’ a subject. Terminal assessments, such as GCSE exams, should seek to capture the extent to which the student has mastered a domain. If the assessment is designed well, it will focus on the performance indicators (or what Koretz calls ‘substantial inference weights’) which enable valid inferences to be made about subject mastery (see here for more about sampling the domain). Therefore, spending more time on what is to be tested will also mean spending more time on the most valuable aspects of the subject. However, where alignment is weak, reallocation of instructional time towards what is expected to be tested may inflate scores but at the expense of learning substantive content.
It is easy to downplay the problems associated with reallocation (narrowing instruction) by arguing that if the test is designed well, it won’t matter. However, aligning tests to reflect what is valued in the curriculum is not straightforward. To begin with, these are value judgements and what an exam board believes should be weighted more heavily may not reflect the beliefs of others in schools or in the discipline more widely. Then there are difficulties in encoding these beliefs in test design criteria. Lastly, there are inequalities created by how well teachers understand the customs of the examination. Some teachers will study examiners reports and perhaps even mark papers themselves, therefore gaining additional insights into how exams are designed and where instructional time is best focussed.
Exam coaching refers to the practice of prepping students specifically for the question content and style they should expect. Koretz distinguishes between substantive coaching whereby students are taught what the focus of questions typically is and non-substantive which is about the form of the assessment items. For example, an English teacher may inform students that a question about character flaws is more likely to focus on Romeo rather than Juliet (substantive coaching) whereas an economics teacher may teach the technique of eliminating incorrect responses in a multiple choice question to arrive at the correct answer and the deliberate distractor (non-substantive coaching).
Coaching can remove construct-irrelevant impairments, therefore improving a test’s ability to reveal student’s mastery of a subject - making sure that a question doesn’t ‘trip up’ a student by the way it is structured or by eliminating surprise. However, coaching can also inflate scores at the expense of subject mastery by limiting the student’s ability to handle novelty, apply knowledge flexibly, and acquire a breadth of knowledge.
And then there is cheating, which is always bad. For teachers, this ranges from excessive feedback on controlled assessments, to inflating scores, to steering students’ responses and violating controlled conditions.
The validity of inference
Koretz’s 7 responses illustrate that the behavioural effects of testing are varied and mixed. The concept of gravity suggests that these effects may be inconsistent across the system.
Why does this matter?
It matters because many things depend on our ability to make valid inferences from examination results. An examination must tell us something meaningful about a student’s mastery of a subject for accreditation to be valid. It must tell us something about who is ready to progress to further study in the subject. At a system level, examination results are used to make inferences about whether schools are effective and whether educational standards are improving.
If the behaviours promoted by the examinations system improve learning and therefore improve exam performance, without harmful unintended consequence, then so much the better. However, if the result is inflated test scores which do not reflect an underlying improvement in subject mastery, then we should be concerned.
At 100% Assessment, we contend that the purpose of assessment is to promote learning. Gravity effects are useful to the extent that they do so. Where behaviours are directed towards raising test scores without improving subject mastery, something is going wrong.
For those of us claiming not to be an exam factory, what standards should we hold ourselves to?
Examinations should weigh upon us. We should allow them to motivate us towards more productive behaviours.
There should be safeguards against over-working for teachers and students.
Curriculum time should be apportioned by factors other than how heavily weighted the subject is in school performance measures.
We should select syllabuses according to how well the assessment approach reflects what it means to be good at the subject and guard against narrowing the scope of our instruction for the sake of test performance.
We should coach students to the extent that they can confidently show what they know but not so that they can score marks just for their game play.
There should be no tolerance or complacency with regards to cheating.
If we can hold ourselves to these standards, examinations can be made to work in our favour as a powerful motivating force.
Koretz, D.M., McCaffrey, D.F. and Hamilton, L.S., 2001. Toward a framework for validating gains under high-stakes conditions. Center for the Study of Evaluation, National Center for Research on Evaluation, Standards, and Student Testing, Graduate School of Education & Information Studies, University of California, Los Angeles.


