School accountability: why multi-year measures would be fairer to everyone and how they could be implemented

John Jerrim.

School league tables produced every year by the Department for Education currently use a single year of test score data in the headline measure. Based upon how a school’s most recent cohort of Year 11’s performed on their GCSEs, schools get labelled with terms such as “above” and “below” average.

This is not a smart thing to do. It would be much better to use the average performance of a school across multiple years instead.

This blog recaps the reasons why – and then suggests how this could be done.

Why is data from a single year problematic?

The Centre for Education and Youth’s election manifesto called for multi-year averages for three main reasons based on the 2017 report “Testing the Water”:

1. To provide a more reliable and valid picture of performance.

2. To offer school leaders respite from ultra-stakes and short term accountability.

3. To reduce teacher workload and pupil stress.

A more meaningful measure

In a single GCSE cohort, a school can have fewer than 100 pupils. This is not a lot of data to be basing important judgements upon. On top of that, although the correlation between Progress 8 measures across two adjacent years is quite high (around 0.9), there will always be outliers where a school has had an academically weaker cohort due simply to chance.

Yet a single bad year of performance data has big important consequences for teachers and schools and is terrible news for parents who count on the government to provide them with reliable information about their children’s schools.

At a basic level, a three-year average means the sample size for each school increases; the ‘signal’ from the data improves and the ‘noise’ is reduced. This, in turn, means we have more certainty in the results – and can provide better information to the users of school data (e.g. parents when selecting their children’s school).

Respite for school leaders

A football manager culture has developed in schools, with Heads made or broken by a single set of results. This is in no-one’s interest and makes the job unappealing and soul-destroying for many.

Reduced teacher and pupil workload and stress

Pressure on school leaders to prove their worth with a single year’s results is frequently passed on to teachers of exam groups and, in turn, their pupils.

Official recognition that improving school results is more of long game will discourage hot housing of single year groups and encourage a longer-term approach.

How could this actually be done? An illustration

Since Testing the Water was published, extensive discussions about the three-year average idea have thrown up remarkably few objections with the two predominant ones being:

1. Is it possible?

2. Is there a risk of not spotting a school in rapid decline?

This blog is intended to rebut both these objections.

How do we actually go about using data from multiple years to generate the headline performance measure of a school?

Let’s take three years of Progress 8 data from the Jubilee High School (URN = 139948). This school had a single difficult year in 2017/2018 (P8 score of -0.35), but did much better in 2016/17 and 2018/19 (P8 scores of 0.23 in both).

What then is the best way to present this data to make the true performance of the school clear?

Figure 1 presents results from a statistical technique known as a meta-analysis. It is normally used to summarise results from across multiple studies – basically the methodology underpinning the Education Endowment Foundation’s toolkit.

Yet this technique can also be used to combine the information we have about schools from across multiple years – as shown in Figure 1.  This presents the Progress 8 results of this school in a “Forest Plot”, neatly illustrating the trend in its P8 scores over time. Importantly, at the bottom it also provides our best guess of the overall performance of the school, combining together information from across all three years.

The outlying year (2017/2018) has some impact upon the overall result; but does not completely overwhelm it. This is fair given its much stronger performance in the previous and subsequent academic year, and is a better reflection of the progress typically made by pupils in the school.

Our proposal is that, from next year, this becomes the headline measure used in the school performance tables.

Figure 1. Forest plot of the Progress 8 scores for Jubilee High School (URN = 139948)

Why do we believe that this is an improvement over the status quo?

First, the Forest Plot in Figure 1 clearly demonstrates where there has been a single “dodgy” year for a school in a trend of otherwise stellar results.

Second, because we are now relying upon a multi-year average, this below-bar year has some impact – but does not simply swamp – the headline measure of performance of the school.

Finally, as we are now using much more data – and have a much larger sample size – the confidence interval accompanying our headline P8 result is much smaller. In other words, our headline measure provides more accurate information. This is great news for parents and policy makers trying to understand school performance.

Figure 2 helps to illustrate this final point by drawing upon data for another school (Looe Community Academy). The Progress 8 score of this school has been very stable over time: 0.22 in 2016/17, 0.26 in 2017/18 and 0.23 in 2018/19. But, as in each year its P8 lower confidence interva[HD2] l has always (just) crossed 0, it has consistently been labelled as “average” by the Department for Education.

Yet this is simply a result of only looking at one year of data at a time!

By combining information across the three years, we can say with more confidence that this school is indeed above average. This is reflected by the narrower confidence interval for our suggested headline measure – which no longer crosses zero. (In our preferred headline measure, the P8 result for this school would be reported as 0.24 with a confidence interval of 0.09 to 0.38).

Figure 2. Forest plot of the Progress 8 scores for Looe Community Academy (URN = 137839)

Spotting issues before it’s too late

During conversations about three-year averages with policy makers and civil servants, the most common objection has been that the DfE needs to be able to spot any dramatic shifts before it’s too late. The reply is, of course, that there’s nothing to stop them looking at the information internally, with due sensitivity to its provisionality and unreliability, without this being a headline, public measure that ends up emblazoned on every newspaper’s front page.

However, an added advantage of our proposed approach is that we can decide the contribution each academic year makes to the headline result.

In the examples above, each year was given equal weight. But perhaps more recent results should be given more prominence than results from previous years. After all, the most recent year of data may be a better predictor of future performance than slightly older data.

Figure 3 therefore provides the results for Jubilee High Schoolagain. But now we give more weight (50%) to the latest year (2018/19), 30% weight to the data from 2017/18 and 20% weight to that from 2016/2017.

Figure 3. Forest plot of the Progress 8 scores for Jubilee High School (URN = 139948) with differential weights

Of course, the weights assigned to each year are somewhat arbitrary and open to debate. But the key point here is that we can use data from multiple years in different ways – based upon what various stakeholders (schools, unions, teachers, parents, students, government) feel is most appropriate.

Another way of thinking about this is that the DfE currently gives the most recent year 100% weight in their headline school performance measures. All other information from previous years is given no weight at all. There is simply no way this approach makes sense.

The question the DfE must now ask themselves is how an approach that not only lacks reliability, but also drives some of the most pernicious problems in our education system, can possibly be justified any longer.


Tagged with: ,
Posted in accountability and inspection, Leadership and management, Schools
3 comments on “School accountability: why multi-year measures would be fairer to everyone and how they could be implemented
  1. Stephen Gorard says:

    Can’t have a confidence interval for non-randomised data, as here. They are even more meaningless than normal. There is no sampling distribution to assess. Value-added is not independent of raw scores. All else makes sense.

    • John J says:

      Hi Stephen

      Yes – not an unfair comment. And I almost added a sentence about this very point in the blog. It got left out as it was already getting too long, and didn’t feel the place for it.

      It is of course what the DfE have chosen to do to report confidence intervals for individual schools. Hence why they have been included / discussed in the blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

UCL Institute of Education

This blog is written by academics at the UCL Institute of Education.

Our blog is for anyone interested in current issues in education and related social sciences.
Keep up with the latest IOE research
IOE Tweets

Enter your email address and we'll let you know when a new post is published

Join 41,861 other followers