Measuring changes in pupil attainment is at the heart of our work as education researchers. It is a practice that is also routinely carried out in schools to monitor pupil progress and teaching quality. One means of doing this is through the purchase of standardised tests in core subjects such as maths and English that report a student’s performance relative to a national distribution.
The Education Endowment Foundation (EEF) regularly uses these standardised tests in their randomised controlled trials. When trials are completed, the data is archived in routinely and matched to administrative exams data as it becomes available. This presents a unique opportunity for independent researchers to analyse the statistical properties of these commercial tests, which will in turn inform trial design, but also has important implications for how they are used in schools.Earlier this year we analysed the student test data drawn from 14 of the largest trials run so far, all of which now are matched to Key Stage 2 data from government tests taken at age 11. One part of our analysis measures the predictive validity of the commercial assessments with respect to the Key Stage 2 test in the same subject (i.e. maths or English). Predictive validity is simply the extent to which a score on a test predicts scores on some other test. Where it is low, we worry about both reliabilities of one or other of the tests and/or whether the tests are measuring similar underlying constructs (validity).
In this blog post I raise several concerns arising from our findings. First, data from a single commercial test is unlikely to show whether a school is making good progress in a subject if test conditions are not standardised across schools. This means it will probably not be a good predictor of a school’s ‘value added’ between Key Stage 1 and Key Stage 2 SATs. Second, while short assessments designed to be completed within a lesson can help teachers monitor what students can do, they are rarely capable of telling us what sort of progress an individual student has made in the previous year.
We measured the correlation between pupil Key Stage 2 maths test results and their score in earlier tests that were administered as part of the EEF trials. Figure 1 plots the correlation between the test sat and the KS2 maths test on the vertical axis. The horizontal axis plots the time in months between the test and the KS2 maths test. The vertical red line illustrates the May year 6 date of the KS2 SATs tests. The bunching of points on the left hand side of the chart shows the Key Stage 1 assessments, which are taken approximately 48 months earlier.
The correlation between these standardised tests and the KS2 maths tests was generally not as high as we had expected. We had expected most trials to follow a standard pattern whereby, the closer in time the standardised test to KS2 SATs, the higher the correlation; and that the correlation would be much higher for these standardised tests than it would be for the KS1 maths test taken much earlier in the year, for which we do not possess fine grain test scores. One trial’s data clearly conforms to the expected pattern (marked as P48C1 on Figure 1); many more do not.
Figure 1: Pupil correlation between standardised tests and Key Stage 2 tests in maths
For reading and English tests, the correlations with the Key Stage 2 tests in English are generally lower overall, which is to be expected (see Figure 2). More significantly, it is rarely the case in these trials that the test becomes a better predictor of KS2 English SATs the closer in time it gets to the year 6 SATs.
Figure 2: Pupil correlation between standardised tests and Key Stage 2 tests in English
For assessments where the predictive validity in relation to Key Stage 2 is quite low we might worry about the reliability of one or other of the tests or we should question whether the tests are measuring similar underlying constructs. Our report recommends that the EEF pays more attention to the conditions under which these tests are being sat by students. One issue is that these standardised assessments will often be administered in a setting that is relatively low stakes for the pupil. Existing psychometric literature (reviewed by Wise and DeMars, 2005) suggests that variation in persistence during the standardised test is a prime candidate for the poor correlation with Key Stage 2 data. Student persistence during a test can be inferred through proxy measures such as item non-response, careless answering and performance decline over a paper (see, for example, analysis of PISA data). We know that student sensitivity to whether or not a test is high stakes varies across countries and other cultural settings (see, for example, experiments run in the US versus China).
Our concern, both for EEF trials and for schools using these standardised tests, is that we need to be sure that standardised tests are sat in standardised conditions where students and teachers have standardised perceptions of the importance of the test. For example, inference is potentially compromised where only some teachers know that these tests are used in their own performance management; where some teachers are able to view the test papers in advance to influence revision guidance and others not; where some parents and pupils believe the test has important consequences for class allocations and other not.
These concerns about achieving greater consistency in how these tests are used in schools, whether through high quality guidance or other means, applies as much to multi-academy trusts devising their own tests as it does to schools who choose to buy commercial tests. If the data that has been collected through the EEF trials is representative of the wider experience of using these tests in schools, then it has two major implications for what we are reasonably able to infer from this data.
Firstly, using data from a single commercial assessment to decide whether a school is making good progress in a subject might be quite unreliable if test conditions are not standardised across schools. This means it is unlikely to act as a good predictor of the progress a school will make between two government assessments at Key Stage 1 and Key Stage 2.
We illustrate this for one trial where 888 pupils in 24 schools took a commercially available maths test just one month before their Key Stage 2 tests. Figure 3 shows that the school value-added estimated using KS1 maths as a starting point and the standardised test as an outcome is very poorly correlated with the ‘official’ value-added or progress measure they will achieve using the KS2 maths tests just one month later.
Figure 3: Value-added estimates in maths across 24 schools
Secondly, short assessments that are designed to be completed within a lesson are rarely capable of telling us whether the progress an individual student has made in a year is noteworthy in any case. If we measure start-of-year and end-of-year attainment with a test which has an internal reliability of 0.9 (the level reported by most test companies in this situation) and if the correlation between test scores a year apart is over 0.8 (as many test companies report), then progress – i.e. the change in attainment – will be measured with a great deal of uncertainty. A school will typically find that only a minority of their pupils record a test score growth statistically significantly different from zero. Furthermore, chance alone will be responsible for a large proportion of those who appear to have noteworthy growth or decline in scores. Dylan Wiliam (2010) summarises that the challenge of measuring annual test score growth occurs because “the progress of individual students is slow compared to the variability of achievement within the age cohort”.
Tests, including standardised tests that are short enough to be sat in a lesson, are an excellent way for teachers to monitor what students can do. However, headteachers and others who are keen to measure whether pupils have learnt enough in a subject within a space of year should take care to understand that relatively short, low stakes tests necessarily provide limited inference.