Summative Testing

I recently read Daisy Christadoulou’s new book “Making Good Progress? The future of Assessment For Learning”.
It was a good read and helped clarify several questions or ideas I’ve had about assessment in education. In her book, Christodoulou discusses why formative assessment hasn’t delivered the goods, the pitfalls of invalid summative assessments, and how to improve both of these. It’s worth noting that the generalities of Christodoulou’s book apply for everyone but there are some things that are specific to England’s education system.

The School Problem

One thing that keeps popping up in the schools I’ve taught at is teachers’ wishes to administer in-house final exams or “crossgrade exams”. I’ve always thought these were invalid tests and didn’t really want to participate in them. However, I don’t think my co-workers understood why I didn’t like them. They typically think that I just don’t like exams, and my half-hearted attempts to briefly explain why I don’t like them never got my point across. Christadoulou writes about the problems with summative assessments and her (and my) points are basically this: it’s really hard to design a test instrument that is both sensitive and robust (I designed and implemented validation protocols and tests for 10 years when I was a mechanical engineer, prior to entering teaching). We need to be confident that the exam measures what we intend it to, that the test results are reliable, and that it can differentiate between students. In the context of final exams in Vancouver where we operate on linear system (all courses run for 9 months straight), creating an exam that satisfies our requirements is very, very difficult.

Reliability

When it comes to in-class exams, the idea that we can test a year’s worth of learning in one hour is ridiculous. For example, if I wanted to test proportional reasoning in math 8, I would need at least 30 minutes for just that one topic. Now multiply that topic by 4 or 5… The issue is that students would not be asked enough questions of increasing difficulty in order to find out what they understand and the test would inherently be unfair. This is a type of sampling error since we must ask only a sample number of questions. Certain topics will have more weighting in the exam than others and if you’re a student that understands less about the more heavily weighted topics, you will be unfairly judged as doing poorer. If you still don’t follow me on what this is all about, you should read Christadoulou’s book.

The above paragraph refers to exam based summative assessments, and many people suggest that we can use other test instruments for summative purposes. For example, I’ve been experimenting with using “performance tasks” in lieu of tests. I liked the idea that these tasks are things that we want our students to be able to do and offer a decent snapshot into student understanding. I’ve already seen some significant problems with these types of assessments though. First, they typically don’t do a good job differentiating between many different levels of understanding. The tasks themselves may be multimodal, where there are only a few places that differentiate understanding. Differentiation is also difficult because the teacher is required to judge the work that the students complete. A second problem is that judgements may will differ from teacher to teacher. Anyone that has graded student work using a rubric knows exactly what I mean. The rubric is either too rigid such that it does not capture many aspects of the work because they’re not listed on the rubric, or the rubric is ‘soft’ and it’s totally up to the teacher’s discretion as to how to grade the work.

Shared Meaning

The end result of these issues is that it is difficult to develop a “shared meaning” 1 of student learning when doing summative testing. After all, I don’t need to produce a summative grade for myself or the student. When we make a summative grade it is done so that other people can understand how the student fits into the broader context of the education system. If I give a student a 85% when another teacher may give them a 91% or 78%, we have a problem with shared meaning. And if we can’t create a shared meaning, what is the point of any summative assessment? Its usefulness is pretty limited. Without a shared meaning of what 85% represents, we are only telling a very broad story of what the student knows or understands. In this example the one thing everyone can be sure of is that the student is somewhere between 78% and 91%.

The Provincial Problem

So here is the question then. Who in BC actually thinks that teachers are producing reliable summative assessment data on students? And if we’re not producing reliable data, then why are we producing overall grades? Reliable means that different people in different contexts (parents, school districts, provincial Ministry of Education, universities) can look at different student grades and be confident that a student with a higher grade knows and understands more than a student with a lower grade. With enough data, we can even say how a student performs against some accepted reference.

And just to be clear - I’m not writing this post because I don’t believe in final exams or overall grades. I’m pointing out that what we’re doing in BC doesn’t make sense in this regard. I personally would advocate for an end year provincially managed final exam for summative assessment purposes.

1. Wiliam, D. and Black, P., 1996. Meanings and Consequences: a basis for distinguishing formative and summative functions of assessment?, British Educational Research Journal, 22(5), pp.537–548