I recently came across mathematical models that were proposed to measure teacher effectiveness in K-12 education. Since a large part of my work is to use (formulate, solve, analyze) mathematical models in business settings, I was naturally curious to see how mathematical models are expected to help school district superintendents and school principals identify and reward the best teachers.
An article by Michael Winerip in the New York Times earlier this month ("Evaluating New York Teachers, Perhaps the Numbers Do Lie") provides an example of such models and of a real-life teacher with stellar peer evaluations and strong performance who, according to the Department of Education's formula, is among New York's worst teachers. (93% of NY teachers are ranked higher than her. That bad.) This means she will not be receiving tenure this year - the decision is delayed, though, not denied - and, if layoffs become determined by scores rather than seniority, would be at risk of losing her job.
At a high level, the idea is to run a linear regression on students' test scores using various explanatory variables, such as the teachers whom students have had this year but also in earlier years for the same subject. I do admit I have some serious reservations about that sort of models (a good overview of their limitations can be found in a recent post on the "Schools Matter" blog; that post is a must-read for anyone interested in the topic and mistakes people make in trying to extract teacher's quality from students' scores.)
The recent financial crisis has shown the pitfalls of using pretty models with incorrect assumptions, and the use of linear regressions in such a critical application (rating teachers) raises many alarm bells in my mind. Linear regressions are certainly very easy to implement: MBA students are taught about them in their first year, anyone can do them in Excel (if something is included in Microsoft Excel, it really has gone mainstream). Whether they give meaningful results is open for debate. The statistical formula that determines the NY teacher's future has thirty-two (32) variables; the Times has a small graph giving a glimpse of its complexity. I have only mistrust for linear models that require 32 inputs to sufficiently explain an output; this screams "over-specified" to me. (Over-specified basically means that you're trying too hard to perfectly explain the data.)
Besides, the fact that linear regressions are easy to do doesn't mean all quantities in the world have a linear relationship to the variables that are supposed to explain them. For instance, maybe high school students can get decent-but-not-stellar grades in spite of poor teachers because their parents can help them for the basics, but teacher's quality might really make a difference in students' ability to reach the very top scores. In that case, teacher's quality does not have a linear relationship to students' grades at all.
Another significant issue for me is the use, as other explanatory variables (things that are used in the regression to try to explain students' grades), of teachers' quality in previous years. Why in the world should the student's current grade be linear in the quality of a previous year's teacher? Wouldn't it make more sense to compound teachers' influence over time, since the earlier teachers are supposed to provide the basics the rest will stand on? In that case, it seems a multiplicative, rather than additive, of teachers' influence over the years would be more appropriate. (For the math types: maybe the linear regression should be applied to the logarithm of the student grades rather than the student grades themselves.)
If we stay on this issue of including past teachers' impact just a moment longer, we can easily spot other problems. For instance, what if two thirds half of this year's curriculum builds upon last year's material and the rest uses new concepts? or maybe material seen two years ago and not covered since? You have to account somehow for only two thirds of the student's grade being impacted by last year's teaching. At the same time, if this year's teacher tries to help students by reviewing last year's concepts, she loses time she could use covering her assigned material and might increase another teacher's score instead of her own, so it might seem counterproductive numbers-wise.
Many researchers have investigated such issues, and a comprehensive report by Florida State University researchers about the linear regressions used in value-added models and their outputs provides much (quantitative) information on the topic, including on p.32 a summary of the models used in statistical studies, with a seemingly infinite array of variations in the experimental setup. For instance, the issue of students forgetting what they have learned if they don't use it for a while is analyzed in terms of the "persistence of schooling inputs", with some studies adopting a decay model and others a no-decay one.
Each model generates different estimates of teacher quality; in fact, the Florida State University researchers mention in their abstract: "Relying on measurable characteristics of students, teachers and schools alone likely produces inconsistent estimates of the effect of teacher characteristics on student achievement... These findings suggest that many models currently employed to measure the impact of teachers are mis-specified."
Evaluating teachers and recognizing those who do the best job are important, worthy tasks. I doubt student-based tests provide the best avenue to achieve these goals. As a synthesis available on the website of the Center for Educator Compensation Reform points out, "research by Jacob and Lefgren (2005) also supports the assertion that evaluators' judgments of teacher performance can be predictive of student achievement." (They do add: "However, observation systems need to be designed and implemented carefully to combat tendencies to rate nearly all teachers to the same level.")
I am looking forward to the final report of the Measures of Teacher Effectiveness project, which was launched in the fall 2009 by the Gates Foundation and analyzes a broader set of potential indicators of teacher effectiveness, such as "videotaped observations of teachers in their classrooms, student and teacher surveys [and] pedagogical content knowledge tests." When I was in high school, I remember how an "inspector"/evaluator from the regional school system (somewhat similar to the state school system in the US) would occasionally come to the school and sit in a few lectures of the teachers he was tasked with evaluating.
The teachers would know beforehand of his visit - not too much in advance, I think, but early enough to tell their students before he showed up - and they would explain to their students how important this was to their career because the inspector was going to rate them. (One of those was the ability to engage students in discussions, I think, and an urban legend circulated about a teacher elsewhere telling students which questions they were supposed to ask, the inspector figuring it out and the teacher being transferred to the school system equivalent of the Siberian tundra. The evaluators were not that easy to game.) We were nice kids and did not want to get our teachers in trouble, but in some schools I can see how the mere presence of an inspector could completely change classroom dynamics and not allow the evaluator to properly assess the teacher, so videotaping the lectures seems a good way to make the inspector as inconspicuous as possible.
The Measures of Teacher Effectiveness project has enrolled 3,000 teachers and has the potential to re-focus the debate in major ways; however, I was puzzled to learn it only runs over two academic years: in 2009-10, the focus was on teacher recruitment and data collection, while 2010-11 is supposed to focus on "validating promising measures". I just don't think one year worth of data collection is enough to draw solid conclusions - as much as we would all like teachers to perform steadily, receiving similar ratings year after year after year (because if your kid doesn't learn what he's supposed to, the thought that "the teacher had an off year" is not going to provide much comfort), performance can be affected by factors that don't have anything to do with teachers' intrinsic abilities - a few students can drastically change classroom dynamics, for instance, or the previous year's teacher was ill most of the time and the sub did not cover as much of the material. It is worth noting that teachers volunteered for the project, so they might be teachers who are very consistent in their ability to instruct students. The CECR synthesis mentions: "There is a growing consensus that 1 year of value-added data may not be sufficiently reliable for making high-stakes decisions about individual teacher effectiveness." It also includes references of recent work documenting the "noise" in value-added models.
It turns out that the preliminary policy brief and research report of the MET project are online already: this is the first of four installments and focuses on mathematics and English language arts teachers in grades 4 to 8, and although the fact that the reports were published in December 2010 makes me wonder how much the 2010-11 year was used to "validate promising measures", I am glad that we have access to these results before policy-makers decide on the tools school districts should use. The early findings, however, fall far short from being earth-shattering:
- "a teacher's past success in raising student achievement on state tests (that is, his or her value added) is one of the strongest predictors of his or her ability to do so again." (The fact that value added seems to be defined as student achievement on state tests is interesting.)
- "the teachers with the highest value-added scores on state tests also tend to help students understand math concepts or demonstrate reading comprehension through writing." This is measured through supplemental tests (don't you like all that testing!) and seems "particularly true in mathematics".
- "the average student knows effective teaching when he or she experiences it." (Since The Chronicle of Higher Education published a short article a few months ago about college students lying on anonymous course evaluations, it is worth asking how many students would be truthful if student surveys are expanded.)
- "valid feedback need not be limited to test scores alone. By combining different sources of data, it is possible to provide diagnostic, targeted feedback to teachers who are eager to improve."
The MET policy brief also includes a timeline for the various stages of the project.
In my last point before I end this already long post, I'll echo a comment made in the "School Matters" blog post, about the dangers of over-using value-added models. In particular if you think about the laudable goal of merit-based pay, you have to decide whether adding value in junior-year history (no matter how it's measured) counts the same as adding value in freshman-year English, or math. Junior teachers might be assigned to introductory-level courses while their more senior colleagues teach more advanced courses; a junior teacher might teach the course no one else wanted and thus generate little "value added" because of the poor fit, but would be stellar in another course. If you use her poor performance to justify her lack of raise, she might hold it against you that you did not assign her to a course where she could have truly shined. While it would be so much better if every teacher could be assigned to the course that fits best his or her abilities, this is not always the case. Test scores will not always measure teachers' intrinsic abilities.
It seems that people are trying to find a single, magic number to value teacher effectiveness, in just about the same way that financiers tried to find a single, magic number to measure risk (and, as history showed, they failed miserably). The NY teacher, although liked by her students, might be underperforming. But she deserves to know what she is doing wrong (in actionable steps, not just "we want more of your students who scored 3's at the test to get 4's") instead of feeling at the mercy of a statistical model with unclear assumptions. This might start with rethinking the definition of "value added" - whatever positive experiences school provides, I doubt many students look back on receiving a passing grade on proficiency tests as one of the defining moments of their education. If teachers have to add value to their students' lives, surely this can be formulated in a way that inspires students and teachers alike.



