I recently came across mathematical models that were proposed to measure teacher effectiveness in K-12 education. Since a large part of my work is to use (formulate, solve, analyze) mathematical models in business settings, I was naturally curious to see how mathematical models are expected to help school district superintendents and school principals identify and reward the best teachers.
An article by Michael Winerip in the New York Times earlier this month ("Evaluating New York Teachers, Perhaps the Numbers Do Lie") provides an example of such models and of a real-life teacher with stellar peer evaluations and strong performance who, according to the Department of Education's formula, is among New York's worst teachers. (93% of NY teachers are ranked higher than her. That bad.) This means she will not be receiving tenure this year - the decision is delayed, though, not denied - and, if layoffs become determined by scores rather than seniority, would be at risk of losing her job.
At a high level, the idea is to run a linear regression on students' test scores using various explanatory variables, such as the teachers whom students have had this year but also in earlier years for the same subject. I do admit I have some serious reservations about that sort of models (a good overview of their limitations can be found in a recent post on the "Schools Matter" blog; that post is a must-read for anyone interested in the topic and mistakes people make in trying to extract teacher's quality from students' scores.)
The recent financial crisis has shown the pitfalls of using pretty models with incorrect assumptions, and the use of linear regressions in such a critical application (rating teachers) raises many alarm bells in my mind. Linear regressions are certainly very easy to implement: MBA students are taught about them in their first year, anyone can do them in Excel (if something is included in Microsoft Excel, it really has gone mainstream). Whether they give meaningful results is open for debate. The statistical formula that determines the NY teacher's future has thirty-two (32) variables; the Times has a small graph giving a glimpse of its complexity. I have only mistrust for linear models that require 32 inputs to sufficiently explain an output; this screams "over-specified" to me. (Over-specified basically means that you're trying too hard to perfectly explain the data.)
Besides, the fact that linear regressions are easy to do doesn't mean all quantities in the world have a linear relationship to the variables that are supposed to explain them. For instance, maybe high school students can get decent-but-not-stellar grades in spite of poor teachers because their parents can help them for the basics, but teacher's quality might really make a difference in students' ability to reach the very top scores. In that case, teacher's quality does not have a linear relationship to students' grades at all.
Another significant issue for me is the use, as other explanatory variables (things that are used in the regression to try to explain students' grades), of teachers' quality in previous years. Why in the world should the student's current grade be linear in the quality of a previous year's teacher? Wouldn't it make more sense to compound teachers' influence over time, since the earlier teachers are supposed to provide the basics the rest will stand on? In that case, it seems a multiplicative, rather than additive, of teachers' influence over the years would be more appropriate. (For the math types: maybe the linear regression should be applied to the logarithm of the student grades rather than the student grades themselves.)
If we stay on this issue of including past teachers' impact just a moment longer, we can easily spot other problems. For instance, what if two thirds half of this year's curriculum builds upon last year's material and the rest uses new concepts? or maybe material seen two years ago and not covered since? You have to account somehow for only two thirds of the student's grade being impacted by last year's teaching. At the same time, if this year's teacher tries to help students by reviewing last year's concepts, she loses time she could use covering her assigned material and might increase another teacher's score instead of her own, so it might seem counterproductive numbers-wise.
Many researchers have investigated such issues, and a comprehensive report by Florida State University researchers about the linear regressions used in value-added models and their outputs provides much (quantitative) information on the topic, including on p.32 a summary of the models used in statistical studies, with a seemingly infinite array of variations in the experimental setup. For instance, the issue of students forgetting what they have learned if they don't use it for a while is analyzed in terms of the "persistence of schooling inputs", with some studies adopting a decay model and others a no-decay one.
Each model generates different estimates of teacher quality; in fact, the Florida State University researchers mention in their abstract: "Relying on measurable characteristics of students, teachers and schools alone likely produces inconsistent estimates of the effect of teacher characteristics on student achievement... These findings suggest that many models currently employed to measure the impact of teachers are mis-specified."
Evaluating teachers and recognizing those who do the best job are important, worthy tasks. I doubt student-based tests provide the best avenue to achieve these goals. As a synthesis available on the website of the Center for Educator Compensation Reform points out, "research by Jacob and Lefgren (2005) also supports the assertion that evaluators' judgments of teacher performance can be predictive of student achievement." (They do add: "However, observation systems need to be designed and implemented carefully to combat tendencies to rate nearly all teachers to the same level.")
I am looking forward to the final report of the Measures of Teacher Effectiveness project, which was launched in the fall 2009 by the Gates Foundation and analyzes a broader set of potential indicators of teacher effectiveness, such as "videotaped observations of teachers in their classrooms, student and teacher surveys [and] pedagogical content knowledge tests." When I was in high school, I remember how an "inspector"/evaluator from the regional school system (somewhat similar to the state school system in the US) would occasionally come to the school and sit in a few lectures of the teachers he was tasked with evaluating.
The teachers would know beforehand of his visit - not too much in advance, I think, but early enough to tell their students before he showed up - and they would explain to their students how important this was to their career because the inspector was going to rate them. (One of those was the ability to engage students in discussions, I think, and an urban legend circulated about a teacher elsewhere telling students which questions they were supposed to ask, the inspector figuring it out and the teacher being transferred to the school system equivalent of the Siberian tundra. The evaluators were not that easy to game.) We were nice kids and did not want to get our teachers in trouble, but in some schools I can see how the mere presence of an inspector could completely change classroom dynamics and not allow the evaluator to properly assess the teacher, so videotaping the lectures seems a good way to make the inspector as inconspicuous as possible.
The Measures of Teacher Effectiveness project has enrolled 3,000 teachers and has the potential to re-focus the debate in major ways; however, I was puzzled to learn it only runs over two academic years: in 2009-10, the focus was on teacher recruitment and data collection, while 2010-11 is supposed to focus on "validating promising measures". I just don't think one year worth of data collection is enough to draw solid conclusions - as much as we would all like teachers to perform steadily, receiving similar ratings year after year after year (because if your kid doesn't learn what he's supposed to, the thought that "the teacher had an off year" is not going to provide much comfort), performance can be affected by factors that don't have anything to do with teachers' intrinsic abilities - a few students can drastically change classroom dynamics, for instance, or the previous year's teacher was ill most of the time and the sub did not cover as much of the material. It is worth noting that teachers volunteered for the project, so they might be teachers who are very consistent in their ability to instruct students. The CECR synthesis mentions: "There is a growing consensus that 1 year of value-added data may not be sufficiently reliable for making high-stakes decisions about individual teacher effectiveness." It also includes references of recent work documenting the "noise" in value-added models.
It turns out that the preliminary policy brief and research report of the MET project are online already: this is the first of four installments and focuses on mathematics and English language arts teachers in grades 4 to 8, and although the fact that the reports were published in December 2010 makes me wonder how much the 2010-11 year was used to "validate promising measures", I am glad that we have access to these results before policy-makers decide on the tools school districts should use. The early findings, however, fall far short from being earth-shattering:
- "a teacher's past success in raising student achievement on state tests (that is, his or her value added) is one of the strongest predictors of his or her ability to do so again." (The fact that value added seems to be defined as student achievement on state tests is interesting.)
- "the teachers with the highest value-added scores on state tests also tend to help students understand math concepts or demonstrate reading comprehension through writing." This is measured through supplemental tests (don't you like all that testing!) and seems "particularly true in mathematics".
- "the average student knows effective teaching when he or she experiences it." (Since The Chronicle of Higher Education published a short article a few months ago about college students lying on anonymous course evaluations, it is worth asking how many students would be truthful if student surveys are expanded.)
- "valid feedback need not be limited to test scores alone. By combining different sources of data, it is possible to provide diagnostic, targeted feedback to teachers who are eager to improve."
The MET policy brief also includes a timeline for the various stages of the project.
In my last point before I end this already long post, I'll echo a comment made in the "School Matters" blog post, about the dangers of over-using value-added models. In particular if you think about the laudable goal of merit-based pay, you have to decide whether adding value in junior-year history (no matter how it's measured) counts the same as adding value in freshman-year English, or math. Junior teachers might be assigned to introductory-level courses while their more senior colleagues teach more advanced courses; a junior teacher might teach the course no one else wanted and thus generate little "value added" because of the poor fit, but would be stellar in another course. If you use her poor performance to justify her lack of raise, she might hold it against you that you did not assign her to a course where she could have truly shined. While it would be so much better if every teacher could be assigned to the course that fits best his or her abilities, this is not always the case. Test scores will not always measure teachers' intrinsic abilities.
It seems that people are trying to find a single, magic number to value teacher effectiveness, in just about the same way that financiers tried to find a single, magic number to measure risk (and, as history showed, they failed miserably). The NY teacher, although liked by her students, might be underperforming. But she deserves to know what she is doing wrong (in actionable steps, not just "we want more of your students who scored 3's at the test to get 4's") instead of feeling at the mercy of a statistical model with unclear assumptions. This might start with rethinking the definition of "value added" - whatever positive experiences school provides, I doubt many students look back on receiving a passing grade on proficiency tests as one of the defining moments of their education. If teachers have to add value to their students' lives, surely this can be formulated in a way that inspires students and teachers alike.




Very interesting post. As you point out, pretty much anyone can be taught to fit a linear regression model. Fitting a proper model is much more difficult. It's always possible to criticize statistical models, but as I read your post three questions struck me. First, is the relationship between score (or "value added") and predictors really linear, or is this just another instance of "all regressions are linear" syndrome? Second, if students are being tracked over time, were appropriate time series methods employed? (Do the analysts even know about autocorrelation?) Third, was multilevel analysis used (my guess is no), and if not, should it have been (my guess is yes)? There's a book by Harvey Goldstein about multilevel analysis available online. In his first chapter, he cites the following example as a motivator:
"A well known and influential study of primary (elementary) school children carried out in the 1970's (Bennett, 1976) claimed that children exposed to so called 'formal' styles of teaching reading exhibited more progress than those who were not. The data were analysed using traditional multiple regression techniques which recognised only the individual children as the units of analysis and ignored their groupings within teachers and into classes. The results were statistically significant. Subsequently, Aitkin et al, (1981) demonstrated that when the analysis accounted properly for the grouping of children into classes, the significant differences disappeared and the 'formally' taught children could not be shown to differ from the others."
In the analysis you cite, the teacher factor is being taken into account (whether properly is another question), but what about class (if that's distinguishable from teacher), school, city, ...?
Posted by: Paul Rubin | March 29, 2011 at 09:57 AM
Oy vey. So much fail here. Do they not know about adjusted R-squared, which punishes just throwing in more and more variables?
However, this statement of yours:
"I doubt many students look back on receiving a passing grade on proficiency tests as one of the defining moments of their education."
I happen to partially disagree with. After all, I will always remember passing the physics anticipatory exam at Lehigh, the credit for which goes all to my physics teacher in high school, despite my poor performance in his class. My 740 on my SAT II writing which got me out of a year of college English is directly attributed to my 10th grade English teacher who was hell on earth as a grammar drill sergeant, but gave me a fantastic education (and would later help me prepare for the SAT verbal section also, on which I scored 670), and I also remember my calculus teachers and my 4s on the AP exams (AB and BC) which also got me out of a year of calculus.
I definitely agree that while the number itself might not have much significance, the memory of the teacher that helped earn that grade (especially when it may mean saving thousands of dollars on gen-ed courses in college) will certainly be remembered more strongly than other teachers.
In fact, I think that the effectiveness of a teacher can be measured exactly through that--her students' performances on difficult, higher-authority administered exams (college board, state, nationwide/international mathematical olympiads, etc...). Of course, this goes on the assumption that the concepts that the students study in order to perform well on these exams are comprehensive, and that if the teacher "teaches to the test" (not a very savory phrase, I know), that said teacher will cover the whole gamut of needed concepts anyway.
I know that a lot of experts say "hey, what if the child isn't a good test taker"? I'm of the opinion that a properly designed exam will properly identify whether an individual knows the concepts or not.
But yeah...linear regression with 32 variables? What...how...huh...FAIL.
But then again, those with the statistical talent find work at Google, the financial industry, as professors, but not analyzing teacher performance. So I suppose these evaluators got what they paid for and probably hired some less than apt individuals from the bottom of the barrel of their statistics class.
(Then again, it's rather easy for people like us to point out flaws in the quantitative methodologies of less quantitatively educated people :P)
Posted by: Ilyaquant.wordpress.com | April 03, 2011 at 05:46 PM