Previous month:
February 2011
Next month:
April 2011

March 2011

Value-Added Models in Teaching

I recently came across mathematical models that were proposed to measure teacher effectiveness in K-12 education. Since a large part of my work is to use (formulate, solve, analyze) mathematical models in business settings, I was naturally curious to see how mathematical models are expected to help school district superintendents and school principals identify and reward the best teachers.

An article by Michael Winerip in the New York Times earlier this month ("Evaluating New York Teachers, Perhaps the Numbers Do Lie") provides an example of such models and of a real-life teacher with stellar peer evaluations and strong performance who, according to the Department of Education's formula, is among New York's worst teachers. (93% of NY teachers are ranked higher than her. That bad.) This means she will not be receiving tenure this year - the decision is delayed, though, not denied - and, if layoffs become determined by scores rather than seniority, would be at risk of losing her job. 

At a high level, the idea is to run a linear regression on students' test scores using various explanatory variables, such as the teachers whom students have had this year but also in earlier years for the same subject. I do admit I have some serious reservations about that sort of models (a good overview of their limitations can be found in a recent post on the "Schools Matter" blog; that post is a must-read for anyone interested in the topic and mistakes people make in trying to extract teacher's quality from students' scores.)

The recent financial crisis has shown the pitfalls of using pretty models with incorrect assumptions, and the use of linear regressions in such a critical application (rating teachers) raises many alarm bells in my mind. Linear regressions are certainly very easy to implement: MBA students are taught about them in their first year, anyone can do them in Excel (if something is included in Microsoft Excel, it really has gone mainstream). Whether they give meaningful results is open for debate. The statistical formula that determines the NY teacher's future has thirty-two (32) variables; the Times has a small graph giving a glimpse of its complexity. I have only mistrust for linear models that require 32 inputs to sufficiently explain an output; this screams "over-specified" to me. (Over-specified basically means that you're trying too hard to perfectly explain the data.)    

Besides, the fact that linear regressions are easy to do doesn't mean all quantities in the world have a linear relationship to the variables that are supposed to explain them. For instance, maybe high school students can get decent-but-not-stellar grades in spite of poor teachers because their parents can help them for the basics, but teacher's quality might really make a difference in students' ability to reach the very top scores. In that case, teacher's quality does not have a linear relationship to students' grades at all.

Another significant issue for me is the use, as other explanatory variables (things that are used in the regression to try to explain students' grades), of teachers' quality in previous years. Why in the world should the student's current grade be linear in the quality of a previous year's teacher? Wouldn't it make more sense to compound teachers' influence over time, since the earlier teachers are supposed to provide the basics the rest will stand on? In that case, it seems a multiplicative, rather than additive, of teachers' influence over the years would be more appropriate. (For the math types: maybe the linear regression should be applied to the logarithm of the student grades rather than the student grades themselves.)

If we stay on this issue of including past teachers' impact just a moment longer, we can easily spot other problems. For instance, what if two thirds half of this year's curriculum builds upon last year's material and the rest uses new concepts? or maybe material seen two years ago and not covered since? You have to account somehow for only two thirds of the student's grade being impacted by last year's teaching. At the same time, if this year's teacher tries to help students by reviewing last year's concepts, she loses time she could use covering her assigned material and might increase another teacher's score instead of her own, so it might seem counterproductive numbers-wise.

Many researchers have investigated such issues, and a comprehensive report by Florida State University researchers about the linear regressions used in value-added models and their outputs provides much (quantitative) information on the topic, including on p.32 a summary of the models used in statistical studies, with a seemingly infinite array of variations in the experimental setup. For instance, the issue of students forgetting what they have learned if they don't use it for a while is analyzed in terms of the "persistence of schooling inputs", with some studies adopting a decay model and others a no-decay one.

Each model generates different estimates of teacher quality; in fact, the Florida State University researchers mention in their abstract: "Relying on measurable characteristics of students, teachers and schools alone likely produces inconsistent estimates of the effect of teacher characteristics on student achievement... These findings suggest that many models currently employed to measure the impact of teachers are mis-specified."

Evaluating teachers and recognizing those who do the best job are important, worthy tasks. I doubt student-based tests provide the best avenue to achieve these goals. As a synthesis available on the website of the Center for Educator Compensation Reform points out, "research by Jacob and Lefgren (2005) also supports the assertion that evaluators' judgments of teacher performance can be predictive of student achievement." (They do add: "However, observation systems need to be designed and implemented carefully to combat tendencies to rate nearly all teachers to the same level.")

I am looking forward to the final report of the Measures of Teacher Effectiveness project, which was launched in the fall 2009 by the Gates Foundation and analyzes a broader set of potential indicators of teacher effectiveness, such as "videotaped observations of teachers in their classrooms, student and teacher surveys [and] pedagogical content knowledge tests." When I was in high school, I remember how an "inspector"/evaluator from the regional school system (somewhat similar to the state school system in the US) would occasionally come to the school and sit in a few lectures of the teachers he was tasked with evaluating.

The teachers would know beforehand of his visit - not too much in advance, I think, but early enough to tell their students before he showed up - and they would explain to their students how important this was to their career because the inspector was going to rate them. (One of those was the ability to engage students in discussions, I think, and an urban legend circulated about a teacher elsewhere telling students which questions they were supposed to ask, the inspector figuring it out and the teacher being transferred to the school system equivalent of the Siberian tundra. The evaluators were not that easy to game.) We were nice kids and did not want to get our teachers in trouble, but in some schools I can see how the mere presence of an inspector could completely change classroom dynamics and not allow the evaluator to properly assess the teacher, so videotaping the lectures seems a good way to make the inspector as inconspicuous as possible.

The Measures of Teacher Effectiveness project has enrolled 3,000 teachers and has the potential to re-focus the debate in major ways; however, I was puzzled to learn it only runs over two academic years: in 2009-10, the focus was on teacher recruitment and data collection, while 2010-11 is supposed to focus on "validating promising measures". I just don't think one year worth of data collection is enough to draw solid conclusions - as much as we would all like teachers to perform steadily, receiving similar ratings year after year after year (because if your kid doesn't learn what he's supposed to, the thought that "the teacher had an off year" is not going to provide much comfort), performance can be affected by factors that don't have anything to do with teachers' intrinsic abilities - a few students can drastically change classroom dynamics, for instance, or the previous year's teacher was ill most of the time and the sub did not cover as much of the material. It is worth noting that teachers volunteered for the project, so they might be teachers who are very consistent in their ability to instruct students. The CECR synthesis mentions: "There is a growing consensus that 1 year of value-added data may not be sufficiently reliable for making high-stakes decisions about individual teacher effectiveness." It also includes references of recent work documenting the "noise" in value-added models.

It turns out that the preliminary policy brief and research report of the MET project are online already: this is the first of four installments and focuses on mathematics and English language arts teachers in grades 4 to 8, and although the fact that the reports were published in December 2010 makes me wonder how much the 2010-11 year was used to "validate promising measures", I am glad that we have access to these results before policy-makers decide on the tools school districts should use. The early findings, however, fall far short from being earth-shattering:

  1. "a teacher's past success in raising student achievement on state tests (that is, his or her value added) is one of the strongest predictors of his or her ability to do so again." (The fact that value added seems to be defined as student achievement on state tests is interesting.)
  2. "the teachers with the highest value-added scores on state tests also tend to help students understand math concepts or demonstrate reading comprehension through writing." This is measured through supplemental tests (don't you like all that testing!) and seems "particularly true in mathematics".
  3. "the average student knows effective teaching when he or she experiences it." (Since The Chronicle of Higher Education published a short article a few months ago about college students lying on anonymous course evaluations, it is worth asking how many students would be truthful if student surveys are expanded.)
  4. "valid feedback need not be limited to test scores alone. By combining different sources of data, it is possible to provide diagnostic, targeted feedback to teachers who are eager to improve."

The MET policy brief also includes a timeline for the various stages of the project.

In my last point before I end this already long post, I'll echo a comment made in the "School Matters" blog post, about the dangers of over-using value-added models. In particular if you think about the laudable goal of merit-based pay, you have to decide whether adding value in junior-year history (no matter how it's measured) counts the same as adding value in freshman-year English, or math. Junior teachers might be assigned to introductory-level courses while their more senior colleagues teach more advanced courses; a junior teacher might teach the course no one else wanted and thus generate little "value added" because of the poor fit, but would be stellar in another course. If you use her poor performance to justify her lack of raise, she might hold it against you that you did not assign her to a course where she could have truly shined. While it would be so much better if every teacher could be assigned to the course that fits best his or her abilities, this is not always the case. Test scores will not always measure teachers' intrinsic abilities.

It seems that people are trying to find a single, magic number to value teacher effectiveness, in just about the same way that financiers tried to find a single, magic number to measure risk (and, as history showed, they failed miserably). The NY teacher, although liked by her students, might be underperforming. But she deserves to know what she is doing wrong (in actionable steps, not just "we want more of your students who scored 3's at the test to get 4's") instead of feeling at the mercy of a statistical model with unclear assumptions. This might start with rethinking the definition of "value added" - whatever positive experiences school provides, I doubt many students look back on receiving a passing grade on proficiency tests as one of the defining moments of their education. If teachers have to add value to their students' lives, surely this can be formulated in a way that inspires students and teachers alike.

"The Numerati" by Stephen Baker

I read The Numerati some time ago and never got around to writing a post about it, so here is my long overdue summary. Stephen Baker is a former BusinessWeek writer with interests in technology; his latest book, Final Jeopardy: Man vs Machine and the Quest to Know Everything was published last month. The topic of The Numerati, according to the book jacket, is "a new math intelligentsia [who] is devising ways to dissect our every move [using the trail of data we leave on the Internet] and predict, with stunning accuracy, what we will do next, [in order] to manipulate our behavior."

Whoever wrote the book jacket got a bit carried away ("the mathematical modeling of humanity", really?) but the book itself makes an important contribution. It is divided in seven chapters: Worker, Shopper, Voter, Blogger, Terrorist, Patient and Lover; in each, Baker describes what he learned from extensive discussions with experts in the field. To be honest, I am not quite sure I belong to his intended audience (who seems to be the majority of the population who doesn't practice data-mining nor math modeling, and needs to be educated about the potential and pitfalls of data), although Baker did attend the INFORMS annual meeting two years ago and autographed some of his books to operations researchers. On the other hand, I don't know if the people who would benefit most from his research will be sufficiently interested in data-mining to buy a whole book on it - I can see how they would gain from an article in their favorite magazine, but a book is a tougher sell. Thankfully, The Numerati is now out on paperback and Kindle, so people can get the book relatively cheaply.

I found myself a bit frustrated at times by the book's high-level descriptions, since I understand the technical part enough to want to know more about the complexities faced by the experts Baker interviewed, but the level of technicality was excellent for a layperson interested in learning more. I particularly enjoyed reading the issues faced by Google's Adsense with respect to spam blogs, or splogs, (in the "blogger" chapter); as an update, Google changed the way it ranks search results just last month to try to fight content farms.

Also, the "patient" chapter was fascinating from beginning to end; it focused on networked gadgets that can help hospital patients or people in poor health. A scientist at Intel Research Lab whom Baker interviewed "sees sensors eventually recording and building statistical models of almost every aspect of our behavior. They'll track our pathways in the house, the rhythm of our gait." But Baker also points out that taking advantage of this technology is not as easy as it sounds. In my favorite anecdote, on p.158 of the hardcover edition, he explains: "One woman, researchers were startled to see, gained eight pounds between bedtime and breakfast. A dangerous accumulation of fluids? Time to call an ambulance? No. Her little dog had jumped on the bed and slept with her."

However, the potential of data analysis is undeniable: according to the Intel scientist, "specialists studying the actor Michael J Fox in his old TV shows can detect the onset of Parkinson's years before Fox himself knew he had it." (p.165) In another startling analysis, described p.177, researchers at University College London studied the manuscripts that prizewinning novelist Iris Murdoch left behind when she died of Alzheimer's, and were able to identify a curve followed by her use of language in her books, growing more complex until the height of her career and then falling off. While Baker sometimes oversells his case by picturing a distant future where our lives will be dominated by data-mining, rather than the more relevant (for readers) near- to medium- term, the studies he quotes are very interesting.

Finally, the "lover" chapter has an unexpected application to the resumes of job candidates: "according to BusinessWeek, 94 percent of US corporations ask for electronic resumes. They use software to sift through them, picking out a selection of "finalists" for human managers to consider." (p.195) Baker comments: "The point is that when we want to be found... we must make ourselves intelligible to machines. We need good page rank. We must fit ourselves to algorithms."

Lehigh Commencement Speaker Announced

The name of the 2011 Commencement Speaker at Lehigh University was announced about two weeks ago: we are honored to have Ellen Kullman, chair and chief executive officer at DuPont, address the graduating class in May. I was particularly excited to learn that she has a bachelor of science in mechanical engineering from Tufts, as well as a MBA from Northwestern; in 2009, Forbes ranked her No 7 in its list of the 100 most powerful women in the world, and No 4 in the US. Kullman was also No 8 in the Wall Street Journal's list of the 50 women to watch in 2008 and No 15 in Fortune's list of the 50 most powerful women in 2008. She was also No 5 in Fortune's list of businesspeople of the year (of any gender) in 2010.

I did not know DuPont had a female CEO and am grateful to whoever nominated her for bringing her to campus. With such an impressive list of accolades, it is clear that Kullman has a lot of knowledge and wisdom she can impart to graduates, both male and female, on Commencement Day. That is why I find the comments to the Brown and White (Lehigh's student newspaper) article that publicized the announcement particularly disappointing. Here are a few excerpts:

  • #1 comment: "Who is this lady, why do we care? Number 7, that's great, that's really great. Def[initely] gonna be bored to death by this Gast wannabe." (Alice Gast is Lehigh's current president.)
  • #2 comment: "I'm gonna give her the benefit of the doubt, but I'm fairly disappointed that this was the best that the selection committee could do... she seems fairly dry and very similar to Gast, although not of as high a background. I don't see her giving a memorable speech... I understand that she is a very successful woman who is the head of a major chemical company, but having a replica of your president speak isn't exactly thrilling."
  • #3 comment: "This reeks of a third or fourth choice by the selection committee. I feel cheated."

Of course, those represent only the views of three people (apparently seniors) at Lehigh, which graduates about 1,000 undergraduate students a year. They can hardly be interpreted as the prevalent views on campus. As an educator, though, I have to pause and ask myself where exactly the $52,000-a-year education (estimated cost of attendance) that these students received went off track if they feel "cheated" that the CEO of DuPont, who happens to be currently recognized as one of the most successful businesspeople in the world, will give the commencement speech at their graduation.

It is interesting that, among all the many successful female executives Kullman could remind people of, these students point out an alleged resemblance with our own female university president, although the DuPont CEO has much higher recognition in the business world. They didn't say: "I wish Lehigh had gotten Indra Nooyi instead" (for the "Who is this lady, why do we care?" crowd out there, Nooyi is the CEO of PepsiCo and No 3 on Forbes' powerful-women list.) The fact that they can only think of our own president as a female executive is something I wish a well-rounded education had remedied a long time ago.

I've attended a lot of the Commencement exercises in recent years, so I'll make a few general comments about what I've witnessed. I have, indeed, heard some really big names speak at Commencement over the years. Yes, having a household name make a speech before you get your degree is flattering (they did make the trip, after all); however, some of those household names in my experience sounded like Lehigh was only one stop among many on the Commencement circuit, with no particular effort to tailor their speech to Lehigh or dispel any doubt that they were only repeating the same speech as the previous week, only to a different crowd. That, for me, should make students feel cheated.

The best Commencement speech I've ever heard was given in 2008 by Bill Amelio '79, then CEO of Lenovo; I encourage you to read the blog post I wrote about it here (starting with the third paragraph). Amelio gave an excellent speech where he shared lessons he learned the hard way. I also appreciated the Lehigh connection. Yet, when it was announced he would give the Commencement speech, students had mixed feelings about his being selected too.

What do you think makes a good Commencement speech? Here are some should-haves for me:

  • The speaker must make a clear, documented reference to Lehigh and its graduating class (by "documented" I mean one should not be able to use the Find&Replace function in the Word file of the speech, replacing Lehigh's name by that of another university, and still have the speech sound fine). For instance, the speaker may want to single out a few graduating seniors for their accomplishments - just that he/she took the time to learn about the university and some of its seniors speaks volumes in my book. It's the students' day, after all.
  • The speech has stories in it. It is so much easier to remember stories than generalizations and it humanizes the speaker. The stories should have some element of struggle (with hopefully a positive ending) and be about something students can relate to. No one wants to hear "well you might be confused as to what to do next but I knew my path since I was 12 and my life has been one giant smooth blissful ride." Besides, I don't think that sort of smooth sailing to success even happens - all meaningful achievements involve hard work, moments of doubt, setbacks, etc. Otherwise you're not striving hard enough, and then you probably don't reach a level of success in life that makes you Commencement-speaker-material.
  • The speech is relatively short, to keep the focus on the graduates. Again, it's their day more than the Commencement Speaker's day. Speakers shouldn't try to steal the spotlight by giving marathon speeches all about them, while students truly wait for the moment they can hold their diploma. This is not a filibuster on the Senate floor - the goal is not to stand in the way. Thankfully, most speakers realize that.

I can't guess whether Kullman will give a memorable speech or not. But we can all learn from someone who made it to the top. If (some) students are not willing to listen to her advice, then who will they listen to?