In 2004, the Chinese government decided there were too many accidental deaths. China’s safety record, it decreed, should be brought in line with those of other middle-income countries. The State Council set a target: a decline in accidental deaths of 2.5 percent per year.
Provincial authorities kicked into gear. Eventually, 20 out of a total of 31 provinces adopted “no safety, no promotion” policies, hitching bureaucrats’ fate to whether they met the death ceiling. The results rolled in: by 2012 recorded accidental deaths had almost halved.
It wasn’t, however, all about increased safety. For instance, officials could reduce traffic deaths by keeping victims of severe accidents alive for eight days. They counted as accidental deaths only if the victims died within seven.
In a study of China’s declining deadly accidents, Raymond Fisman of Columbia University and Yongxiang Wang of the University of Southern California concluded that “manipulation played a dominant role.” Bureaucrats - no surprise - cheated.
This is hardly unusual. It is certainly not exclusive to China. These days, in fact, it has acquired particular importance in the debate over how to improve American education.
The question is, what will happen when teachers are systematically rewarded, or punished, based to some extent on standardized tests? If we really want our children to learn more, the design of any system must be carefully thought through, to avoid sending incentives astray.
“When you put a lot of weight on one measure, people will try to do well on that measure,” Jonah Rockoff of Columbia said. “Some things they do will be good, in line with the objectives. Others will amount to cheating or gaming the system.”
The phenomenon is best known as Goodhart’s Law, after the British economist Charles Goodhart. Luis Garicano at the London School of Economics calls it the Heisenberg Principle of incentive design, after the defining uncertainty of quantum physics: A performance metric is only useful as a performance metric as long as it isn’t used as a performance metric.
It shows up all over the place. Some hospitals in the United States, for example, will often do whatever it takes to keep patients alive at least 31 days after an operation, to beat Medicare’s 30-day survival yardstick. Last year, Chicago magazine uncovered how the Chicago Police Department achieved declining crime rates, simply by reclassifying incidents as noncriminal.
“We don’t know how big a deal this is,” said Jesse Rothstein, a professor at the University of California, Berkeley, who has criticized evaluation metrics based on test scores. “It is one of the main concerns.”
American education has embarked upon a nationwide experiment in incentive design. Prodded by the Education Department, most states have set up evaluation systems for teachers built on the gains of their students on standardized tests, alongside more traditional criteria like evaluations from principals.
Fourteen more states are expected to have fully developed systems this academic year, according to the National Council on Teacher Quality - an advocacy group that supports rigorous assessments. All but six states are expected to have one by the 2016-17 school year.
The assessments are backed by sophisticated research. An important study by Rockoff and two Harvard professors, Raj Chetty and John Friedman, found that teachers who improved students’ scores, termed high value-added teachers, raised the students’ chances of going to college as well as their salaries later in life.
But teachers - and parents - are up in arms. In New York, the teachers’ union strongly opposes Gov. Andrew Cuomo’s proposal to increase the weight of test scores to 50 percent of a teacher’s evaluation. The governor is being hammered over the issue in opinion polls.
“People who claim to be market-based reformers want to sell the theory that there is a direct correlation between test scores, the effort of teachers and the success of children,” said Randi Weingarten, who heads the American Federation of Teachers. “It just ignores everything else that goes into learning.”
Critics have questioned the Harvard scholars’ findings. Teachers argue there is no way they could isolate the impact of teaching itself from other factors affecting children’s learning, particularly such things as the family background of the students, the impact of poverty, racial segregation, even class size.
Rothstein at Berkeley suggested that sorting plays a big role in their results: better-ranked teachers got better students. Other studies found teachers’ scores jump around a lot from year to year, putting their value into question. Rockoff, Chetty and Friedman have defended their results.
In this heated debate, however, it is important not to lose sight of Goodhart’s Law. Most of these studies measured the impact of test scores when tests carried little weight for teachers’ future careers. But what happens when tests determine whether a teacher gets a bonus or keeps his or her job?
From Atlanta to El Paso, school officials have been accused of cheating to improve their standing on test scores.
Fraud is not the only concern. In one study, schools forced to improve grades by the No Child Left Behind law were found to have focused on helping children who were at the cusp of proficiency. They had no incentive to address those comfortably above the cut or those with little hope of gaining enough in the short term.
A survey of teachers at a school district in the Southwest that awarded bonuses based on test scores found that many tried to avoid both gifted students and those not yet proficient in English whose grades were tough to improve. Others employed “drill and kill” strategies to ensure their students nailed the tests.
Education reformers acknowledge the challenge but argue that should not stand in the way of rigorous assessments.
“Any time you perform an evaluation you must worry about unintended side effects,” said Joel Klein, former chancellor of New York City schools, who famously battled the teachers’ union. “But the absence of evaluation is totally unacceptable.”
High-stakes tests can encourage bad behavior. But they encourage good behavior, too. A study of public schools in Florida found that schools did focus on low-performing students, lengthened the time devoted to teaching, gave teachers more resources and tried to improve the learning environment.
Supporters like Klein point out that value-added scores should not only serve to penalize or reward teachers but also provide feedback to help them improve. Test-based metrics should be leavened with other inputs, from principals, peers and even students. And better tests under the Common Core education standards will lead to better assessments.
“If the evaluation is to have any meaning, it must have stakes,” said James Liebman of Columbia Law School, who served as the chief accountability officer of the New York Education Department under Klein. “But the stakes can be inflexible - like there’s a rigid line after which you’ll fall off - or flexible, as in, ‘We’ll see where you compare to others and have a conversation.’”
Will schools sink in a sea of unintended consequences? Rockoff doesn’t think so. He offers one precondition for success: “The obvious answer is, do not put too much weight on any single measure.”
At the Education Department, officials say the new measures are just about helping teachers and principals improve.
Brad Jupp, a special adviser to Secretary Arne Duncan, compares the anxiety about the adoption of new evaluation tools to the uncertainty in the 1940s over what would happen if the sound barrier was broken. Some people thought it would destroy the plane. Others thought the plane would accelerate to a million miles per hour. When Chuck Yeager finally broke it in 1947, neither happened.
“There is this experience of breaking the sound barrier together,” Jupp said, “that is valuable for both sides.”