June 10th, 2010

That observation, made by the W.C. Fields character in “You Can’t Cheat An Honest Man,” points to one of the problems with trying to substitute high-stakes low-quality standardized testing for actual educational reform. “Dukenfield’s Law,” as I’ve dubbed it after Field’s birth-name (hey, if your birth certificate read William Claude Dukenfield, you’d think about changing it, too) is still operating, with an estimated 1-3% of all teachers cheating to improve their students’ test scores, sometimes with the active encouragement of their principals. And that’s in a world where having students memorize questions and answers from previous tests doesn’t count as cheating.

28 Responses to ““If a thing’s worth winning, it’s worth cheating for””

  1. JMG says:

    Funny, I just came to your site after finishing Diane Ravitch’s outstanding book “The Death and Life of the Great American School System.” She has a great chapter on “The Problem with Accountability” (test cheating, teaching to the test, etc. etc. … in other words, all the problems of our testmaniac “reform” culture led by the “Billionaire Boys Club” — the title of another great chapter on the Richies (Gates, Waltons by the dozens, and and Eli Broad) who have decided that THEY KNOW HOW TO FIX EDUCATION and if those damn teachers and educators will just get out of the way they’ll do it by Gum . . . )).

    I’m old enough to remember a time when far more Americans had spent time in the military. All in all, I’m opposed to the draft, but one thing I have seen is that there is a huge cost to a culture where most of the intelligentsia have not done any time. One thing that you learn in the military is to beware the novice 2d Lt. who confidently asserts that you don’t really need to understand a subject or profession to be a leader, you just have to be declared “a leader” and to surround yourself with smart folks who can provide you with the advice you’ll need. Admiral Rickover would have nothing to do with any of this type and refused to let anyone in his program who had not mastered the details of it. “The Billionaire Boys Club” members are all of the belief that their money gives them special insights about how education is and ought to work, and yet none of them bother with any empirical testing — it’s plow right on in, damn anyone who raises questions, and declare war on the unions and the teachers.

    Sam Smith published a fantastic essay from a NJ teacher on point:

    http://www.hpae.org/newsroom/articles/20100518_letter

  2. Isn’t there a law that any indicator used as the basis for a strong incentive becomes unreliable? First identified I think as a disease of Soviet central planning; if managers were rewarded for number of nails produced, you got lots of small nails; if by weight of nails, fewer big ones. But it applies just as much to capitalist managers, rewarded for profits or stock price, and - as here - to teachers and pupil performance. The only way to keep the measurements reasonably undistorted is to turn the gas down and reduce the incentives to gold stars and the like.

  3. Brett Bellmore says:

    Memorizing the answers to the tests at least involves the students learning SOMETHING, unless they’re memorizing the letters for multiple choice tests. The problem here is that the students learning IS the output we want, not just teacher hours. If we’re not going to measure it, and condition pay on it, we might as well throw up our hands and give up.

    In what other line of work isn’t pay or job retention conditioned on actually doing the work productively?

    Anyway, if only 1-3% of the teachers are cheating, that’s fairly good, it means the tests are honest in 97-99% of the cases. Just can the teachers who get caught cheating, so that number doesn’t creep up, and you’re good to go.

  4. J says:

    Brett Bellmore, I think everyone agrees that “the students learning IS the output we want”. The problem is how to measure that. If this is not done well, you end up with a situation where students are just trained to regurgitate a lowest-common-denominator set of facts that form the basis for every year’s test. Is that learning? Is it the kind of learning we want to foster?

    I think James Wimberley makes a good point about the distorting effects of overspecified incentives.

    I’m not super-concerned about 1-3% of teachers cheating, though obviously one should endeavor to get that down to 0%. However, there’s a big difference between a case where that 1-3% is distributed more or less randomly and a case where entire schools, school districts, or states are gaming the system to inflate their scores. That seems more problematic to me for a number of reasons.

  5. Seth Gordon says:

    Memorizing the answers to the tests at least involves the students learning SOMETHING

    Well, yes, but school is ostensibly preparation for real life. In real life I don’t get pieces of paper saying “234 + 182 = ?; 3% of 27,800 = ?”. Instead, I fill out tax returns and I look at competing pundits’ proposals to reduce Federal spending. If performance on the tests the kids take at age ten does not correlate with actual skills they need as adults, then the tests are more like hazing than evaluation.

  6. Ohio Mom says:

    Brett asks:

    In what other line of work isn’t pay or job retention conditioned on actually doing the work productively?

    To which I answer, this is too easy. Do you read the papers? Let’s start with Wall Street…

  7. Josh G. says:

    There may be some schools whose performance was so poor that even teaching to the test would provide better results than what was being done before. However, the vast majority of such schools do not fall into this category. Overall, teaching to the test has degraded the quality of instruction. Many middle-class parents are furious about the way that teaching in their schools - which was adequate enough before the NCLB “reforms” - has been dumbed-down and Taylorized.

  8. teacherken says:

    James, I think you are referring to what is known as Campbell’s Law, formulated by Donald T. Campbell in 1976, in Campbell, Donald T., Assessing the Impact of Planned Social Change The Public Affairs Center, Dartmouth College, Hanover New Hampshire, USA. December, 1976. It reads as follows:

    “The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor.”

    We see this in many arenas. For example, if you give police bonuses for # of arrests, arrests will go up but percentage of convictions will go down. If you give bonuses for % of convictions, police may choose not to arrest if they are not sure of a conviction, meaning some who would be convicted will never even be charged.

    This has been traced through the history, for example, of the Chinese Civil Service system. I suggest if you want to explore that, you look at

    Nichols, S. L., & Berlner, D. C. (2007). Collateral Damage: How high-stakes testing corrupts America’s schools Cambridge, MA: Harvard Education Press

    Peace

  9. paul says:

    James Wimberley is right. In theory the thing being measured correlates perfectly enough (ahem) with the thing that’s actually desired, but in practice lossy compression (which is what testing is) is not strong against deliberate attack.

    But it’s actually worse than this. Years ago I talked to economists at the federal reserve, who pointed out that as soon as any leading indicator (or combination thereof) became proven as the most accurate predictor of future economic conditions, it would become instead the most accurate predictor of Fed and other government actions to modify the path of the economy. And as a result its original statistical validity would disappear from future data series.

  10. Eli says:

    The problem with performance-pay is two-fold: First, an implementation model has yet to be shown that is both fair and effective. Second, to the extent that it assumes the achievement gap is going to be solved simply by better teaching, it ignores that SES demographics are the real driver, and that unequal levels of human and social capital across geographic regions need to be addressed before real change can occur. I go into this issue at length here.

  11. Great post, Mark, and a very important topic. Cheating is, of course, a major concern, but there is also a concern that both the absolute results and the more relational results of “value-added,” or “gain score” assessment (the current flavor of the month in measurement circles), are highly correlated with poverty and race and subject to massive statistical “noise,” and that they therefore are a poor tool for telling us whether teachers of poor and minority students are actually causing (or failing to cause) learning to occur. Of course, this means that they also have the potential to overestimate the quality of teachers in less poor and less diverse schools.

    We have been conducting a discussion of the value-added portion of this debate, and its legal implications where adverse employment actions are based on the results, at our education law and policy blog, The Edjurist: http://www.edjurist.com . For those of your readers interested in the technical (measurement and evaluation) issues, Bruce Baker of Rutgers has a magnificent blog on school finance issues where he shares his research on the effectiveness of all large-scale standardized testing in education: http://schoolfinance101.wordpress.com/category/race-to-the-top/value-added-teacher-evaluation/. Bruce started our ongoing legal discussion on his School Finance 101 blog, so I recommend reading his post first, and then checking the ongoing discussion at the Edjurist.

  12. marcel says:

    In re Wimberly’s question: Economists know this as Goodhart’s Law. According to the Wikipedia link, he has priority over Campbell.

  13. JMG says:

    I took a master’s degree in Engineering Management and did a lot of coursework in quality management in the program. One of the things that has stayed with me and most often proved useful was a prof’s comment that “Any one measurement is crap.” Meaning that, because of measurement’s tendency to distort the activity being measured, you need a carefully considered suite of measurements — preferably derived from customer utility — to measure anything in a meaningful, useful way.

    The idiots behind NCLB have ignored everything we’ve learned from the quality revolution and from Deming’s lifetime of work. (“Drive out fear”? HA! Fear’s all we’ve got!) With predictable results.

  14. Um, it’s actually very possible to cheat an honest person. Counterfeiters do it all the time. The entire industry of predatory loans was built on lying to people about the true costs of loans so that they would take loans they couldn’t actually afford. (Yes, a lot of it was also built around encouraging people to not admit to themselves that they couldn’t afford the loans they were taking, but outright lying was a large part of it.) A lot of scams work by lying to people about the services offered, or the fair value of them, and are particularly effective on the easily confused.

    I’ve seen the distinction drawn between “scams” and “confidence tricks”, the latter being the class of scams which rely on the mark wanting to make a dishonest buck. But even the classic pigeon drop has versions which rely on the *honesty* of the mark rather than the dishonesty.

    Sorry, I realize that’s all off-topic of the fundamental corruptness of using standardized testing as a surrogate for actual evaluation of learning. But “Dukenfield’s Law” blinds us to the real nature of cheating.

  15. Brett Bellmore says:

    “I took a master’s degree in Engineering Management and did a lot of coursework in quality management in the program. One of the things that has stayed with me and most often proved useful was a prof’s comment that “Any one measurement is crap.” Meaning that, because of measurement’s tendency to distort the activity being measured, you need a carefully considered suite of measurements — preferably derived from customer utility — to measure anything in a meaningful, useful way. “

    That’s a fair point. The flip side is something my control systems prof used to regularly say: “If you don’t measure it, you can’t control it.”

    We need good measures of teaching effectiveness, and while we shouldn’t punish teachers for getting students whose parents don’t care if they get educated, we ultimately must compare teachers on the basis of how well they do their jobs.

  16. MobiusKlein says:

    In a real elementary school, the process of sorting kids into classes is political enough. Which teacher has the hard kids and easy ones.
    Once their compensation depends on it, the knives will come out. Add in the inclusion students with learning disabilities and it will be toxic - let’s give the kids with the most problems to the teacher lowest on the totem pole.

  17. marcel says:

    I don’t recall where I came across a link to this NBER WP, it may have been on this blog. It strikes me as relevant to Mobius Klein’s comment. The abstract is:

    Non-random assignment of students to teachers can bias value added estimates of teachers’ causal effects. Rothstein (2008a, b) shows that typical value added models indicate large counter-factual effects of 5th grade teachers on students’ 4th grade learning, indicating that classroom assignments are far from random. This paper quantifies the resulting biases in estimates of 5th grade teachers’ causal effects from several value added models, under varying assumptions about the assignment process. If assignments are assumed to depend only on observables, the most commonly used specifications are subject to important bias but other feasible specifications are nearly free of bias. I also consider the case where assignments depend on unobserved variables. I use the across-classroom variance of observables to calibrate several models of the sorting process. Results indicate that even the best feasible value added models may be substantially biased, with the magnitude of the bias depending on the amount of information available for use in classroom assignments.

  18. MobiusKlein says:

    Thanks marcel.
    Possible non-random factors also include classroom facilities for special ed. Some rooms are modified for hard of hearing kids, some rooms may not be accessible by wheelchair, and so on. It’s a thorny problem - we need some way to find out what teachers are effective, which teaching techniques are best, and all. Without data, we have nothing. I think low-stakes testing is going to be more useful until we find out WTF is going on.

  19. NCG says:

    Another problem is that multiple-choice tests are more easily scored than essay or reading tests, which to my mind are much more important. As Seth said, citizens face a difficult task of sorting through all the BS the highly-paid consultants shovel at them during elections. (Btw: I totally want to sue someone over Scantron sheets. It is emotional abuse of children. “Color in the bubble completely, but if you go outside the lines a little tiny bit, there goes your entry to Yale… Oh, and don’t forget to hurry…” But I digress.)

    I also think this merit thing overlooks much of the social characteristics of the people who go into teaching. If all they cared about were getting better stats than their friend in the next classroom, I guess they’d be on Wall Street, helping to &*!% everything up even more.

    Also, on the issue of selection bias, some kids are going to be more in synch with one teacher versus another. Only an idiot would ignore this in making assignments. So, there goes your random selection.

    Isn’t it funny that at the high SES schools, they’re trying to go the other direction? Exactly what jobs are we preparing these kids to do, anyway?

  20. JMG says:

    “The flip side is something my control systems prof used to regularly say: “If you don’t measure it, you can’t control it.”

    True enough — except that the converse is not. That is, just because you have measured something (which may or may not be “it” — and in the case of teacher quality, student test scores are assuredly not “it”) doesn’t mean that you can (a) control it; or (b) that your measurement will help you do so; or (c) that your measurement is actually measuring the results of your “control” operations.

    The assumption that the Billionaire Boys Club members, MBAs, and some engineers make is that because we know teaching is important, that it’s possible to isolate on that variable retrospectively and distribute rewards/punishments accordingly in a useful way. Well, the data will confess to anything if you torture them enough, but in complex systems like schools, there are not simple relationships like hydraulic inputs = degrees of motion outputs.

  21. CharlesWT says:

    “Well, the data will confess to anything if you torture them enough, but in complex systems like schools, there are not simple relationships like hydraulic inputs = degrees of motion outputs.”

    Yes, deliver us from the technocrats who believe that anything that can be measured can be controlled whether it be schools, the economy or climate change.

  22. teacherken, Marcel:
    Thanks for the references. I had actually heard of Goodhart (I even met the guy once) but my memory mis-led me to Goodwin, a disgraced Scottish banker and dead end.

    JMG, Brett:
    Measuring everything that moves is good Deming QA. But how much of the educational outcome can you realistically measure or test for? I don’t think the testing lobby have come up with a reliable way of assessing more than organised knowledge and logical problem-solving, which are useful to be sure, but not everything we want from learning. Attempts are traditionally made to assess communication skills, but it’s pretty subjective once you get beyond counting mistakes. Teamwork, creativity, self-discipline, listening, persuasion, judgement in making individual and collective choices … Form-free education gives you the tongue-tied idiot savant; content-free education gives you the flack with an MBA.

    One idea that’s not perhaps been considered enough is intensive sampling, which would mitigate the problem of distorsion at reasonable cost. You could afford to compensate the unlucky test subjects for their time and effort, though taking part would have to be compulsory. In engineering, you sometimes take a sample and test to destruction. In sociology and medicine, longitudinal surveys or samples of say 5,000 out of populations of millions have given good and statistically solid results. If the aim is to measure school or teacher performance, intensive sampling might work; but any longitudinal element (retesting the same children) would have to be secret, which looks impossible.

  23. Brett Bellmore says:

    You can test learning particular facts, and test reasoning skills. Which aren’t everything we’d like from education, but are certainly the least we should demand of it.

    I think we’re not going to make much progress in education until we automate most of it, with the live teachers dealing with exception handling, not routine. It’s just too labor intensive as is.

  24. JMG says:

    I think we’re not going to make much progress in education until we automate most of it, with the live teachers dealing with exception handling, not routine. It’s just too labor intensive as is.

    I’ve submitted that to the “We had to destroy the village in order to save it” Hall of Fame. Progress in education towards what, exactly?

  25. Barry says:

    Mark: “with an estimated 1-3% of all teachers cheating to improve their students’ test scores, sometimes with the active encouragement of their principals. ”

    Is that evidence in favor of teachers being one of the most honest groups of people in our country?
    To me, this sounds like Freaknomics guys amazed that real estate agents can sell their own homes for 10% more than their clients’ homes.

  26. Ridge Runner says:

    The “cheating percents” cited are lower bounds, given that the cheating detection methods are not all-seeing. The biggest cheat in the system is the mythology that promotes the institution and its staff as the cause of students learning, rather than an obstacle to their ‘wising up’.

    “The shocking possibility that dumb people don’t exist in sufficient numbers to warrant the millions of careers devoted to tending them will seem incredible to you. Yet that is my central proposition: the mass dumbness which justifies official schooling first had to be dreamed of; it isn’t real.”
    - The Underground History of American Education -
    http://www.johntaylorgatto.com/underground/toc1.htm