Doctor Cleveland's picture

    Teaching by the Numbers

    Last week, New York City released Teacher Data Reports for every teacher in its system. This week, I got my own teaching numbers: last semester's teaching evaluation scores. Getting my numbers was a good thing for me personally; they were very high, and my bosses tend to reward that. Releasing the New York City numbers was a bad thing generally, which can only set education in the city back. But both sets of numbers are largely bullshit.

    Teaching quality is very difficult to quantify, and no method that currently exists comes anywhere close to doing it accurately. It's not impossible. But it is impossible with the current tools. It's like going to Mars: a good thing to do that is impossible this week but can be possible someday if we keep working.  In fact, getting quantifiable teacher assessment right is probably a bigger technical challenge, in terms of how far the existing technology is from the ultimate goal, than a Mars mission. Measuring teacher quality accurately is something that we should do, and which will have important benefits for K-12 and university teaching alike. But it's important to be realistic about what we can actually achieve right now. And relying too quickly on the flawed early technology, like launching a mission to Mars in a rocket that can't make it, will only set back progress and cause real people pointless harm.

    These aren't sour grapes. The current method of quantifying university teaching treats me very favorably, and if more weight were given to those numbers it could only be to my benefit. My evaluation scores are reliably high, and last semester's were close to perfect. But that's no reason for me to believe them. Student evaluations, which are the only widespread quantitative measure of university teaching and by which some numbers-minded administrators set great store, measure student satisfaction, not student learning. These are obviously not the same thing. My numbers don't prove I'm effective. They prove that I'm well-liked. And the students have no way to know how much, or how little, they would have learned with a different teacher, because mine is the only version of that class they've taken. If, for example, I did not cover material that students at other universities would routinely learn in that course, my students would have no idea. And if my students learned significantly more than students in a similar course somewhere else, if they covered more material, understood it better, and developed stronger intellectual skills than students elsewhere, they couldn't know that either. They do have an intuitive sense of whether they're learning or not, and that intuitive sense is part of their satisfaction with the course and the teacher, but it's only part. These numbers are correlated with effective teaching, but the correlation isn't strong and (even worse) the correlation itself varies widely from class to class. Some popular teachers are very effective. Some are simply easy. Still others are merely likeable. Which am I? The numbers don't tell me.

    There are other ways to evaluate university teaching: peer observations, teaching portfolios, testimonial letters from former students, and so on. All of these methods are also imperfect. They work best when several methods are combined, which gives you a fuzzy but reasonably adequate picture of how someone is doing in the classroom. The fuzziness of those measures has fundamental consequences for the entire profession of college teaching. They're good enough to weed out flagrant incompetence, or at least to reassure the school that the flagrantly incompetent have been weeded out. But they're not focused enough to make fine distinctions between good teachers and very good ones, let alone distinctions between the very good and the truly excellent. This is why professors ultimately advance their careers either as researchers or as administrators; research talent and administrative skill are easier to measure, and make it easier to distinguish the excellent from the merely good and the very best from the merely excellent. There isn't a career path that rewards superb teachers for their teaching, because the available measurements can't reliably tell those people from the teachers who are only above-average.

    If college administrators tried to use the existing measurements to reward the best teachers by, say, promoting people whose teaching scores were 10% better than their colleagues', they'd be prospecting for fool's gold. Was I really 5% or 10% better last semester than I was the semester before? Hell, no. In fact, I was badly distracted last semester: I had an especially laborious and high-stakes administrative role to perform, I was making an inter-state commute almost every weekend, I got terrible medical news about people dear to me, and halfway through the semester I got married. I never slacked on my course prep, but I guarantee you that there was no extra time to put into it. The numbers, taken at face value, suggest that I should strive for that level of stress and distraction every semester, but the numbers should shut their damn mouths. And when this semester's numbers "show" that my effectiveness has "declined" 10% or 15% (because last semester's scores can only decline), that won't mean that I've actually become less effective. It will mean that the scores fluctuate widely from semester to semester because they are extremely imprecise.

    But my bullshit evaluation numbers are a masterpiece of statistical rigor compared to the numbers that New York City just published. That data is related to students' performance on a standardized test, which is already imperfectly correlated with how much students have learned. So from the start you've got a shaky correlation with a shaky correlation. Then the numbers are adjusted in various ways to make them more "meaningful," but considering how small the sample sizes are  all the extra variables and sub-tabs, and the addition of new correlation problems with each new variable, actually make the numbers murkier and more volatile. Even if you ignore all those problems, and you shouldn't, there's the problem of sample size itself. In some instances the number crunchers themselves admit that the margin of error for particular teachers hits 53%. This means that a teacher ranked in the 50th percentile might actually belong in the 103rd percentile and be a miracle worker, or belong in the -3rd percentile and have been dead since the 1990s.

    When I call these numbers bullshit, I don't mean that they serve no purpose at all. We will only get meaningful techniques of measurement by experimenting with different approaches. The numbers we have are not useful as actual measurements. They are useful as steps in the project of devising better measurements. Bullshit, put to the right use in your garden and combined with the right mix of water, seeds, and sunlight, will eventually yield a nutritious salad. But that doesn't mean you put the bullshit on a plate and call it lettuce. The New York City numbers are pretty obviously not ready for public consumption. Serving them up represents a health hazard.

    Bill Gates, a champion of number-driven education reform, published an op-ed in the Times opposing the release of the teacher numbers. By and large, Gates gets it: the numbers aren't ready for prime-time and using them to publicly shame teachers will only cause harm. And Gates is right that using numbers punitively, especially when the numbers themselves aren't even half-baked, will only make teachers resist the whole project of numerical assessment. Of course it will.

    Finding ways to measure teaching quality would eventually benefit teachers enormously. Teachers don't oppose measurement and numerical assessment because they fear change, or don't want to be held accountable, or because they're union thugs. Teachers oppose these "reform" initiatives because the "reformers," sometimes with the best of intentions, often use badly flawed measurements as if they were self-evident facts. No one in their right mind would want to be evaluated that way, especially when "education reform" in its current form has no suggestions for helping "underperforming" teachers except firing them. Gates understands that education reform should ultimately aim to help teachers improve, rather than simply replacing them, but many "reformers" take a much cruder approach. Claiming that the teachers are just looking out for their self-interest doesn't cut it; you can ask teachers to put their own interests aside for the sake of the kids' education, but you can't ask them to put their interests aside for the sake of number-driven policies that don't help the kids' education and likely hurt it. Turning over K-12 education to a set of statistics that don't actually measure learning is not a worthwhile goal, period, let alone a goal worth getting fired for.

    People with the best intentions can do enormous damage to our education system by naively relying on numbers that are a long way from becoming reliable. These people are perfectly sincere. They really think that the bullshit is lettuce, and they will tell you at length how important leafy greens are to a good diet. If someone tells you that identifying the best teachers is perfectly simple, you're likely talking to one of these naive and disastrously well-meaning souls. They not only do damage to the current education system, but they set back reform, because peddling bullshit and calling it lettuce has the long-term effect of making teachers oppose lettuce on principle, and moves us further from the day that we can actually produce a healthy salad.

    And what Bill Gates does not get is that not everyone who advocates these number-driven policies is naive or well-intentioned. There are a number of people supporting numerical assessment who are not interested in improving education at all, but who are simply anti-teacher or even anti-education. Some are union-busters, some have ideological problems with public schools, some have other motives. But they are not interested in producing lettuce. They just want to see some teachers eating shit. This can be difficult for well-meaning "reformers" to see; when you understand yourself as crusading for the public good, you tend to see anyone who joins you as one of the good guys. But it is transparently and intuitively obvious to teachers. When the same politicians and interest groups who were down on teachers last year are suddenly talking about "assessment" and "reform" this year, it's obvious that those politicians and activists are just adopting a new name for the same old ends. And that leads many teachers to see all advocates of reform, no matter how well-intentioned, as part of an older anti-education agenda. When reformers talk about reform leading to higher pay for the best teachers while the "underperformers" are fired, it is very obvious to people who actually teach that no one is going to get much of a raise, but that the firings are at the top of the agenda. (Even when school systems follow through with merit pay, the increases are small, and in many systems the "best" teachers don't do any better financially under the "reformed" system.) The sincere reformers, such as Arne Duncan or Barack Obama, generally don't grasp this. Their opportunistic allies do.

    The genuine reformers damage their cause through their careless choice of allies, and by working with people who are operating in bad faith. They not only create resistance from the very people who should be their most important allies, the teachers themselves, but they ensure that any "reforms" enacted will be implemented abusively rather than productively: that flawed numbers will be treated as hard data, that results will be used to punish teachers and not to help them, that the promised raises never come but the threatened firings do. "School reform" will be a thin disguise for teacher-bashing as long as the "reformers" include education-bashers in their political coalition. That alliance will always provide enough political backing for new punishments, but not enough for the promised rewards. Bill Gates should be applauded for reminding the well-meaning readers of the New York Times what education reform is supposed to be. But his plans will never bear fruit until he comes to grips with what "education reform" actually is.


    Excellent piece all around. I think we'll be able to get a good measure of teacher performance when we make teaching reproducible. All we need to do that is to create an AI capable of teaching Early 20th Century British Literature, General Relativity, and all of the other courses being taught in colleges today. (That might sound daunting, and it is, but if you can cover those two, you're probably already a good ways towards covering the rest.)

    Excellent indeed!

    These are tough issues.

    What we are supposed to be doing is measuring the future abilities of the children and there lies chaos.

    In white suburbia five decades ago (and some more) we took tests.

    We took state wide tests, we took pretend state wide tests in preparation for state wide tests, we took our MMPI's (do you look in the toilet every time you do your business or just some of the time?), we took weekly tests and monthly tests and quarterly tests, we took open book tests and closed book tests we took multiple choice tests and true false tests and essay tests....

    After awhile, us white test taking suburbanites got pretty good at taking tests!

    And since, in my humble opinion, we had just about the best ed possible as far as middle class education, I am kind of on the side of testing. I mean you spend six hours a day constantly be lectured to and reading many pages of material the teacher has to figure out if the teaching method is working.

    What we are dealing with is chaos actually. Sometimes I feel that all I ever learned with 20 years of schooling was how to take tests.

    I would suggest a three pronged test for teachers (punny huh?)

    Use the test scores.

    Use peer review--how do other teachers perceive you?

    Use parent review--are the parents happy about the progress of their children?

    Use administrative review.

    We are stuck with analysts who offer bumper sticker phrases that amount to dog whistles to certain 'control' groups.

    The real message is that teachers receive too goddamn much money for only 9 months work!

    Well; do we pay our teachers 7 bucks an hour and hope for the best?

    The problem is that we have 6.2 million teachers so the repubs use an on line calculator and figure out that a billion more dollars could be going into their pockets!

    I had to laugh when I first saw how much money Santorum was making whilst he home schooled his kids.

    Same result for Bachmann and many others of course.

    Did you notice lately that the repubs in states like Wisconsin and Ohio and Indiana and Maine went after the high volume governmental employees like teachers and firefighters and police?

    Hey, you want an extra billion bucks to give away to corporations so that they will hire folks at 8 bucks an hour? Just f*&k the teachers and the firefighters and...

    All I know is that if you went to school in white suburbia in the 50's & 60's you were much better at taking tests than if you were schooled in Watts.

    the end

    Except I am sorry to take your time by going on a nonsensical rant. ha

    What can I say? It is what I do!




    I would suggest a three pronged test for teachers (punny huh?)
    1. Use the test scores.
    2. Use peer review--how do other teachers perceive you?
    3. Use parent review--are the parents happy about the progress of their children?
    4. Use administrative review.

    Ironic on a piece titled "Teaching by the Numbers". wink

    In all seriousness, while this is a step up from the current system, it still suffers from the popularity contest feature that Doc Cleveland refers to. It's possibly the best we can do, and arguably even better than nothing (unlike the current system), but it definitely should never be expected to be completely accurate.

    One thing I noted during my time teaching in public schools is that new teachers tend to get the worst classes. These students also tend to have the worst test scores, and although a good teacher might make them better than they would otherwise be, comparing two teachers based on how well they teach two different classes is not going to give you an accurate comparative measure.

    Well somebody got the joke. hahahahah

    And the firemen would get the worst jobs and the police would get the worst jobs!

    Sounds like human nature (a phrase I have despised for 5 decades) but there will always be a hazing process.

    I don't have any answers. You as a teacher would know that there can never be a three or four or sixteen pronged attack on a system that seems to be failing in 50%? of our school districts.

    I recall blogging on this issue several times before.

    I recall History 101 and while into the first lecture the prof (Altof) asked for a raise of hands as to how many students had a history teacher who was also a coach.

    Almost everyone raised their hands.

    The professor continued:

    Well let us begin at the beginning. The 1700's involved the 18th century...hahahah


    All the repubs will ever ever ever do is to look at the numbers and see what class will involve the most money.

    Good point.

    And i would like to add at this point that those who dedicated years in service to being teachers should be admired and bonused and congratulated for their service.

    And I congratulate you!



    A few thoughts and questions raised by the original post:

    1) If you have the data, should you withhold it from parents who use the schools and may want to make their own choices? From taxpapers who are paying the teachers and are, essentially, their bosses one step removed?


    2) A few reformers pose the bigger question another way. Though value-added and test scores are imperfect ways of judging a teacher, is it more, or less, imperfect than the existing way? There is significant error already in a system that treats every teacher as equally effective and compensates them equally, other than automatically giving raises simply because they stuck it out another year, whether they tried to get better over that year or not. I'm not sure yet which way is more imperfect, but the difference in error between the two approaches may be the key, not whether either way has problems.


    3) Though you are right in noting that raises for high-performing teachers are small, that may because unions won't agree to take from the bottom to give to the top. The discussion is always about raises for the top, which means greater costs, which means higher taxes. If you suggest going to a dollar-neutral system in which the same money is distributed differently, unions balk. They just want more money, which is just as natural as taxpayers wanting to pay less. You can take Republican cost-cutting and union money-grabbing accusations out of the picture by just shifting the same dollars. If the goal is to reward good teaching and give less reward to poor teaching, then maybe the distribution pattern should be changed. Yes? No?






    In no order:

    2) Every harmful pseudo-science makes the "better than nothing/better than what you're doing" argument. And it's always wrong.

    Astrology isn't perfect, but it's better than not knowing anything about the future. Subjecting job applicants to handwriting analysis isn't perfect, but it's better than not knowing anything about their psychology. Bleeding a patient doesn't always work, but it's better than not doing anything and hoping they get better on their own. Every one of those claims is wrong.

    When you don't know something, it is always better to accept your lack of knowledge rather than to embrace some pseudo-knowledge that helps you conceal your ignorance from yourself. Pseudo-knowledge makes you more confident in your ignorance and misunderstanding and thus makes your actions more damaging.

    In any case, you have no way of knowing whether the new numbers are more or less imperfect than the existing way, because you can't even assess the assessments.

    1) By this logic, obviously, releasing such badly flawed data causes foreseeable harm without any certainty of doing good. If we have astrological and phrenological charts on the teachers, should we release those to the parents and taxpayers, so they can make up their own minds? No. And your emphasis on the parents and taxpayers as "bosses" makes it clear that you think that the parents and teachers should use this goobledygook "data" to demand high-stakes personnel actions. That's an excellent reason not to give them bad data, frankly.

    3) I can't help noticing that your first impulse is that the "underperforming" teachers need to be punished more. The idea of helping struggling teachers get better at teaching is not on the table. And that makes "reform" counterproductive. People won't become better teachers because they're afraid of punishment. Most underperforming teachers would prefer to be better teachers, but don't know how. And since many "reformers" and their sympathizers, such as yourself, have no idea how to make teaching better, you simply look for people to scapegoat as bad teachers. That isn't helping anybody.

    In my school, we had administrative review, where by "administrative", I mean a combination of the assistant principal in charge of education and the head of the department in which a teacher taught. All teachers were reviewed at least 3 times a year, by one or the other of the people I mentioned, and everyone knew ahead of time the list of criteria by which we were being judged. You were required to pass 3 reviews (out of 5, if necessary) and if you didn't, you were on notice. After 3 years of being on notice, you stopped receiving raises. Personally, I would've favored getting rid of teachers at that point, if not sooner, but as you say, teacher unions seem to fight that.

    That said, although this system worked fine at my school (as far as I knew), there are schools where the powers-that-be are part of the problem.

    One point is that there's not a one-size-fits-all solution. Another is that the method being used here is a one-size-fits-none. I.e., I wouldn't expect it to be valid in any school system.

    (For what it's worth, I passed everyone of my reviews for the two years I taught in the public school system. However, I was talked to by the assistant principal in charge of education for being too harsh with my students. I was lucky enough—unlike most new teachers—to get several advanced classes of physics students and was told that, "these are A and B students and they should be making As and Bs." I responded that, "yes, they should, and they can if they want to." Most of them were making As and Bs, and I had a couple who later thanked me for giving them the tools to thrive at Georgia Tech.)

    In Florida, they have a new program to dissuade sixth-graders from developing an interest in ultimately pursuing a college liberal arts education.


    My wife just switched careers to become a teacher, so I've gotten to see the public system from a different perspective than I had before. We're in VA, a "right to work" state. Here are a few observations:

    • The people who have the biggest impact on the quality of education and the whole school experience are the administrators. And yet, in the national "debate," such as it is, the quality of administrators almost never comes up.

    In my very limited experience, administrators frequently haven't done any teaching in a long time and/or were bad teachers when they did teach. I've heard that in some cases, bad teachers are "promoted" to administrator instead of being let go.

    My wife has had HORRENDOUS experiences with administrators who, almost to a person, have a dramatically negative impact on the learning environment.

    • Much is made of teacher evaluations as a way of weeding out the bad and promoting the good. The underlying assumption is that this sort of weeding out goes on in other "industries" and professions--but teachers are shielded from it. But is this really the case? Are bad doctors, lawyers, machinists, salespeople, etc., culled by "the market," or do we have plenty of bad doctors without their bringing down the general state of doctor-dom the way bad, but protected teachers are said to bring down the general state of education in this country?

    • Put another way, why isn't there a general call to improve the state of doctoring, lawyering through improved evaluations? Or is the assumption that these other professions are in good shape while education is uniquely in bad shape?

    Latest Comments