Lies, Damn Lies, and Data Mining

    In the wild (i.e., when we're not talking about contrived examples), data mining involves significant amounts of statistics. There are two common quotes that come to mind when talking about statistics:

    There are three kinds of lies: lies, damned lies, and statistics. (Popularized by Mark Twain/Samuel Clemens, who attributed it to Benjamin Disraeli, but with uncertain provenance.)

    and

    The old saying is that “figures will not lie,” but a new saying is “liars will figure.” It is our duty, as practical statisticians, to prevent the liar from figuring; in other words, to prevent him from perverting the truth, in the interest of some theory he wishes to establish. (Carroll D. Wright, a prominent statistician employed by the U.S. government in 1889)

    Although it's true that there are many who would use statistics to obfuscate, the more common danger is when people who don't intend to misuse statistics use them without properly understanding them. Here are a couple stories from Leonard Mlodinow's The Drunkard's Walk, the first of which concerned him personally:

    My most memorable encounter with the Reverend Bayes came one Friday afternoon in 1989, when my doctor told me by telephone that the chances were 999 out of 1,000 that I'd be dead within a decade. …


    The adventure started when my wife and I applied for life insurance. The application procedure involved a blood test. … [Mlodinow's HIV test] came back positive. Though I was too shocked initially to quiz him about the odds he quoted, I later learned that he had derived my 1-in-1,000 chance of being healthy from the following statistic: the HIV test produced a positive test when the blood was not infected with the AIDS virus in only 1 in 1,000 blood samples. That might sound like the same message he passed on, but it wasn't. My doctor had confused the chances that I would test positive if I was not HIV-positive with the chance that I would not be HIV-positive if I tested positive.

    He goes into more detail explaining it (and I highly recommend that book), but the take-away result is that instead of his chances being 999 in 1,000 of having HIV (and hence AIDS), using the a priori odds of a heterosexual non-drug-abusing white male American having HIV is 1 in 10,0000, the actual odds that Mlodinow had the HIV virus was about 10%. (Consider those 10,000 people: about 10 of them would have a false-positive result on an HIV test, and 1 of them would have a true-positive result.)

    Another excellent example from his book, which gets to the heart of the problem with over-relying on data mining concerned Sally Clark, in Britain. Her first child died from SIDS at 11 weeks, and was diagnosed as SIDS (sudden infant death syndrome). However, when her second baby also apparently died from SIDS (this time at 8 weeks), she was accused of smothering both children. The odds that were calculated for having two children dying from SIDS was 1-in-73,000,000. Putting aside the fact that there are far more than 73 million people in the world (and almost that many in Britain alone), the way those odds were calculated were to take the odds that a single child would die from SIDS (which had calculated at 1-in-8,543) and square it. The problem is, that process assumes statistical independence, which when one considers the possible medical explanation of SIDS seems even more unlikely than having two of your children in a row die from it. The good news is that after three and a half years in jail, Sally Clark was eventually released from prison when it was uncovered that the pathologist working for the prosecutor had withheld the information that Clark's second child had been suffering from a bacterial infection at the time of her death.

    So, other than statistics, what does this have to do with data mining, since neither of these stories involved data mining? Well, while I was working on my Ph.D., I was also working for a data mining facility and my boss shared a story with us about an interesting statistical result that had been uncovered: average school SAT scores are negatively correlated with the amount of tax payer dollars spent on those schools. This was surprising, so people dug deeper. It turns out that the average school SAT scores are higher out west, where far fewer students per school take the SAT and where less money is spent per school on average. The reasonable explanation for the result was selection bias, as students who take the SAT out west are usually interested in applying to out-of-state colleges, and those students (regardless of location) usually do better on the SAT.

    So what? Well, the real lesson our boss wanted us to learn from this was that if the result had been the reverse (i.e., that SAT scores were positively correlated with the amount spent on schools) it's less likely that people would've been considering alternate explanations.

    Think about that.

    Now, consider a hypothetical: data mining statistics come back suggesting that the odds that John Doe is guilty of X is 99.999%. John Doe is a minority with a juvenile record that includes crimes similar to X, and he happens to live within 20 miles of Alan Street and 1st Avenue, in New York City where the crime happened. Although 99.999% seems pretty ironclad, it means that there are 1-in-100,000 odds that a non-guilty person would be also implicated. How many people live within 20 miles of Alan Street and 1st Avenue? How many of them have juvenile records? How many of those records might include similar crimes? Do we even know if the person actually guilty of the crime has a juvenile record? (No, we don't. This is my hypothetical, after all, and it would seem weird to assume otherwise regardless.)

    So, who here now thinks John Doe is most likely guilty? Who here thinks he would be found guilty in a court of law?

    Comments

    Given the context of this post, it would be interesting and important to know how the government is using the data it gathers. But I don't suppose that they'll tell us that.


    It would be just as interesting and important (if not more so) to know how the government is gathering the data it uses, but it's even less likely they'll tell us that!


    The statistical explanation for the "collateral damage" deaths with drone strikes.


    Your post reminded me of something I snipped from a few days ago.  Thought you might be interested.

    Graduate courses are springing up to meet the demand for analysts

    Obtaining data is easy; it can come from a huge variety of automated sources, including RFID tags, mouse clicks, or sales receipts. And the analytic software systems—such as SAS Institute’s eponymous SAS and IBM’s SPSS—that are required to work with this data are getting better, says Michael Hasler, director of a new M.S. in Business Analytics program at the University of Texas at Austin. But what’s missing are the people: “You need to take these large unstructured data sets, clean them up, and find insights, but there’s a shortage of talent to do that work,” says Hasler.

     


    Well I am going to start a company that, for a small fee, figures out what the government is figuring out about you.

    It's a free country, right?


    Ummm...too late?

    http://panopticon.com/

     


    Drat.


    Latest Comments