Odds n’ DNA Databases

Sunday, May 4th, 2008

Steve Chapman had a column last week about the benefits of assembling large DNA databases of the populace for the purpose of solving crimes.

The L.A. Times has a story this weekend on why that creates some problems that might not be readily apparent.

The main problem is that the odds of a false match increase exponentially when you’re running a DNA sample against a database of hundred of thousands of people (in Britain, the number is well into the millions)–a Bayes’ Theorem problem. The problem is exacerbated when you’re dealing with decayed DNA from old “cold cases,” where you have even fewer markers than in well-preserved DNA samples.

Let’s say the U.S. adopts a Great Britain policy on collecting DNA–basically a move toward, at some point in the future, having DNA on file for everyone in the country. Well now the 1 in 1.1 million odds against the suspect in the L.A. Times case are being run against a database of 380 million people. The numbers say that you’re going to pull up about 345 matches in the U.S. alone. In the California case, the database is obviously much smaller than the entire U.S. population, and only one of those 345 people showed up from the 330,000-person FBI DNA database–the (admittedly unsympathetic) subject of the article. But any of the other 344 potential matches in the U.S. (or the 2,200 matches worldwide) could have committed the crime. They just weren’t in the database.

DNA database searches are an excellent starting point for law enforcement. But given the odds of false matches when running DNA against an extensive database, we should be very careful about moving the burden of proof onto matches to prove their innocence. It’s also unfortunate that the judge in the case profiled in the L.A. Times would only allow the prosecution’s miscalculated 1 in 1.1 million chance of a false match into evidence, and not the more statistically sound 1 in 3. Even if one were to accept the idea that the scientific community is divided over the proper way to calculate the possibility of a false match (and I’m not convinced there’s really that much of a debate), you’d think a judge should either allow the jury to be made aware of that division of opinion, and that there are serious statisticians and scientists who would put the odds much, much lower than the odds suggested by the prosecutors in the case.

Digg it |  reddit |  del.icio.us |  Fark

55 Responses to “Odds n’ DNA Databases”

  1. #1 |  Lloyd Flack | 

    You cannot give meaningful estimates of the probability that the match is the culprit without giving an estimate of the probability of the culprit being in the data base. You can reliably place a floor under this probability, however this floor could and usually will be fairly low. At least a floor that you could defend in court is likely to be.

  2. #2 |  Xrlq | 

    It’s also unfortunate that the judge in the case profiled in the L.A. Times would only allow the prosecution’s miscalculated 1 in 1.1 million chance of a false match into evidence, and not the more statistically sound 1 in 3.

    I don’t think it’s an either-or. One measures the odds that someone will be falsely identified, while the other measures the odds that any particular individual will. Which one is relevant depends on other factors. If the only reason to suspect Puckett is the partial DNA match, and we can only assume that the odds are roughly 50-50 that the true killer was in the database to begin with, then there’s about a 1 in 3 chance that they fingered the wrong guy. But if there’s enough other evidence out there to make him the chief suspect, as the story indicates, then it’s fair to say that the odds of the partial DNA match falsely pointing to the same individual are a million to one.

  3. #3 |  Pablo | 

    Yes, we have no way to compute the probability that the true assailant is in the offender database. So we have to make one of two mutually exclusive assumptions: (1) the true assailant is NOT in the offender database, which then predicts that there is approximately a 1 in 3 chance the evidence DNA profile will match someone who is innocent. Alternatively, (2) the true assailant is in the offender database. Under this second assumption there is a 100% certainty that his profile will be found to match the evidence.
    Since the database represents a relatively small sample of the general population it seems to me that assumption (1) is more likely than assumption (2). The reason for the 1 in 3 chance is the result of the limited partial evidence DNA profile, there is simply not enough genetic information available to achieve maximum discrimination. If there was a full evidence profile, then the chance could be 1 in trillions that the evidence profile will match someone who is innocent. That probability would be sufficient to believe the defendant is guilty. Whereas a probability of 1/3 that the evidence profile could originate from an innocent falsely accused leaves too much doubt that the defendant is guilty. Without any further evidence I would vote not guilty.

  4. #4 |  Xrlq | 

    I just had a long conversation with Patterico on the topic, and it occurred to me that we should be able to compute a reasonable estimate of the odds of the killer being in the database, in one of two ways:

    1. Compute the percentage of rapist-murderers who have prior convictions for sex offenses that would be likely to land them in the sex offender database.
    2. Compute the percentage of cold database hits that result in 2 or more hits. Assuming a 1 in 3 chance of one false positive, a 1 in 9 chance of two, and so on, we should be able to compare the expected number of false hits to the number of total hits, and extrapolate that the difference represents the number of true hits.

    Once we know the relative odds that a true hit is in the database, we should be able to say with a good deal of confidence how reliable a single cold hit is.

  5. #5 |  Lloyd Flack | 

    In principle both your ideas are good. It would however be more difficult to calculate these probabilities than you might think. The problem is mostly inhomogeneities in the data.

    In the first approach different groups of criminals might have different probabilities of being in the data base. Still the group of interest might be homogeneous enough for this to work. However we have no way of knowing the size of the group that never gets caught.

    In the second case the probability of a false positive will vary from case to case. This could make the calculations messy.

    Still it’s worth trying.