Bayes’ Theorem and DNA Database Searches
Tuesday, May 6th, 2008Via the comments to this Eugene Volokh post, it looks like the Ninth Circuit has just thrown out (pdf) a guilty verdict over precisely the problems with predicting odds when doing cold DNA database searches that we discussed earlier this week. Excerpt from the opinion:
Here, [DNA expert Renee] Romero initially testified that [defendant Troy Don Brown]’s DNA matched the DNA found in [rape victim Jane Doe]’s underwear, and that 1 in 3,000,000 people randomly selected from the population would also match the DNA found in Jane’s underwear (random match probability). After the prosecutor pressed her to put this another way, Romero testified that there was a 99.99967 percent chance that the DNA found in Jane’s underwear was from Troy’s blood (source probability). This testimony was misleading, as it improperly conflated random match probability with source probability. In fact, the former testimony (1 in 3,000,000) is the probability of a match between an innocent person selected randomly from the population; this is not the same as the probability that Troy’s DNA was the same as the DNA found in Jane’s underwear, which would prove his guilt. Statistically, the probability of guilt given a DNA match is based on a complicated formula known as Bayes’s Theorem, see id. at 170-71 n.2, and the 1 in 3,000,000 probability described by Romero is but one of the factors in this formula.
Once again, it’s worth noting that if other evidence points to a suspect, and you then get a match to your suspect after running the crime scene DNA against a database, you can be reasonably certain of guilt. I’m just wary of using cold matches as the starting point of an investigation. Precisely because many people misunderstand the fairly high odds of false matches with large databases, you run the risk of the investigation becoming more about finding proof that the match committed the crime than about investigating who committed the crime. The problem grows when you’re talking about decades-old cases where evidence has degenerated, witnesses have died, and records may or may not still be around.
TheAgitator.com

DNA evidence itself doesn’t necessarily prove anything, except in something very specific like child molestation, where it’s not possible to have a lawful explanation for why it is there. It is important for people to realize that all DNA evidence really means by itself is that someone has a connection with the crime scene, but it doesn’t itself explain what the connection is, and it could easily be harmless.
“finding proof that the match committed the crime than about investigating who committed the crime”
I think finding proof that a suspect did it and investigating the crime are the same thing and, also, that you’re creating a distinction without a difference between DNA evidence and any other evidence the cops may have.
Every clue that points towards a perpetrator has some odds of pointing to a random innocent person, only with most of those clues (a witness saw someone that looked like the suspect, there were partial fingerprints at the scene) we don’t know what the odds are. It doesn’t make a difference if the first clue found is DNA evidence or any other evidence that points to a specific person… after that first suspect is found the police are going to try to prove that he did it.
I have concerns about DNA databases, but this just isn’t one of them.
The point is that if the police really believe that a cold match means it’s a million to one or better certainty that they have the right guy, then they’re going to tailor their investigation to that guy, and look for evidence against that guy, and likely not bother with other leads. In other words, they’ll be investigating the match, and not necessarily the crime. If you so much as lived in the same state as the victim at the time the crime was committed, the burden of proof switches to you to come with an alibi, or prove your innocence.
I’m not saying cold searches should never be used. Only that police, prosecutors, and the public need to be aware of the statistical probabilities, here, and they need to know that matches are nowhere near a certain indicator of guilt.
This didn’t post before, I don’t know if it had to do with the glitch or if I broke a rule or something, but I wanted to bring up chimerism. This is basically when one person has two or more sets of DNA in their body. Discovery did a show on it called “I am my own twin”. In it, a woman almost lost her child because a DNA test did not identify her as the mother. This has happened in transplant cases as well. Here is more info on that case: http://en.wikipedia.org/wiki/Lydia_Fairchild
While all the articles I’ve read say this is rare, I can’t find anything that estimates just how rare or how common it is. It just adds to the issues of errors you’ve already brought up.
Isn’t that an argument for big, extensive DNA databases, especially in large states like CA? If each DNA search turns up 30-odd cold matches, that pretty much means other evidence would be required to even arrest someone, much less convince a jury, right?
“I think finding proof that a suspect did it and investigating the crime are the same thing”
Sort of, but sometimes an investigator settles on a suspect too early and starts trying to build a case without considering possible alternatives. This is called “falling in love with your theory of the case” and it’s a well-known trap that investigators are normally taught to avoid.
I think the fear here is a repeat of what has been done with science and DUI. In that case a breathalyzer is supposed to be used as supporting evidence to build a case against a defendant (e.g. BAC +.08, impaired driving, failure of field sobriety tests, etc.). However a BAC +.08 alone is enough for a conviction even in the absence of any other evidence. Here the “science” of the breathalyzer trumps all other evidence, even in spite of the fact that a .08 BAC isn’t a certainty of impairment. Those who fear a similar single conviction element based on DNA alone are justified. After all, the courts are educated in the law, not science and statistics. And we know that Breathalyzers and DNA are a sure thing; haven’t you seen CSI?
Radley - Ok, I think we’re on the same page. As long DNA evidence is not treated as infallible truth I don’t have a problem with its use.
Windypundit - Yes, but “falling in love” can happen with any first bit of evidence. DNA isn’t special in that regard.
Two things:
One, “99.99967″ may be the technical accuracy of the test, but it doesn’t include any human sources of error.
Secondly, even if we ignore point number one, a database of all the people in the world (for example) would return a large number of results. The perpetrator would be, with 99.99967% certainty, on the list. But the list would also be longer than one name. So getting a positive result doesn’t mean your DNA was in the underwear, it means you’re one of the handful (or more) who might be the one. But, when the database is so small that you’re the only one who gets a positive flag, you look like the culprit.
But Radley, if we had a database with 150 trillion people in it, and the probability of somebody else having a match is equal to Planck’s constant then clearly we’d have a match that proved guilt…so clearly you are a fool and nobody should ever trust you again.
Okay, seriously though yes, you absolutely need to do this via Bayes theorem. That is you want to know the following probability:
Prob(Guilt|DNA Match).
This is not necessarily the same as Prob(DNA Match|Guilt).
To find the above though you would use the following equation:
Prob(Guilt|DNA Match) = Prob(DNA Match|Guilt)Prob(Guilt)/Prob(DNA Match).
The probability of DNA Match is the 1 in 3 million probability. The higher that number, all else held constant, the lower the Probability of Guilt given a DNA match.
Think of it this way, are the two probabilities equal:
Prob(Royal Flush|Winning Hand) vs. Prob(Winning Hand|Royal Flush).
Given that I have a royal flush, I’m pretty much going to win. However, one can have a winning hand and not have a royal flush (the first probability). Clearly the first probability is going to be smaller than the second probability.
Oh and the Prob(Guilt), that is the prior probability of guilt. This is tricky to handle, IMO. Depending on your flavor of Bayesian analysis it can be a subjective probability. Also, there is the “innocent until proven guilty” aspect as well. I’d say that a good starting point would be Prob(Guilt) = 0.5. This is often considered a “non-informative” prior. You can also calculate the above probabilities for different priors. Also, you’d eventually want to factor in other evidence as well such as eyewitness testimony, an alibi, etc.
Knowing nothing else, I’m sorry, but a P(guilt) of 0.5 isn’t an uninformative prior at all. Guilty and innocent isn’t a symmetry here. The symmetry for an uninformative prior is between the different people who could have done it, so P(guilt) is going to be 1/population. We do, of course, restrict it to the relevant population of men,in the area, consistent with the evidence. Letting X be all this other evidence, what we would like is
P(guilt | DNA match, X) = P(DNA match | guilt, X) * P(Guilt | X) / P(DNA match | X).
It can be really helpful to include the X, as it helps solidify what you’re conditioning against, so that it doesn’t change in different expressions.
This is even better, but you should be careful in that I doubt a guy in Maine raped a woman in Hawaii. If we had good data on the ratio of rapists in the male population we could use that as well, or as you say, include it in X when we do our conditioning. In any event, using different priors is probably a good idea as it gives you an idea of how sensitive the posterior is to the specification of the prior.
Two or more witnesses - nothing less. But then there is no justice among men.
Science lies. All science is opinion based upon some set of “facts” agreed to by those working in a specific area.
If you had no good suspects and no specific prior reason to suspect the person before searching the DNA database for a match, then the prior probability of guilt P(Guilty) could reasonably be set at 1/N, where N is the number of people who could possibly have committed the crime. Assuming that the semen found on the victim came from the assailant, then N could easily be in the millions if the crime occurred in a large metropolitan area.
As an example, suppose that P(Match | Guilty) = 1 (or nearly so), P(Match | not Guilty) = 1/3000000, and N = 2000000. Then
P(Match)
= P(Guilty) P(Match | Guilty) + P(not Guilty) P(Match | not Guilty)
= 1/2000000 + 1/3000000 (approximately)
= 5/6000000
P(Guilty|Match)
= P(Guilty) P(Match|Guilty) / P(Match)
= 1/2000000 * 1 / (5/6000000) (approximately)
= 3/5
= 60%
In this case, the DNA evidence is good reason to suspect the person, but it doesn’t conclusively prove his guilt.