## Odds n’ DNA Databases

Sunday, May 4th, 2008Steve Chapman had a column last week about the benefits of assembling large DNA databases of the populace for the purpose of solving crimes.

The *L.A. Times* has a story this weekend on why that creates some problems that might not be readily apparent.

The main problem is that the odds of a false match increase exponentially when you’re running a DNA sample against a database of hundred of thousands of people (in Britain, the number is well into the millions)–a Bayes’ Theorem problem. The problem is exacerbated when you’re dealing with decayed DNA from old “cold cases,” where you have even fewer markers than in well-preserved DNA samples.

Let’s say the U.S. adopts a Great Britain policy on collecting DNA–basically a move toward, at some point in the future, having DNA on file for everyone in the country. Well now the 1 in 1.1 million odds against the suspect in the L.A. Times case are being run against a database of 380 million people. The numbers say that you’re going to pull up about 345 matches in the U.S. alone. In the California case, the database is obviously much smaller than the entire U.S. population, and only one of those 345 people showed up from the 330,000-person FBI DNA database–the (admittedly unsympathetic) subject of the article. But any of the other 344 potential matches in the U.S. (or the 2,200 matches worldwide) could have committed the crime. They just weren’t in the database.

DNA database searches are an excellent starting point for law enforcement. But given the odds of false matches when running DNA against an extensive database, we should be very careful about moving the burden of proof onto matches to prove their innocence. It’s also unfortunate that the judge in the case profiled in the *L.A. Times* would only allow the prosecution’s miscalculated 1 in 1.1 million chance of a false match into evidence, and not the more statistically sound 1 in 3. Even if one were to accept the idea that the scientific community is divided over the proper way to calculate the possibility of a false match (and I’m not convinced there’s really that much of a debate), you’d think a judge should either allow the jury to be made aware of that division of opinion, and that there are serious statisticians and scientists who would put the odds much, much lower than the odds suggested by the prosecutors in the case.

2 very good articles. I’m actually surprised at how well written the LA Times story was. It was full of information that a juror/potential juror should know but won’t be told during an actual trial. Thanks for the links.

In regards to Chapmans article, I don’t think the government has any right to keep ANY identifying evidence on someone arrested but not convicted of a crime.

From DNA of Family, a Tool to Make Arrests

Privacy Advocates Say the Emerging Practice Turns Relatives Into Genetic Informants

http://www.washingtonpost.com/wp-dyn/content/article/2008/04/20/AR2008042002388.html

“The main problem is that the odds of a false match increase exponentially when you’re running a DNA sample against a database of hundred of thousands of people (in Britain, the number is well into the millions)–a Bayes’ Theorem problem.”

I don’t claim to be a statistics expert, but I have posted on this as well, and I cite people who are statistics experts (including the head of the statistics department at Oxford) who disagree.

Let me throw out an example for you to illustrate the counterargument in a simple form. Assume that we have a DNA match, and the random match probability is “1 in x” where x is a number that exceeds the population of the Earth. Now assume that you have a database consisting of the population of the Earth — and you get one hit.

Common sense says that you are virtually certain that your one hit is a match to the person who left the DNA profile at the crime scene.

But judging from your statement quoted above, the chances of a *false* match increase exponentially the larger the database becomes. Presumably, with a database of the entire world population, the chances of a false match approach certainty, if one were to accept your quoted statement.

That can’t be.

These examples show that it makes no logical sense to say that the value of the evidence declines as small databases get larger, but increases greatly as they get bigger still, until the value approaches certainty with a fully comprehensive database.

The larger the database, the larger the chance of *a* match. But once you have a match, the original chances of that match being a match are unaffected by the size of the database — just as a room with 10,000 randomly tossed coins is virtually certain to have one that came up heads, but the chance of any *one* coin coming up heads is always 50%.

I wrote a law professor about this who sent me an upcoming law review article about it that he has allowed me to quote. Here is a paragraph I found particularly illuminating:

“We can approach this question in two steps. First, we consider what the import of the DNA evidence would be if it consisted only of the one match between the defendant’s DNA and the crime-scene sample (because he was the only person tested). Then, we compare the impact of the match when the data from the trawl [of the database] are added to give the full picture. . . . In the database trawl case . . . [i]f anything, the omitted evidence makes it more probable that the defendant is the source. On reflection, this result is entirely natural. When there is a trawl, the DNA evidence is more complete. It includes not only the fact that the defendant matches, but also the fact that other people were tested and did not match. The more people who are excluded, the more probable it is that any one of the remaining individuals — including the defendant — is the source. Compared to testing only the defendant, trawling therefore increases the probability that the defendant is the source. A database search is more probative than a single-suspect search.”

Patterico,

It is true that Balko overstates the rate at which the error would grow. The only reason it would not have a linear growth, as opposed to an exponential growth, would be incompetence, human error and corruption among those inserting the data barring there being some weird flaw in the code.

That said, a national database is probably not a very good idea until there are effective procedures in place to correct mistakes and to ensure that people take the proper care to insert data. Sex offender registries already show the potential of what can happen when the government keeps a lot of data, but doesn’t ensure that it is valid and accurate, not to mention have clear procedures for quickly correcting mistakes.

Of course, the elephant in the room here is just how effective the DNA biometrics software really is. Having worked a few government contracts myself, I can tell you from personal experience that the government frequently allows boneheaded choices to go through with the way that COTS (commercial, off-the-shelf) software is chosen for their systems. Before something of this level of impact on the criminal justice system goes through, there really needs to be a thorough investigation of the capabilities of all of the products in the field, that is open to the public, and the government needs to go with

the best, not the cheapest.One other thing, this national DNA database will most likely (very, very high probability) be built with COTS, not custom code by high-priced experts, so you really do have to worry about the quality of what is the state of the art in DNA biometrics software available right now.

As my wife pointed out, even linear growth very may be unacceptable.

//Common sense says that you are virtually certain that your one hit is a match to the person who left the DNA profile at the crime scene.//

“Common” sense may tell you that, but that doesn’t mean it’s right.

Suppose a million people choose numbers from 0 to 1,000,000 (i.e. 1,000,001 possibilities) with uniform probability. I know that Fred chose the number 123,456. How likely is it that Fred is the only person to have chosen that number? You’re saying that since the number of possibilities exceeds the number of people it should be pretty likely. It isn’t.

The probability that nobody other than Fred choose the number is (1,000,000/1,000,001)^999,999 which is about 37%; the probability of a duplicate is 63%. If there’s one other person and Fred was arbitrarily chosen because he happens to match, that means there’s about a 32% chance he was the wrong choice. I’d call that pretty ‘reasonable doubt’.

Even upping the number of possibilities to 2,000,000 doesn’t make thing clear cut. In that case, the probability that Fred is unique is (1,999,999/2,000,000)^999,999, or about 61%. So about a 20% chance of misidentification. For that matter, even 5,000,000 possibilities won’t make things perfect. That yields an 82% probability of uniqueness, or a 9% chance of a misidentification.

The probative value of a ‘hot match’, in a case where the suspect is identified BEFORE the database search, may be somewhat increased by the size of the database (assuming the match is unique within the database, of course). On the other hand, the probative value of a ‘cold match’, when the suspect is identified BY the database search, is reduced as the database becomes more inclusive.

supercat,

You say:

“The probative value of a ‘hot match’, in a case where the suspect is identified BEFORE the database search, may be somewhat increased by the size of the database (assuming the match is unique within the database, of course). On the other hand, the probative value of a ‘cold match’, when the suspect is identified BY the database search, is reduced as the database becomes more inclusive.”

I think it’s helpful to use mathematical terms when you’re discussing math, rather than using terms like “probative value.” When you talk about the “probative value” (or when, as the LAT does, you talk about matches hitting an “innocent person”) it confuses things. Because, as Radley would agree, innocence or guilt depends upon factors other than a DNA match — it depends on what the circumstances of that match are. Similarly, the “probative value” of evidence is also a subjective determination that hinges on all sorts of extraneous information.

But again, under any definition of “probative value,” I can’t agree with this: “the probative value of a ‘cold match’, when the suspect is identified BY the database search, is reduced as the database becomes more inclusive.” So if the database includes the entire population of the Earth, and you get one match, by your logic the “probative value” is at its LOWEST, when common sense tells you it is actually at its HIGHEST.

“Suppose a million people choose numbers from 0 to 1,000,000 (i.e. 1,000,001 possibilities) with uniform probability. I know that Fred chose the number 123,456. How likely is it that Fred is the only person to have chosen that number? You’re saying that since the number of possibilities exceeds the number of people it should be pretty likely.”

When did I say that? My example set forth a situation where the database consisted of the entire Earth’s population, and you got one hit, with a random match probability of “1 in x” where x is a number greatly exceeding the population of the Earth.

That’s not the same as what you characterized me as saying.

I think that all of this will be moot in the near (5-10 yrs) future. We wont have have DNA profiles of suspects. We’ll have complete DNA sequences HGP style. Every one of 2.5 billion nucleotides will be known in the suspect’s DNA and the crime scene samples (if the evidence if freshly collected)

supercat,

Suppose a million people choose numbers from 0 to 1,000,000 (i.e. 1,000,001 possibilities) with uniform probability. I know the murderer chose 123,456, and I know he is among the million people. I now examine a database consisting of all million people to see if anyone chose the number 123,456, and I learn that one and only one person did: Fred. Common sense tells me Fred is the murderer.

THAT is the proper restatement of my analogy, not the one you were making. I was talking about a database of all 6.7 billion people (i.e. everyone in the world) yielding one hit, when we know the donor is one of those 6.7 billion people. Common sense says we have the guy. But the “chances of a single hit being a false positive increase as the database expands” theory advocated by you, Balko, and the LAT says the opposite: that it is virtually certain we *don’t* have the guy.

i see a lot of arguments being used that are based on probability. whereas these are all more or less valid, i would like to point out that there are three parts to finding a suspect: he has to have a reason to commit the crime, an opportunity and a means. This will further diminish the pool of candidates.

For the record though: I am totally aganst a DB like that. I just feel uneasy abt the gov collecting too much data.

Patterico –

I’d agree with you on this point: If other evidence in a case points to a particular suspect, and you then run DNA taken from the victim against a database, and it returns your guy as a match, that’s a pretty good indication that you have the right person. And the bigger the database, the the more confident you can that you have your man.

The problem arises when you start running DNA taken from a victim against a database

cold(as was the case in the L.A. Times story), and thenbeginyour investigation based on the results. When you start doing that, you’re going to get false positives, and the bigger your database, the better your odds of getting false positives. Once you have your results, you also then run the risk of an investigation tainted by confirmation bias.Look at it this way: If the odds of someone having the same set of markers as a given piece of DNA evidence are 150,000 to 1, and you run it against a database of 300,000 people, half the time you’re usually going to get

twomatches–and at least one of them has to be innocent. Of course, there’s also a decent chance they couldbothbe innocent, and the person who actually committed the crime isn’t in your database.the bigger your database, the better your odds of getting false positives.I’ve been troubled by this argument all morning. The odds of a false positive don’t change, just the odds of a hit in the database. I have an inkling that the smaller database would cause more problems for investigators. Wouldn’t getting one hit be much more likely to corrupt an investigation than getting three hundred?

You are not going to go anywhere, now, with fighting the sufficiency of the database and the validity of the markers in your own individual case. The standard is “a reasonable degree of scientific certainty”. You can have a battle of the experts but it will be about methodology and interpretation, not about the established validity of the science. That might come down the road when the science changes. The odds are that your expert will be made to be seen a total kook should he try to challenge the established science.

Radley,

You say:

“Once you have your results, you also then run the risk of an investigation tainted by confirmation bias.”

That is true, but I didn’t think it’s what the article was addressing. There are two issues here: 1) the statistical relevance of the DNA data, and 2) the reliability of any corroborating evidence. You make the argument that #2, the corroborating evidence, could be tainted by confirmation bias, and I understand the argument. But that shouldn’t change the question of the relevance of #1: the statistical relevance of the DNA data. And that’s what I thought the LAT article was about: what are juries being told about the statistical relevance of the DNA data?

Prof. Kaye, whom I quote above, puts it this way:

“Consider two cases. In Case I, the defendant was identified through a trawl, and further investigation produced confirmatory evidence. In Case II, the confirmatory evidence was known at the outset, making the defendant a suspect. The police did not bother to secure a DNA sample from him, however, because they knew that his thirteen-locus STR genotype was already included in the state’s convicted-offender database. To be on the safe side, rather than just compare the crime-scene genotypes to the defendant’s record in the database, they ordered a full search through the database. This search showed that the defendant matched and that no one else did. The only difference between the package of evidence in the two cases is the order in which it was uncovered.89 In Case I, the police trawled, then “confirmed.” In Case II, they “confirmed,” then trawled. It is hard to see why the evidence in Case I would be any less persuasive than that in Case II.”

Now, it seems to me that this conclusion can be attacked with respect to the confirmatory evidence, which *could* arguably be tainted by confirmation bias, depending on what the evidence is. (Some evidence is more susceptible to confirmation bias than other evidence.) But if you look *only* at the issue of what you know statistically speaking based on the match, it’s the same in both cases. Isn’t that clearly true?

Patterico,

Yes, in the particular circumstances of your hypothetical scenario, you would have very high confidence that you had the right guy. However, the higher the individual probability of a match, the more unlikely it is you actually would get <= 1 match in the entire world. If the individual probability is high enough, as in the case described in the L.A. Times article, indeed you do get less confidence as the database grows and becomes more representative of the general population (as opposed to the criminal population).

That’s actually an argument that database hits have *more* evidentiary value, because they eliminate a large number of suspects (criminals in the database) who otherwise would be more likely than members of the general public to commit crimes.

I’d have to think about that one more before reaching a conclusion as to how I feel about that argument, but I don’t think it affects the LAT article’s conclusions.

(Patterico) You specified that the “1 in x” exceeded the Earth’s population, but you did not indicate whether it exceeded it by a factor of 1.01, two, five, ten, or a million. Having the “x” exceed the population by a factor of five may seem like a large factor, but it’s not enough to make duplicates unlikely. If “x” is sufficiently large, duplicates become unlikely, but it has to become really big before they cease to be a factor.

Also, I tend to think confirmation bias is often insufficiently acknowledged. If police set out to “prove Mr. Z committed a crime” rather than “find out who committed a crime”, the resulting mindset may lead them to ignore evidence that would point to someone else.

(Jeff) The probability that a hit in the database is a false positive is affected affected by whether the database contains people who are more or less likely to have committed the crime. If I had a database which contained only the DNA of convicted-but-escaped murder-rapists who were not in prison two weeks ago, and I get a cold match in that database with DNA evidence from a bunch of rape-murders that occurred within the past couple weeks, I’d say the probability of a false positive is not terribly high. Conversely, if I were to get a cold match in a database consisting only of people who had never been convicted of any crime, I’d regard the probability of a false positive to be much higher.

You have two effects working in opposite directions. You have the effect of more and more of the population being included as the data base grows. You also have the probability of a false positive increasing as the data base grows.

Let N be our population size. Let n be our data base size. Let p be the probability of a random match. Let m be the number of matches found in the data base.

Now our expected number of matches is m +(N-n)p. The expected value of m is np.

If m is 1 then the probability of it being a correct match is 1/(1+(N-n)p). If it is the only match found in the data base then the probability of it being a correct match does go up as the proportion of the population covered by the data base increases. Note however that this finite population benefit is only strong when the data base covers most of the poulation. It is the fraction not covered that matters. Note also if we would normally expect more than one match in the database and by chance we have only one the the probability of it being a correct match will be lower than you might expect.

If there is more than one match found the the probability of any particular match being a correct match match is 1/(m+(N-n)p). Since the expected value of m is np the chance of getting a false match increases with the data base size. The benefit of increasing the data base size comes from reducing the term (N-n)p above. That is the benefit depends on the prbability of a mismatch being small or the data base covering most of the population.

I think the danger of large data bases comes in when p is not so small. In these cases it can be because the DNA has degraded and so there are less markers available. Alternatively it can be because there are correlations between markers in the population increasin the chances of patterns being duplicated. This will happen with relatives.

You can calculate all od the odds you want, but in my opinion you miss the most important factors. “Government run computer database” says it all.

Name one thing that is run by the Government competently?

When it comes to the Computer part I have two things to say. Microsoft Operating systems and Intel Pentium floating decimal error.

As for the database part look at the number of errors in the credit company’s databases, look at the No-Fly database and look at the number of lost or stolen government computers.

There was a case somewhat like this a few years ago in Michigan. A college co-ed had been raped and murdered in the early 70′s (might have been late 60′s. Running the rape kit semen sample through the database returned a 100% hit on a man who turned out to have been 4 years old at the time, so it could not have been him. Resubmitting the sample, they got another 100% hit, and since this man was an adult at the time of the murder, he was charged and convicted of her murder. Police, prosecutors and jurors paid no attention to the fact that if the first “cold hit” couldn’t have been correct, maybe the second one was wrong too. They explained the first away as a mistake, but the second had to have been true because he was an adult, and could not account for his time 30 years earlier.

You’re right, you are not an expert in statistics. You need to put the problem in terms of bayes theorem, as Radley noted, failure to do so means you are taking the wrong approach.

Yes that is fine as a hypothetical goes, but the problem is that in reality you’ll often get lots of duplicates, which is what supercat is saying. As such, the fact that you have a list of people who chose the number 123,456 but you don’t know which of them are the murderer. The chances of chosing that number randomly is 1 in a 1,000,000, but saying that to a jury. You need to tell the jury just exactly how many people you got on that list and why you singled out the person you are charging with the crime. Failure to do so means you aren’t doing your job.

Think of it this way, get 30 to 50 people together and chances are that at least two of them share the same birthday. Is there something of cosmic significance there? No, just plain old boring statistics. Same thing is at work here, the larger the number the people, even with low probabilities, the higher the likelihood of match.

Hmmm, this,

Should read as:

supercat says:

“(Patterico) You specified that the “1 in x” exceeded the Earth’s population, but you did not indicate whether it exceeded it by a factor of 1.01, two, five, ten, or a million. Having the “x” exceed the population by a factor of five may seem like a large factor, but it’s not enough to make duplicates unlikely. If “x” is sufficiently large, duplicates become unlikely, but it has to become really big before they cease to be a factor.”

Except that my hypothetical posited that there are no duplicates because we have a database of the whole world and we got only one match.

Steve Verdon quotes me as saying:

“Common sense says that you are virtually certain that your one hit is a match to the person who left the DNA profile at the crime scene.”

and responds:

“You’re right, you are not an expert in statistics. You need to put the problem in terms of bayes theorem, as Radley noted, failure to do so means you are taking the wrong approach.”

If you’re going to use an insulting tone like that, Steve Verdon, you’d think you’d take better care to be sure you’re right. Please go back and re-read my example, which contains the sentence: “Now assume that you have a database consisting of the population of the Earth — and you get one hit.”

Now explain to me how, absent human error, you can have a database consisting of the population of the world, get only one hit, and not have the right person.

I’ll be right here waiting.

And no, you don’t get to cite human error in your response because your snarky comment was an argument about statistics and not the possibility of human error. So any reference to human error is a copout.

Should you choose to admit that you didn’t read my example correctly, the polite thing to do would be to apologize for your sarcastic tone when the mistake was yours.

Perhaps your point was that it doesn’t really matter in my example what the RMP is — and that’s true. I just tried to pick a large number — the population of the Earth — so that I could make it clear that we’re not talking about RMP’s like 1 in a million. That way, when I say you only get 1 hit out of 6.7 billion, people won’t carp that my example is extraordinarily unlikely given the RMP posited.

But we are not talking about the cases where we would expect only one match in the world. The main concern is cold searches using degraded DNA. Here we have much higher match probabilities and if we are searching a large data base then we can easily have false positives occuring by chance. If there is a 1 in 1.1 million probability of a match occuring by chance and you search a data base with 330,000 observations none of whom are the culprit then you have nearly a 26% chance of having at least one match occuring by chance. Allowing for multiple false positives the expected number of false positives is 0.3. the figure the jury should have been given is 26%, 1 in 1.1 million is misleading.

the figure the jury should have been given is 26%26% what?

Note that your hypo assumes that “you search a data base with 330,000 observations none of whom are the culprit.” Which you don’t know.

And the jury heard nothing about a database anyway.

This all strikes me as similar to telling juries “x% of people who are arrested are later convicted/acquitted.” Juries are to judge each case on its facts and ignore how the person got in front of them unless that is otherwise relevant to innocence or guilt.

Patterico,

The point is that it does matter in these cases how the evidence was obtained. The example given above is one which gives a 26% chance of the data base giving at least one match to an innocent person. That it was obtained by such a search is extremely relevant in this case and the jury should have been told this.

There were probably others not covered by the data base who would have matched the profile. The jury should have been told this.

I’m a statistician and I work on high throughput data including genetic microarray data. In this sort of data false positives are a serious concern. The 1 in 1.1 million figure is highly misleading under the circumstances. 26% and how it was derived are what they should have been give.

“The example given above is one which gives a 26% chance of the data base giving at least one match to an innocent person.”

Assuming the database consists entirely of innocent people.

But why would you make that assumption?

Patterico,

Your example is actually irrelevant in that it:

1. Assumes away the probelm.

2. Is taking precisely the polar opposite of the problem noted in the article.

Assuming a unique match in a database that covers everyone is the one case where there is no problem at all. The only logical conclusion is that the person is guilty. This basically side steps the issue. The recommendations in the article are for instances when the database is much smaller than total coverage and the match is less perfect.

When you have two hits from a database, then you have a problem. If the crime in question was indeed committed by one person it must be that one is guilty and one is innocent. Therefore, it is relevant to condition on all the information contrary to what you are claiming. As such telling the jury about 1 in a million chances, and leaving out the 50-50 portion of the information is deceptive. In fact, I’d say such a prosecutor should be punished in some way.

Oh, and your source, Prof. Kaye, he himself admits that as a database grows larger, it will lead to more false positives. Your examples of perfect matches and population wide databases aren’t interesting from the context of this problem….because they aren’t a problem.

As for your specific example, there is the problem of the exact size of x. As has already been pointed out, even if x is very large, there can still be a problem with more than one hit. You want to restrict it to a very special and trivial case. Fine, but your logic only works in that case. A trivial result like that is…well boring and not very useful.

No such assumpion was made. If we assume there only one guilty person then the data base still had 330,000 innocent people. Even if there was a guilty person in the data base there is still a 26% chance of there being a match to an innocent person in adition to the match to the guilty person. There will probably be several people not in the data base whose DNA is a match but who of course were not looked at.

A data base search like this is sufficient reason to suspect and investigate someone. It is not reason to convict. If the DNA evidence is presented in court how it was obtained should be as well and the appropriate probabilities. And those doing the investigation should be on the lookout for confirmation bias on their own part.

That’s about what I expected, Verdon. You falsely and snidely claimed my example would not result in getting the right person. When I point out that you got it wrong, and that it would indeed result in getting the right person — and that you were a jerk about it to boot, even as you were FlATLY wrong — I get no apology. No acknowledgement that you got it totally, 100 percent backwards.

Just your usual bluster and subject-changing. You simply pretend you were really trying to make a different point. And you hope nobody will notice the goalposts having been moved all the way down the field.

“No such assumpion was made. If we assume there only one guilty person then the data base still had 330,000 innocent people. Even if there was a guilty person in the data base there is still a 26% chance of there being a match to an innocent person in adition to the match to the guilty person.”

Mr. Statistician,

Does it change the percentages if the guilty person is in the database?

Noodle on that and get back to me.

But whether the guilty person is in the data base is precisely what you do not know.

Having the guilty person in the data base does not influence the chance of having a false positive there as well.

You have to consider the population outside the data base as well. If there is one hit in the data base and an estimated 9 matches in the population outside the data base then the chance that the match repesents the guilty person is 1 in 10.

Having the guilty person in the data base does not influence the chance of having a false positive there as well.Really?

It’s a slight simplification, but really. If you know the guilty person is in the database, you can exclude him from it, and calculate the probability that an innocent person is in the other (N-1). For large N (say, greater than 100), this is virtually the same as the original chance.

I take it that when you say it’s a simplification, you actually mean “*not* really, but close.”

Yes, and I can quantify how close: to within one part in 330,000.

“He’s six feet tall”

“Really”

“Well, actually, six feet and 0.0002 inches”.

The difference is smaller than the margin of error on either calculation from not knowing the exact number of people in the database.

“If you know the guilty person is in the database, you can exclude him from it, and calculate the probability that an innocent person is in the other (N-1).”

How is that different from me saying that you have to start with a database of known innocent people to know the chances of a *false* positive? Your method gets you to the same place, by removing the guilty person.

You’re still assuming you know who the guilty person is, and the whole point is that in real life we don’t know that. So when we get a result we don’t know the chances that *that* result is a false positive — and *that’s* the question juries are interested in.

My analysis shows that whether the guilty person is in the database or not, the chances of an innocent person being in the database is roughly the same, by analyzing each of these cases separately.

No we are not. Aaron and I are both saying the same thing. If you add the guilty person to the data base it does not affect the chance of getting a false positive as well. In the the example above the expected number of matches is 1.3 if the guilty person is in the data base and it is 0.3 if the guilty person is not in the data base.

OK, I get what you are saying. As you folks have worded your comments, I believe you are correct. But the issue relevant to the LAT article is: what is the significance of a *single* hit in a database? And the answer is: you can’t talk about the probability of a false positive unless you *assume* that the guilty person is *not* in the database.

Are we agreed on that?

Although certainly, not knowing whether the guilty person is there or not, a jury that heard there had been a database hit could be told this: “even in a situation where we assume that everyone in the database is innocent, there’s a 1/3 chance of a hit, which would be a false positive given the assumption.”

But in a case with only one hit, you can’t tell the jury there’s a 1/3 chance of a *false* positive because you’re thereby telling them there’s a 1/3 chance the guy sitting in front of them is not the donor — and you can’t know that unless you assume up front that the donor wasn’t in the database.

In which case it’s 100% that he’s not the donor.

If you have only one match in the database what is the probability that it is the guilty person? Let pf be the the probability of there being one false match in data base (in the case above this will be about 22%). Let pg be the probability that the guilty person is in the data base. This is the hard one to estimate. If people were randomly placed in the data base then pg=n/N where n is the data base size and N is the population size. Since the data base is largely composed of previous offenders pg will actually be somewhat larger than this. But how much larger?

The probability that the match is the guilty person is pg x (1-pf) / (pg x (1-pf) + (1 – pg) x pf). Now let us assume pf = 22%. If pg is 50% then the probability that we have the guilty person is 78%. If pg is 10% the the probability of guilt is 28%. If pg is 1% then the probability that we have the guilty person is 3%.

Now you are not supposed to use the criminal proclivities of a person as evidence so by the rules of evidence you should use pg =n/N. If you disagree with this so do I. However the point is that even with realistically high probilities of the data base including the culprit the DNA evidence might still be far from conclusive.

I don’t know. I’m not sure it’s possible to tell juries anything meaningful about the “probability” that the true donor is really in the database.

You would have to use the cautious estimates in this case. You would have to use n/N. The problem with an old case is that N , the population that potentially could be in the database will be large because people could have moved anywhere in the country. You would have to rely on the other evidence that turned up in the investigation. These cold data base hits might only give probable cause. You have to be aware that this might be a false lead.

You cannot give meaningful estimates of the probability that the match is the culprit without giving an estimate of the probability of the culprit being in the data base. You can reliably place a floor under this probability, however this floor could and usually will be fairly low. At least a floor that you could defend in court is likely to be.

I don’t think it’s an either-or. One measures the odds that

someonewill be falsely identified, while the other measures the odds that any particular individual will. Which one is relevant depends on other factors. If the only reason to suspect Puckett is the partial DNA match, and we can only assume that the odds are roughly 50-50 that the true killer was in the database to begin with, then there’s about a 1 in 3 chance that they fingered the wrong guy. But if there’s enough other evidence out there to make him the chief suspect, as the story indicates, then it’s fair to say that the odds of the partial DNA match falsely pointing to thesameindividual are a million to one.Yes, we have no way to compute the probability that the true assailant is in the offender database. So we have to make one of two mutually exclusive assumptions: (1) the true assailant is NOT in the offender database, which then predicts that there is approximately a 1 in 3 chance the evidence DNA profile will match someone who is innocent. Alternatively, (2) the true assailant is in the offender database. Under this second assumption there is a 100% certainty that his profile will be found to match the evidence.

Since the database represents a relatively small sample of the general population it seems to me that assumption (1) is more likely than assumption (2). The reason for the 1 in 3 chance is the result of the limited partial evidence DNA profile, there is simply not enough genetic information available to achieve maximum discrimination. If there was a full evidence profile, then the chance could be 1 in trillions that the evidence profile will match someone who is innocent. That probability would be sufficient to believe the defendant is guilty. Whereas a probability of 1/3 that the evidence profile could originate from an innocent falsely accused leaves too much doubt that the defendant is guilty. Without any further evidence I would vote not guilty.

I just had a long conversation with Patterico on the topic, and it occurred to me that we should be able to compute a reasonable estimate of the odds of the killer being in the database, in one of two ways:

1. Compute the percentage of rapist-murderers who have prior convictions for sex offenses that would be likely to land them in the sex offender database.

2. Compute the percentage of cold database hits that result in 2 or more hits. Assuming a 1 in 3 chance of one false positive, a 1 in 9 chance of two, and so on, we should be able to compare the expected number of false hits to the number of total hits, and extrapolate that the difference represents the number of true hits.

Once we know the relative odds that a true hit is in the database, we should be able to say with a good deal of confidence how reliable a single cold hit is.

Xrlq,

In principle both your ideas are good. It would however be more difficult to calculate these probabilities than you might think. The problem is mostly inhomogeneities in the data.

In the first approach different groups of criminals might have different probabilities of being in the data base. Still the group of interest might be homogeneous enough for this to work. However we have no way of knowing the size of the group that never gets caught.

In the second case the probability of a false positive will vary from case to case. This could make the calculations messy.

Still it’s worth trying.