Friday, January 01, 2010

Data Mining: The Needle in a Haystack Problem

Let us be fair to Obama. He does appear to be committed to ending his predecessor's policies of torture and indefinite detention without hearing. He has, admittedly, been rather timid and incomplete in his approach, often more interested in covering up his predecessor's crimes than rooting them out. And even this tepid approach has engendered considerable resistence, which can only escalate in the wake of the most recent attack. But there is a clear and identifiable difference here between the Obama and Bush approaches.

By contrast, Obama has not created any daylight between himself and Bush on wiretapping (as it was finally approved by Congress). Nor is that all. The Bush Administration's approach to wiretapping attracted the most attention because of its blatant illegality, but it was only a small part of a larger whole that we still know very little about, and that the Obama Administration has embraced, apparently without modification. Consider:

The Inspector General's Report on surveillance found no evidence of intentional misuse of the warrantless surveillance program (p. 13), but warned that in its current, legal form it involves "unprecedented collection activities" that must be closely monitored (p. 38). The IG report gave only the vaguest hints what the warrantless surveillance consisted of, other than to quote NSA director Michael Hayden to the effect that saying the activities were "more aggressive" than FISA allowed, but "less intrusive" because the period of time was much shorter than authorized by a FISA warrant (p. 16). This appears to confirm reports by the Washington Post that computers were sifting "hundreds of thousands" of calls, faxes and e-mails into and out of the US. After various levels of screening, some agents were allowed to listen to some conversations -- about 5,000 people according to one source.

The Post article denied that any domestic calls had been subject to warrantless surveillance. But USA Today famously reported that the NSA was also keeping an immense database of all domestic phone calls, "the largest database ever assembled in the world," looking for suspicious patterns. The legality of this program has never been settled. The same goes for the less publicized Homeland Security program keeping score on international travelers to assess their threat risk.

There is no question about the legality of of National Security Letters, which allow the FBI to command the production of a wide variety of information without having to resort to a subpoena, let alone a warrant. FBI use of NSL's has been extensive, with some 140,000 such letters issues from 2003 to 2006, an average of nearly 50,000 a year. Approximately half of those letters did not lead to any prosecution at all, and most others were used in immigration, money laundering or fraud cases. Very few were used to prosecute actual terrorists. The total number dropped to 16,000 once such abuses were revealed, but soon began edging up again afterward. Also legal is the ever-expanding terrorism watch list, along with the much smaller No-Fly list, which nonetheless contains many dubious entries and torments even more innocent people who happen to have the same name as a tangential terror suspect.

And then there were plans that were rejected at first, only to be adopted in other form such as TIPS, which sought to recruit mail carriers, meter readers, repairmen and so forth as spies and informants. Or Total Information Awareness, that was supposed to analyze patterns in everything. These did not so much disappear as mutate. These are the Bush era policies and programs that the Obama Administration is keeping intact. All fit under the broad rubric of data mining.* All seek to vacuum up huge quantities of data and analyze it for patterns indicating terrorist activity.

Data mining has its defenders. For instance former libertarian Richard Posner argues that there is no danger to civil liberties because the initial scrutiny is done by machine, and only seen by human eyes (or ears) if the program indicates a threat to national security. The only danger could be in abuse the blackmail political rivals. Others at the time of the USA Today article argued that because of the sheer volume of data, there could be no danger to privacy.

The basic problem with looking for terrorists by data mining is there just aren't that many terrorists out there. The estimated number of Al-Qaeda operatives in Yemen is 300. Another 200 are estimated to be in Pakistan. John Ashcroft's sweeping dragnet after 9-11 netted a grand total of one (Ali Saleh al-Marri). The sleeper cells predicted at the time never appeared. (I realize, of course, that Al-Qaeda is not the only terrorist organization in the world. But it is the only one that targets us). In short, we are looking for a needle in a haystack. Data mining in such an instance poses serious problems.

Security expert Bruce Schneier explains well. When searching for a needle in a haystack, adding more "hay" does not good at all. Computers and data mining are useful only if they are looking for something relatively common compared to the database searched. For instance, out of 900 million credit card in the US, about 1% are stolen or fraudulently used every year. One in a hundred is certainly the exception rather than the rule, but it is a common enough occurrence to be worth data mining for. By contrast, the 9-11 hijackers were a 19-man needle in a 300 million person haystack, beyond the ken of even the finest super computer to seek out. Even an extremely low rate of false alarms will swamp the system.

And that does, in fact, appear to have happened. The FBI, frustrated with all the false leads generated, began referring to them as "calls to Pizza Hut." An NSA data miner acknowldged, "Frankly, we'll probably be wrong 99 percent of the time . . . but 1 percent is far better than 1 in 100 million times if you were just guessing at random."

But there are obvious problems with generating so many false leads. The first is whether it is useful at all. The Inspector General's Report was unable to quantify its usefulness to any degree, other than to say that Hayden vouched for its usefulness and said that it would have captured two of the 9-11 hijacker. But has it thwarted any actual terrorist attacks? Most thwarted attacks have begun with a specific tip. (This is worth an entire post). Another, which Schneier focuses on, is the waste of manpower investigating false leads that might be put to other use.

But besides uselessness and the time and effort wasted on false leads, there are real civil libertarian dangers as well. Bush's defenders are quick to point out that none of the data mining led to COINTELPRO style abuses. So far as I know, this is true. But there are other kinds of dangers as well. When a system regularly generates false leads and forces police to investigate them, these fruitless investigations, too, are an infringement on the liberty of people senselessly investigated and expose everyone to the risk of such pointless investigation. Investigation of false alarms differs from COINTELPRO-style abuses in being mindless rather than malicious, but it infringes on liberty nonetheless.

The other danger is that any police force tasked with looking for needles in a haystack, there will be strong institutional and bureacratic pressure to find something. If no needles are turning up, the temptation will be to find a prickly piece of hay and try to convince people that it is sort of like a needle. This can, indeed, lead to COINTELPRO sorts of abuses. In Maryland, state police investigated everyone from anti-war groups to PETA to customers protesting a 72% rate increase as possible terrorists. (The article also cites abuses by city police and the FBI, but links are not functioning).

Obviously, this is not the best time politically to call for a cutback in data mining activities. Obama is already under sufficient attack for not torturing, for using civilian trials, and for releasing GTMO detainees determined not to be terrorists. For him to move away from data mining now would lead to a wingnut feeding frenzy. But what we need is not any more information to swamp the system, but better analysis of what we already have. And no ridiculous rules against leaving your seats for the last hour of flight.

__________________________________________
*Actually, NSL's quite probably are not a data mining tool so much as a method to streamline data collection that it is easy to get sloppy and overuse.

Labels: ,

3 Comments:

Blogger petrenkov said...

It is rather interesting for me to read this article. Thanks for it. I like such topics and everything that is connected to this matter. I would like to read a bit more on that blog soon.

Sincerely yours
Alice Tudes

8:42 AM  
Anonymous Anonymous said...

buy propecia online propecia when see results - buy propecia discount

6:38 AM  
Anonymous data science consulting said...

Hello,
The Article on Data Mining The Needle in a Haystack Problem is nice.It give amazing information about the Data Mining.Thanks for Sharing the information about it.data science consulting

10:54 PM  

Post a Comment

Subscribe to Post Comments [Atom]

Links to this post:

Create a Link

<< Home