Well, the “blogosphere” (I hate using that word) and various online communities are abuzz with the news that AOL Research just released 20 million search queries of some 650,000 users over a three-month time period from March to May of 2006. Though the data were released in order to provide “a real query log” to aid in search engine research, it constitutes a huge violation of privacy, as though usernames have been removed, they have been replaced by unique identifiers, which can be used to track an individual user’s searches, allowing information to be collected about them and a profile built.
The download was taken offline sometime yesterday (August 6th, 2006), but enough people had already downloaded it to ensure that mirrors would quickly be set up, ensuring its continued spread. While the intent of this release of data was obviously not malicious, it was a poorly thought-out move that AOL is sure to receive more bad PR for – especially in light of their recent troubles.
Good intent, bad execution
The dataset was first released sometime on or before August 4th, 2006, on AOL Research. Like Google, Yahoo! and Microsoft, AOL maintains a Research site to inform and help members of the academic and developer community of things they’re up to, and to offer assistance to researchers. As Google’s cache of the former download page indicates, the intent of offering this download was to facilitate research into making search engine technologies better, by offering a look into real users’ search queries and behaviours.
Having a little bit of experience in the research field (I’m currently working as a summer undergraduate research student), I can say that it’s not unheard of to see industry researchers helping out the academic community, either through joint research projects or, as in this case, the release of datasets for testing some particular sort of algorithm.
For example, one of the projects of the researchers in my lab is speech quality assessment, which is essentially developing an algorithm or system that can process a speech file and assess its quality as a human would, which has enormous benefits for Telco’s since they are always seeing what speech codecs are most efficient in terms of size/quality. In order to do this, one needs a vast quantity of voice files (and their associated Mean Opinion Scores – rated by humans) to test around. These voice files and their data can only be obtained through costly (both in time and money) trials that cost lots of money – we’re talking six-figures and above here. Much of the data can thus only be obtained by Telcos who have the money to conduct these trials. Fortunately, many of them have been gracious enough to release these datasets to our lab (and others) – but the data is only released after a strict NDA (Non-Disclosure Agreement) is signed, along with perhaps other agreements.
Some of you might be wondering why this matters at all. After all, it’s just a few million search queries that AOL users entered – containing no personally identifiable information. Well, that’s true – however, since the usernames were merely replaced by unique identifiers (eg. a username is changed to a random, unique number, so “John Doe” is always mapped to “123456”, for example), profiles of users’ searches can be built – opening so many cans of worms that I can’t even count that high.
Here’s a complete breakdown of what information is provided for each query:
- an anonymous user ID number.
- the query issued by the user, case shifted with most punctuation removed.
- the time at which the query was submitted for search.
- if the user clicked on a search result, the rank of the item on which they clicked is listed.
- if the user clicked on a search result, the domain portion of the URL in the clicked result is listed.
This is an absolute goldmine of information to anyone concerned with search engines, SEO or anything related to that. Not only do you get a complete profile of a user’s searches, but you get the exact time they conducted the search, what, if any, results they clicked on (and the rank of that result), along with the domain name of site the clicked-through to. Running datamining techniques on this information is somewhat trivial, as it can easily be analyzed to determine patterns of most any sort.
Internet marketers would give an arm and a leg for this sort of data, but they now will have their hands on it – for free. In fact, there’s already been analysis on the dataset, producing valuable results, and claims that Google will get spammed with made-for-adsense sites (a clear violation of their TOS) targetting keywords for popular and profitable markets. This will further pollute search results, and generate more noise, creating more headaches for search engines like Google, who will have to further adapt to prevent the signal-to-noise ratio from further deterioration.
However, this dataset probably (or hopefully?) won’t have a lasting effect for spammers. As Thought Market points out, since everyone has access to this data, so does Google as well. And Google has proved themselves not to be slackers on developing data analysis techniques – after all, how did they become arguably the most successful search engine? You can bet they will be analyzing this dataset to see what results they can get, and hypothesizing about how these results might be used. If they seen anything fishy going on, you can bet they’ll remove those sites from their index.
Justice is blind
But perhaps the most important or unnerving aspect of the release of this dataset is the implications it has on legal matters. The data released is exactly the sort of data the US DOJ requested from the major search engines (Google, Microsoft, Yahoo! and AOL) earlier this year to “defend its argument that the Child Online Protection Act is constitutionally sound”. Microsoft, Yahoo! and AOL complied, while Google resisted the subpoena – and when the matter went to court, the judge ruled in Google’s favour.
Taken in this context, things do not look good for AOL – while Google’s fighting to protect your privacy, AOL’s actively working in the opposite direction, it seems. I realize it was not meant to be that way, but with something this serious, the effect is sometimes far more important than the intent. Looking at the raw search queries of some users reveals things ranging from embarrasing or weird searches, to queries that may indicate they’re up to no good.
Let’s be clear here – simply searching for “how to kill your wife” is not tantamount to murder – nor is it even enough proof of conspiracy to commit murder. At least, that’s how I see things from my “not a lawyer” perspective. However, consider a situation where someone did kill their wife. If their search results were made available, and this query turned up, how do you think law makers would respond? Or, consider sample size of 1000 murderers, and their search queries. If it were statistically established that a significant percentage of murderers search on the Internet for some aspect of murder before the actual crime, how would law makers respond?
These are complex questions, and I’m not sure what I’d do with the data. Statistics is a complex topic, but the important fact is they are often abused to support invalid and over-reaching actions. This obviously leads to questions and issues like “thought crime” and comparisons with Minority Report. I generally don’t like using slippery-slope arguments, but this is an area where it’s easy to see how abuse could happen, considering that the US DOJ has already expressed an interest in this sort of data, and with all the NSA wiretaps going on.
There were also many other weird, and often disgusting search queries going on, often by the same set of users. I won’t repeat them here, but you can easily see for yourself what some AOL users were searching for. While these searches represent a very small percentage of all the searches conducted, it doesn’t paint a good picture of the Internet for lawmakers. (Keep in mind that these sort of searches are to be expected – after all, even if the incident rate for violent crime is something low like 27 per 100 000, that’s still 0.027%, and keep in mind this search query dataset represents over 650,000 users.)
Note that I haven’t even touched on the topic of false positives. What if someone was writing a paper on the effects of torture, or of any of the other examples of “man’s inhumanity to man”? They’d obviously have to search for some pretty disturbing stuff (I’m glad I’ve never had such a writing assignment), and then these searches would be tied together by the unique identifier. There are plenty of other examples, but the idea is that using someone’s search queries for legal purposes is an easy way to 1984 – while there obviously may be statistical connections between actual crimes, the potential for “though crime” is just too large.
Waves and reverberations
As far as I can tell, this story (and thus the wide release of the data) only started sometime this weekend, probably on Friday or Saturday. But, it’s already making its way around the non-traditional news outlets (blogs, social communities) like fire through a tissue factory. There’s been talk of how personally identifiable information may be available, and of the legal ramifications. It’s already in the top five among Technorati searches, so you can believe that anyone who wants this data will get it. If this doesn’t hit the mainstream news soon, I’ll be very surprised and disappointed – so far there’s only one result for it in Google News.
AOL has apparently responded to this privacy leak, in a comment available here.
Nothing really new or interesting – other than the data represented only about 1.5% of all AOL users, and those included were only US users who used AOL’s client software. As expected, Mr. Weinstein of AOL said that it was an “innocent enough attempt to reach out to the academic community with new research tools”, but as expected, it didn’t go through official channels for approval. You can be sure that AOL’s going to change their policy on this sort of stuff – don’t expect any more stuff from AOL research for a while. (Someone or some people probably also got fired.)