<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>unitstep.net &#187; search engine</title>
	<atom:link href="http://unitstep.net/blog/category/search-engine/feed/" rel="self" type="application/rss+xml" />
	<link>http://unitstep.net</link>
	<description>the home of peter chng</description>
	<lastBuildDate>Mon, 01 Mar 2010 02:28:39 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Google&#8217;s SearchWiki: Promote Search Results!</title>
		<link>http://unitstep.net/blog/2008/11/21/googles-searchwiki-promote-search-results/</link>
		<comments>http://unitstep.net/blog/2008/11/21/googles-searchwiki-promote-search-results/#comments</comments>
		<pubDate>Sat, 22 Nov 2008 01:13:01 +0000</pubDate>
		<dc:creator>Peter Chng</dc:creator>
				<category><![CDATA[google]]></category>
		<category><![CDATA[search engine]]></category>
		<category><![CDATA[social]]></category>
		<category><![CDATA[usability]]></category>
		<category><![CDATA[digg]]></category>
		<category><![CDATA[promote]]></category>
		<category><![CDATA[search]]></category>
		<category><![CDATA[searchwiki]]></category>

		<guid isPermaLink="false">http://unitstep.net/?p=570</guid>
		<description><![CDATA[Yesterday, Google launched its SearchWiki tools, which allows registered users to promote or remove entries from a Google search to further personalize results. This will allow users to customize and tailor the results to what they&#8217;re interested in, but it&#8217;s worthwhile to note that Google has probably done something similar with their personalized search histories, [...]]]></description>
			<content:encoded><![CDATA[<p>Yesterday, Google launched its <a href="http://googleblog.blogspot.com/2008/11/searchwiki-make-search-your-own.html">SearchWiki tools</a>, which allows registered users to promote or remove entries from a Google search to further personalize results. This will allow users to customize and tailor the results to what they&#8217;re interested in, but it&#8217;s worthwhile to note that Google has probably done something similar with their personalized search histories, already offered to registered users.</p>
<p>A few things to note: Firstly, while the act of promoting or removing a search result seems very akin to Digg, the result is not the same.  The changes you make only affect your own search results, and Google is very clear on this.  However, it would be madness to believe that Google would not use the data gathered from this social experiment to further improve their algorithms.  You also have the option of adding your own results to further personalize your searches and there is an option for seeing what <em>others</em> have recommended/promoted or removed, providing for an interesting social experiment.</p>
<p class="image">
<a href="http://unitstep.net/wordpress/wp-content/uploads/2008/11/google-promote.jpg"><img src="http://unitstep.net/wordpress/wp-content/uploads/2008/11/google-promote.jpg" alt="" title="google-promote" width="417" height="288" class="alignnone size-full wp-image-571" /></a>
</p>
<p>Secondly, as this <a href="http://blogs.wsj.com/biztech/2008/11/21/google-no-longer-the-same-search-results/">WSJ blog notes</a>, this ability may annoy people who have used SEO tactics to improve their site&#8217;s placement in Google&#8217;s search rankings.  However, I find this complaint misses the point: Search is supposed to simplify people&#8217;s lives, and if they&#8217;ve promoted or removed a link it was because they found something to be more useful or irrelevant.  </p>
<p>This isn&#8217;t yet a &#8220;wisdom of the crowds&#8221; approach to search results, but its undoubtedly a step forward towards a hybrid approach that takes in more human input to determine the quality of results and their placement.  One can only hope it will improve with time!</p>
<hr/>Copyright &copy; 2010 <strong><a href="http://unitstep.net">unitstep.net</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at is guilty of copyright infringement. Please contact <strong><a href="mailto:webmaster@unitstep.net">webmaster@unitstep.net</a></strong> for more information.<br/><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://unitstep.net/blog/2008/11/21/googles-searchwiki-promote-search-results/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Google moves to add facial recognition to search?</title>
		<link>http://unitstep.net/blog/2006/08/20/google-moves-to-add-facial-recognition-to-search/</link>
		<comments>http://unitstep.net/blog/2006/08/20/google-moves-to-add-facial-recognition-to-search/#comments</comments>
		<pubDate>Sun, 20 Aug 2006 17:42:13 +0000</pubDate>
		<dc:creator>Peter Chng</dc:creator>
				<category><![CDATA[google]]></category>
		<category><![CDATA[image processing]]></category>
		<category><![CDATA[privacy]]></category>
		<category><![CDATA[search engine]]></category>

		<guid isPermaLink="false">http://unitstep.net/blog/2006/08/20/google-moves-to-add-facial-recognition-to-search/</guid>
		<description><![CDATA[Google recently acquired Neven Vision, a company with lots of technology related to image-processing, specifically, facial recognition.  The initial aim of this IP acquisition is to integrate it with Picasa, Google&#8217;s photo-organizing tool and online-photo sharing site.  According to Google:
&#8220;It could be as simple as detecting whether or not a photo contains a [...]]]></description>
			<content:encoded><![CDATA[<p>Google <a href="http://www.informationweek.com/news/showArticle.jhtml?articleID=192201732&#038;subSection=Breaking+News">recently acquired Neven Vision</a>, a company with lots of technology related to image-processing, specifically, facial recognition.  The initial aim of this IP acquisition is to integrate it with <a href="http://picasa.google.com/">Picasa</a>, Google&#8217;s photo-organizing tool and online-photo sharing site.  According to Google:</p>
<blockquote cite="http://www.informationweek.com/news/showArticle.jhtml?articleID=192201732&#038;subSection=Breaking+News"><p>&#8220;It could be as simple as detecting whether or not a photo contains a person, or, one day, as complex as recognizing people, places, and objects&#8230;&#8221;</p>
</blockquote>
<p>The article goes on to speculate that Google may use this technology to enable searching of people on the web by any photos of them that may be online.  Certainly, this wouldn&#8217;t be out of their league, and <a href="http://gigaom.com/2005/11/16/googles-riya-designs/">they&#8217;ve been interested</a> in this technology for some time.  Indeed, Sergey Brin, one of the co-founders of Google, has said that <a href="http://tailrank.com/posts/562949953800800/Google_Buys_Neven_Vision">image recognition</a> is something they&#8217;d like to do.  Despite the obvious usefulness of such a tool, it does raise privacy questions.</p>
<h3>Add it to the list</h3>
<p>Besides facial recognition, Google is also active in <a href="http://www.micropersuasion.com/2006/06/google_execs_hi.html">voice recognition</a> as well, and have hinted at <a href="http://news.bbc.co.uk/2/hi/technology/5084870.stm">using it serve up web ads</a> based on what&#8217;s playing on your TV.  (Seems far-fetched at present time)  Their OCR technology is also pretty decent, as <a href="http://books.google.com/">Google Book Search</a> is able to search the full text of many books online, that have had their pages scanned in &#8211; the service will also hilite the relevant passages, a neat feature.  All of these services are part of Google&#8217;s future plans to catalogue and make available for search, all forms of information, not just those that are presented as text on the web. </p>
<h3>Do no evil</h3>
<p>Of course, while Google has been getting attention from privacy advocates as of late, because of the huge amounts of information they collect from users, they&#8217;re one of the better companies when it comes to keeping your information private.  When the US DOJ tried to get search engine companies to turn over anonymized search queries for their users, all of them complied except for Google, which took the matter to court and won.  Thus, Google is probably one of the companies you should worry about less, as there are many others who have less or no respect for an individual&#8217;s privacy.  However, <a href="http://unitstep.net/blog/2006/08/07/aol-releases-search-queries-for-650000-users-in-blatant-disregard-for-privacy/">recent blunders by AOL</a> have done little to quell the fears of privacy advocates.</p>
<p>Facial recognition is almost certainly used by governments, as it&#8217;s an important tool for them.  So, what&#8217;s the big deal with Google getting it, then, if it&#8217;s already in use?  Well, the problem is that this signals the technology is becoming more available and pervasive, and indeed, as that happens, we will become more used to it.  The fact that anything you put on the Internet can, and will, be viewable by all is only driven home more by the use of this technology &#8211; some people still don&#8217;t realize this, though.</p>
<h3>Reasonable expectation of privacy</h3>
<p>While it&#8217;s probably true that when you go out in public, you have little to no expectation of privacy, I doubt that this is a good thing overall.  Facial recognition technology, when fully developed, and combined with a network of security cameras could enable <em>automated</em> tracking of a person&#8217;s whereabouts &#8211; no need for them to be carrying a cellphone or other tracking device.  Is this a good thing?  Some would argue it would better enable the tracking of criminals and terrorists, but I believe the potential for abuse is very real and something that needs to be addressed. </p>
<p>While I don&#8217;t think Google would abuse this technology, others certainly would, which is why I think people need to be aware of what&#8217;s going on and for companies like Google to be open about what they&#8217;re planning on doing with it.  </p>
<hr/>Copyright &copy; 2010 <strong><a href="http://unitstep.net">unitstep.net</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at is guilty of copyright infringement. Please contact <strong><a href="mailto:webmaster@unitstep.net">webmaster@unitstep.net</a></strong> for more information.<br/><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://unitstep.net/blog/2006/08/20/google-moves-to-add-facial-recognition-to-search/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>AOL search query data release: The aftermath</title>
		<link>http://unitstep.net/blog/2006/08/08/aol-search-query-data-release-the-aftermath/</link>
		<comments>http://unitstep.net/blog/2006/08/08/aol-search-query-data-release-the-aftermath/#comments</comments>
		<pubDate>Wed, 09 Aug 2006 00:44:11 +0000</pubDate>
		<dc:creator>Peter Chng</dc:creator>
				<category><![CDATA[aol]]></category>
		<category><![CDATA[privacy]]></category>
		<category><![CDATA[search engine]]></category>

		<guid isPermaLink="false">http://unitstep.net/blog/2006/08/08/aol-search-query-data-release-the-aftermath/</guid>
		<description><![CDATA[As expected, there has been much discourse following AOL&#8217;s release of the search query records of 650,000 of their users over a three-month period.  After the publically-released data (which was originally put up for research purposes, with no ill intent), was discovered by a blogger, news subsequently spread throughout the online communities, prompting many [...]]]></description>
			<content:encoded><![CDATA[<p>As expected, there has been much discourse following <a href="http://unitstep.net/blog/2006/08/07/aol-releases-search-queries-for-650000-users-in-blatant-disregard-for-privacy/">AOL&#8217;s release of the search query records</a> of 650,000 of their users over a three-month period.  After the publically-released data (which was originally put up for research purposes, with no ill intent), was discovered by a blogger, news subsequently spread throughout the online communities, prompting many people to download the data and AOL to eventually pull the data from their site.  However, some of those who had already got the data began to offer it up for download, either from their servers or through Bittorrent, providing a very good example of &#8220;letting the cat outta the bag&#8221;. </p>
<h3>No end to the data</h3>
<p>Traditional news began covering the story today, with even <a href="http://www.time.com/time/business/article/0,8599,1224405,00.html">TIME magazine</a> penning a little piece about the incident and its relation with privacy rights, or the lack thereof.  Other people have been more enterprising, with <a href="http://www.techcrunch.com/2006/08/08/aol-data-first-web-interfaces-up/">someone setting up</a> <a href="http://www.aolsearchdatabase.com/">an online, searchable database</a> of the results, thus saving people the time of importing all the data into their own DBMS.  </p>
<p>After it was discovered that users were apparently searching for <a href="http://plentyoffish.wordpress.com/2006/08/07/aol-search-data-shows-users-planning-to-commit-murder/">&#8220;how to kill your wife&#8221;</a>, ValleyWag set up a <a href="http://valleywag.com/tech/aol/find-the-scariest-aol-user-search-record-192602.php">contest to find</a> the scariest search record.  Apparently, the online community&#8217;s appetite for the grotesque is insatiable.</p>
<p>Even more interesting were some of the <a href="http://www.valleywag.com/tech/aol/scariest-search-records-aol-saves-crew-of-oceanic-flight-815-192860.php">results  from the contest</a>: Someone apparently submitting a search query that looked very much like a &#8220;message in a bottle&#8221; of a castaway stranded on an island.  ValleyWag suspects a viral marketing campaign for the TV series <cite>Lost</cite>.  Or, maybe someone is actually stranded on an island in the South Pacific with a laptop and WiFi access &#8211; though that wouldn&#8217;t be such a bad life!</p>
<h3>Changes upcoming</h3>
<p>With all the bad publicity AOL has been getting as of late (problems with cancelling customer accounts, workforce reductions, etc.), this leak could not have come at a worse time.  Further complicating the situation is the fact that bloggers spread the information like wildfire, and helped the data get into wide release.  In fact, the previously mentioned <a href="http://www.aolsearchdatabase.com/">AOL Search Database</a> is in violation of the user agreement that came with the search data &#8211; it states that the information provided is only to be used for non-commercial (research) purposes.  The search database site is clearly displaying ads and the owner hoping to profit from it.</p>
<p>Companies are going to tighten up their policy on data release, and I wouldn&#8217;t expect any of them to be releasing data to the research community any time soon.  At the very least, the way they deal with search records will be kept even more discreet. </p>
<hr/>Copyright &copy; 2010 <strong><a href="http://unitstep.net">unitstep.net</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at is guilty of copyright infringement. Please contact <strong><a href="mailto:webmaster@unitstep.net">webmaster@unitstep.net</a></strong> for more information.<br/><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://unitstep.net/blog/2006/08/08/aol-search-query-data-release-the-aftermath/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>AOL releases search queries for 650,000 users in blatant disregard for privacy</title>
		<link>http://unitstep.net/blog/2006/08/07/aol-releases-search-queries-for-650000-users-in-blatant-disregard-for-privacy/</link>
		<comments>http://unitstep.net/blog/2006/08/07/aol-releases-search-queries-for-650000-users-in-blatant-disregard-for-privacy/#comments</comments>
		<pubDate>Mon, 07 Aug 2006 17:33:59 +0000</pubDate>
		<dc:creator>Peter Chng</dc:creator>
				<category><![CDATA[aol]]></category>
		<category><![CDATA[privacy]]></category>
		<category><![CDATA[search engine]]></category>

		<guid isPermaLink="false">http://unitstep.net/blog/2006/08/07/aol-releases-search-queries-for-650000-users-in-blatant-disregard-for-privacy/</guid>
		<description><![CDATA[Well, the &#8220;blogosphere&#8221; (I hate using that word) and various online communities are abuzz with the news that AOL Research just released 20 million search queries of some 650,000 users over a three-month time period from March to May of 2006.  Though the data were released in order to provide &#8220;a real query log&#8221; [...]]]></description>
			<content:encoded><![CDATA[<p>Well, the &#8220;blogosphere&#8221; (I hate using that word) and various online communities are abuzz with the news that AOL Research just <a href="http://www.techcrunch.com/2006/08/06/aol-proudly-releases-massive-amounts-of-user-search-data/">released 20 million search queries</a> of some 650,000 users over a three-month time period from March to May of 2006.  Though the data were released in order to provide &#8220;a real query log&#8221; to aid in search engine research, it constitutes a huge violation of privacy, as though usernames have been removed, they have been replaced by unique identifiers, which can be used to track an individual user&#8217;s searches, allowing information to be collected about them and a profile built.  </p>
<p>The download was taken offline sometime yesterday (August 6th, 2006), but enough people had already downloaded it to ensure that <a href="http://www.gregsadetsky.com/aol-data/">mirrors</a> would quickly be set up, ensuring its continued spread.  While the intent of this release of data was obviously not malicious, it was a poorly thought-out move that AOL is sure to receive more bad PR for &#8211; especially in light of their <a href="http://www.courant.com/business/hc-aol0804.artaug04,0,3506664.story?coll=hc-headlines-business">recent troubles</a>.</p>
<h3>Good intent, bad execution</h3>
<p>The dataset was first released sometime on or before August 4th, 2006, on <a href="http://research.aol.com">AOL Research</a>.  Like Google, Yahoo! and Microsoft, AOL maintains a Research site to inform and help members of the academic and developer community of things they&#8217;re up to, and to offer assistance to researchers.  As <a href="http://72.14.207.104/search?q=cache:2Qvd2z9VbuIJ:research.aol.com/pmwiki/pmwiki.php%3Fn%3DResearch.500kUserQueriesSampledOver3Months+&#038;hl=en&#038;gl=us&#038;ct=clnk&#038;cd=1">Google&#8217;s cache</a> of the former download page indicates, the intent of offering this download was to facilitate research into making search engine technologies better, by offering a look into real users&#8217; search queries and behaviours.  </p>
<p>Having a little bit of experience in the research field (I&#8217;m currently working as a summer undergraduate research student), I can say that it&#8217;s not unheard of to see industry researchers helping out the academic community, either through joint research projects or, as in this case, the release of datasets for testing some particular sort of algorithm.</p>
<p>For example, one of the projects of the researchers in my lab is speech quality assessment, which is essentially developing an algorithm or system that can process a speech file and assess its quality as a human would, which has enormous benefits for Telco&#8217;s since they are always seeing what speech codecs are most efficient in terms of size/quality.  In order to do this, one needs a vast quantity of voice files (and their associated Mean Opinion Scores &#8211; rated by humans) to test around.  These voice files and their data can only be obtained through costly (both in time and money) trials that cost <strong>lots</strong> of money &#8211; we&#8217;re talking six-figures and above here.  Much of the data can thus only be obtained by Telcos who have the money to conduct these trials.  Fortunately, many of them have been gracious enough to release these datasets to our lab (and others) &#8211; <strong>but the data is only released after a strict <abbr title="Non-Disclosure Agreement">NDA</abbr> (Non-Disclosure Agreement) is signed</strong>, along with perhaps other agreements.  </p>
<p>Thus AOL&#8217;s intent certainly was not bad, as they merely wanted to help out in the academic circles, and maybe get cited in a paper or two.  However, their execution of this release was poor.  They should not have released this data publically, allowing just anyone to download it, but rather have followed a pattern of <strong>releasing only under an <abbr title="Non-Disclosure Agreement">NDA</abbr></strong> and only to parties that they thought were reasonably going to use it for research purposes.  In many ways, it&#8217;s worse than releasing data they paid for, because much of the data constitutes the private data of regular AOL users, data that <em>perhaps</em> they agreed to protect under AOL&#8217;s privacy policy.  </p>
<h3>Privacy concerns?</h3>
<p>Some of you might be wondering why this matters at all.  After all, it&#8217;s just a few million search queries that AOL users entered &#8211; containing no personally identifiable information.  Well, that&#8217;s true &#8211; however, since the usernames were merely replaced by unique identifiers (eg. a username is changed to a random, unique number, so &#8220;John Doe&#8221; is always mapped to &#8220;123456&#8243;, for example), profiles of users&#8217; searches can be built &#8211; opening so many cans of worms that I can&#8217;t even count that high.</p>
<p>Here&#8217;s a complete breakdown of what information is provided for each query:</p>
<dl>
<dt>AnonID</dt>
<dd>an anonymous user ID number.</dd>
<dt>Query</dt>
<dd>the query issued by the user, case shifted with most punctuation removed.</dd>
<dt>QueryTime</dt>
<dd>the time at which the query was submitted for search.</dd>
<dt>ItemRank</dt>
<dd>if the user clicked on a search result, the rank of the item on which they clicked is listed.</dd>
<dt>ClickURL</dt>
<dd>if the user clicked on a search result, the domain portion of the <acronym class="uttInitialism" title="Uniform Resource Locator">URL</acronym> in the clicked result is listed.</dd>
</dl>
<p>This is an absolute <em>goldmine</em> of information to anyone concerned with search engines, SEO or anything related to that.  Not only do you get a complete profile of a user&#8217;s searches, but you get the exact time they conducted the search, what, if any, results they clicked on (and the rank of that result), along with the domain name of site the clicked-through to.  Running datamining techniques on this information is somewhat trivial, as it can easily be analyzed to determine patterns of most any sort. </p>
<p>Internet marketers would give an arm and a leg for this sort of data, but they now will have their hands on it &#8211; <em>for free</em>.  In fact, there&#8217;s <a href="http://plentyoffish.wordpress.com/2006/08/06/aol-releases-googles-most-prized-keyword-list-google-is-gonna-get-mega-spammed/">already been</a> analysis on the dataset, producing valuable results, and claims that Google will get spammed with made-for-adsense sites (a clear violation of their TOS) targetting keywords for popular and profitable markets.  This will further pollute search results, and generate more noise, creating more headaches for search engines like Google, who will have to further adapt to prevent the <a href="http://unitstep.net/blog/2006/07/31/comment-spam-evolution/">signal-to-noise ratio from further deterioration</a>. </p>
<p>However, this dataset probably (or hopefully?) won&#8217;t have a lasting effect for spammers.  As <a href="http://www.thoughtmarket.com/blogarch/2006/08/how_useful_will_1.php">Thought Market points out</a>, since everyone has access to this data, so does Google as well.  And Google has proved themselves not to be slackers on developing data analysis techniques &#8211; after all, how did they become arguably the most successful search engine? You can bet they will be analyzing this dataset to see what results <em>they can get</em>, and hypothesizing about how these results might be used.  If they seen anything fishy going on, you can bet they&#8217;ll remove those sites from their index. </p>
<h3>Justice is blind</h3>
<p>But perhaps the most important or unnerving aspect of the release of this dataset is the implications it has on legal matters.  The data released is <em>exactly the sort of data the <a href="http://www.boingboing.net/2006/01/19/_doj_search_requests.html">US DOJ requested</a></em> from the major search engines (Google, Microsoft, Yahoo! and AOL) earlier this year to &#8220;defend its argument that the Child Online Protection Act is constitutionally sound&#8221;.  Microsoft, Yahoo! and AOL complied, while Google resisted the subpoena &#8211; and when the matter went to court, the judge ruled <a href="http://googleblog.blogspot.com/2006/03/judge-tells-doj-no-on-search-queries.html">in Google&#8217;s favour</a>.</p>
<p>Taken in this context, things do not look good for AOL &#8211; while Google&#8217;s fighting to protect your privacy, AOL&#8217;s actively working in the opposite direction, it seems.  I realize it was not meant to be that way, but with something this serious, the effect is sometimes far more important than the intent.  Looking at the raw search queries of some users reveals things ranging from embarrasing or weird searches, to queries that may <a href="http://plentyoffish.wordpress.com/2006/08/07/aol-search-data-shows-users-planning-to-commit-murder/">indicate they&#8217;re up to no good</a>.</p>
<p>Let&#8217;s be clear here &#8211; simply searching for &#8220;how to kill your wife&#8221; is not tantamount to murder &#8211; nor is it even enough proof of conspiracy to commit murder.  At least, that&#8217;s how I see things from my &#8220;not a lawyer&#8221; perspective.  However, consider a situation where someone <em>did kill</em> their wife.  If their search results were made available, and this query turned up, how do you think law makers would respond?  Or, consider sample size of 1000 murderers, and their search queries.  If it were statistically established that a significant percentage of murderers search on the Internet for some aspect of murder before the actual crime, how would law makers respond? </p>
<p>These are complex questions, and I&#8217;m not sure what I&#8217;d do with the data.  Statistics is a complex topic, but the important fact is they are often abused to support invalid and over-reaching actions.  This obviously leads to questions and issues like &#8220;thought crime&#8221; and comparisons with <cite>Minority Report</cite>.  I generally don&#8217;t like using slippery-slope arguments, but this is an area where it&#8217;s easy to see how abuse could happen, considering that the US DOJ has already expressed an interest in this sort of data, and with all the <a href="http://www.cnn.com/2005/POLITICS/12/17/bush.nsa/">NSA wiretaps</a> going on.  </p>
<p>There were also many other weird, and often disgusting search queries going on, often by the same set of users.  I won&#8217;t repeat them here, but you can easily <a href="http://plentyoffish.wordpress.com/2006/08/07/aol-search-data-shows-users-planning-to-commit-murder/">see for yourself</a> what some AOL users were searching for.  While these searches represent a very small percentage of all the searches conducted, it doesn&#8217;t paint a good picture of the Internet for lawmakers.  (Keep in mind that these sort of searches are to be expected &#8211; after all, even if the incident rate for violent crime is something low like 27 per 100 000, that&#8217;s still 0.027%, and keep in mind this search query dataset represents over 650,000 users.)</p>
<p>Note that I haven&#8217;t even touched on the topic of false positives.  What if someone was writing a paper on the effects of torture, or of any of the other examples of &#8220;man&#8217;s inhumanity to man&#8221;?  They&#8217;d obviously have to search for some pretty disturbing stuff (I&#8217;m glad I&#8217;ve never had such a writing assignment), and then these searches would be tied together by the unique identifier.  There are plenty of other examples, but the idea is that using someone&#8217;s search queries for legal purposes is an easy way to <cite>1984</cite> &#8211; while there obviously may be statistical connections between actual crimes, the potential for &#8220;though crime&#8221; is just too large.</p>
<h3>Waves and reverberations</h3>
<p>As far as I can tell, this story (and thus the wide release of the data) only started sometime this weekend, probably on Friday or Saturday.  But, it&#8217;s already making its way around the non-traditional news outlets (blogs, social communities) like fire through a tissue factory.  There&#8217;s been talk of how <a href="http://www.ugcs.caltech.edu/~dangelo/aol-search-query-logs/">personally identifiable information</a> may be available, and of the <a href="http://plentyoffish.wordpress.com/2006/08/07/aol-search-data-shows-users-planning-to-commit-murder/#comments">legal ramifications</a>.  It&#8217;s already in the top five among <a href="http://technorati.com/">Technorati</a> searches, so you can believe that anyone who wants this data will get it.  If this doesn&#8217;t hit the mainstream news soon, I&#8217;ll be very surprised and disappointed &#8211; so far there&#8217;s <a href="http://news.google.com/news?hl=en&#038;ned=us&#038;q=aol+data&#038;btnG=Search+News">only one result</a> for it in Google News. </p>
<h3>Update</h3>
<p>AOL has apparently responded to this privacy leak, in a comment <a href="http://plentyoffish.wordpress.com/2006/08/07/aol-search-data-shows-users-planning-to-commit-murder/">available here</a>.  </p>
<p>Nothing really new or interesting &#8211; other than the data represented only about 1.5% of all AOL users, and those included were only US users who used AOL&#8217;s client software.  As expected, Mr. Weinstein of AOL said that it was an &#8220;innocent enough attempt to reach out to the academic community with new research tools&#8221;, but as expected, it didn&#8217;t go through official channels for approval.  You can be sure that AOL&#8217;s going to change their policy on this sort of stuff &#8211; don&#8217;t expect any more stuff from AOL research for a while.  (Someone or some people probably also got fired.)</p>
<hr/>Copyright &copy; 2010 <strong><a href="http://unitstep.net">unitstep.net</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at is guilty of copyright infringement. Please contact <strong><a href="mailto:webmaster@unitstep.net">webmaster@unitstep.net</a></strong> for more information.<br/><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://unitstep.net/blog/2006/08/07/aol-releases-search-queries-for-650000-users-in-blatant-disregard-for-privacy/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Search engine spam</title>
		<link>http://unitstep.net/blog/2006/06/19/search-engine-spam/</link>
		<comments>http://unitstep.net/blog/2006/06/19/search-engine-spam/#comments</comments>
		<pubDate>Tue, 20 Jun 2006 02:42:43 +0000</pubDate>
		<dc:creator>Peter Chng</dc:creator>
				<category><![CDATA[google]]></category>
		<category><![CDATA[search engine]]></category>
		<category><![CDATA[spam]]></category>

		<guid isPermaLink="false">http://www.unitstep.net/blog/2006/06/19/search-engine-spam/</guid>
		<description><![CDATA[I recently read about how Google was the latest victim of search engine spam, or the intentional creation of useless pages in order to get a high ranking or listing on a search engine results page.  The story was later Dugg, and you may have seen it on my &#8220;Recently visited&#8221; list.  While [...]]]></description>
			<content:encoded><![CDATA[<p>I recently <a href="http://googlesystem.blogspot.com/2006/06/billions-of-spam-pages-indexed-by.html">read about</a> how Google was the latest victim of search engine spam, or the intentional creation of useless pages in order to get a high ranking or listing on a search engine results page.  The story was later <a href="http://digg.com/technology/How_One_Spammer_Got_BILLIONS_of_Pages_into_Google_in_3_Weeks">Dugg</a>, and you may have seen it on my &#8220;Recently visited&#8221; list.  While Google has fixed this current problem, this type of Internet spam has been growing at a very fast pace for the past few years, for a few reasons, and will probably out-grow conventional e-mail spam in the future.  It presents its own set of unique problems, many of which have yet to be solved by Google, or, in my opinion, other search engines as well.</p>
<p>
During this latest round of spamming, which reached its peak this weekend, it appears that well over 5 billion spam pages were indexed by Google; while this by itself is a huge number, taken in context with the total number of pages that Google had indexed at the time, around 25 billion by the source in the first link, it is simply astonishing.  What&#8217;s even more impressive, or scary, is the fact that the site was started only less than a month ago, making this intrusion into the Google search indexes not only massive, but frighteningly fast as well.
</p>
<p>
From reading the posts at Digg, and from the <a href="http://merged.ca/monetize/flat/how-to-get-billions-of-pages-indexed-by-Google.html">resultant link</a>, it appears the spammer used a script in order to serve up articles based on keywords, and furthermore, utilized many topical subdomains in order to generate the content that would appear &#8220;high&#8221; on the keywords list, and thus be indexed by Google.  Comment-spam (on forums, blogs, and the like) may or may not have played a role in getting the pages ranked higher, but one thing is for certain &#8211; these useless pages made it <em>very</em> high onto the search results page, in many cases filling multiple spots in the top 10 results.  These searchs were for common terms, such as &#8220;war on terror pros cons&#8221; and &#8220;pizza sauce recipe&#8221;.
</p>
<p>
But what&#8217;s the reason for this? Well, the same as for any spam marketing campaign &#8211; advertising.  Because of the currently huge market for Internet advertising, the potential for making lots of money of ads on popular sites is an opportunity many cannot turn down.  You&#8217;re likely to see these somewhat unobtrusive text ads on most any popular site nowadays &#8211; in fact it was Google who first popularized them as a replacement for the annoying animated graphic banner ads and popups, which put off many viewers.  This form of advertising is, undoubtedly, the backbone of many web 2.0 companies.  Companies like <a href="http://digg.com">Digg</a>, <a href="http://flickr.com">Flickr</a> and even <a href="http://google.com">Google</a> rely on ads for nearly 100% of their revenue.
</p>
<p>
But this potential has turned many to the dark side of advertising &#8211; creating spam sites whose sole purpose is to attract viewers for increased ad viewing.  While successful web 2.0 companies may display ads on their site, they all offer some useful service that people return for.  These spam sites do not offer any useful service or information, but instead manipulate search engine results in order to trick users to visiting their site.  Once there, the user will find only semi-meaningful information laced intricately laced with ads, or perhaps, no information and only ads.  While this clearly violates many ad providers terms-of-service (such as Google Adwords), most sites have no problem doing this or finding marketers who don&#8217;t care about such trivial things.
</p>
<p>
This is perhaps the other side of the double-edged sword that is the Internet.  On the one hand, forums, blogs, and other community-based sites offer the immensive capacity for spreading useful information.  On the other hand, they also offer the ability to spread <em>useless</em> information as well, and in some cases, search engines cannot yet discriminate between the two as most humans would.  This can be seen in the huge amounts of comment spam, and spam blogs that pervade the Internet.
</p>
<p>
All of this creates problems on many levels, and in many ways, is more damaging that e-mail spam.  While e-mail spam is annoying for the junk it creates in our inboxes, and the extra bandwidth it consumes, for the most part anti-spam tools have helped curb this influx.  However, search engine spam targets the most basic use of the Internet, and that is the ability to find useful information.  With all the spam sites out there, and the manipulation of search engine results that comes from this, the ability to conduct a search that returns useful information may be compromised in the future if proper countermeasures are not employed.
</p>
<p>
Furthermore, it creates a nightmare for the people who engineer the search engines, as they must find a way to tweak the algorithms to prevent this from happening again.  In the process, false positives may be generated, causing legitimate sites to be unintentionally delisted, causing futher headaches.  Google has been having <a href="http://www.sitepoint.com/forums/showthread.php?t=388258">delisting problems</a> as of late, and one wonders if this is related to the recent spamming problem.
</p>
<p>
One also has to wonder how many of these problems may have come as a result of the upgrade Google did to its datacenters, <a href="http://en.wikipedia.org/wiki/Big_Daddy_Google">dubbed &#8220;Big Daddy&#8221;</a>.  Google did not seem to have these sorts of problems before this, so perhaps there is a correlation, but maybe not a cause.  It&#8217;s interesting to note, however, that the aim of the &#8220;Big Daddy&#8221; updates was to <em>prevent</em> this sort of thing &#8211; and to keeping meaningful sites in the index, which is exactly the opposite of what happened, since the spamming from these sites evidently bumped out important sites from the indexes.
</p>
<p>
This recent round of spamming was not unique to Google; it affected the Yahoo! and MSN search results as well, though not to the extent that it did to Google&#8217;s.  It was probably not the intention of the spammer to get so many pages indexed on Google, and probably got &#8220;out of hand&#8221; quickly, however, one has to wonder if this spam site directly targeted Google in its quest for search engine manipulation, or whether this was just a coincidence.  But the problem for search engines remains, and that is, how to effectively discriminate between meaningful and useless information, without making too many mistakes, one way or the other.
</p>
<p>
Thankfully, it appears the Google is working on fixing this problem, and not only by just removing the most recent spams.  It will take some work, but I think they should hopefully arrive at a solution, but as always, spammers will always be working to gain the upper hand as well.  Let&#8217;s hope that the combined effort of companies like Google, Yahoo! and Microsoft can thwart them, or else the Internet may become awash in the useless garbage of spam.</p>
<hr/>Copyright &copy; 2010 <strong><a href="http://unitstep.net">unitstep.net</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at is guilty of copyright infringement. Please contact <strong><a href="mailto:webmaster@unitstep.net">webmaster@unitstep.net</a></strong> for more information.<br/><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://unitstep.net/blog/2006/06/19/search-engine-spam/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
