<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>

<channel>
	<title>unitstep.net &#187; spam</title>
	<atom:link href="http://unitstep.net/blog/category/spam/feed/" rel="self" type="application/rss+xml" />
	<link>http://unitstep.net</link>
	<description>the home of peter chng</description>
	<pubDate>Sun, 30 Nov 2008 23:12:11 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.6.5</generator>
	<language>en</language>
			<item>
		<title>Passing the 100,000 mark</title>
		<link>http://unitstep.net/blog/2008/01/12/passing-the-100000-mark/</link>
		<comments>http://unitstep.net/blog/2008/01/12/passing-the-100000-mark/#comments</comments>
		<pubDate>Sun, 13 Jan 2008 02:10:16 +0000</pubDate>
		<dc:creator>Peter Chng</dc:creator>
		
		<category><![CDATA[akismet]]></category>

		<category><![CDATA[spam]]></category>

		<category><![CDATA[wordpress]]></category>

		<guid isPermaLink="false">http://unitstep.net/blog/2008/01/12/passing-the-100000-mark/</guid>
		<description><![CDATA[This week, Akismet reported that it had blocked its 100,000th spam comment on my site/blog.  While that&#8217;s not a remarkable number, in light of how little traffic my site gets that figure becomes somewhat more significant.  Since this site has only been around for just over one and a half years (19 months), [...]]]></description>
			<content:encoded><![CDATA[<p>This week, <a href="http://akismet.com/">Akismet</a> reported that it had blocked its 100,000th spam comment on my site/blog.  While that&#8217;s not a remarkable number, in light of how little traffic my site gets that figure becomes somewhat more significant.  Since this site has only been around for just over one and a half years (19 months), that works out to roughly 5200 spam comments every month, or a little over 1300 every week.  Note that the current averages are actually much higher since in the beginning I got a lot less spam before the bots discovered my site.</p>
<p>Props definitely go out to <a href="http://automattic.com/">Automattic</a> for creating such a reliable and accurate service.  When I <a href="/blog/2006/07/31/comment-spam-evolution/">first wrote about it</a> over a year ago, I was very impressed with its precise filtering of spam and non-spam (aka <dfn>ham</dfn>) comments along with its unobtrusiveness.  Akismet truly makes spam filtering transparent to the end user, unlike other methods such as CAPTCHAs.</p>
<p>Of course, I can&#8217;t forget thanking the developers of <a href="http://wordpress.org/">WordPress</a> as well.  Without them, I would have no site from which I&#8217;d have to protect from spam. <img src='http://unitstep.net/wordpress/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </p>
<p>Check back later when this site surpasses the 1,000,000 mark for spam.</p>
<hr/>Copyright &copy; 2008 <strong><a href="http://unitstep.net">unitstep.net</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at is guilty of copyright infringement. Please contact <strong><a href="mailto:webmaster@unitstep.net">webmaster@unitstep.net</a></strong> for more information.<br/><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://unitstep.net/blog/2008/01/12/passing-the-100000-mark/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Akismet problems</title>
		<link>http://unitstep.net/blog/2006/08/27/akismet-problems/</link>
		<comments>http://unitstep.net/blog/2006/08/27/akismet-problems/#comments</comments>
		<pubDate>Sun, 27 Aug 2006 15:23:48 +0000</pubDate>
		<dc:creator>Peter Chng</dc:creator>
		
		<category><![CDATA[akismet]]></category>

		<category><![CDATA[comment spam]]></category>

		<category><![CDATA[spam]]></category>

		<category><![CDATA[wordpress]]></category>

		<guid isPermaLink="false">http://unitstep.net/blog/2006/08/27/akismet-problems/</guid>
		<description><![CDATA[I&#8217;ve been using Akismet to control comment spam and so far, it&#8217;s performance has been excellent - out of close to 300 comments, it didn&#8217;t report a single false positive and only let through one or two cleverly-crafted comments.  However, for some reason, in the past day it&#8217;s started letting through comments that are [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been using <a href="">Akismet</a> to control <a href="http://unitstep.net/blog/2006/07/31/comment-spam-evolution/">comment spam</a> and so far, it&#8217;s performance has been excellent - out of close to 300 comments, it didn&#8217;t report a single false positive and only let through one or two cleverly-crafted comments.  However, for some reason, in the past day it&#8217;s started letting through comments that are clearly spam, and I&#8217;ve had to manually label them as such and then delete them.</p>
<p>I wonder if this is related to the <a href="http://akismet.com/blog/2006/08/better-stats/">recent update</a> to the system (as it&#8217;s a centralized service), or something else.  The update seemed to be only related to improving the statistics tracking of the service, and not something related to the spam-detection algorithm.  I&#8217;ve tried disabling/enabling the plugin, so we&#8217;ll see if that helps - has anyone else been having these problems?</p>
<hr/>Copyright &copy; 2008 <strong><a href="http://unitstep.net">unitstep.net</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at is guilty of copyright infringement. Please contact <strong><a href="mailto:webmaster@unitstep.net">webmaster@unitstep.net</a></strong> for more information.<br/><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://unitstep.net/blog/2006/08/27/akismet-problems/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Comment spam evolution</title>
		<link>http://unitstep.net/blog/2006/07/31/comment-spam-evolution/</link>
		<comments>http://unitstep.net/blog/2006/07/31/comment-spam-evolution/#comments</comments>
		<pubDate>Tue, 01 Aug 2006 01:54:42 +0000</pubDate>
		<dc:creator>Peter Chng</dc:creator>
		
		<category><![CDATA[spam]]></category>

		<category><![CDATA[wordpress]]></category>

		<guid isPermaLink="false">http://unitstep.net/blog/2006/07/31/comment-spam-evolution/</guid>
		<description><![CDATA[Spam is pervasive; it is everywhere.  If Ben Franklin were alive today, he&#8217;d probably be quoted as saying that &#8220;In this world nothing is certain but death and spam&#8220;. In fact, it&#8217;s one of the major downsides of the web as we know it.  With increased availability of information, comes the inevitability of [...]]]></description>
			<content:encoded><![CDATA[<p>Spam is pervasive; it is everywhere.  If Ben Franklin were alive today, he&#8217;d probably be <a href="http://www.brainyquote.com/quotes/authors/b/benjamin_franklin.html">quoted</a> as saying that <em>&#8220;In this world nothing is certain but death and <strong>spam</strong>&#8220;</em>. In fact, it&#8217;s one of the major downsides of the web as we know it.  With increased availability of information, comes the inevitability of spam - direct consumer marketing thrown in alongside legitimate content that decreases the  SNR (Signal-to-noise Ratio), effectively making it harder to find quality, real information on the Internet. </p>
<h3>In the old days&#8230;</h3>
<p>Back a few years ago, the big thing was e-mail spam.  It&#8217;s been around so long that almost anyone who&#8217;s used e-mail knows about it.  All those e-mails telling you how to re-finance your mortgage, get low-cost prescription drugs or how to grow a certain appendage longer tended to fill up one&#8217;s inbox, making it a pain to delete all of them and find the real e-mail that you needed to read.  </p>
<p>For those of you thinking that e-mail spam was big annoyance, <a href="http://www.dmnews.com/cms/dm-news/internet-marketing/35135.html">recent polls</a> have shown that people apparently do make purchases from spam e-mails, making it a viable direct marketing tactic.  However, server-side e-mail spam filters have greatly increased in their ability to weed out junk e-mails in the past few years, so for many people, spam is not as big of a problem as it once was.  Unless, you&#8217;re <a href="http://unitstep.net/blog/2006/07/26/windows-live-mail-slow-bloated-and-not-very-usable/">using Hotmail</a>.</p>
<h3>Adapt or die</h3>
<p>Faced with the prospect of decreased income thanks to e-mail spam filters (or just wanting to make more money), spammers began a new front in the war of direct marketing.  They began to spam forums, guestbooks and most recently, commenting systems on various systems in the hopes of drawing attention to the products they were advertising.  The purpose was the same - to encourage people to buy products, mostly the same ones they had been sending out in mass e-mails.  So, while e-mail spam has not gone away in recent times, alternative forms of spam have increased many times.  </p>
<p>In fact, spamming online communites with message may even be more effective, since a spammer needs only to get their message on one site in order for it to be viewed by many.  However, since spam tended to be easy to distinguish from comments posted by humans, it was relatively easy for developers to combat this by building in anti-spam features to weed out spam.  Spammers responded by making their &#8220;bots&#8221; - the automated programs that sent out the spam messages - better and trying to make the &#8220;quality&#8221; of their messages seem more &#8220;human&#8221;.</p>
<h3>A personal experience</h3>
<p>My site, <a href="http://unitstep.net">unitstep.net</a>, is far from a popular site.  However, spam bots have still managed to find this site and I&#8217;ve logged 104 spam comments in just over two months&#8217; worth of operation.  That&#8217;s quite impressive, and shows that spammers are actively searching for new blogs to spam.  As <a href="http://unitstep.net/blog/2006/06/19/search-engine-spam/">I mentioned before</a>, some comment spam is aimed at promoting other websites in order to increase their ranking in SERPS (Search Engine Results Pages), thus drawing unsuspecting visitors to their sites, where they are served up advertising that looks like regular content.  (WordPress and most blogs add an attribute of <code>rel="nofollow"</code> to comment links to defeat this, but they still try.)  Most of the spam I&#8217;ve got, has been of the direct marketing variety, though.</p>
<p>If you haven&#8217;t seen the comment spam though, it&#8217;s because of <a href="http://akismet.com/">Akismet</a>, a truly kick-ass plugin for WordPress, written by the same team.  It basically uses a central authority to check on every comment that&#8217;s submitted, and analyzes its content to determine if its likely to be spam, or likely to be real.  Comments that are marked as spam aren&#8217;t shown, and are instead put in a moderation queue, for me to look at and delete or, if it&#8217;s a false positive, allow it through.  Akismet appears to learn as well, so it&#8217;s success rate increases the longer it&#8217;s been in use.  I apparently joined relatively late in the game, as I have yet to find Akismet report a false positive or let a spam comment through. </p>
<p>Until today.</p>
<p>However, I don&#8217;t blame Akismet for this one, as I was almost caught off guard.  Here was the content of the comment:</p>
<blockquote><p>Plato learningâ€¦</p>
<p>I am Karin, very interesting article that contained the information I was searching for in Google, thanksâ€¦.</p></blockquote>
<p>Upon closer inspection, the comment does look like a spammer&#8217;s, since it didn&#8217;t specifically relate to the post&#8217;s topic, instead using a generalized statement.  The first sentence also made no sense.  But, what threw me off was the lack of links in the post - the spam bot had just instead used the regular &#8220;homepage&#8221; field to fill in the <acronym class="uttInitialism" title="Uniform Resource Locator">URL</acronym> of their spam blog - or splog - in the hopes that someone would visit it.</p>
<p>In fact, due to my curiousity, I had to visit the site to see what was on it.  (This was, of course, how I determined it to be a splog) The splog consisted of a load of nonsensical posts, and of course, lots of ads, by Google no less.  Since it&#8217;s against the terms-of-service of Google Adsense to make a site <em>for the direct purpose of displaying Google ads</em>, this spammer is obviously in clear violate of the program.  The spam blog wasn&#8217;t littered with links though (besides the ads), and the posts would look human after only a quick check - a closer inspection reveals sentences merged together into nonsensical, run-on paragraphs.</p>
<h3>The Turing test</h3>
<p>All of the efforts by current spammers reminded me of the <a href="http://en.wikipedia.org/wiki/Turing_Test">Turing Test</a>.  Basically, it&#8217;s a concept that if a person communicating with a computer cannot reliably tell if they are communicating with a computer or a human, then that computer system is said to have passed the test.  Turing thought it was a better way of evaluating a computer over the question of &#8220;Can a computer think?&#8221;</p>
<p>In a way, spam and anti-spam techiques are engaged in some sort of Turing test. Except that it&#8217;s a computer system (the anti-spam system) that is trying to determine if an entity is a computer (spambot) or a real human.  The anti-spam techniques described above have typically been widely used, along with <a href="http://en.wikipedia.org/wiki/Captcha">CAPTCHAs</a> or other &#8220;tests&#8221; that the typical human can easily do, but are relatively hard to design into a computer program or system.</p>
<p>However, as programming techniques and algorithms advance, the spam/anti-spam war is sure to heat up.  As seen by some of the comment spam I&#8217;ve started to get, spammers are getting better with the bots they produce.  They&#8217;re moving away from posting messages that are heavily-laden with links to gambling/porn sites that are obviously spam, to posting messages that could have conceivably come from a human, with only the regular home page link that any curious visitor might click.  As these techiques advance, it&#8217;s entirely possible that a spam bot could read a blog post, analyze the content, and post a reasonably-specific message that contained a few links here and there to spam sites.  The same goes for CAPTCHAs as well - advances in image processing are making optical character recognition more and more accurate.</p>
<p>The end result is, as I mentioned at the beginning, a lower SNR and lower quality of information on the web.  Not only will more useless spam information be disseminated, making it more likely to find crap rather than helpful information when doing a search - but it&#8217;ll also be harder to post actual information that isn&#8217;t incorrectly labelled as spam.  For example, if spam bots were able to post vaguly post-specific comments to sites, then quick remarks by people, such as <em>&#8220;Thanks for the useful information!&#8221;</em>, might also get labelled as spam.  Same goes for CAPTCHAs - if we make the images more distorted and the characters harder to read, people might also have trouble - I know I do on some of them - and this creates another problem of accessibility for the disabled.</p>
<p>This isn&#8217;t to say I&#8217;ve lost hope.  Akismet, so far, has only missed one comment - and I would have missed it had I not visited the linked site.  So, it hasn&#8217;t really let me down.  Others have had <a href="http://www.darcynorman.net/2006/06/24/20-000-spam-attempts-per-day-and-counting">similar success</a>, so for now, I think anti-spam techniques have the upper hand.  Spam is merely an annoyance and a statistic currently, and I hope it stays that way.  What I&#8217;ve talked about in this entry could be viewed as a <a href="http://unitstep.net/blog/2006/07/22/black-dawn-the-next-pandemic-or-bird-flu-the-worst-case/">worse case scenario</a> of sorts.</p>
<p>Now, excuse me while I press &#8220;Delete All&#8221; on the spam comment moderation queue.</p>
<hr/>Copyright &copy; 2008 <strong><a href="http://unitstep.net">unitstep.net</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at is guilty of copyright infringement. Please contact <strong><a href="mailto:webmaster@unitstep.net">webmaster@unitstep.net</a></strong> for more information.<br/><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://unitstep.net/blog/2006/07/31/comment-spam-evolution/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Search engine spam</title>
		<link>http://unitstep.net/blog/2006/06/19/search-engine-spam/</link>
		<comments>http://unitstep.net/blog/2006/06/19/search-engine-spam/#comments</comments>
		<pubDate>Tue, 20 Jun 2006 02:42:43 +0000</pubDate>
		<dc:creator>Peter Chng</dc:creator>
		
		<category><![CDATA[google]]></category>

		<category><![CDATA[search engine]]></category>

		<category><![CDATA[spam]]></category>

		<guid isPermaLink="false">http://www.unitstep.net/blog/2006/06/19/search-engine-spam/</guid>
		<description><![CDATA[I recently read about how Google was the latest victim of search engine spam, or the intentional creation of useless pages in order to get a high ranking or listing on a search engine results page.  The story was later Dugg, and you may have seen it on my &#8220;Recently visited&#8221; list.  While [...]]]></description>
			<content:encoded><![CDATA[<p>I recently <a href="http://googlesystem.blogspot.com/2006/06/billions-of-spam-pages-indexed-by.html">read about</a> how Google was the latest victim of search engine spam, or the intentional creation of useless pages in order to get a high ranking or listing on a search engine results page.  The story was later <a href="http://digg.com/technology/How_One_Spammer_Got_BILLIONS_of_Pages_into_Google_in_3_Weeks">Dugg</a>, and you may have seen it on my &#8220;Recently visited&#8221; list.  While Google has fixed this current problem, this type of Internet spam has been growing at a very fast pace for the past few years, for a few reasons, and will probably out-grow conventional e-mail spam in the future.  It presents its own set of unique problems, many of which have yet to be solved by Google, or, in my opinion, other search engines as well.</p>
<p>
During this latest round of spamming, which reached its peak this weekend, it appears that well over 5 billion spam pages were indexed by Google; while this by itself is a huge number, taken in context with the total number of pages that Google had indexed at the time, around 25 billion by the source in the first link, it is simply astonishing.  What&#8217;s even more impressive, or scary, is the fact that the site was started only less than a month ago, making this intrusion into the Google search indexes not only massive, but frighteningly fast as well.
</p>
<p>
From reading the posts at Digg, and from the <a href="http://merged.ca/monetize/flat/how-to-get-billions-of-pages-indexed-by-Google.html">resultant link</a>, it appears the spammer used a script in order to serve up articles based on keywords, and furthermore, utilized many topical subdomains in order to generate the content that would appear &#8220;high&#8221; on the keywords list, and thus be indexed by Google.  Comment-spam (on forums, blogs, and the like) may or may not have played a role in getting the pages ranked higher, but one thing is for certain - these useless pages made it <em>very</em> high onto the search results page, in many cases filling multiple spots in the top 10 results.  These searchs were for common terms, such as &#8220;war on terror pros cons&#8221; and &#8220;pizza sauce recipe&#8221;.
</p>
<p>
But what&#8217;s the reason for this? Well, the same as for any spam marketing campaign - advertising.  Because of the currently huge market for Internet advertising, the potential for making lots of money of ads on popular sites is an opportunity many cannot turn down.  You&#8217;re likely to see these somewhat unobtrusive text ads on most any popular site nowadays - in fact it was Google who first popularized them as a replacement for the annoying animated graphic banner ads and popups, which put off many viewers.  This form of advertising is, undoubtedly, the backbone of many web 2.0 companies.  Companies like <a href="http://digg.com">Digg</a>, <a href="http://flickr.com">Flickr</a> and even <a href="http://google.com">Google</a> rely on ads for nearly 100% of their revenue.
</p>
<p>
But this potential has turned many to the dark side of advertising - creating spam sites whose sole purpose is to attract viewers for increased ad viewing.  While successful web 2.0 companies may display ads on their site, they all offer some useful service that people return for.  These spam sites do not offer any useful service or information, but instead manipulate search engine results in order to trick users to visiting their site.  Once there, the user will find only semi-meaningful information laced intricately laced with ads, or perhaps, no information and only ads.  While this clearly violates many ad providers terms-of-service (such as Google Adwords), most sites have no problem doing this or finding marketers who don&#8217;t care about such trivial things.
</p>
<p>
This is perhaps the other side of the double-edged sword that is the Internet.  On the one hand, forums, blogs, and other community-based sites offer the immensive capacity for spreading useful information.  On the other hand, they also offer the ability to spread <em>useless</em> information as well, and in some cases, search engines cannot yet discriminate between the two as most humans would.  This can be seen in the huge amounts of comment spam, and spam blogs that pervade the Internet.
</p>
<p>
All of this creates problems on many levels, and in many ways, is more damaging that e-mail spam.  While e-mail spam is annoying for the junk it creates in our inboxes, and the extra bandwidth it consumes, for the most part anti-spam tools have helped curb this influx.  However, search engine spam targets the most basic use of the Internet, and that is the ability to find useful information.  With all the spam sites out there, and the manipulation of search engine results that comes from this, the ability to conduct a search that returns useful information may be compromised in the future if proper countermeasures are not employed.
</p>
<p>
Furthermore, it creates a nightmare for the people who engineer the search engines, as they must find a way to tweak the algorithms to prevent this from happening again.  In the process, false positives may be generated, causing legitimate sites to be unintentionally delisted, causing futher headaches.  Google has been having <a href="http://www.sitepoint.com/forums/showthread.php?t=388258">delisting problems</a> as of late, and one wonders if this is related to the recent spamming problem.
</p>
<p>
One also has to wonder how many of these problems may have come as a result of the upgrade Google did to its datacenters, <a href="http://en.wikipedia.org/wiki/Big_Daddy_Google">dubbed &#8220;Big Daddy&#8221;</a>.  Google did not seem to have these sorts of problems before this, so perhaps there is a correlation, but maybe not a cause.  It&#8217;s interesting to note, however, that the aim of the &#8220;Big Daddy&#8221; updates was to <em>prevent</em> this sort of thing - and to keeping meaningful sites in the index, which is exactly the opposite of what happened, since the spamming from these sites evidently bumped out important sites from the indexes.
</p>
<p>
This recent round of spamming was not unique to Google; it affected the Yahoo! and MSN search results as well, though not to the extent that it did to Google&#8217;s.  It was probably not the intention of the spammer to get so many pages indexed on Google, and probably got &#8220;out of hand&#8221; quickly, however, one has to wonder if this spam site directly targeted Google in its quest for search engine manipulation, or whether this was just a coincidence.  But the problem for search engines remains, and that is, how to effectively discriminate between meaningful and useless information, without making too many mistakes, one way or the other.
</p>
<p>
Thankfully, it appears the Google is working on fixing this problem, and not only by just removing the most recent spams.  It will take some work, but I think they should hopefully arrive at a solution, but as always, spammers will always be working to gain the upper hand as well.  Let&#8217;s hope that the combined effort of companies like Google, Yahoo! and Microsoft can thwart them, or else the Internet may become awash in the useless garbage of spam.</p>
<hr/>Copyright &copy; 2008 <strong><a href="http://unitstep.net">unitstep.net</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at is guilty of copyright infringement. Please contact <strong><a href="mailto:webmaster@unitstep.net">webmaster@unitstep.net</a></strong> for more information.<br/><span style="float: right;font-size: 7pt"><a href="http://blog.taragana.com/index.php/archive/wordpress-plugins-provided-by-taraganacom/">Plugin</a> by <a href="http://www.taragana.com/">Taragana</a></span>]]></content:encoded>
			<wfw:commentRss>http://unitstep.net/blog/2006/06/19/search-engine-spam/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>
