Search engine spam

I recently read about how Google was the latest victim of search engine spam, or the intentional creation of useless pages in order to get a high ranking or listing on a search engine results page. The story was later Dugg, and you may have seen it on my “Recently visited” list. While Google has fixed this current problem, this type of Internet spam has been growing at a very fast pace for the past few years, for a few reasons, and will probably out-grow conventional e-mail spam in the future. It presents its own set of unique problems, many of which have yet to be solved by Google, or, in my opinion, other search engines as well.

During this latest round of spamming, which reached its peak this weekend, it appears that well over 5 billion spam pages were indexed by Google; while this by itself is a huge number, taken in context with the total number of pages that Google had indexed at the time, around 25 billion by the source in the first link, it is simply astonishing. What’s even more impressive, or scary, is the fact that the site was started only less than a month ago, making this intrusion into the Google search indexes not only massive, but frighteningly fast as well.

From reading the posts at Digg, and from the resultant link, it appears the spammer used a script in order to serve up articles based on keywords, and furthermore, utilized many topical subdomains in order to generate the content that would appear “high” on the keywords list, and thus be indexed by Google. Comment-spam (on forums, blogs, and the like) may or may not have played a role in getting the pages ranked higher, but one thing is for certain – these useless pages made it very high onto the search results page, in many cases filling multiple spots in the top 10 results. These searchs were for common terms, such as “war on terror pros cons” and “pizza sauce recipe”.

But what’s the reason for this? Well, the same as for any spam marketing campaign – advertising. Because of the currently huge market for Internet advertising, the potential for making lots of money of ads on popular sites is an opportunity many cannot turn down. You’re likely to see these somewhat unobtrusive text ads on most any popular site nowadays – in fact it was Google who first popularized them as a replacement for the annoying animated graphic banner ads and popups, which put off many viewers. This form of advertising is, undoubtedly, the backbone of many web 2.0 companies. Companies like Digg, Flickr and even Google rely on ads for nearly 100% of their revenue.

But this potential has turned many to the dark side of advertising – creating spam sites whose sole purpose is to attract viewers for increased ad viewing. While successful web 2.0 companies may display ads on their site, they all offer some useful service that people return for. These spam sites do not offer any useful service or information, but instead manipulate search engine results in order to trick users to visiting their site. Once there, the user will find only semi-meaningful information laced intricately laced with ads, or perhaps, no information and only ads. While this clearly violates many ad providers terms-of-service (such as Google Adwords), most sites have no problem doing this or finding marketers who don’t care about such trivial things.

This is perhaps the other side of the double-edged sword that is the Internet. On the one hand, forums, blogs, and other community-based sites offer the immensive capacity for spreading useful information. On the other hand, they also offer the ability to spread useless information as well, and in some cases, search engines cannot yet discriminate between the two as most humans would. This can be seen in the huge amounts of comment spam, and spam blogs that pervade the Internet.

All of this creates problems on many levels, and in many ways, is more damaging that e-mail spam. While e-mail spam is annoying for the junk it creates in our inboxes, and the extra bandwidth it consumes, for the most part anti-spam tools have helped curb this influx. However, search engine spam targets the most basic use of the Internet, and that is the ability to find useful information. With all the spam sites out there, and the manipulation of search engine results that comes from this, the ability to conduct a search that returns useful information may be compromised in the future if proper countermeasures are not employed.

Furthermore, it creates a nightmare for the people who engineer the search engines, as they must find a way to tweak the algorithms to prevent this from happening again. In the process, false positives may be generated, causing legitimate sites to be unintentionally delisted, causing futher headaches. Google has been having delisting problems as of late, and one wonders if this is related to the recent spamming problem.

One also has to wonder how many of these problems may have come as a result of the upgrade Google did to its datacenters, dubbed “Big Daddy”. Google did not seem to have these sorts of problems before this, so perhaps there is a correlation, but maybe not a cause. It’s interesting to note, however, that the aim of the “Big Daddy” updates was to prevent this sort of thing – and to keeping meaningful sites in the index, which is exactly the opposite of what happened, since the spamming from these sites evidently bumped out important sites from the indexes.

This recent round of spamming was not unique to Google; it affected the Yahoo! and MSN search results as well, though not to the extent that it did to Google’s. It was probably not the intention of the spammer to get so many pages indexed on Google, and probably got “out of hand” quickly, however, one has to wonder if this spam site directly targeted Google in its quest for search engine manipulation, or whether this was just a coincidence. But the problem for search engines remains, and that is, how to effectively discriminate between meaningful and useless information, without making too many mistakes, one way or the other.

Thankfully, it appears the Google is working on fixing this problem, and not only by just removing the most recent spams. It will take some work, but I think they should hopefully arrive at a solution, but as always, spammers will always be working to gain the upper hand as well. Let’s hope that the combined effort of companies like Google, Yahoo! and Microsoft can thwart them, or else the Internet may become awash in the useless garbage of spam.

Comments for this entry are closed

But feel free to indulge in some introspective thought.