Spam is pervasive; it is everywhere. If Ben Franklin were alive today, he’d probably be quoted as saying that “In this world nothing is certain but death and spam“. In fact, it’s one of the major downsides of the web as we know it. With increased availability of information, comes the inevitability of spam – direct consumer marketing thrown in alongside legitimate content that decreases the SNR (Signal-to-noise Ratio), effectively making it harder to find quality, real information on the Internet.
In the old days…
Back a few years ago, the big thing was e-mail spam. It’s been around so long that almost anyone who’s used e-mail knows about it. All those e-mails telling you how to re-finance your mortgage, get low-cost prescription drugs or how to grow a certain appendage longer tended to fill up one’s inbox, making it a pain to delete all of them and find the real e-mail that you needed to read.
For those of you thinking that e-mail spam was big annoyance, recent polls have shown that people apparently do make purchases from spam e-mails, making it a viable direct marketing tactic. However, server-side e-mail spam filters have greatly increased in their ability to weed out junk e-mails in the past few years, so for many people, spam is not as big of a problem as it once was. Unless, you’re using Hotmail.
Adapt or die
Faced with the prospect of decreased income thanks to e-mail spam filters (or just wanting to make more money), spammers began a new front in the war of direct marketing. They began to spam forums, guestbooks and most recently, commenting systems on various systems in the hopes of drawing attention to the products they were advertising. The purpose was the same – to encourage people to buy products, mostly the same ones they had been sending out in mass e-mails. So, while e-mail spam has not gone away in recent times, alternative forms of spam have increased many times.
In fact, spamming online communites with message may even be more effective, since a spammer needs only to get their message on one site in order for it to be viewed by many. However, since spam tended to be easy to distinguish from comments posted by humans, it was relatively easy for developers to combat this by building in anti-spam features to weed out spam. Spammers responded by making their “bots” – the automated programs that sent out the spam messages – better and trying to make the “quality” of their messages seem more “human”.
A personal experience
My site, unitstep.net, is far from a popular site. However, spam bots have still managed to find this site and I’ve logged 104 spam comments in just over two months’ worth of operation. That’s quite impressive, and shows that spammers are actively searching for new blogs to spam. As I mentioned before, some comment spam is aimed at promoting other websites in order to increase their ranking in SERPS (Search Engine Results Pages), thus drawing unsuspecting visitors to their sites, where they are served up advertising that looks like regular content. (WordPress and most blogs add an attribute of
rel="nofollow" to comment links to defeat this, but they still try.) Most of the spam I’ve got, has been of the direct marketing variety, though.
If you haven’t seen the comment spam though, it’s because of Akismet, a truly kick-ass plugin for WordPress, written by the same team. It basically uses a central authority to check on every comment that’s submitted, and analyzes its content to determine if its likely to be spam, or likely to be real. Comments that are marked as spam aren’t shown, and are instead put in a moderation queue, for me to look at and delete or, if it’s a false positive, allow it through. Akismet appears to learn as well, so it’s success rate increases the longer it’s been in use. I apparently joined relatively late in the game, as I have yet to find Akismet report a false positive or let a spam comment through.
However, I don’t blame Akismet for this one, as I was almost caught off guard. Here was the content of the comment:
I am Karin, very interesting article that contained the information I was searching for in Google, thanksâ€¦.
Upon closer inspection, the comment does look like a spammer’s, since it didn’t specifically relate to the post’s topic, instead using a generalized statement. The first sentence also made no sense. But, what threw me off was the lack of links in the post – the spam bot had just instead used the regular “homepage” field to fill in the URL of their spam blog – or splog – in the hopes that someone would visit it.
In fact, due to my curiousity, I had to visit the site to see what was on it. (This was, of course, how I determined it to be a splog) The splog consisted of a load of nonsensical posts, and of course, lots of ads, by Google no less. Since it’s against the terms-of-service of Google Adsense to make a site for the direct purpose of displaying Google ads, this spammer is obviously in clear violate of the program. The spam blog wasn’t littered with links though (besides the ads), and the posts would look human after only a quick check – a closer inspection reveals sentences merged together into nonsensical, run-on paragraphs.
The Turing test
All of the efforts by current spammers reminded me of the Turing Test. Basically, it’s a concept that if a person communicating with a computer cannot reliably tell if they are communicating with a computer or a human, then that computer system is said to have passed the test. Turing thought it was a better way of evaluating a computer over the question of “Can a computer think?”
In a way, spam and anti-spam techiques are engaged in some sort of Turing test. Except that it’s a computer system (the anti-spam system) that is trying to determine if an entity is a computer (spambot) or a real human. The anti-spam techniques described above have typically been widely used, along with CAPTCHAs or other “tests” that the typical human can easily do, but are relatively hard to design into a computer program or system.
However, as programming techniques and algorithms advance, the spam/anti-spam war is sure to heat up. As seen by some of the comment spam I’ve started to get, spammers are getting better with the bots they produce. They’re moving away from posting messages that are heavily-laden with links to gambling/porn sites that are obviously spam, to posting messages that could have conceivably come from a human, with only the regular home page link that any curious visitor might click. As these techiques advance, it’s entirely possible that a spam bot could read a blog post, analyze the content, and post a reasonably-specific message that contained a few links here and there to spam sites. The same goes for CAPTCHAs as well – advances in image processing are making optical character recognition more and more accurate.
The end result is, as I mentioned at the beginning, a lower SNR and lower quality of information on the web. Not only will more useless spam information be disseminated, making it more likely to find crap rather than helpful information when doing a search – but it’ll also be harder to post actual information that isn’t incorrectly labelled as spam. For example, if spam bots were able to post vaguly post-specific comments to sites, then quick remarks by people, such as “Thanks for the useful information!”, might also get labelled as spam. Same goes for CAPTCHAs – if we make the images more distorted and the characters harder to read, people might also have trouble – I know I do on some of them – and this creates another problem of accessibility for the disabled.
This isn’t to say I’ve lost hope. Akismet, so far, has only missed one comment – and I would have missed it had I not visited the linked site. So, it hasn’t really let me down. Others have had similar success, so for now, I think anti-spam techniques have the upper hand. Spam is merely an annoyance and a statistic currently, and I hope it stays that way. What I’ve talked about in this entry could be viewed as a worse case scenario of sorts.
Now, excuse me while I press “Delete All” on the spam comment moderation queue.