Sunday, December 11, 2005


An ounce of prevention

Nobody likes spam. It clogs up coporate e-mail servers and internet backbones. Hunting down valid e-mail messages in the flood of unsolicited junk mail becomes a daily chore. It may even be e-mail that you or your users find horribly offensive. Modern spam filters have come a long way but they aren't perfect. Important e-mail messages could be flagged as spam erroneously and even the best filters allow the occasional message through.

So how are these spammers getting your e-mail address? There are many different correct answers to that question. Some buy lists. Some collect e-mail addresses via worms, viruses, or spyware. You may have given your e-mail address to an unscrupulous site that collects registration information. There is however one method that seems to be more common than others - web spiders.

Software programs run by spammers traverse the web reading page after page by following link after link in a similar process to the way google would index a web page for searching. However, instead of indexing your page, this program is only looking for e-mail addresses. When it finds an e-mail address, it logs it in a database.

Is there anything we, as web developers, can do to truly prevent spammers from getting our users e-mail addresses? Nope. Is there anything we can do to slow the spammers down a little? Maybe.

Obviously, we can ask our users to please refrain from posting their e-mail addresses to websites but that's not sufficient. First off, we're likely to be ignored. Secondly, there are times in which you need e-mail addresses to appear on a website. For an example, look at virtually any website's "Contact Us" page.

So, what can we try? First, we have to realize that a spider is not a person or a web browser. It doesn't "see" an e-mail address on a page the way our eyes do. An example of where this could be important is images. An e-mail address inside of a .gif or .jpg probably won't be picked up. How does a spider recognize and record an e-mail address? It surely varies from spider to spider but we can probably make a few assumptions.

1) The e-mail address has to appear in the source of the web page.
2) The spider probably looks for 1 or more characters followed by an @ followed by one or more characters followed by a . followed by "com", "net", or "org".
3) The spider probably doesn't have a javascript interpreter built in

Let's look at the HTML required to generate an e-mail link:

<a href=""></a>

It's pretty easy to see how a piece of software could grab the e-mail address from this sample. Anybody who's taken an intro level programming class could write the code to do it.

Let's compare it to the following:

sd = ".";
sp1 = "body";
sp2 = "some";
sa = "@";
ss1 = "where";
st1 = "com";
st2 = "org";

se = sp2 + sp1 + sa + sp2 + ss1 + sd + st1;
document.write("<a href=\"mailto:" + se + "\">" + se + "</a>");

The end result that a visitor to the webpage will see is exactly the same in both of these cases but the second example would be much harder for a spider to interpret. Without a javascript interpreter, obfuscating an e-mail to where it can't be recognized by a spider is trivial.

But is it worth the hassle? Will it make a noticeable difference? I don't know. Let's find out. This link goes to a page that contains an e-mail address written out by javascript and an e-mail address directly embedded into the html. Eventually, spiders should find that page through this one. I'll post back periodically with the number of spam messages that each are receiving. If you'd like to help the experiment along. Please link to either this article or directly to the page linked here.

This page is powered by Blogger. Isn't yours?