Statistics hold out hope for slamming spam

About 500 programmers, researchers, hackers and IT administrators gathered at the Massachusetts Institute of Technology earlier this month seeking not just to slow the relentless onslaught of spam but to destroy its business model completely.

About 500 programmers, researchers, hackers and IT administrators gathered ay the Massachusetts Institute of Technology earlier this month seeking not just to slow the relentless onslaught of spam but to destroy its business model completely.

They want to increase to almost 100% the proportion of spam mail that users’ filters reject. That would mean spammers would receive few, if any, responses, making sending unsolicited bulk email a financially prohibitive task.

Such an effective filter would demand radically new techniques, but programmer William Yerazunis and conference organiser Paul Graham see great hope for such development in Bayesian probability algorithms, which assess the odds that a mail is spam based not on the arbitrary assigning of scores to certain words and phrases but on the statistical measurement of the common features of real spam.

Graham’s paper describes how he counted words and other isolated character sequences (tokens) in a large sample of messages, pre-identified as spam and not-spam, and built a hash index for each on frequency of occurrence of these tokens, providing a measure of the likelihood that a message with certain tokens in it will be spam or non-spam.

“When new mail arrives, it is scanned into tokens, and the most interesting 15 tokens, where interesting is measured by how far their spam probability is from a neutral 0.5, are used to calculate the probability that the mail is spam,” says Graham in his paper.

Words not in either index are assigned a value of 0.4 — on the optimistic side. Other tweaks to the formula are also designed to bias towards optimism and hence avoid “false positives” — legitimate emails falsely identified as spam.

Using similar techniques, Yerazunis has devised a language for writing spam filters, and presented evidence at the conference of a 99.915 success rate at identifying spam.

If such filters fulfil their promise and catch on with users “we might not need to have a [spam] conference next year”, Graham says.

But he acknowledges that there is a problem of user apathy, as there is with users who fail to implement security patches or act as unwitting hosts for DDoS attacks.

He refers Computerworld to his comments on the subject. There he points out that the “idiots” most likely to respond to spam are not the people for whom most filter software is designed.

“The person who responds to spam is a rare bird. Response rates can be as low as 15 per million. That’s the whole problem: spammers waste the time of a million people just to reach the 15 stupidest or most perverted. The great danger is that whatever filter is most widely deployed in the idiot market will require too much effort by the user.

“As long as the 15 idiots continue to see spams, we’re all going to be sent them. So whether filters put an end to spam depends on how the email software used by the idiots is designed. My guess is that idiots are pretty passive, so the key here is to make the default do the right thing.

Hear me, O AOL and Microsoft: when you release Bayesian filters, don’t make all the users train their own filters from scratch. Use initial filters based on mail classified by all your users. That way, as long as the user just keeps blindly clicking, most email will end up in the right corpus (in the spam or not-spam bin as appropriate).

“Do that, and spam will decrease, which will mean lower infrastructure costs, and thus greater profits for you,” he says.

Join the newsletter!

Error: Please check your email address.

Tags spam

More about AOLMassachusetts Institute of TechnologyMicrosoftTechnology

Show Comments

Market Place

[]