FRAMINGHAM (09/25/2003) - We suspect that you had more than enough mathematics in the form of Bayes Theorem last week so this week we'll explain how it's used in what is called Bayesian filtering to remove spam (note that the same techniques also can be used for general text filtering).
One of the best and most accessible descriptions of Bayesian filtering is an article from August 2002 titled "A Plan for Spam" by Paul Graham.
The background to this work was an attempt to build a spam-proof Web-based mail reader written in a version of Lisp called Arc that Graham and his colleagues were developing (we have no idea what happened to this project).
Upfront Graham says that using the filter "we now miss less than 5 per 1,000 spams, with 0 false positives." Sounds good, and the amazing thing about the filter is that it really isn't that complicated. Actually, if you read Graham's first and second papers you'll see he added features to the second version that improved performance considerably.
To create the filter, sample sets of spam and non-spam mail were analyzed by breaking the text into tokens. Tokens are created from the text of a message using a few simple rules: Case is preserved; exclamation marks are retained; periods and commas are retained if they occur between two digits (to allow IP addresses and prices to be included); price ranges are recognized (such as US$20 to $25) and turned into two separate tokens (for example, $20 and $25); and tokens that occur within header fields such as To:, From: and Subject:, or within URLs are prefaced with an identifier to let them be weighted separately from a body token (for example, "Enlargement" in the Subject: becomes "Subject*Enlargement" - the asterisk can be any character that is ignored in messages).
The frequency of all tokens in each set is counted, and Bayes Theorem is applied to derive another table of the probabilities associated with each token being used in a spam or a non-spam message. It is worth noting that Graham weights the non-spam tokens twice as heavily as spam tokens to minimize false positives - the extra weight actually makes the filter less sensitive to spam but safer for non-spam (on the other hand, the success Graham claims makes the decreased sensitivity irrelevant).
When new mail arrives the tokens are extracted, and the 15 tokens with the highest spam probability are used to calculate the probability that the message is spam. If the spam probability exceeds the threshold value - Graham uses 0.9 - then the message is assumed to be spam.
Graham's second filter had a number of improvements over the first version that broadened the range of tokens created and how they were compared with the table of tokens with spam probabilities. See Graham's second paper, "A Better Bayesian Filter", for the details.
One interesting issue is how to handle HTML mail - do you ignore the HTML or analyze it? Graham chose the middle road and selected to analyze some HTML coding such as anchors and images, as these contain URLs that often contain words or patterns that indicate spamming. Even font tags prove to be useful because apparently spammers have a predilection for specifying bright red text!
Does it work? Well, Graham says in his last article: "Between Dec. 10, 2002, and Jan. 10, 2003, I got about 1,750 spams. Of these, four got through. That's a filtering rate of about 99.75 percent." Impressive stuff.
So why does Bayesian filtering work, and can spammers get around it? Bayesian filtering works by pure statistics, which is far simpler and computationally more efficient than trying to have a filter that understands message syntax or content in any meaningful way. The use of tokens provides a broad and consistent range of data points and, if you tune the filter by adding or removing your own spam and non-spam words, you wind up with a filter that is unique to you to a greater or lesser degree. This means that a spammer can't make guesses about any given filter configuration accurately enough to defeat it except by blind luck!
But as effective as Bayesian filtering can be, it requires optimization to be really good. And combining it with whitelists to minimize false positives, blacklists to minimize false negatives and other techniques, such as source verification, makes for a more accurate spam filter that needs less fine-tuning.
See if we filter you at firstname.lastname@example.org.