Unlike Bogofilter, exactly what these values represent is uncertain, although considering many users probably have no understanding of Bayesian analysis, much the same could also be said for Bogofilter, of course. They also include a Bayesian probability test similar to Bogofilter’s, as well as white and black lists for individual customization.įigure 2: Debian adds its own SpamAssassin tests to the already comprehensive list.Įach test assigns an email message a positive or negative value, which is added to the results of other tests to determine whether the email is ham or spam. In the English version, some basic tests for French, German, and Italian are also included. From their number alone, you can tell they are a varied lot, but they include tests for the common indicators of spam in headings, in the bodies of email, and in HTML code, as well as tests for recognizing offers for anti-viruses, drugs, and pornography. More than 50 are listed in my current installation of Debian Stable. You can view the Perl scripts used by SpamAssassin in /usr/share/spamassassin. Many tests, although not all, rely heavily on regular expressions to catch variations of words and phrases. SpamAssassin’s main approach is to identify the characteristics of spam and then run tests to locate them. SpamAssassin takes a different approach from Bogofilter. bogofilter folder in your home directory. Advocates of this approach emphasize its simplicity, as well as its lower number of false positives once it is trained – that is, once the white and black lists are produced. However, the most important point for the average user is that Bogofilter relies on statistical probability, supplemented by each user’s list of spam and ham. The mathematically inclined can learn more about how Bogofilter assigns the probability of an email being spam by following the links and reading the man page for the filter. However, the basic approach remains that advocated by Graham. The modern refinements include recognizing MIME types, treating each hostname and IP address as a separate token (rather than dividing them up into separate words), and ignoring dates and Message-IDs as irrelevant. Today, Bogofilter is maintained by other developers,and has refined Graham’s calculations based on Gary Robinson’s suggestions. For this reason, he also included the possibility of using white lists to indicate non-spam, or “ham,” and black lists to indicate spam.Īfter reading Graham’s essay, Eric S. However, he also recognized that the more personalized the filter was, the more accurate it would be. If the probability was greater than 0.9, the message was considered spam.Īccording to Graham, the advantage of this statistical approach is that it refers to something real – the probability of being spam – and worked with both neutral and spam-indicating words. By examining the top 15 tokens in the header and body of each new email message, he calculated the possibility that it was spam. Graham’s solution was to parse his samples of spam and non-spam into tokens, or individual words, and use Bayesian tools to assign each token the possibility that it indicates spam, biasing them slightly in favor of not being spam to minimize false positives. After trying to develop filters based on the identifying characteristics of spam, Graham concluded that beyond a certain point, the more rules he added, the more false positives he obtained – that is, the more email messages that were incorrectly identified as spam. However, to make an informed choice between spam filters requires considerably more detail.īogofilter has its roots in “ A Plan for Spam,” a 2002 essay by English developer Paul Graham. The more suspect words contained in an email, the greater the chances it is spam. More specifically, both apply Bayes’ work by collecting words and assigning a probability that each word indicates spam. To call them Bayesian means nothing more than their structure is based on the the 18th century work of Thomas Bayes in statistics and probability. In fact, learning that Bogofilter and SpamAssassin are “Bayesian” is useless for choosing between them. Instead, most users simply nod solemnly when they read that both involve “Bayesian filtering.” Most of us – including many who use the phrase – have no idea what Bayesian filtering is, but it sounds scientific and reassures us that either choice is acceptable. However, what is less often discussed is which filter is the best to use in which circumstances. Although a few other choices (e.g., SpamBayes) are available, when an email reader offers a plugin, it is almost always for either Bogofilter or SpamAssassin. ![]() Other choices, like DSPAM, are no longer in development. These days, the choice of spam filters comes down to Bogofilter and SpamAssassin.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |