Bayes Filtering in SpamAssassin

The Bayesian classifier in SpamAssassin began tagging emails a few days ago. I found this out because while messages were not marked as spam, my procmail rule started diverting all messages to my spam folder. The old rule was not particular about where the yes was and since BAYES contains yes, all emails looked like spam. The new rule only looks for the yes at the beginning.

# Old Rule
:H
* ^X-Spam-Status:.*Yes
$MAIL/spam
# New Rule
:H
* ^X-Spam-Status: Yes
$MAIL/spam

Now incoming spam messages contain an additional score in the spam report.

X-Spam-Report:
        *  3.5 BAYES_99 BODY: Bayesian spam probability is 99 to 100%
        *      [score: 1.0000]

I was surprised that it took the Bayes filter three months to gather enough email to begin scoring incoming email. It is a nice addition because it bumps up the spam scores enough to ensure that more messages that are spam get marked as such.

URIBL SpamAssassin Settings

I have been receiving a lot of emails that contain web links that are getting marked as spam. According to URLBL.COM, these are links that appear in spam and not links where the spam originates. Therefore, about all I can do is whitelist the senders or dial down the scores on the rules for these filters. After adding a handful of senders to the whitelist, I decided to alter the rules.

I found all of the URIBL rules in the /usr/share/spamassassin/50_scores.cf file. I copied them to the /etc/spamassassin/local.cf file where I could change the values to something more reasonable:

score URIBL_AB_SURBL 0 0.800 0 0.900 # n=0 n=2
score URIBL_JP_SURBL 0 1.400 0 0.700 # n=0 n=2
score URIBL_OB_SURBL 0 1.000 0 0.700 # n=0 n=2
score URIBL_PH_SURBL 0 1.000 0 0.800 # n=0 n=2
score URIBL_RHS_DOB 0 0.400 0 0.500 # n=0 n=2
score URIBL_SBL 0 1.200 0 0.700 # n=0 n=2
score URIBL_SC_SURBL 0 1.200 0 0.200 # n=0 n=2
score URIBL_WS_SURBL 0 1.000 0 0.700 # n=0 n=2
score URIBL_BLACK 0 0.900 0 0.900 # n=0 n=2

Hopefully decreasing the scores for these rules will decrease false positives that I have been receiving in my inbox.

Configure SpamAssassin with Postfix on Ubuntu

I’ve been running a mail server for the last year and a half. When I initially set up my Postfix mail server on Ubuntu, I knew that eventually I would need to add a spam filter. I recently decided that SpamAssassin was the best choice to filter email on my mail server.

I now receive on average more than one spam message each day. Interestingly, all of my spam is sent to an email address that I have only given out to Marquette University. I guess that means they have either sold my email address or poorly secured it in their database. Neither would surprise me.

I used the content from two different tutorials to get SpamAssassin up and running on my server.

First, I installed SpamAssassin.

apt-get install spamassassin spamc

Next, I created the spamd user and group. You can specify a specific uid and gid if you want.

groupadd spamd
useradd -g spamd -s /bin/false -d /var/log/spamassassin spamd

Then I created the spamd home directory and set the permissions.

mkdir /var/log/spamassassin
chown spamd:spamd /var/log/spamassassin

Then I set up some configuration for SpamAssassin. You can edit the file directly, but I use Sed so that I can automate the installation process in a script. This enables SpamAssassin, Cron, and some other options.

DEFAULT_SPAMASSASSIN=/etc/default/spamassassin
mv $DEFAULT_SPAMASSASSIN $DEFAULT_SPAMASSASSIN.default
sed '
    s/ENABLED=0/ENABLED=1/
    s/CRON=0/CRON=1/
    s/^OPTIONS.*/SAHOME="\/var\/log\/spamassassin"\nOPTIONS="--create-prefs --max-children 5 --username spamd -H ${SAHOME} -s ${SAHOME}\/spamd.log"/
' $DEFAULT_SPAMASSASSIN.default > $DEFAULT_SPAMASSASSIN

Then I set up the rest of the configuration for SpamAssassin. I initially set the required score to 2.0, but this caused a lot of legitimate emails (ham) to be marked as spam. The following configuration will rewrite subjects of spam messages to identify them as spam.

SA_LOCAL_CF=/etc/spamassassin/local.cf
mv $SA_LOCAL_CF $SA_LOCAL_CF.default
echo "
rewrite_header Subject [***** SPAM _SCORE_ *****]
required_score           5.0
# to be able to use _SCORE_ we need report_safe set to 0
# If this option is set to 0, incoming spam is only 
# modified by adding some \"X-Spam-\" headers and no 
# changes will be made to the body.
report_safe     0

# Enable the Bayes system
use_bayes               1
use_bayes_rules         1
# Enable Bayes auto-learning
bayes_auto_learn        1

# Enable or disable network checks
skip_rbl_checks         0
use_razor2              0
use_dcc                 0
use_pyzor               0
" > $SA_LOCAL_CF

Now that I have been running the spam filter for a couple weeks, I have had to whitelist some email addresses that send me emails with strange headers or get sent from “shady” IP addresses. This goes into the same local.cf file.

whitelist_from *@hq.acm.org

I find it amusing that emails from the ACM keep getting marked as spam. Next I started SpamAssassin.

/etc/init.d/spamassassin start

Next, I modified Postfix to send emails through the SpamAssassin filter.

POSTFIX_MASTER_CF=/etc/postfix/master.cf
mv $POSTFIX_MASTER_CF $POSTFIX_MASTER_CF.default
sed 's/smtp      inet  n       -       -       -       -       smtpd/smtp      inet  n       -       -       -       -       smtpd\n\t-o content_filter=spamassassin/'  \
$POSTFIX_MASTER_CF.default > $POSTFIX_MASTER_CF
echo 'spamassassin unix -     n       n       -       -       pipe
  user=spamd argv=/usr/bin/spamc -f -e    
  /usr/sbin/sendmail -oi -f ${sender} ${recipient}' >> $POSTFIX_MASTER_CF

Next, reload Postfix so it will use SpamAssassin.

/etc/init.d/postfix reload

Once SpamAssassin is running, you can train it by passing it spam and ham emails.

sa-learn -u spamd --spam --mbox /path/to/spam_mbox
sa-learn -u spamd --ham --mbox /path/to/ham_mbox

After adjusting the spam threshold, training the filter with spam messages that I have acquired over the last year, and whitelisting a few problematic senders, my spam filter has been doing a good job of marking spam as spam. At this point it is easy enough to sort through the email manually and confirm that they are spam. In the future, if it ever gets bad enough, I will be able to automatically delete the messages or filter them into a different mailbox on delivery.