Integrating CRM114 into MIMEDefang

UPDATED 4/14/04 - The crmcheck and crmlearn scripts were updated to use a new Perl script crmstrip (below). The crmstrip script strips locally generated mail headers from messages to be learned so that those messages better resemble what they looked like when classified (which happens prior to the insertion of local headers).

Intro

Sounds like there aren't too many people doing this just yet, so I thought I'd whip up a quick howto, and I do mean quick. For instance, it's assumed you already have MIMEDefang installed.


Step One - Build/Install CRM114

This part is pretty easy and well documented, so I'm not going to waste time with it here. Go here, download, unpack and read the howto.

One thing that's not well covered in the CRM114 docs is installation. Sure, you do the standard make install thing, but that's just the binaries. What I did was create a /etc/mail/crm114 directory. Then I went about creating the .css files, copying over stuff from the source tree that looked important, editing config files, twiddling ownership, permissions and whatnot.

So now that most of the dust has settled and it's been running for awhile, /etc/mail/crm114 looks like this:

-rw-rw-rw-    1 defang   defang    5925848 Mar  4 22:43 allmail.txt
-rw-r--r--    1 root     root            0 Mar  1 21:35 blacklist.mfp
-rw-r--r--    1 root     root         4647 Mar  1 21:35 blacklist.mfp.orig
-rwxr-xr-x    1 root     root         1543 Feb 28 21:24 classifymail.crm
-rwxr-xr-x    1 root     root         1214 Feb 28 21:24 mailexpand.crm
-rw-r--r--    1 root     root         5596 Feb 28 20:12 mailfilter.cf
-rwxr-xr-x    1 root     root        27675 Feb 28 21:23 mailfilter.crm
-rwxr-xr-x    1 root     root          257 Feb 29 13:55 mdcrm
-rw-rw-rw-    1 defang   defang   12582924 Feb 28 15:22 nonspam.css
-rw-rw-rw-    1 defang   defang     137653 Mar  4 22:17 nonspamtext.txt
-rwxr-xr-x    1 root     root         4884 Feb 28 21:24 pad.crm
-rw-r--r--    1 root     root            0 Mar  4 00:01 priolist.mfp
-rw-r--r--    1 root     root           49 Mar  4 00:01 priolist.mfp.orig
-rw-rw-rw-    1 defang   defang       4515 Feb 29 22:51 rejected_by_blacklist.txt
-rw-rw-rw-    1 defang   defang    1908106 Mar  4 22:17 rejected_by_css.txt
-rw-r--r--    1 root     root          163 Feb 28 21:30 rewrites.mfp
-rwxr-xr-x    1 root     root         1561 Feb 28 21:24 rewriteutil.crm
-rw-r--r--    1 root     root          267 Feb 28 21:24 scrub_mailfile_rewrites.mfp
-rwxr-xr-x    1 root     root          310 Feb 28 21:24 shroud.crm
-rw-rw-rw-    1 defang   defang   12582924 Feb 28 15:22 spam.css
-rw-rw-rw-    1 defang   defang      95993 Mar  4 21:58 spamtext.txt
-rw-r--r--    1 root     root           48 Mar  2 00:04 whitelist.mfp
-rw-r--r--    1 root     root           67 Mar  1 21:36 whitelist.mfp.orig

Obviously leaving files at 666 is going to be very non-smart on some boxes, but I'm the only user on mine. If you don't want to leave them wide open on your system, just keep in mind that defang needs write access to some files and whatever user account does the learning also needs write access to at least the .css files.

Step Two - Build a wrapper to run CRM114

While you could run CRM114 directly from MIMEDefang without a wrapper, I chose to use one. Call me silly, but to me this is "better."

/etc/mail/crm114/mdcrm:

#!/bin/sh

LOG=/tmp/mdcrm.log

echo "======================================" >> $LOG
echo "Start: `date`" >> $LOG
pwd >> $LOG

/etc/mail/crm114/mailfilter.crm --fileprefix=/etc/mail/crm114/ \
  --stats_only < ./INPUTMSG 2>> $LOG

echo "End: `date`" >> $LOG

Step Three - Modify MIMEDefang

I added the following code to MD's filter_begin() function:

# Testing CRM114 here...
    if ((-s "./INPUTMSG") <= (500 * 1024)) {	# 500kB limit
	open(CRM,"/etc/mail/crm114/mdcrm |")
	  or md_graphdefang_log("Failed opening mailfilter.crm");

	@result = <CRM>;
	chomp $result[0];
	if( $result[0] < 0 ){
	    action_add_header("X-CRM114-Status", "SPAM  ( pR: $result[0] )");
	}else{
	    action_add_header("X-CRM114-Status", "HAM  ( pR: $result[0] )");
	}

	close(CRM)
	  or md_graphdefang_log("Failed closing mailfilter.crm");
    }

Note that all this code does is tag. That's fine if you are relying on procmail to direct your spam/ham to their respective mailboxen. And since you will need to babysit the filter for some time (indefinitely?), that's probably the best approach. I may at some point add code to refuse mail that gets particularly low crm scores, stay tuned...

Step Four - Checking/Training scripts

These are designed to save a little typing. Both require an email on stdin. Modify as you see fit.

/usr/local/bin/crmcheck:

#!/bin/sh

/usr/local/bin/crmstrip | /etc/mail/crm114/mailfilter.crm --fileprefix=/etc/mail/crm114/ | grep CRM

exit 0

/usr/local/bin/crmlearn:

#!/bin/sh

case "$1" in
	-s)
		/usr/local/bin/crmstrip | /etc/mail/crm114/mailfilter.crm --fileprefix=/etc/mail/crm114/ --learnspam | grep CRM
	;;
	-h|-n)
		/usr/local/bin/crmstrip | /etc/mail/crm114/mailfilter.crm --fileprefix=/etc/mail/crm114/ --learnnonspam | grep CRM
	;;
	*)
		cat<<EOF
Try -s for spam and -h or -n for ham.
EOF
	;;
esac

exit 0

This script supports the two above...

/usr/local/bin/crmstrip:

#!/usr/bin/perl

# Here's the story...

# CRM114 was giving me grief, saying it didn't need to learn messages that it
# may have misclassified only moments ago.  The only explanation I can come
# up with is that the messages being classified via milter have a few less
# headers than when they actually arrive in my mailbox, so basically CRM114
# is learning from a slightly different message than what was misclassified.
# My solution is to use Perl to strip out the added headers and return the
# message to the form it was in when originally classified.

# Using a handy example, we need to go from this:

# From r.ratliffuo@modsim.co.kr  Wed Apr 14 20:39:21 2004
# Return-Path: <r.ratliffuo@modsim.co.kr>
# Received: from embassi.de ([218.146.9.3])
#         by calvin.boinklabs.com (8.12.8/8.12.8) with SMTP id i3F0dCnI021372;
#         Wed, 14 Apr 2004 20:39:16 -0400
# Message-ID: <5d2b01c42282$a3d602b0$8bb757f1@embassi.de>
# From: "Reggie Ratliff" <r.ratliffuo@modsim.co.kr>
# To: cwilkins@boinklabs.com, cwilkins-web@boinklabs.com
# Subject: INC^R.EASE YOUR D'I^C,K  WEIGHT               ^    gjanjbzw
# Date: Thu, 15 Apr 2004 00:43:09 +0000
# MIME-Version: 1.0
# Content-Type: text/html;
#         charset="us-ascii"
# Content-Transfer-Encoding: 8bit
# X-CRM114-Status: SPAM  ( pR: -41.5376  )
# X-Scanned-By: MIMEDefang 2.37

# To this:

# Message-ID: <5d2b01c42282$a3d602b0$8bb757f1@embassi.de>
# From: "Reggie Ratliff" <r.ratliffuo@modsim.co.kr>
# To: cwilkins@boinklabs.com, cwilkins-web@boinklabs.com
# Subject: INC^R.EASE YOUR D'I^C,K  WEIGHT               ^    gjanjbzw
# Date: Thu, 15 Apr 2004 00:43:09 +0000
# MIME-Version: 1.0
# Content-Type: text/html;
#         charset="us-ascii"
# Content-Transfer-Encoding: 8bit

# Which means we need to lose:

#	From
#	Return-Path:
#	X-CRM114-Status:
#	X-Scanned-By:

# And of course we have a few special cases:

#	Only clobber the first (local) Received: header
#	Clobber locally generated Message-ID: headers

# Note that the X-CRM114-Status and X-Scanned-By headers were custom for my
# setup.  Yours may not have them and/or may feature other custom generated
# headers that need to be stripped.  Adjust the code below as needed.

# Also, if you are wondering where you are supposed to compare "before" and
# "after" headers as shown above:
#
#	The "before" version comes from (in my case)
#	/etc/mail/crm114/rejected_by_css.txt  (If you don't have that file
#	somewhere, you need to enable it in the crm114 config file.)
#
#	It should be pretty obvious that the "after" version comes from
#	your inbox, or incoming spam folder.

# Now, getting down to business...

# Set this to match the FQDN of locally generated message ID's.  In other
# words, the FQDN of your inbound mail server.
$localsrv = 'calvin.boinklabs.com';

# loop through the message line by line
$mode = 'head';
$gotrcvd = 0;
while( <stdin> ){

	# just print to stdout and loop if we're no longer in header mode
	if( ($mode ne 'body' and /^$/) or $mode eq 'body' ){
		$mode = 'body';
		print;
		next;
	}

	# If we are here, we've got headers, or a header continuation to deal
	# with.

	# eat continuations for headers we wish to supress
	if( $mode eq 'eat' and /^\t/ ){
		next;
	}

	# Here's where we look for headers to clobber
        if( /^From\s/i
         or /^Return-Path:\s/i
         or /^Message-ID:\s+.*$localsrv/i
         or /^X-CRM114-Status:\s/i
         or /^X-Scanned-By:\s/i
         or /^Status:\s/i               # These 3 are added by Mutt
         or /^Content-Length:\s/i
         or /^Lines:\s/i ){
		$mode = 'eat';
		next;
	}

	# Special case - just the topmost (local) received header gets stripped
	# adjust the literal 1 if you need to strip more than one.
	if( /^Received:\s/ and $gotrcvd < 1 ){
		$gotrcvd++;
		$mode = 'eat';
		next;
	}

	# If we got this far, we should print.  So we will!
	print;

	# lastly, turn off 'eat' mode so we don't gobble up needed header lines
	$mode = 'head';
}

exit(0);

Well that's it for now. Happy spam stomping!

Please direct inquires to: cwilkins@dtserv.com

All content ©2004 Dauntless Technical Services (except the stuff that isn't mine).