# $Id: README,v 1.1 2004/02/07 18:30:00 vanbaal Exp $

Congratulations!!!  You got this far.  First things first.

     THIS SOFTWARE IS LICENSED UNDER THE GNU PUBLIC LICENSE

	  	     IT MAY BE POORLY TESTED.

  	  IT MAY CONTAIN VERY NASTY BUGS OR MISFEATURES.

 		      THERE IS NO WARRANTY.  

		THERE IS NO WARRANTY WHATSOEVER!

  	  A TOTAL, ALMOST KAFKA-ESQUE LACK OF WARRANTY.
	    
  	        Y O U   A R E   W A R N E D   ! ! !

Now that we're clear on that, let's begin.

              ----  What YOU Should Do Now  -----

Contents:

	1) "What Do You Want?"

	2) If you want to write programs...

	3) How to "make" CRM114



1) "What Do You Want?"

 *  If you just want to use CRM114 Mailfiltering, print out the 
   CRM114_Mailfilter HOWTO and read _THAT_.  Really; we will help a LOT.
   The instructions in the HOWTO are much more in-depth and up to
   date than whatever you can glean from here.


2) If you want to write programs, read the introduction file INTRO.txt
   It'll get you oriented.  

   Remember, this is a wierdass language, you _don't_ understand it
   yet.  (okay, wiseguy, what does a "LIAF" statement do?  :-) )

   Then, print out and read the QUICKREF.txt (quick reference card).
   You'll want this by your side as you write code until you get
   used to the language.


3) CRM114 (as of this writing) does not have a fully functional
   .config file.  There is a beta version, but it doesn't work
   on all systems.

   Until that work is finished, you have a couple of recommended options:
  
	1) run the pre-built binary release, 

    or

        2) use the pre-built makefile to build from sources.


   Here are some useful Makefile targets:

	"make clean"  -- cleans up all of the binaries that you have 
			that may or may not be out of date.  DO NOT
			do a "make clean" if you're using a binary-only
			distribution, as you'll delete your binaries!

	"make all" -- makes all the utilities (both flavors of crm114,
 	                cssutil, cssdiff, cssmerge), leaving them in 
			the local directory.

	"make install" --  as root will build and install CRM114 with
			the TRE REGEX libraries as /usr/bin/crm .

			n.b. There is _no_ "make uninstall" at this point.

	"make install_gnu -- as root will build and install CRM114 with
			the older GNU REGEX libraries.  This is
			obsolete but still provided for those of us
			with a good sense of paranoid self-preservation.

	"make install_binary_only -- as root, if you have the binary-only
			tarball, will install the pre-built, statically
			linked CRM114 and utilities.  This is very handy if
			you are installing on a security-through-minimalism
			server that doesn't have a compiler installed.

	"make install_utils" -- will build the css utilities "cssutil", 
			"cssdiff", and "cssmerge". 

			cssutil gives you some insight into
			the state of a .css file, cssdiff lets you
			check the differences between two .css files,
			and cssmerge lets you merge two css files.

	"make cssfiles" - given the files "spamtext.txt" and
			"nonspamtext.txt", builds BRAND NEW spam.css
                        and nonspam.css files.

                Be patient- this can take about 30seconds per 100Kbytes 
                of input text!  It's also destructive in a sense - repeating
		this command with the same .txt files will make the
		classifier a little "overconfident" (and runs the risk
		of maxing out the feature buckets in the .css files, which
		will definitely _decrease_ your accuracy).  If your .txt
		files are bigger than a megabyte, use the -w option to
		increase the window size to hold the entire input.

*** Utilities for looking into .css files.

This release also contains the cssutil utility to take a look at 
and manage .css spectral files used in the mailfilter.

Section 8 of the CRM114_Mailfilter_HOWTO tells how to use these
utilitis; you _should_ read that if you are going to use the 
CLASSIFY funtion in your own programs.



*** How to configure the mailfilter.crm mail filter:

   The instructions given here are just a synopsys- refer to the CRM114
   Mailfilter HOWTO, included in your distribution kit.

   You will need to edit mailfilter.cf , and perhaps a few other
   files.  The edits are quite simple, usually just inserting a username,
   a password, or choosing one of several given options.
	


***  The actual filtering pipeline:

 - If you have requested a safety copy file of all incoming mail, the
   safety copy is made.

 - An in-memory copy of the incoming mail is made; all mutilations
   below are performed on this copy (so you don't get a ravaged
   tattered sham of email, you get the real thing)

 - If you have specified BASE64 expansion (default ON), any base64 attachments
   are decoded.

 - If you have specified undo-interruptus, then HTML comments are
   removed.

 - The rewrites specified in "rewrites.mfp" get applied.  These
   are strictly "from>->to" rewrites, so that your mail headers 
   will look exactly like the "canonical" mail headers that were
   used when the distribution .css files were built.  If you build
   your own .css files from scratch, you can ignore this.  

 - Filtration itself starts with the file "priolist.mfp' .  Column 1
   is a '+' or '-' and indicates if the regex (which starts in column 2)
   should force 'accept' or 'reject' the email.

 - Whitelisting happens next, with "whitelist.mfp" .  No need for a + or
   a - here; every regex is on it's own line and all are whitelisting.

 - Blacklisting happens next, with "blacklist.mfp" .  No need for + or -
   here either- if the regex matches, the mail is blacklisted.

 - Failing _that_, the sparse binary polynomial hash with bayesian
   chain rule (SBPH/BCR) matching system kicks in, and tries to figure out
   whether the mail is good or not.  SBPH/BCR matching can occasionally
   make mistakes, since it's statistical in nature.

 - The mailfilter can be remotely commanded.  Commands start in 
   column 1 and go like this (yes, command is just that- the letters
   c o m m a n d, right at the start of the line.  You mail a message
   with the word command, the command password, and then a command word
   with arguments, and the mailfilter does what you told it.

   command yourmailfilterpassword whitelist addr-or-string
	- auto-accepts mail containing the whitelist string.

   command yourmailfilterpassword blacklist addr-or-string
	- auto-rejects mail containing the blacklisted string

   command yourmailfilterpassword spam
	- "learns" all the text following this command line as spam, and will
	   reject anything it gets that is "like" it.  It doesn't
	   "learn" from anything above this command, so your headers
	   (and any incoming headers) above the command are not considered
	   part of the text learned.  It's up to your judgement what part
	   of that text you want to use or not.

   command yourmailfilterpassword nonspam
	- "learns" all the text following this line as NOT spam, and will
           accept any mail that it gets that is "like" it.  Like
	   learning spam, it excludes anything above it in the file
	   from learning.

  The included five files (priolist.mfp, whitelist.mfp, blacklist.mfp,
  spam.css and nonspam.css) are meant for example, mostly.

	- rewrites.mfp is a set of rewrites to be applied to the
		incoming mail to put it in "canonical" form.  
		You don't _need_ to edit this file to match your
		local system names, but your out-of-the-box
		accuracy will be improved greatly if you do.

	- priolist.mfp is a set of very specific regexes, prefixed by +
		or -.  These are done first, as highest priority.

	- whitelist.mfp is mailfilterpatterns that are "good".  No line-spans
		allowed- the pattern must match on one line.

	- blacklist.mfp is mailfilterpatterns that are "bad".  Likewise,
		linespanning is not allowed (by default).  Entries in
		this file are all people who spam me so much I started to
		recognize their addresses... so I've black-holed them.
		If you like them, you might want to unblackhole them.

	- spam.css and nonspam.css:  These are large files and as of
		2003-09-20, are included only in the .css kits.  CRM
	        .css files are "Sparse Spectra" files and they
		contain "fingerprints" of phrases commonly seen in
		spam and nonspam mail.  The "fingerprint pipeline" is 
		currently configured at five words, so a little spam
		matches a whole lot of other spam.  It is difficult but
		not impossible to reverse-engineer the spam and nonspam
		phrases in these two files if you really want to know.

		To understand the sparse spectrum algorithm, read the
		source code (or the file "classify_details.txt"); 
		the basic principle is that each word is
		hashed, words are conglomerated into phrases, and 
		the hash values of these phrases are stored in the 
		css file.  Matching a hash means a word or phrase under
		consideration is "similar to" a message that has been
		previously hashed.  It's usually quite accurate, though
		not infallable.

The filter also keeps three logs: one is "alltext.txt", containing a
complete transcript of all incoming mail, the others are spamtext.txt
and nonspamtext.txt; these contain all of the text learned as spam
and as nonspam, respectively (quite handy if you ever migrate between
versions, let me assure you).

Some users have asked why I don't distribute my learning text, just
the derivative .css files: it's because I don't own the copyright on
them!  They're all real mail messages, and the sender (whoever that
is) owns the copyright, not me (the recipient).  So, I can't publish
them.  But never fear, if you don't trust my .css files to be clean,
you can build your own with just a few day's spam and nonspam traffic.
Your .css files will be slightly different than mine, but they will
_precisely_ match your incoming message profile, and probably be 
more accurate for you too.

A few words on accuracy: there is no warranty- but I'm seeing typical
accuracies > 99% with only 12 hours worth of incoming mail as example
text.  With the old (weak, buggy, only 4 terms) polynomials, I got a 
best case of 99.87% accuracy over a one-week timespan.  I now see 
quality averaging > 99.9% accuracy (that is, in a week of ~ 3000 messages,
I will have 1 or 2 errors, usually none of them significant.

Of course, this is tuned to MY spam and non-spam email mixes; your
mileage will almost certainly be lower until you teach the system what
your mail stream looks like.


