	
	How CRM114's LEARN and CLASSIFY really work.

This document describes the internal workings of the CRM114 LEARN
and CLASSIFY functions.  You do _not_ need to know this to use CRM114
effectively; this is to satisfy the curiosity of those who really
want deep knowledge of the tools they use.

The general concept is this: break the incoming text into short
phrases of from one to five words each.  A phrase can have words in
the middle of the phrase skipped (e.g. "BUY <skip_word> ONLINE NOW!!!"
is always a bad sign.), and more than one phrase can use the same
word.  You can't change the order of the words, but you _can_ bridge
across newlines, punctuation, etc. Make all the phrases you can make.

For each phrase you can make, keep track of how many times you 
see that phrase in both the spam and nonspam categories.  When you 
need to classify some text, make the phrases, and count up how many times
all of the phrases appear in the two different categories.  The 
category with the most phrase matches wins.

Note that you never have to cross-reference between the two category
phrase sets.  If a phrase appears in both categories an equal number
of times, then both categories get an equal score boost.  Since 
an equal score boost doesn't change which category will win, there's
no need to cross-reference category phrase counts.  

NB: This process is called "sparse binary polynomial hashing" because
it uses a set of polynomials to generate a hash-of-hashes; sparse because not
all words are represented by nonzero terms, binary because the
changing coefficient terms are always either 0 or 1, and a hash
because, well, it's a hash.  :)

(note: As of Nov 1, 2002, this has changed - and changed again
in November of 2003, see further below).  

Instead of simply comparing raw count scores, we now do a Bayesian
chain-rule to calculate the probability of "good" versus "evil".  The
Bayesian chainrule formula is

	                      P(A|S) P(S)
	    P (S|A) =   -------------------------
	               P(A|S) P(S) + P(A|NS) P(NS)

which (in words) says: "The NEW chance of spam, given some feature A,
is equal to the chance of A given spam times the OLD chance that it's
spam, divided by the sum of the chance of A given spam times the old
chance it's spam plus the chance of A given nonspam, times the old
chance it's nonspam".)


We start assuming that the chance of spam is 50/50.

We count up the total number of features in the "good" versus "evil"
feature .css files.  We use these counts to normalize the chances of
good versus evil features, so if your training sets are mostly "good",
it doesn't predispose the filter to think that everything is good.

We repeatedy form a feature with the polynomials, check the .css files
to see what the counts of that feature are for spam and nonspam, and
use the counts to calculate P(A|S) and P(A|NS) [remember, we correct
for the fact that we may have different total counts in the spam and
nonspam categories]. 

We also bound P(A|S) and P(A|NS) to prevent any 0.0 or 1.0
probabilities from saturating the system.  If you allow even _one_ 0.0
or 1.0 into the chain rule, there's no way for the system to recover
even in the face of overwhelming evidence to the contrary.  The
actual bound in use depends on the total number of counts of the
feature A ever encountered, irrespective of their good/evil nature.

[additional note: versions from 20030630 to 20031200 used a 
fairly gentle method to generate the local probabilities from
the relative hit counts.  From 20031200 onward, this local probability
was modified by the number and sequence of the terms of the
polynomial.  The best model found so far is a set of coefficients that
model a Markov chain; polynomials that have a longer chain length
(and therefore a closer match) get a significantly higher boost.]

Once we have P(A|S) and P(A|NS), we can calculate the new P(S) and
P(NS).  Then we get the next feature out of the polynomial hash 
pipeline (each extra word makes 15 features) and repeat until we hit
the end of the text.  Whichever set has the greater probability wins.

We also take multiple files AS A GROUP, so it's as though we added
the corresponding hash buckets together for everything on the left
of the | and everything on the right.

-----


Now, on to the brutish details.

In terms of the actual implementation, LEARN and CLASSIFY are
pipelined operations.  The pipeline has these stages (as of the
2002-10-21 version) :
	
1) Tokenization.  The input text is tokenized with the supplied regex
   (usually [[:graph:]]+ ) into a series of disjointed word tokens.

2) Each word token is hashed separately.  The hash used is a "fast hash", 
   not particularly secure, but with reasonably good statistics.

3) Each hash is pushed into the end of a five-stage pipeline.  Each
   value previously pushed moves down one level in the pipeline.

4) The pipeline stages are tapped to supply values H0 through H4 that
   will be multiplied by the particular polynomial's coefficients. (H4
   being the newest value).

5) After each value is pushed into the hash pipeline, the full set of
   polynomials are evaluated.  These polynomials have changed over
   various releases, but as of 2002-10-23 the coefficients are:

   poly# \ for:  H4     H3   H2   H1    H0
   1	          0      0    0	   0     1
   2	          0      0    0	   3	 1
   3	          0      0    5	   0	 1
   4	          0      0    5	   3	 1
   5	          0      9    0	   0	 1
   6	          0      9    0	   3	 1
   7	          0      9    5	   0	 1
   8	          0      9    5	   3	 1
   9	          17     0    0	   0	 1
  10	          17     0    0	   3	 1
  11	          17     0    5	   0	 1
  12	          17     0    5	   3	 1
  13	          17     9    0	   0	 1
  14	          17     9    0	   3	 1
  15	          17     9    5	   0	 1
  16	          17     9    5	   3	 1

  (yes, it's like counting in binary, but the low-order bit is always
  turned on so that the low order bits in the polynomial result is always
  affected by all nonzero elements of the hash pipeline.  "skipped"
  words have a coefficient of zero, that zeroes their effect on the 
  output of that polynomial, "skipping" the word)  

6) These 16 results (call them "superhashes") reflect all phrases up to
   length 5 found in the input text.  Each is 32 bits long.

7) Each of the .css files is mmapped into virtual memory.  The default
   size of a .css file is one megabyte plus one byte, and each byte of
   a .css file is used as a single 8-bit unsigned integer.  Using the
   length of the .css file as a modulus, each superhash value maps
   into a particular byte of the .css file.  Each .css file also has a
   "score", initialized to zero.

8) if we're LEARNing, we increment the byte at that superhash index in
   the .css file (being careful to not overflow the 8-bit limit, so
   the maximum value is actually 255)

9) (pre-Nov-2002 versions): if we're CLASSIFYing, we increment the
   per-.css-file score of that .css file by the number found in that
   superhash-indexed byte.

   (post-Oct-2002 versions): if we're CLASSIFYing, instead of just
   incrementing the per-.CSS file scores, we (a) normalize the
   relative proportions of the .css files with respect to the total
   number of features in each .css file, (b) convert the bin values
   indexed by the superhash to a probability, (c) "clip" the
   probability values to reasonable values (there is no such thing as
   "certainty" with a finite sample of an infinite and nonstationary
   source such as human language), and (d) update the running
   probability using the Bayesian chain rule formula above.

10) repeat the above pipeline steps for each "word" in the text.

11) The .css file with the larger score (or probability) at the end
    "wins".


There you have it.  Previous plynomial sets (using only H0 thorugh H3 of
the hash pipeline, with prime-number coefficients) have reached over
99.87% accuracy.   My (unproven) suspicion is that the five-stage pipeline 
can do even better.

n.b. slight error in edge effects - right now, we don't execute the
pipeline polynomial set until the ppeline is full; conversely we stop 
executing the polynomial set when we run out of tokens.  This means 
that we don't give the first and last few tokens of the email the full 
treatment; that's a bug that should be rectified.  The other side of the
problem is that filling and flushing the pipe gives worse results
by putting too much emphasis on "zero hash" and too much emphasis
on the first and last few words.


---More details on the post-Nov-2002 release:---

In releaes after Nov 1 2002, instead of just comparing counts, we do
the true Bayesian chain rule to calculate the probabilities of pass
versus fail.  The bounding limits are first to bound within

   [ 1/featurecount+2 , 1 - 1/featurecount+2].  
   
and then to add further uncertainty to that bound additionally by a
factor of 1/(featurecount+1).  

We do the chain rule calculation and then we clip the minimum
probability to MINDOUBLE, which is host specific but is a VERY small
number (on the order of 10^-300 for Intel boxes).  This further
prevents getting the chain rule stuck in a 0.0 / 1.0 state, from which
there is no recovery.

Lastly, because of underflow issues, we quickly lose significance in
the greater of the two probabilities.  For example, 1.0 - (10^-30) is
exactly equal to 1.00000; yet 10^-30 is easily achieveable in the
first ten lines of text.  Therefore, we calculate the chainrule
probabilities twice, using P(S) and P(NS) separately, and then use the
smaller one to recompute the larger one.  Thus, even if there's 
arithmetic underflow in computing the larger probability, we still 
retain the full information in the smaller probability.





---  Yet More Details - for Post-200310xx Versions ----

During the summer and fall of 2003, I continued experimenting with 
improvements to SBPH/BCR as described above.  It became clear that
SBPH/BCR was _very_ good, but that it was still operating within the
limits of a linear classifier without hidden levels- e.g. it was 
a perceptron (with all of the limitations that perceptron-based
classifiers have).

Luckily, the databases in CRM114 are more than adequate to support
a higher-level model than a simple linear perceptron classifier.
I tested a 5th order Markovian classifier, and found that it was
superior to any other classifier I had tried.

A Markovian classifier operates on the concept that _patterns_ of
words are far more important than individual words.  

For example, a Bayesian encountering the phrase "the quick brown fox
jumped" would have five features: "the", "quick", "brown", "fox", and
"jumped".  

A Sparse Binary Polynomial Hasher would have sixteen features:

 the
 the quick
 the <skip> brown
 the quick brown
 the <skip> <skip> fox
 the quick <skip> fox
 the <skip> brown fox
 the quick brown fox

... and so on.  But each of these features would recieve the same
weighting in the Bayesian chain rule above.

The change to become a Markovian is simple- instead of giving each
Sparse Binary Polynomial Hash (SBPH) feature a weight of 1, give each 
feature a weight corresponding to how long a Markov Chain it matches
in either of the archetype texts.  

A simple way to do this would be to make the weight equal to the number
of words matched - in this case the weights would be:

 the				1
 the quick			2
 the <skip> brown		2
 the quick brown		3
 the <skip> <skip> fox		2
 the quick <skip> fox		3
 the <skip> brown fox		3
 the quick brown fox		4

and indeed, this gives some improvement over standard SBPH.

But there is room for further improvement.  The filter as stated above
is still a linear filter; it cannot learn (or even express!) anything
of the form:

	"A" or "B" but not both

This is a basic limit discovered by Minsky and Papert in 1969 and
published in _Perceptrons_.

In this particular case there is a convenient way to work around this
problem.  The solution is to make the weights of the terms
"superincreasing", such that long Markov chain features have so high a
weight that shorter chains are completely overruled.

For example, if we wanted to do "A or B but not both" in such a
superincreasing filter, the weights:

	"A" at 1
	"B" at 1
	"A B" at -4

will give the desired results.

For convenience in calculation, CRM114 uses the superincreasing
weights defined by the series 2^(2n)- that is, 

 the				1
 the quick			4
 the <skip> brown		4
 the quick brown		16
 the <skip> <skip> fox		4
 the quick <skip> fox		16
 the <skip> brown fox		16
 the quick brown fox		64

Note that with these weights, a chain of length N can override
all chains of length N-1, N-2, N-3... and so on.

This is particularly satisfying, because the standard .css files
already contain all of the information needed to do this more advanced
calculation.  The file format is not only compatible, it is _identical_
and so users don't have to re-start their training.

This Markovian matching gives a considerable increase in accuracy
over SBPH matching, and almost a factor of 2 improvement over Bayesian
matching.  It is now the default matching system in CRM114 as of 
version 200310xx.

    -Bill Yerazunis