crm114(1)                           CRM114                           crm114(1)



  NAME
      crm114 - The Controllable Regex Mutilator

  SYNOPSIS
      crm  [-d  N  (enter  debugger after running N cycles. Omitting N means N
      equals 0.)]  [-e (do not import any environment variables)]  [-h  (print
      help  text)] [-p (generate an execution-time-spent profile on exit)] [-P
      N (max program lines)] [-q m (mathmode (0,1 = alg/RPN only in EVAL,  2,3
      =  alg/RPN  everywhere))]  [-s  N  (new  feature  file  (.css) size is N
      (default 1 meg+1 featureslots))] [-S N (new feature file (.css) size  is
      N  rounded  to 2^I+1 featureslots)] [-t (user trace output)] [-T (imple-
      mentors trace output (only for the masochistic!))]  [-u  dir  (chdir  to
      directory  dir  before  starting  execution)]  [-v (print CRM114 version
      identification and exit)] [-w N (max  data  window  (bytes,  default  16
      megs))]  [--  (signals the end CRM114 flags; prior flags are not seen by
      the user program; subsequent args are not processed by  CRM114)]  [--foo
      (creates  the  user  variable :foo: with the value SET)] [--x=y (creates
      the user variable :x: with the value y)] [-{ stmts} (execute the  state-
      ments inside the {} brackets)] crmfile (.crm file name)

  DESCRIPTION
      CRM114  is a language designed to write filters in. It caters to filter-
      ing email, system log streams, html, and other marginally human-readable
      ASCII that may occasion to grace your computer.

      CRM114's unique strengths are the data structure (everything is a string
      and a string can overlap another string), it's ability to work on  truly
      infinitely  long  input  streams, it's ability to use extremely advanced
      classifiers to sort text, and the  ability  to  do  approximate  regular
      expressions  (that is, regexes that don't quite match) via the TRE regex
      library.

      CRM114 also sports a very powerful subprocess control  facility,  and  a
      unique  syntax  and program structure that puts the fun back in program-
      ming (OK, you can run away screaming now). The  syntax  is  declensional
      rather  than  positional;  the  type  of  quote marks around an argument
      determine what that argument will be used for.

      The typical CRM114 program uses regex operations more often  than  addi-
      tion  (in  fact,  math was only added to TRE in the waning days of 2003,
      well after CRM114 had been in daily use for over a year and a half).

      In other words, crm114 is a very very  powerful  mutagenic  filter  that
      happens to be a programming language as well.

      The  filtering  style  of the CRM-114 discriminator is based on the fact
      that most spam, normal log file messages, or other uninteresting data is
      easily  categorized  by a few characteristic patterns (such as "Mortgage
      leads", "advertise on the internet", and "mail-order toner cartridges".)
      CRM114  may  also  be  useful  to folks who are on multiple interlocking
      mailing lists.

      In a bow to Unix-style flexibility, by default CRM114 reads  it's  input
      from  standard  input, and by default sends it's output to standard out-
      put. Note that the default action has a zero-length output.  Redirection
      and  use  of other input or output files is possible, as well as the use
      of windowing, either delimiter-based or time-based, for  real-time  con-
      tinuous applications.

      CRM114  can  be  used for other than mail filtering; consider it to be a
      version of grep with super powers. If perl  is  a  seventy-bladed  swiss
      army knife, CRM114 is a razor-sharp katana that can talk.

  INVOCATION
      Absent the -{ program } flag, the first argument is taken to be the name
      of a file containing a crm114 program, subsequent arguments  are  merely
      supplied  as  :_argN:  values. Use single quotes around commandline pro-
      grams '-{ like this }' to prevent the shell from  doing  odd  things  to
      your command-line programs.

      CRM114  can  be  directly invoked by the shell if the first line of your
      program file uses the shell standard, as in:

      #! /usr/bin/crm

      You can use CRM114 flags on the shell-standard invocation line, and hide
      them  with  '--' from the program itself; '--' incidentally prevents the
      invoking user from changing any CRM114 invocation flags.

      Flags should be located after any positional variables  on  the  command
      line. Flags are visible as :_argN: variables, so you can create your own
      flags for your own programs (separate CRM114 and user flags with  '--').
      Two examples on how to do this:

      ./foo.crm bar mugga < baz  -t -w 150000

      ./foo.crm -t -w 1500000 -- bar < baz mugga

      One example on how not to do this:

      ./foo.crm -t -w 150000 bar < baz mugga

      (That's WRONG!)

      You  can  put a list of user-settable vars on the #!/usr/bin/crm invoca-
      tion line. CRM114 will  print  these  out  when  a  program  is  invoked
      directly  (e.g.  "./myprog.crm -h", not "crm myprog.crm -h") with the -h
      (for help) flag. (note that this works ONLY on  bash  on  Linux-  *BSD's
      have a different bash interpretation and this doesn't work)

      Example:

      #!/usr/bin/crm  -( var1 var2=A var2=B var2=C )

      This allows only var1 and var2 be set on the command line. If a variable
      is not assigned a value, the user can set  any  value  desired.  If  the
      variable  is  equated  to  a  set  of  values, those are the only values
      allowed.

      Another example:

      #!/usr/bin/crm  -( var1 var2=foo )  --

      This allows var1 to be set to any value, var2 may only be set to  either
      foo  or not at all, and no other variables may be set nor may invocation
      flags be changed (because of the trailing "--"). Since "--" also  blocks
      '-h' for help, such programs should provide their own help facility.

  VARIABLES
      Variable  names  and  locations  start with a : , end with a : , and may
      contain only characters that have ink (i.e. the  [:graph:]  class)  with
      few exceptions.

      Examples        :here:,       :ThErE:,       :every-where_0123+45%6789:,
      :this_is_a_very_very_long_var_name_that_does_not_tell_us_much:.  Builtin
      variables:

      :_nl:                newline
      :_ht:                horizontal tab
      :_bs:                backspace
      :_sl:                a slash
      :_sc:                a semicolon
      :_arg0: thru :_argN: command-line args, including all flags
      :_argc:              how many command line arguments there were
      :_pos0: thru :_posN: positional args ('-' or '--' args deleted)
      :_posc:              how many positional arguments there were
      :_pos_str:           all positional arguments concatented
      :_env_whatever:      environment value 'whatever'
      :_env_string:        all environmental arguments concatenated
      :_crm_version:       the version of the CRM system
      :_dw:                the current data window contents

  VARIABLE EXPANSION
      Variables  are  expanded by the :*: var-expansion operator, e.g. :*:_nl:
      expands to a newline character. Uninitialized  vars  evaluate  to  their
      text name (and the colons stay).

      You  can  also  use the standard constant C '\' characters, such as "\n"
      for newline, as well as excaped hexadecimal and  octal  characters  like
      \xHH  and  \oOOO  but  these are constants, not variables, and cannot be
      redefined.

      Depending on the value of "math  mode"  (flag  -q).  you  can  also  use
      :#:string_or_var:  to  get the length of a string, and :@:string_or_var:
      to do basic mathematics and inequality testing, either only in EVALs  or
      for all var-expanded expressions. See "Sequence of Evaluation" below for
      more details.

  PROGRAM BEHAVIOR
      Default behavior is to read all of standard  input  till  EOF  into  the
      default  data  window  (named  :_dw:), then execute the program (this is
      overridden if first executable statement is a WINDOW statement).

      Variables don't get their own  storage  unless  you  ISOLATE  them  (see
      below),  instead  variables  are  start/length  pairs  indexing into the
      default data window. Thus, ALTERing an unISOLATEd variable  changes  the
      value  of  the default data buffer itself. This is a great power, so use
      it only for good, and never for evil.

  STATEMENTS AND STUFF
      Statements are separated with a ';' or with a newline.

      \
              '\' is the string-text escape character. You only need to escape
              the  literal  representation  of  closing delimiters inside var-
              expanded arguments.

              You can use the classic C/C++ \-escapes, such as \n, \r, \t, \a,
              \b,  \v, \f, \0, and also \xHH and \oOOO for hex and octal char-
              acters, respectively.

              A '\' as the last character of a line means  the  next  line  is
              just a continuation of this one.

              A  \-escape  that isn't recognized as something special isn't an
              error; you may optionally escape any of the delimiters >, ) ]  }
              ; / # \ and get just that character.

              A  '\'  anywhere  else is just a literal backslash, so the regex
              ([abc])\1 is written just that way; there is no need to  double-
              backslash the \1 (although it will work if you do).
      # this is a comment
      # and this too \#
              A  comment  is  not  a  piece  of  preprocessor sugar -- it is a
              statement and ends at the newline or at "\#".
      insert filename
              inserts the file verbatim at this line at compile time.
      ;
              statement separator - must ALWAYS be escaped as \;  unless  it's
              inside delimiters or else it will mark the end of the statement.
      { and }
              start and end blocks of statements. Must always be  '\'  escaped
              or  inside  delimiters  or  these  will  mark the start/end of a
              block.
      noop
              no-op statement
      :label:
              define a GOTOable label
      accept
              writes the current data window  to  standard  output;  execution
              continues.
      alius
              if  the  last  bracket-group succeeded, ALIUS skips to end of {}
              block (a skip, not a FAIL); if the  prior  group  FAILed,  ALIUS
              does  nothing.  Thus,  ALIUS  is  both an ELSE clause and a CASE
              statement.
      alter (:var:) /new-val/
              destructively change value of var to newval; (:var:) is  var  to
              change  (var-expanded);  /new-val/  is  value to change to (var-
              expanded).
      classify <flags> (:c1:...|...:cN:) (:stats:) [:in:] /word-pat/
              compare the statistics of the current data  window  buffer  with
              classfiles c1...cN.

              <flags>          If  <flags>  is set to <nocase>, ignore case in
                               word-pat, does not change  case  in  hash  (use
                               tr() to do that on :in: if you want it).
              (:c1:  ...        file or files to consider "success" files. The
                               CLASSIFY succeeds if these  files  as  a  group
                               match best. If not, the CLASSIFY does a FAIL.
              |                optional  separator. Spaces on each side of the
                               " | " are required.
              .... :cN:)       optional files to the right of " | "  are  con-
                               sidered  as  a  group  to  "fail". If statement
                               fails, execution skips to end of enclosing {..}
                               block,  which  exits  with  a  FAIL status (see
                               ALIUS for why this is useful).
              (:stats:)        optional var that will  get  a  text  formatted
                               matching summary
              [:in:]           restrict  statistical  measure  to  the  string
                               inside :in:
              /word-pat/       regex to describe what a parseable word is.
      eval (:result:) /instring/
              repeatedly evaluates /instring/ until it ceases to change,  then
              places  that  result  as the value of :result: . EVAL uses smart
              (but foolable) heuristics to avoid infinite loops, like evaluat-
              ing  a  string  that  evaluates  to a request to evaluate itself
              again. The error rate is about 1 / 2^62 and  will  detect  chain
              groups of length 255 or less.  If the instring uses math evalua-
              tion (see section below on math operations) and  the  evaluation
              has  an  inequality  test,  (>,  <  or =) then if the inequality
              fails, the EVAL will FAIL to the end of block. If the evaluation
              has  a  numeric  fault  (e.g. divide-by-zero) the EVAL will do a
              TRAPpable FAULT.
      exit /:retval:/
              ends program execution. If supplied, the return  value  is  con-
              verted to an integer and returned as the exit code of the crm114
              program. If no retval is supplied, the return value is 0.
      fail
              skips down to end of the current { } block and causes that block
              to exit with a FAIL status (see ALIUS for why this is useful)
      fault /faultstr/
              forces  a  FAULT  with the given string as the reason. The fault
              string is val-expanded.
      goto /:label:/
              unconditional branch (you can use a variable as the  goal,  e.g.
              /:*:there:/ )
      hash (:result:) /input/
              compute a fast 32-bit hash of the /input/, and ALTER :result: to
              the hexadecimal hash value. HASH is not warranted to be constant
              across  major  releases  of  CRM114, nor is it cryptographically
              secure.

              (:result:)       value that gets result.
              /input/          string  to  be  hashed  (can  contain  expanded
                               :*:vars:, defaults to the data window :_dw:)
      intersect (:out:) [:var1: :var2: ...]
              makes  :out:  contain  the  part  of the data window that is the
              intersection of :var1 :var2:  ...  ISOLATEd  vars  are  ignored.
              This  only  resets  the value of the captured variable, and does
              NOT alter any text in the data window.
      isolate (:var:) /initial-value/
              puts :var: into a data area outside of the data  buffer;  subse-
              quent  changes  to this var don't change the data buffer (though
              they may change the value of any var subsequently set inside  of
              this var).  If the var already was ISOLATED, this is a noop.

              (:var:)          name of ISOLATEd var (var-expanded)
              /initial-value/  optional   initial   value   for   :var:  (var-
                               expanded). If no value is supplied, the  previ-
                               ous value is retained/copied.
      input <flags> (:result:) [:filename:]
              read  in  the  content  of  filename.  If no filename, then read
              stdin

              <byline>         read one line only
              (:result:)       var that gets the input value
              [:filename:]     the file to read
      learn <flags> (:class:) [:in:] /word-pat/
              learn the statistics of the :in: var (or the input window if  no
              var) as an example of class :class:

              <flags>          can   be   any   of   <nocase>,   <refute>  and
                               <microgroom>.  <nocase>: ignore case in  match-
                               ing word-pat (does not ignore case in hash- use
                               tr() to do  that  on  :in:  if  you  want  it).
                               <refute>:  this  is  an  anti-example  of  this
                               class- unlearn  it!  <microgroom>:  enable  the
                               microgroomer  to  purge less-important informa-
                               tion automatically whenever the statistics file
                               gets to crowded.
              (:class:)        name  of  file  holding hashed results; nominal
                               file extension is .css
              [:in:]           captured var containing the text to be  learned
                               (if omitted, the full contents of the data win-
                               dow is used)
              /word-pat/       regex that defines a "word". Things that aren't
                               "words" are ignored.
      liaf
              skips  UP to START of the current {} block (LIAF is FAIL spelled
              backwards)
      match <flags> (:var1: ...) [:in:] /regex/
              Attempt to match the given regex; if  match  succeds,  variables
              are  bound;  if match fails, program skips to the closing '}' of
              this block

              <flags>          flags can be any of

                               <abstatement succeeds if match not present
                               <noignore case when matching
                               <frstartrmatch at start of the [:in:] var
                               <frstartrmatch at start of previous  successful
                                  match on the [:in:] var
                               <frstartt>match at one character past the start
                                  of the  previous  successful  match  on  the
                                  [:in:] var
                               <frstart>match at one character past the end of
                                  prev. match on this [:in:] var
                               <nerequire match to  end  after  end  of  prev.
                                  match on this [:in:] var
                               <basearchs>backward in the [:in:] variable from
                                  the last successful match.
                               <nodon'tlallow this match to span lines
              (:var1: ...)     optional variables to bind to regex result  and
                               '(' ')' subregexes
              [:in:]           search only in the variable specified; if omit-
                               ted, :_dw: (the full input data window) is used
              /regex/          POSIX regex (with \ escapes as needed)
              If  you  build CRM114 to use the GNU regex library for MATCHing,
              be warned that GNU REGEX has numerous issues. See the KNOWN_BUGS
              file for a detailed listing.
      output <flags> [filename] /output-text/
              output an arbitrary string with captured values expanded.

              <flags>          <append>:  append to the file (otherwise, over-
                               writes)
              [filename]       filename to send output (var-expanded), default
                               output is to stdout
              /output-text/    string to output (var-expanded)
      syscall <flags> (:in:) (:out:) (:status:) /command/
              execute a shell command

              <flags>          can  be any of <keep> and <async>. <keep>: keep
                               this process around; if kept,  then  a  syscall
                               with  the same :keep: var will continue feeding
                               to and reading from  the  kept  proc.  <async>:
                               don't  wait  for  process  to send an EOF; just
                               grab what's available in the  process's  output
                               pipe  and  proceed  (limit  per  syscall is 256
                               Kbytes)
              (:in:)           var-expanded string to feed to command as input
                               (can be null if you don't want to send the pro-
                               cess something.) You must specify this  if  you
                               want to specify an :out: variable.
              (:out:)          var-expanded  varname  to  place  results  into
                               (must pre-exist, can be null if you don't  want
                               to  read the process's output (yet, or at all).
                               Limit per syscall is 256 Kbytes. You must spec-
                               ify  this if you want to use the :status: vari-
                               able)
              (:status:)       if you want to keep a minion  proc  around,  or
                               catch the exit status of the process, specify a
                               var here. The minion process's  PID  and  pipes
                               will be stored here. The program can access the
                               proc again with another syscall by  using  this
                               var  again.  When  the process exits, it's exit
                               code will be stored here.
      trap (:reason:) /trap_regex/
              traps faults from  both  FAULT  statements  and  program  errors
              occurring  anywhere  in the preceding bracket-block. If no fault
              exists, TRAP does a SKIP to end of block. If there  is  a  fault
              and the fault reason string matches the trap_regex, the fault is
              trapped, and execution continues with the line after  the  TRAP,
              otherwise the fault is passed up to the next surrounding trapped
              bracket block.

              (:reason:)       the fault message that caused this FAULT. If it
                               was  a  user  fault,  this is the text the user
                               supplied in the FAULT statement.
              /trap_regex/     the regex that determines what kind  of  faults
                               this  TRAP will accept. Putting a wildcard here
                               (e.g.  /.*/  means  that  ALL  faults  will  be
                               trapped here.
      union (:out:) [:var1: :var2: ...]
              makes  :out:  contain the union of the data window segments that
              contains var1, var2... plus any intervening text  as  well.  Any
              ISOLATEd  var  is  ignored.  This  is non-surgical, and does not
              alter the data window
      window <flags> (:w-var:) (:s-var:) /cut-regex/ /add-regex/
              window slider. This deletes to and including the cut-regex  from
              :var:  (default: use the data window), then reads adds from std.
              input till add-regex (inclusive).

              <flags>          flags can be any of

                               <nocase>         ignore case when matching cut-
                                                and add- regexes
                               <bychar>         check   input   for  add-regex
                                                every character
                               <byline>         check  input   for   add-regex
                                                every line
                               <byeof>          wait for EOF to check for add-
                                                regex  (extra  characters  are
                                                kept around for later)
                               <eofends>        read  lots of input; the input
                                                is up to the  regex  match  OR
                                                the contents till EOF
              (:w-var:)        what var to window
              (:s-var:)        what  var to use for source (defaults to stdin,
                               if you use a source var you  must  specify  the
                               windowed var.
              /cut-regex/      var-expanded cut pattern
              /add-regex/      var-expanded  add pattern, if absent reads till
                               EOF
              If both cut-regex and add-regex are  omitted,  and  this  window
              statement is the first executable statement in the program, then
              CRM114 does not wait to read  a  anything  from  standard  input
              input before starting program execution.

  A QUICK REGEX INTRO
      A regex is a pattern match. Do a "man 7 regex" for details.

      Matches are, by default "first starting point that matches, then longest
      match possible that can fit".

      a through z
      A through Z
      0 through 9
              all match themselves.
      most punctuation
              matches itself, but check below!
      *
              repeat preceding 0 or more times
      +
              repeat preceding 1 or more times
      ?
              repeat preceding 0 or 1 time
      *?, +?, ??
              repeat preceding,  but  shortest  match  that  fits,  given  the
              already-selected  start  point  of the regex. (only supported by
              TRE regex, not GNU regex)
      [abcde]
              any one of the letters a, b, c, d, or e
      [a-q]
              the letters a through q (just one of them)
      {n,m}
              repetition count: match the preceding at least  n  and  no  more
              than m times (POSIX restricts this to a maximum of 255 repeats)
      [[:<:]]
              matches at the start of a word (GNU regex only)
      [[:>:]]
              matches the end of a word (GNU regex only)
      ^
              as  first  char of a match, matches the start of a line (ONLY in
              <nomultiline> matches.
      $
              as last char of a match, matches at the end of a line  (ONLY  in
              <nomultiline> matches)
      .
              (a period) matches any single character (except start-of-line or
              end of line "virtual characters", but it does match a  newline).
      a|b
              match a or b
      (match)
              the  () go away, and the string that matched inside is available
              for capturing. Use \\( and \\) to match actual parenthesis  (the
              first  '\'  tells  "show the second '\' to the regex engine, the
              second '\' forces a literalization onto the parenthesis  charac-
              ter.
      \n
              matches  the N'th parenthesized subexpression. Remember to back-
              slash-escape the backslash (e.g. write this as \\1) This is only
              if you're using TRE, not GNU regex.
      The  following  are  other POSIX expressions, which mostly do what you'd
      guess they'd do from their names.


        [[:alnum:]]
        [[:alpha:]]
        [[:blank:]]
        [[:cntrl:]]
        [[:digit:]]
        [[:lower:]]
        [[:upper:]]
        [[:graph:]]
        [[:print:]]
        [[:punct:]]
        [[:space:]]
        [[:xdigit:]]

      [[:graph:]] matches any character that puts ink on  paper  or  lights  a
      pixel.  [[:print:]] matches any character that moves the "print head" or
      cursor.

  NOTES ON SEQUENCE OF EVALUTATION
      By default, CRM114 supports string length  and  mathematical  evaluation
      only  in an EVAL statement, although it can be set to allow these in any
      place where a var-expanded variable is allowed (see the -q  flag).   The
      default  value  ( zero ) allows stringlength and math evaluation only in
      EVAL statements, and uses non-precedence (that is, strict  left-to-right
      unless  parenthesis  are used) algebraic notation. -q 1 uses RPN instead
      of algebraic, again allowing stringlength and math  evaluation  only  in
      EVAL  expressions.  Modes 2 and 3 allow stringlength and math evaluation
      in any var-expanded expression, with non-precedence  algebraic  notation
      and  RPN  notation  respectively.   Evaluation  is always left-to-right;
      there is no precedence of operators beyond the sequential  passes  noted
      below.  The evaluation is done in four sequential passes:

      1  \-constants like \n, \o377 and \x3F are substituted
      2  :*:var: variables are substituted (note the difference between a con-
         stant like '\n' and a variable like ":*:_nl:" here  -  constants  are
         substituted first, then variables are substituted.)
      3  :#:var: string-length operations are performed
      4  :@:expression:  mathematical  expressions  are  performed;  syntax is
         either RPN or non-precedenced (parens required)  algebraic  notation.
         Embedded  non-evaluated  strings in a mathematical expression is cur-
         rently a no-no.

         Allowed operators are: + - * / % > < = only.

         Only >, <, and = set logical results; they also evaluate to 1  and  0
         for continued chain operations - e.g.

         ((:*:a: > 3) + (:*:b: > 5) + (:*:c: > 9) > 2)

         is true IFF any of the following is true

