JunkMatcher Howto: Pattern Tests
(new stuff is in red)
- What is a typical workflow
to add a new pattern test?
- How do I re-target a pattern
test to a different message view?
- How do I spawn a pattern
test to target additional message views?
- How do I activate certain pattern tests
only on emails from a certain account?
- How do I activate certain pattern tests
only on emails written in certain languages?
- What is a user pattern, and how
do I change a non-user pattern into a user pattern?
- How do I import patterns from a JunkMatcher Pattern
Package (
jpp ) file?
- How do I export patterns to a JunkMatcher Pattern
Package (
jpp ) file?
- How do I get new pattern updates online?
- How do I publish my patterns online?
- By targeting a pattern to both of the
body view and the rendering view, aren't you double-penalizing
a matched email?
- Teach me regular expressions!
- Where can I learn more about Python regular
expressions?
- What are the meta patterns for?
- What are the reserved meta patterns
then?
- Help! I can't match `\', `.', `?', `+', `*',
`|', `[', `(', `)', '^', '$' etc. characters!
- How do I write more efficient regular
expressions?
- How do I match words longer than n characters?
- How do I match vowels and consonants?
- How do I match patterns spanning over
multiple lines? (example: filtering 419
scams)
What is a typical workflow to
add a new pattern test?
- You load a piece of spam into
the Analyzer window.
- You switch back and forth among different message views to
observe the most distinct texts that makes this message "spammy".
- You start typing your experimental pattern into the rounded text
field that's directly underneath the message content, and hit the "Match" button
to the right to test. The matched portion of the text will be highlighted
in red.
- When you're satisfied, hit the "Add" button and the pattern test
targeting the currently selected message view will be appended to your
list of tests in the lower half of the window.
- You may want to spawn the test to target
additional views.
- You may want to drag
to reorder some of these test instances to different positions
in the list.
How do I re-target a pattern
test to a different message view?
In the test table of the Analyzer
window (the lower-half), every pattern test displayed in a row
is given a little popup menu to the right side of the row, and you
can use that menu to change the targeting view for the test, like
this:
How do I spawn a pattern test
to target additional message views?
Easy: first select a pattern test in the Analyzer
window, then hit the "Spawn" button at the bottom of the window
- this will add an almost identical pattern test to the end of the
test table, with only one difference: the targeting message view is the first
view that has not been targeted by the same pattern. If that
view is not what you want to target at, you can always switch
the targeting view to something else.
How do I activate certain
pattern tests only on emails from a certain account?
The answer is almost identical to this
one.
How do I activate certain
pattern tests only on emails written in certain languages?
Bring up the Test Inspector window on the pattern (by double-clicking
on it in the Analyzer window). Enter a pattern that matches the charset
of target language into the "Encoding Pattern" field:
For example, to activate a pattern only on emails written in Traditional
Chinese, enter "^big5 " in its Encoding Pattern field.
What is a user pattern,
and how do I change a non-user pattern into a user pattern?
The distinction between a user pattern vs. a non-user pattern
is only meaningful when you:
The basic idea is when you get new patterns from an outside source,
a comparison will be made between the set of new patterns and your set
of non-user patterns; i.e., only non-user patterns are candidates
of any kind of change: they could be removed or changed, or new non-user
patterns might be added. In other words, non-user patterns are
"managed" and are up to any future updates.
Important corollary: if you want to make sure a certain pattern won't
be touched, you should change it into a user pattern. Take a look
at the screenshot here: see that little "user" checkbox?
Ticking it will change a pattern into a user pattern.
How do I import patterns from
a JunkMatcher Pattern Package (jpp ) file?
To import patterns, first you need to already have a JunkMatcher
Pattern Package file - this file has a file extension .jpp ,
and it can be created by exporting patterns.
The jpp file contains both meta patterns and patterns,
and its icon looks like this:
You can start the importing process by double-clicking on a jpp file,
or drag the file and drop it on top of JunkMatcher.app. Alternatively,
you can choose the menu item "Import Patterns" under the File menu, and
then choose a jpp file to import.
After opening a jpp file, a dialog box will pop up to tell
you some basic info about the package: when and who created the package,
etc.:
Clicking on "Show the Pattern Deltas" will then initiate a comparison
between the current set of non-user patterns
you have and the patterns contained in the file. If the comparison finds
no difference, you will be informed so and the process ends. Otherwise,
you will be presented with a Pattern Delta window to show you all the
changes contained in the jpp file:
(That's right - it is the same window presented to you when installing the factory
version of patterns actually finds any change necessary to revert
your non-user patterns back to the factory set)
In the Pattern Delta window you can select the changes you want to accept,
so in effect you can selectively import patterns. Or you can hit "Accept
All" and then "Proceed" to accept all changes - you will then have exactly
the same set of non-user patterns contained in the jpp file.
One last word: any change won't be final until you hit the
"File -> Save Tests" menu item.
How do I export patterns to a
JunkMatcher Pattern Package (jpp ) file?
To export patterns, choose the menu item "Export Patterns"
under the File menu. You will be presented with this window:
You have three options in terms of choosing what patterns to export:
you can export only the non-user patterns (maybe you want to publish only
these less personalized patterns), only the user patterns (for backup),
or all of the patterns. In any case, the following will also happen:
- All of your meta patterns will also be exported, regardless whether
they are user or non-user meta patterns; and
- All of the patterns/meta patterns in the exported JunkMatcher Pattern
Package (
jpp file) will be marked as non-user (so others
can import them to affect their non-user patterns).
The window above also allows you to enter some basic information that
will be shown to a user when she imports the resulting jpp file
- maybe indirectly from receiving a pattern
update online.
If you intend to publish the resulting jpp file, read on here.
Click on the "Proceed" button, choose a filename to save, and you're
all set.
How do I get new pattern
updates online?
You can do it via the menu item:
Or tell JunkMatcher to
check for pattern updates every time you launch JunkMatcher.app.
If a new pattern update is found, the new pattern package will be downloaded
and an importing process will then take place.
The only difference comparing to importing from a jpp file
is that you don't actually see or open the file.
How do I publish my patterns
online?
The first step of publishing your patterns is of course exporting
your patterns. When the "Exporting Patterns" window shows up, make
sure you tick the box "Generate a pattern news file for publishing".
This will instruct JunkMatcher to generate a file PatternNews (no
file extension), which contains the necessary information telling a
user where to download your patterns, and what the MD5 checksum of
the jpp file is. The URL you will enter there has to be
a direct link leading to your jpp file, and it
doesn't need to be the same URL where you will put your PatternNews file.
(if you don't know what an MD5 checksum is, it's basically a "fingerprint" of
a file, so that we can compare the fingerprint with the actual file to
see if the file has been tempered or corrupted)
After you get the exported PatternNews file and jpp file,
you can then upload them to your server(s). When you announce the published
patterns, tell everyone the URL where you place the PatternNews file
so they can change
their pattern update preference accordingly. Just remember that the
URL actually leading to the jpp file has to match the URL
you entered in the Exporting Patterns window.
Let me reiterate: a jpp file and its PatternNews file
don't need to reside at the same location.
By targeting a pattern
to both of the body view and the rendering view, aren't you double-penalizing
a matched email?
Your reasoning is correct because the
body view and the rendering view are just two different views of the
same content.
That's exactly why there is a hidden provision: whenever a pattern matches
either one of the body/rendering views, it will not try to match against
the other.
However to facilitate the process of building/testing patterns, when
you click on the "Match All" button in JunkMatcher.app, it will still
report matches from the same pattern in both views.
Teach me regular expressions!
In JunkMatcher you can write your own patterns using regular expressions:
it is a fairly flexible and compact representation for textual patterns.
In particular the kind of regular expressions used here are the ones
used in the Python programming language - but
you don't need to learn the language to start writing your own patterns! Don't
you worry - it's really not hard. I'll explain some basics here. If you
want to know more, you can read from here.
(To be more specific, the Python regular expression syntax used here
is from Python version 2.3.x - this is the version Apple shipped in OS
X 10.3.x Panther)
Let's take a look at some of the patterns designed to match a message's body
view (each line below is one pattern):
(?i)v\W?i\W?a\W?g\W?r\W?a
(?i)p\W?e\W?n\W?i\W?s
(?i)<\s*img[^>]+(?:low)?src\s*=\s*(?:'|")\s*http:
(?i)http://\S*\.biz
The first pattern matches variations of that powerful word in most of
the junk - viagra. The initial (?i) makes the pattern
case-insensitive (the default is the opposite), so we'll match vIagra, viAGra,
etc. The `\W ' is a special sequence, representing
a single non-alphanumeric character, such as punctuation marks
(alphanumeric characters are things like a...z, A...Z, 0...9,
etc.), so we'll be able to match vi!agra, etc. The trailing
`? ' makes the pattern preceding it optional, so
`\W? ' means the non-alphanumeric character is only optional,
and we can match both viagra and vi!agra.
With this much knowledge, you already know what the 2nd pattern is about.
The 3rd pattern is by far the most complex pattern: it matches any mentioning
of a HTML img tag. Here are some of the text the pattern
will match:
<IMG SRC="http://blah blah blah" border=0>
<img border=0 lowsrc="http://blah blah blah">
<img src='http://blah blah blah'>
That's right, my friend - they are all capable of rendering images in
your HTML-enabled Mail.app! You'll be amazed by how much creativity the
spammers have these days to hide the real things they want to say. But
with regular expressions in our arsenal, we shall afraid no more!
Some of the things required to understand the 3rd pattern are already
covered, so let's concentrate on the new stuff here. First, `\s '
is another special sequence, representing any whitespace character (such
as space, tabs, returns, etc.), and the trailing `* ' repeats
the preceding character zero or many times. All in all, `\s* '
will match zero or more whitespace characters, and thus allow us to skip
arbitrary number of them.
The second new thing here is `[^>]+ '. The `[] '
construct lists all the possible characters, while `[^] '
lists all the invalid characters (for matching). So `[^>] '
can match anything other than `>'. How about the trailing
`+ ' ? That's very similar to '* ', but it matches one or
more times (i.e., at least one time). So here you go: `[^>]+ '
matches one or more non-`>' characters - because we want
to skip anything other than the `src' part of the string, but
we don't want to wander out of the img tag yet (which ends
with `> ).
The last thing you'll need to know in order to understand the pattern,
is the meaning of `(?:'|") '. The construct `(?:) '
is to group multiple patterns into one. But what about this vertical
bar? It simply signals alternatives - so pattern `(?:'|") '
can match either ' or ".
There is not much left to learn in order to understand the last pattern:
this one matches any URL link pointing to a .biz site; for
example, `http://www.spammer.biz ' (for some reasons, most
of the spam I've seen has something to do with a .biz site). One of the
new things here is how we match a single `. ' (dot) character
- this deserves special mentioning because `. ' alone means
an arbitrary character in regular expressions, namely, it can match any
character. To use the literal meaning of `. ', we need to escape the
special meaning by putting a `\ ' before `. ',
so '\. ' will match a dot, which is what we need here (this
is called escaping). The last new thing
in this pattern is `\S ' - yet another special sequence.
Note this is a capital S, and represents the opposite of what `\s '
(a small s) means: any non-whitespace character.
Finally we'll learn the meaning of `^ ' and `$ ':
the former matches the beginning of a line, and the latter matches the
opposite - the end of a line. This is useful in specifying a safe IP
pattern so that IPs matched won't be sent off to a blacklist for examination
(save time and bandwidth):
^127\.0
which makes any IP address starting with `127.0 ' safe (127.0.0.1 is
the default address of your Mac); and this is also useful in specifying
a filename to match a junk attachment:
(?i)\.pif$
which matches any name ending with the string `.pif ' (case-insensitive).
Where can I learn more about
Python regular expressions?
Here is the section
in the Python documentation (version
2.3.x) describing the regular expressions usage.
What are the meta patterns
for?
In JunkMatcher I extended Python's regular
expressions a little bit with the addition of meta patterns: these
are patterns you can use in writing your patterns (a bit like macros).
I guess by now you certainly have noticed that spammers like to use different
characters of similar shape as substitutes for the original one, for
example they write "v1@gra" for
"viagra" - this is to confuse statistical junk filter since
they count word frequencies to figure out what is junk. Without meta
patterns you would have to write patterns like
"v[i1][a@]gra " to match variations, and this will become
tedious considering many other keywords waiting to be obfuscated ("c@$h"
anyone?). Enter meta patterns: you can define a meta pattern "[i1] " and
name it
"i ", and similarly do so to define meta pattern "a ".
Now you can use
"v(?#i)(?#a)gra " to match many variations of this popular
word. Note the connection between meta pattern names and their usage:
for a meta pattern with name "name", you use it in this form:
"(?#name) " (that's right -
"(?#...) " is used for comments in Python regular expressions,
and I abused it for convenience). Managing your meta patterns can be
easily done in JunkMatcher.app. Only one restriction: you can't use a
meta pattern in defining another meta pattern (no recursion, that is)
- at least for now.
What are the reserved
meta patterns then?
A small set of meta patterns are special in that they cannot be altered.
These patterns are called reserved meta patterns, and are built
directly from your settings in Mail.app. For example, myEmails means
a pattern matching any one of your email addresses specified in the Account
Setting of Mail.app. If you have multiple accounts, individual reserved
meta patterns are also created, such as myEmail1 , myEmail2 ,
etc. These are useful since spammers like to make their messages more "personalized",
but they only know your email addresses (and don't know your full name).
You can take a look at the available reserved meta patterns in the Meta
Patterns drawer in JunkMatcher.app.
Help! I can't match `\', `.',
`?', `+', `*', `|', `[', `(', `)', '^', '$' etc. characters!
Yes you've been bitten by one of the most thorny problems in writing
regular expressions: one of escaping. You see, some characters
carry special meaning in regular expressions, for example, `.' is used
to match a single character (any character), and `?' is to signal
a particular pattern is optional (can appear zero or one time), etc.
To use their original, literal meaning, you have to use a `\' (backslash)
to precede these characters - to `escape' from their special meaning.
Here is a concrete example: to match the latest crop of mutant "\/iagra" (that's
right, the `v' is written as a backslash plus a forward slash), you need
to use this pattern: "\\/iagra " - note the extra backslash
for escaping? Better yet, create a meta
pattern for the mutant `v': "(?:v|\\/) " - this should
cover both the normal and the mutant `v'.
Unfortunately there is one exception to the escaping rule above: inside "[] " special
characters lose their power, so you can't escape them. For example: "[s$] " matches
either `s' or `$' - note you don't add a backslash before `$'. Programmers
are a weird bunch isn't it? Details can be found, again, here.
How do I write more efficient
regular expressions?
JunkMatcher uses regular expressions module provided by Python,
which already did some optimization when compiling each expression into
its internal representation. But there are still some general rules of
thumb for crafting more efficient patterns:
- When grouping multiple patterns into one, use `
(?:) '
instead of `() '. For example, to say "apple or banana",
write (?:apple|banana) instead of (apple|banana) .
This is because the latter grouping instructs Python to remember the
content of the match (if any), which is not used in JunkMatcher anyway.
- Instead of writing multiple related patterns, merge them into one
using the alternative delimiter `
| '. For example,
you can have two patterns to match either apple or banana,
but you can also write (?:apple|banana) . But sometimes
you might value clarity more than efficiency and decide not to merge
patterns - it's up to you.
- Try to merge the same part of alternative patterns. For example,
you can write
(?:refinance|refinancing) , but refinanc(?:e|ing) can
give you better efficiency (Python probably already did this automatically,
but it never hurts to be a bit considerate). Again use your judgment
between readability and efficiency.
- When listing multiple single-character alternatives, use `
[] '
instead of `(?:) ' plus `| '. For example,
spammers like to replace characters with some other characters with
similar "shape" -
"interested" becomes "1nterested", etc. You can write [1i]nterested instead
of (?:1|i)nterested for better efficiency.
How do I match words longer
than n characters?
This is useful in identifying spam since spammers tend to use really
long strings of characters as garbage to mislead statistically trained
filters. According to this
site, the longest English word contains 28 characters (letters),
but in normal usage we might just give ourselves a smaller number, say
20. We can use this pattern:
\w{20,}
to match any word consisting of more than 20 alphanumeric characters.
You can certainly juice this up based on your intuition.
How do I match vowels
and consonants?
Vowels are easy: "[aeiou] ", and consonants are not difficult
either: "[^aeiou] ". The `^' character inside a pair of square
brackets means "not", so what we're saying in the consonants
pattern is "any single character that is not a vowel". Note
however, once outside the "[]", `^' means the start of a line.
How do I match patterns
spanning over multiple lines? (example: filtering 419
scams)
(thanks to Jonathan Cardozo for asking this
question)
Ok say we want to filter junk mails that contain certain keywords, but
the words do not necessarily show up in the same line. For examples,
one way to filter the (in-)famous 419
scam emails is to detect words like: Nigeria, fund, transfer,
etc. It should be easy right? Our first try is:
(?i)Nigeria.+fund.+transfer
This almost won't work, since these 3 words might show up in separate
lines, and by default those `. ' will not match
any newline character used implicitly to break lines! The solution is
simple: we use an additional flag `s ' to indicate we want
`. ' to match newline characters as well (detailed here:
see the explanation for flag `S ' - NOTE we use the lowercase
`s ' inside a regexp instead of the capital `S '
used when writing the expression inside Python programs):
(?is)Nigeria.+fund.+transfer
There is still a small problem though: these 3 words can show up in
a different ordering! For 3 words we can have 6 different orderings,
so we'll just exhaust them all:
(?is)(?:Nigeria.+fund.+transfer|Nigeria.+transfer.+fund|...)
(Yeah I got lazy - the "... " above is left to you to fill
in the rest of the combinations)
I know what you're thinking: this is brain-dead! Well, I'm afraid this
is how things have to be done using regular expressions. Most of the
times you don't have to enumerate all possible orderings - language is
not just arbitrary juxtaposition of symbols after all. Also, choose your
keywords wisely so they are both general enough to get the bad guys,
and precise enough to avoid catching the good guys.
|