JunkMatcher Howto: Property Tests

(new stuff is in red)

How do I know what property X is for?

In the Test Inspector window there is a Description section that describes what each property test does.

I heard about this cool Bayesian filter being integrated into JunkMatcher - where is it?

It's right in the list of tests shown in the Analyzer window. You can get to it by using the search box in that window (just type "SpamBayes"), or you can choose the menu item "Inspect SpamBayes" from the View menu (keyboard shortcut opt-cmd-b). After opening its Test Inspector window, here is what you'll see - say hello to SpamBayes property test:

Test Inspector on SpamBayes property test
SpamBayes is a great Bayesian filter developed by another open-source project - the author merely took their intellectual fruit and integrated it into JunkMatcher. Please do send your appreciation/donation etc. their way! You can also read more about Bayesian filtering on Wikipedia.

The Test Inspector window, in addition to the usual statistics, also shows you how many ham and spam SpamBayes have been trained on. It is important to keep these two numbers as close as possible: if you keep feeding SpamBayes junk, soon it will think the world is filled with nothing but junk!

You can also set two cutoff values for SpamBayes here: the ham cutoff value and the spam cutoff value. SpamBayes detects junk mails by first computing their spam probabilities: these are the numbers between 0.0 and 1.0. If the spam probability of a message is below the ham cutoff, it is classified as ham; if the spam probability of a message is greater than the spam cutoff, it is considered spam. Everything in between is then classified as "unsure" - but currently JunkMatcher treats the unsure category as ham (this is pictorially illustrated in the screenshot above). Basically lowering the spam cutoff increases SpamBayes' sensitivity to spam, with the risk of getting more false positives.

Another important setting exposed here is the happy meal plan for SpamBayes (or the automatic balanced training). The idea is very simple: JunkMatcher should be able to find interesting things to train SpamBayes automatically since it also runs a lot of other tests (instead of requiring us to train SpamBayes manually). These "yummy" ham/spam include the messages that SpamBayes made mistakes on, or nearly made mistakes on. Even when the SpamBayes property test was not turned on, or it was turned on but was not run because a verdict was reached before it had a chance, as long as the balance between the number of trained ham/spam is kept, JunkMatcher should still feed SpamBayes with fresh messages it received.

So here you go: you can turn on the automatic training function, and set a "balance" ratio between the number of the trained ham/spam. JunkMatcher will try very hard to keep feeding SpamBayes with a balanced diet, but if SpamBayes keeps making mistakes (e.g., keeps missing spam), the actual ham vs. spam ratio could be a bit out of the designated value.

Finally, in case you want to make SpamBayes forget all of the previous training, just click on the "Reset SpamBayes Training Data" button.

How do I train SpamBayes?

Sometimes you probably want to train SpamBayes manually because compared to the automatic training the manual process can cram a lot of "spam experience" into SpamBayes fairly quickly, or because you just want SpamBayes to get familiar with certain types of emails. How do you proceed then?

Easy - in Mail.app, just select the emails you want to use for training, right-click (or control-click) to bring up the contextual menu, and then select either the item "Train as Ham" or "Train as Spam" to start training (remember, ham = good and spam = bad!).

Training SpamBayes in Mail.app

Don't worry about accidentally training on the same emails again, as JunkMatcher remembers what messages have been used in training - it will just skip those.

After the last step you will be asked for confirmation:

Training Confirmation

As said in the dialog box - make sure you get the ham/spam difference right! After clicking on Yes button a progress window will pop up to tell you how many messages have been trained:

Training Progress

At the end of training, you will then be presented with a summary dialog:

Training Summary

In the Log window the messages that have been used for training will also be color coded (light blue in the Received Date column).

Important thing to note here again is the balance between the number of ham trained vs. the number of spam trained. You can read more tips about training here at SpamBayes FAQ site.

What are the property tests that require Internet connection to run?

Currently there are two:

  1. Open relay: This test queries a specified number of blacklists on Internet (via DNS lookup) to check if any IP address mentioned in the header of a message matches any known spamming IP address. Blacklists contain a list of "banned" IPs based on user reports. You can choose up to 3 blacklists to query, and define what are "safe IPs" that should not be checked.
  2. Domain name has no MX record or bogus: This test checks on Internet (via DNS lookup) if there is any MX (Mail Exchange) server listed with the domain of the sender address; it is effective to find bogus domain of a sender address. In some cases it may take up to 10 seconds to complete.

There are three important characteristics of these tests:

  1. They both need to connect to outside world via port 53 (DNS).
  2. It's fairly likely that they would give you different results when you run them at different times (the operators of blacklists keep updating their lists, and domain records can come and go).
  3. They usually take more time to run than the other tests. As mentioned the MX test can wait up to 10 seconds for answers - this could happen because some domain names are simply bogus (thus when timeout happens the test result is considered positive).

This is why you might want to run only pattern tests when you're doing a "Match All" in the Analyzer window of JunkMatcher.app.

How do I activate certain property tests only on emails from a certain account?

Bring up the Test Inspector window on the property (by double-clicking on it in the Analyzer window). Enter a pattern that matches the email address you use for the account into the "Recipient Pattern" field:

Test Inspector on a Property

For example, a pattern "\@mac\.com" will activate the property for all emails sent to your .Mac account.

The same thing can be done for a pattern too - just enter the pattern in the Test Inspector window over a pattern instead of a property.

How do I filter out blank junk? (And what did they possibly want to do by sending me those?)

By blank emails, we're talking about emails that show nothing in the message body; sometimes even the subject, sender, date, and/or recipient is missing.

You can turn on property "Blank Rendering" in the Analyzer window to filter out the blank emails (or even make it a hard test).

As to why these emails were sent to you in the first place, I don't have a definitive answer to this one, but here comes my conspiracy theory:

  1. Targeting weakness in email clients/filters: If you look at these emails, you'll find that most of them came with something abnormal - it could be that the message ID is malformed, or the MIME attachment is foobared. These messages were sent to you possibly because spammers figured out problems of specific filters/mail clients, and they wanted to exploit them (e.g., Mail prior to 10.3.3 crashed when you mark this kind of messages "as junk").
  2. Probing bounces: Spammers want to know which email addresses work and which don't - sending blank messages is one of the most cost-efficient ways to accomplish this.
  3. Misconfigured spam software: You know, spammers are users too. ;-)
  4. Disrupting statistical filters: Maybe blank mails can skew the statistical distribution of junk mails?

Insights are welcome!

I notice spammers use my correct email address in the recipients line, but with an incorrect name - how do I catch that?

(thanks to Al Heynneman for asking this question)

This is a good question that resulted in a solution that's proven to be fairly effective. The situation is like this: say you are Joe with an email address joe@mailhost.com, and you receive spam with recipient line written as "Mary <joe@mailhost.com>". Looks like someone has your email address, but doesn't have your name spelled correctly (hm I wonder who that is...).

In JunkMatcher we can catch this by setting the right patterns in the property "Recipient(s) mismatch". Bring up the Test Inspector on the property, and you can access the "Recipient Patterns" window by clicking on the "Edit Recipient Address Patterns" button:

Property "Recipient(s) mismatch"

Any legit email must match at least one of these patterns, otherwise it'll be considered junk. Adding patterns here doesn't automatically enable this functionality though - you still need to turn on the property "Recipient(s) mismatch" in the Analyzer window. For safety reasons, if no pattern is added here and the property is activated, no email will be junked by this test.

So let's add patterns that'll match all the possible ways people will address you in the recipient line - thus effectively rule out the wrong ones. Here are the patterns for this example:

(?i)joe.*joe\@mailhost\.com
(?i)^<?joe\@mailhost\.com

First pay attention to how we escape the `.' to match a real dot. Also the initial "(?i)" is to tell JunkMatcher to ignore case when matching. Now the first pattern will match recipient addresses like these:

Joe <joe@mailhost.com>
"Joe" <joe@mailhost.com>
joe <joe@mailhost.com>
"joe" <joe@mailhost.com>

And the second pattern will match these addresses:

<joe@mailhost.com>
joe@mailhost.com

So, the first pattern matches the correct pairings of the name and the address, while the second only matches the address. Any incorrect pairing of name and address, therefore, is not allowed.

If you can't figure out what the `^' in the second pattern does: it matches the beginning of a line - via this we make sure that nothing before the email address can match.

How do I manage the list of bad sites JunkMatcher has collected from spam?

You can open the Sites window directly from its toolbar button, or by choosing the menu item "Open Sites" (key: opt-cmd-S) from the Window menu. The window shows you a list of bad sites collected so far from the matched emails.

Sites window

The rest should be pretty straightforward: you can add/remove a site from the list. You can also open the drawer for "Safe Sites": this is a list of patterns describing the sites that JunkMatcher should never regard as bad sites.

If you spot a site in the bad site table and you want JunkMatcher to never collect it again, you can select the site and click on "Mark Safe" button - this will remove the site from the bad site list, and add a pattern matching the site in the safe site list.

If you feel curious about who is behind a certain site, you can select a site and click on "Whois" button - might be useful if you are considering legal actions.

Checking a site using Whois

What blacklists are you using?

As of March 19, 2005, I'm using (in this order) bl.spamcop.net, cbl.abuseat.org, and sbl-xbl.spamhaus.org. A good and up-to-date comparison among many blacklists can be found at Jeff Makey's "Blacklists Compared" page (accessible also in the Test Inspector window over the property "Open relay").