JunkMatcher: Pathological Study of Junk Mails

Ever heard of programming contests where people are competing to write the most obfuscated code? There are programming languages that are far easier than others to write really gibberish-looking code (see here and here), and I'm afraid HTML is one of them (well arguably HTML is not a programming language, in the traditional sense). That's right, my friends, these guys behind the junk mails are playing exactly this kind of games to evade the ever more powerful and robust statistical junk filters.

This page is devoted to the study of the obfuscated HTML - in particular, for its use in junk emails. My hope is through spreading the knowledge of the techniques the spammers use, we can better defend ourselves. You are welcome to send me new cases and your study - or even better, a solution to block them through decoding/pattern matching.

(Almost all of the tricks documented here have been addressed in the implementation of JunkMatcher)

Case 1: Hiding messages using tables
Case 2: Using <base ...> tag to hide external links
Case 3: Using white font color to hide texts (updated 20040202)
Case 4: Using <meta http-equiv="refresh"...> trick
Case 5: Abusing Google/Yahoo/MSN to redirect (updated 20040425)
Case 6: Obfuscating a URL (updated 20040128)
Case 7: Using tiny font to hide gibberish (updated 20040304)
Case 8: Web tracking tricks
Case 9: Vacuous tags (updated 20040425)
Case 10: Hidden URLs

Case 1: Hiding messages using tables

This is a perfect example showing how obfuscated HTML can be. First, here is the message you will see in any modern HTML-capable email client:

And you can click here to see the source code (the links are edited out, and it's reformatted for clarity and saved in text file format). Doesn't that look innocent? Now, look closer. There are multiple tables in this HTML message, and the cells are aligned such that the texts are split and placed into separate cells; but when put them together, the real message shows up. It is clearer if you change all border=0 to border=1 in the source, so the grid lines of tables are shown. Now the message looks like this:

Solutions: This presents a difficult problem to both pattern-based and statistically trained filters, since spammers can split up the message anyway they want. A hackish way is filter out any message using <table...> tag - effective but it probably will filter out some of the benign emails (but then how many times your friends or colleagues use tables in their HTML-based messages?) A more accurate solution is to match against the final rendering of a message: for example, in JunkMatcher a text-based rendering for an HTML-based email is provided to you so you can build patterns for it.

Case 2: Using <base ...> tag to hide external links

You know your friends and colleagues rarely send you emails containing image links pointing to some random external sites (something like

<img
        src="http:...">

), so using JunkMatcher you can specify this regular expression to filter out all emails containing them:

But then again we don't want to filter out the messages with "attached" images, because sometimes people do send HTML messages with internal image links (the <img...> links without a following http:...). Therein lies an opportunity waiting to be exploited. Look at the following edited junk mail:

Just by looking at the <img...> line itself the message seems fine, since this is an internal image link. BUT, when it's combined with a <base...> statement, it's no longer internal! This actually applies not only to image links, but to any kind of links!

Solutions: This one is easy - filter out all emails containing a <base...> tag! Who with a benign motive would send you a carefully disguised message using <base...> tag to trick you thinking that an external link is an internal one? A pattern achieving this is shown below:

Case 3: Using white font color to hide texts

Look at the screenshot of a junk message below - nothing special right? (except that it is a piece of spam email)

Now if you select all of the text so that it is highlighted, you'll see there is actually hidden text - you can't see it without highlighting the text because the font color is white:

One burning question: why did they make so much effort to hide some seemingly off-topic text? In this case the hidden text reads like this: "whiff kelp arisen sumptuous mardi bicameral coxcomb ashland nab transferral poisson programmer cretin deciduous colatitude annihilate...". I can think of at least two reasons: (i) to feed garbage to statistically trained junk filters, and (ii) hiding some tracking information so by looking at the message they know who got this message.

Solutions: Who in their right minds would send you white text on white background? So this one is easy - filter out all messages with white text! The following is the pattern written in regular expression:

The RGB code for white color is #FFFFFF, but sometimes spammers use "near-white" color, that's why I only match the first 4 hexadecimal digits, or match if there is an F in the odd-numbered place. Also note this assumes the HTML message has white background, so white text is meant to be hidden.

Case 4: Using <meta http-equiv="refresh"...> trick

Ok this one is obvious - this type of HTML-based messages usually come empty but in the header section there is this sinister meta tag, like this:

If you open this mail in an HTML-enabled mail client, this will immediately direct you to another website. Think your friend would do this? If not, just use this pattern in JunkMatcher:

Case 5: Abusing Google/Yahoo/MSN to redirect

This must have been going on for a while, but I only picked this up lately from JunkMatcher log. This trick is to obfuscate a real website link with a Google front, like this:

The real site they want you to visit is "http://www.spammer.biz". There could be fluff in between the `?' and "q=" but that doesn't matter. Also "www.google.com" can be just "google.com". To filter this, use the following pattern in JunkMatcher:

I've also found out more Yahoo sites: drs.yahoo.com, eur.rd.yahoo.com and srd.yahoo.com can be used to to achieve the same effect. In fact the "eur" part of eur.rd.yahoo.com can be replaced by names of some other geographic regions, such as "us". Taking the common denominator, this pattern should be sufficient:

In this case the real site is spammer.biz, and the SOME_FLUFF part cannot be some random string, which implies it might contain necessary tracking info. Filtering this is also simple:

Actually the pattern above can filter both Yahoo and MSN tricks. Updated 20040425: another MSN site used for redirect is ads.msn.com - the above pattern should work also.

JunkMatcher version 1.06+ introduced a new email property: "Has bad site(s)", which "incorporated" solutions to all of the tricks documented here.

Case 6: Obfuscating a URL

Again these tricks are old, but still worth mentioning. The first trick is to use `@' in a URL:

Everything up to the `@' does not contribute to the final destination you'll link to; the real site in this case, is "http://www.spammer.biz". Use this pattern in JunkMatcher to filter this out:

According to this article (20040128), Microsoft plans to drop support for using usernames in URLs in the upcoming Internet Explorer update. Whether this will remove entirely the use of this trick remains to be seen if the plan goes through.

If you look closer at the pattern above, this will also filter out messages containing links like this:

This is a URL containing your email address "me@mycompany.com". Supposedly if you follow the link your email address will be reported back. I consider this a bad practice in email correspondence, but you might want to allow this.

The second trick, which has been rarely seen these days, is to use really big decimal numbers to encode an IP address. For example, the URL "http://2147666867/" is really "http://128.2.203.179", which is "http://www.cs.cmu.edu". The encoding is done by using this formula:

How do we filter this? The minimal obfuscated and meaningful IP (from "1.0.0.0") according to the formula is 16777216, and any "bigger" IP will be converted to a greater number of digits. The pattern is:

The third trick exploits specifically a flaw in Microsoft Internet Explorer. Put in short, using a non-print character %01 combined with the `@' trick, a link can be obfuscated so that a bogus link address, instead of the real one, is displayed in both the address bar and the status bar of the browser. You can visit here to test your browser (specifically IE version 6.0.2800.1106 on W2k is vulnerable; interestingly IE on Mac is not vulnerable; Camino, Firefox, Mozilla are not vulnerable either). We don't need an additional pattern for this trick since we already filter out URLs containing `@'.

Case 7: Using tiny font to hide gibberish

Paul Maisano brought this to my attention: this case is similar to Case 3: they both try to hide the gibberish text used for confusing statistical filters. Look at the snapshot of this junk mail:

Nothing special right? Look closer at the "dashed" line beneath the texts - that's actually another line of text, with 1 pixel font size! Here is the part of the HTML code:

The ultimate solution, however, is again match with the final rendering of a message, as already illustrated in Case 1. This requires not only the information of the text aspect of rendering, but also the font style and even color aspect of rendering (Case 3 can be generalized to an absurd complexity using CSS).

Case 8: Web tracking tricks

This practice has also existed for a long time now. If you don't expect people sending you mails will need to track your clicks, use this pattern to filter the tracking URLs out:

Admittedly this pattern works only when the variable name used to pass back tracking info is either id or ref. More drastic pattern is:

This practice has a name: "web bug". Since we already filter all external image links, this is not a problem.

This has indeed been addressed since earlier versions of JunkMatcher, using this pattern:

Case 9: Vacuous tags

This is yet another trick to stuff garbage within a word so that visually you can't see the garbage, but for a text scanner it might have difficulty if it tries to look for the word. For example, to break up a keyword that every anti-spam software loves to find: "viagra", it can be coded like this:

Since there's no text surrounded by the tags, this renders nothing in your browser/mail client, therefore you can't even click on it. So "why" you ask? Because some anti-spam tools collect URLs for analysis/filtering (JunkMatcher version 1.06 and later features a "Has bad site(s)" email property), and spammers want to stuff you with these "diverting" URLs either to blow up the size of your site collections, or simply to make lots of noise.

Another similar trick that's used more often is stuffing in entirely bogus HTML tags within words to achieve the same effect:

There is no such tag "<sdlk3sd,>", but browsers will happily skip these mutants and render the word correctly.

In JunkMatcher both of these tricks are defeated by property tests: emails with vacuous tags can be matched using property "Has at least n vacuous tags" (since version 1.14), and emails with bogus tags can be identified using property "Has at least n bad tags". Also, both of these types of tags are removed before matching begins - so you get a nice, sanitized HTML message to match against, and no diverting URLs will be collected.

Note these tricks won't have any effect when matching against the final rendering of messages. So in JunkMatcher you can just add your normal patterns in the Rendering section. But detecting the presence of these tricks at the code level can be easier to do, and can speed up the matching process.

Case 10: Hidden URLs

Compared to the actual text used in junk mails, the websites (URLs) mentioned in them have far more distinguishing power in detecting if a message is spam or not. That's why in JunkMatcher version 1.06+ I have added the function to automatically collect"bad sites" for filtering more spam.

The URLs must have done some damage to spammers, because they start to stuff in garbage URLs just to overwhelm the collection process. Here is an example:

The "http://kingdom.com" is total garbage, since "hrefloadingshref" is not a valid attribute of HTML tag "a" (anchor). Obviously the goal here is to hide an URL such that it won't show up in your mail client (to avoid confusing people), but it can be collected by anti-spam software.

Generalizing this trick a bit, an URL can be hidden in any possible HTML tag with a garbage attribute.

To defeat this, in JunkMatcher version 1.15 and later a new property "Has >= n hidden URLs" has been added. Also these hidden URLs will not be collected when a message is filtered.

Pathological Study of Junk Mails

Case 1: Hiding messages using tables

Case 2: Using `<base ...>` tag to hide external links

Case 3: Using white font color to hide texts

Case 4: Using `<meta http-equiv="refresh"...>` trick

Case 5: Abusing Google/Yahoo/MSN to redirect

Case 6: Obfuscating a URL

Case 7: Using tiny font to hide gibberish

Case 8: Web tracking tricks

Case 9: Vacuous tags

Case 10: Hidden URLs

Pathological Study of Junk Mails

Case 1: Hiding messages using tables

Case 2: Using <base ...> tag to hide external links

Case 3: Using white font color to hide texts

Case 4: Using <meta http-equiv="refresh"...> trick

Case 5: Abusing Google/Yahoo/MSN to redirect

Case 6: Obfuscating a URL

Case 7: Using tiny font to hide gibberish

Case 8: Web tracking tricks

Case 9: Vacuous tags

Case 10: Hidden URLs

Case 2: Using `<base ...>` tag to hide external links

Case 4: Using `<meta http-equiv="refresh"...>` trick