Pathological Study of Junk Mails
Ever heard of programming contests where people are competing to write the most obfuscated code? There are programming languages that are far easier than others to write really gibberish-looking code (see here and here), and I'm afraid HTML is one of them (well arguably HTML is not a programming language, in the traditional sense). That's right, my friends, these guys behind the junk mails are playing exactly this kind of games to evade the ever more powerful and robust statistical junk filters.
This page is devoted to the study of the obfuscated HTML - in particular, for its use in junk emails. My hope is through spreading the knowledge of the techniques the spammers use, we can better defend ourselves. You are welcome to send me new cases and your study - or even better, a solution to block them through decoding/pattern matching.
(Almost all of the tricks documented here have been addressed in the implementation of JunkMatcher)
This is a perfect example showing how obfuscated HTML can be. First, here is the message you will see in any modern HTML-capable email client:
And you can click here to see the source
code (the links are edited out, and it's reformatted for clarity and
saved in text file format). Doesn't that look innocent? Now, look closer.
There are multiple tables in this HTML message, and the cells are aligned
such that the texts are split and placed into separate cells; but when
put them together, the real message shows up. It is clearer if you change
Solutions: This presents a difficult problem to both
pattern-based and statistically trained filters, since spammers can split
up the message anyway they want. A hackish way is filter out any message
You know your friends and colleagues rarely send you emails containing
image links pointing to some random external sites (something like
But then again we don't want to filter out the messages with "attached" images,
because sometimes people do send HTML messages with internal image
<IMG SRC="p1_01.gif" border=0>
Just by looking at the
Solutions: This one is easy - filter out all emails
Look at the screenshot of a junk message below - nothing special right? (except that it is a piece of spam email)
Now if you select all of the text so that it is highlighted, you'll see there is actually hidden text - you can't see it without highlighting the text because the font color is white:
One burning question: why did they make so much effort to hide some seemingly off-topic text? In this case the hidden text reads like this: "whiff kelp arisen sumptuous mardi bicameral coxcomb ashland nab transferral poisson programmer cretin deciduous colatitude annihilate...". I can think of at least two reasons: (i) to feed garbage to statistically trained junk filters, and (ii) hiding some tracking information so by looking at the message they know who got this message.
Solutions: Who in their right minds would send you white text on white background? So this one is easy - filter out all messages with white text! The following is the pattern written in regular expression:
The RGB code for white color is
Ok this one is obvious - this type of HTML-based messages usually come
empty but in the header section there is this sinister
<meta http-equiv="refresh" content="0;URL=http://I.am.a.spammer">
If you open this mail in an HTML-enabled mail client, this will immediately direct you to another website. Think your friend would do this? If not, just use this pattern in JunkMatcher:
This must have been going on for a while, but I only picked this up lately from JunkMatcher log. This trick is to obfuscate a real website link with a Google front, like this:
The real site they want you to visit is
Paul Maisano wrote to me about a new sighting: using Yahoo.com to redirect:"
Use this pattern to filter this out:
I've also found out more Yahoo sites:
Here is another trick using MSN to redirect site traffic. An example:
In this case the real site is
Actually the pattern above can filter both Yahoo and MSN tricks. Updated
20040425: another MSN site used for redirect is
JunkMatcher version 1.06+ introduced a new email property: "Has bad site(s)", which "incorporated" solutions to all of the tricks documented here.
Again these tricks are old, but still worth mentioning. The first trick is to use `@' in a URL:
Everything up to the `@' does not contribute to the final destination
you'll link to; the real site in this case, is
According to this article (20040128), Microsoft plans to drop support for using usernames in URLs in the upcoming Internet Explorer update. Whether this will remove entirely the use of this trick remains to be seen if the plan goes through.
If you look closer at the pattern above, this will also filter out messages containing links like this:
This is a URL containing your email address
The second trick, which has been rarely seen these days, is to use really
big decimal numbers to encode an IP address. For example, the URL
obfuscated IP =
( (first octet * 2^24) + (second octet * 2^16) + (third octet * 2^8) + (fourth octet) )
How do we filter this? The minimal obfuscated and meaningful IP (from "
The third trick exploits specifically a flaw in Microsoft
Internet Explorer. Put in short, using a non-print character
(If you know any other URL obfuscation tricks, please drop me a line)
Paul Maisano brought this to my attention: this case is similar to Case 3: they both try to hide the gibberish text used for confusing statistical filters. Look at the snapshot of this junk mail:
Nothing special right? Look closer at the "dashed" line beneath the texts - that's actually another line of text, with 1 pixel font size! Here is the part of the HTML code:
A solution is simply look for this type of CSS code:
The ultimate solution, however, is again match with the final rendering of a message, as already illustrated in Case 1. This requires not only the information of the text aspect of rendering, but also the font style and even color aspect of rendering (Case 3 can be generalized to an absurd complexity using CSS).
From Paul Maisano:Some spammers allow restricted redirection in order to track which particular promotion is causing the hit on their site. E.g. today I found that you can go to any site you like using something like:
You need a valid "promotion" id as the "cid" parameter or the page complains with an SQL error. We should publish URLs like this and flood them with useless data. That'll teach them. :-)
This practice has also existed for a long time now. If you don't expect people sending you mails will need to track your clicks, use this pattern to filter the tracking URLs out:
Admittedly this pattern works only when the variable name used to pass
back tracking info is either
Paul went on to say this:Spammers are particularly nasty when they include tracking information inside
This practice has a name: "web bug". Since we already filter all external image links, this is not a problem.I also look for any "
This has indeed been addressed since earlier versions of JunkMatcher, using this pattern:
In fact we don't allow form method either for obvious reasons:
This is yet another trick to stuff garbage within a word so that visually you can't see the garbage, but for a text scanner it might have difficulty if it tries to look for the word. For example, to break up a keyword that every anti-spam software loves to find: "viagra", it can be coded like this:
Apparently vacuous tags are also used to hide URLs, like this:
Since there's no text surrounded by the tags, this renders nothing in your browser/mail client, therefore you can't even click on it. So "why" you ask? Because some anti-spam tools collect URLs for analysis/filtering (JunkMatcher version 1.06 and later features a "Has bad site(s)" email property), and spammers want to stuff you with these "diverting" URLs either to blow up the size of your site collections, or simply to make lots of noise.
Another similar trick that's used more often is stuffing in entirely bogus HTML tags within words to achieve the same effect:
There is no such tag "
In JunkMatcher both of these tricks are defeated by property tests: emails with vacuous tags can be matched using property "Has at least n vacuous tags" (since version 1.14), and emails with bogus tags can be identified using property "Has at least n bad tags". Also, both of these types of tags are removed before matching begins - so you get a nice, sanitized HTML message to match against, and no diverting URLs will be collected.
Note these tricks won't have any effect when matching against the final rendering of messages. So in JunkMatcher you can just add your normal patterns in the Rendering section. But detecting the presence of these tricks at the code level can be easier to do, and can speed up the matching process.
Compared to the actual text used in junk mails, the websites (URLs) mentioned in them have far more distinguishing power in detecting if a message is spam or not. That's why in JunkMatcher version 1.06+ I have added the function to automatically collect"bad sites" for filtering more spam.
The URLs must have done some damage to spammers, because they start to stuff in garbage URLs just to overwhelm the collection process. Here is an example:
<a hrefloadingshref=http://kingdom.com href= "http://fosraw.biz/OEx/?affiliate_id=x&campaign_id=x">Windows XP Professional 2002 </a>
Generalizing this trick a bit, an URL can be hidden in any possible HTML tag with a garbage attribute.
To defeat this, in JunkMatcher version 1.15 and later a new property "Has >= n hidden URLs" has been added. Also these hidden URLs will not be collected when a message is filtered.