Pathological Study of Junk Mails

Ever heard of programming contests where people are competing to write the most obfuscated code? There are programming languages that are far easier than others to write really gibberish-looking code (see here and here), and I'm afraid HTML is one of them (well arguably HTML is not a programming language, in the traditional sense). That's right, my friends, these guys behind the junk mails are playing exactly this kind of games to evade the ever more powerful and robust statistical junk filters.

This page is devoted to the study of the obfuscated HTML - in particular, for its use in junk emails. My hope is through spreading the knowledge of the techniques the spammers use, we can better defend ourselves. You are welcome to send me new cases and your study - or even better, a solution to block them through decoding/pattern matching.

(Almost all of the tricks documented here have been addressed in the implementation of JunkMatcher)

  • Case 1: Hiding messages using tables
  • Case 2: Using <base ...> tag to hide external links
  • Case 3: Using white font color to hide texts (updated 20040202)
  • Case 4: Using <meta http-equiv="refresh"...> trick
  • Case 5: Abusing Google/Yahoo/MSN to redirect (updated 20040425)
  • Case 6: Obfuscating a URL (updated 20040128)
  • Case 7: Using tiny font to hide gibberish (updated 20040304)
  • Case 8: Web tracking tricks
  • Case 9: Vacuous tags (updated 20040425)
  • Case 10: Hidden URLs

Case 1: Hiding messages using tables

This is a perfect example showing how obfuscated HTML can be. First, here is the message you will see in any modern HTML-capable email client:

Junk mail
	shown in a mail client

And you can click here to see the source code (the links are edited out, and it's reformatted for clarity and saved in text file format). Doesn't that look innocent? Now, look closer. There are multiple tables in this HTML message, and the cells are aligned such that the texts are split and placed into separate cells; but when put them together, the real message shows up. It is clearer if you change all border=0 to border=1 in the source, so the grid lines of tables are shown. Now the message looks like this:

Junk mail
	shown in a mail client

Solutions: This presents a difficult problem to both pattern-based and statistically trained filters, since spammers can split up the message anyway they want. A hackish way is filter out any message using <table...> tag - effective but it probably will filter out some of the benign emails (but then how many times your friends or colleagues use tables in their HTML-based messages?) A more accurate solution is to match against the final rendering of a message: for example, in JunkMatcher a text-based rendering for an HTML-based email is provided to you so you can build patterns for it.

Case 2: Using <base ...> tag to hide external links

You know your friends and colleagues rarely send you emails containing image links pointing to some random external sites (something like <img src="http:...">), so using JunkMatcher you can specify this regular expression to filter out all emails containing them:

(?i)<\s*img[^>]+(?:low)?src\s*=\s*(?:'|")?\s*http:

But then again we don't want to filter out the messages with "attached" images, because sometimes people do send HTML messages with internal image links (the <img...> links without a following http:...). Therein lies an opportunity waiting to be exploited. Look at the following edited junk mail:

<HTML>
<body>
<base href="http://some.address.planted.by/spammers/">
...
<IMG SRC="p1_01.gif" border=0>
...
</body>
</HTML>

Just by looking at the <img...> line itself the message seems fine, since this is an internal image link. BUT, when it's combined with a <base...> statement, it's no longer internal! This actually applies not only to image links, but to any kind of links!

Solutions: This one is easy - filter out all emails containing a <base...> tag! Who with a benign motive would send you a carefully disguised message using <base...> tag to trick you thinking that an external link is an internal one? A pattern achieving this is shown below:

(?i)<\s*base[^>]+href

Case 3: Using white font color to hide texts

Look at the screenshot of a junk message below - nothing special right? (except that it is a piece of spam email)

A junk
	message

Now if you select all of the text so that it is highlighted, you'll see there is actually hidden text - you can't see it without highlighting the text because the font color is white:

A junk
	message with hidden texts revealed

One burning question: why did they make so much effort to hide some seemingly off-topic text? In this case the hidden text reads like this: "whiff kelp arisen sumptuous mardi bicameral coxcomb ashland nab transferral poisson programmer cretin deciduous colatitude annihilate...". I can think of at least two reasons: (i) to feed garbage to statistically trained junk filters, and (ii) hiding some tracking information so by looking at the message they know who got this message.

Solutions: Who in their right minds would send you white text on white background? So this one is easy - filter out all messages with white text! The following is the pattern written in regular expression:

(?i)(?#<_)font[^>]+(?:color="?(?:#?(?:FFFF|F.F.F.)|white))

The RGB code for white color is #FFFFFF, but sometimes spammers use "near-white" color, that's why I only match the first 4 hexadecimal digits, or match if there is an F in the odd-numbered place. Also note this assumes the HTML message has white background, so white text is meant to be hidden.

Case 4: Using <meta http-equiv="refresh"...> trick

Ok this one is obvious - this type of HTML-based messages usually come empty but in the header section there is this sinister meta tag, like this:

<meta http-equiv="refresh" content="0;URL=http://I.am.a.spammer">

If you open this mail in an HTML-enabled mail client, this will immediately direct you to another website. Think your friend would do this? If not, just use this pattern in JunkMatcher:

(?i)<\s*meta\s+http-equiv\s*="?refresh

Case 5: Abusing Google/Yahoo/MSN to redirect

This must have been going on for a while, but I only picked this up lately from JunkMatcher log. This trick is to obfuscate a real website link with a Google front, like this:

http://www.google.com/url?q=http://www.spammer.biz

The real site they want you to visit is "http://www.spammer.biz". There could be fluff in between the `?' and "q=" but that doesn't matter. Also "www.google.com" can be just "google.com". To filter this, use the following pattern in JunkMatcher:

(?i)http://(?:www\.)?google\.com/url\?

Paul Maisano wrote to me about a new sighting: using Yahoo.com to redirect:

"http://rd.yahoo.com/some/junk/*http://my.real.site/hahaha.html" will take you to "http://my.real.site/hahaha.html" via redirection from yahoo.com.

Use this pattern to filter this out:

(?i)http://rd\.yahoo\.com/\S*\*http

I've also found out more Yahoo sites: drs.yahoo.com, eur.rd.yahoo.com and srd.yahoo.com can be used to to achieve the same effect. In fact the "eur" part of eur.rd.yahoo.com can be replaced by names of some other geographic regions, such as "us". Taking the common denominator, this pattern should be sufficient:

(?i)http://\S+\.yahoo\.com/\S*\*http

Here is another trick using MSN to redirect site traffic. An example:

http://g.msn.com/SOME_FLUFF?http://spammer.biz

In this case the real site is spammer.biz, and the SOME_FLUFF part cannot be some random string, which implies it might contain necessary tracking info. Filtering this is also simple:

(?i)http://\S+[*?]http

Actually the pattern above can filter both Yahoo and MSN tricks. Updated 20040425: another MSN site used for redirect is ads.msn.com - the above pattern should work also.

JunkMatcher version 1.06+ introduced a new email property: "Has bad site(s)", which "incorporated" solutions to all of the tricks documented here.

Case 6: Obfuscating a URL

Again these tricks are old, but still worth mentioning. The first trick is to use `@' in a URL:

http://www.this-is-bogus.com@www.spammer.biz

Everything up to the `@' does not contribute to the final destination you'll link to; the real site in this case, is "http://www.spammer.biz". Use this pattern in JunkMatcher to filter this out:

http://[^@> ]*?@

According to this article (20040128), Microsoft plans to drop support for using usernames in URLs in the upcoming Internet Explorer update. Whether this will remove entirely the use of this trick remains to be seen if the plan goes through.

If you look closer at the pattern above, this will also filter out messages containing links like this:

http://www.somesite.com/something?email=me@mycompany.com

This is a URL containing your email address "me@mycompany.com". Supposedly if you follow the link your email address will be reported back. I consider this a bad practice in email correspondence, but you might want to allow this.

The second trick, which has been rarely seen these days, is to use really big decimal numbers to encode an IP address. For example, the URL "http://2147666867/" is really "http://128.2.203.179", which is "http://www.cs.cmu.edu". The encoding is done by using this formula:

obfuscated IP =
  ( (first octet * 2^24) + (second octet * 2^16) + (third octet * 2^8) + (fourth octet) )

How do we filter this? The minimal obfuscated and meaningful IP (from "1.0.0.0") according to the formula is 16777216, and any "bigger" IP will be converted to a greater number of digits. The pattern is:

http://\d{8,}

The third trick exploits specifically a flaw in Microsoft Internet Explorer. Put in short, using a non-print character %01 combined with the `@' trick, a link can be obfuscated so that a bogus link address, instead of the real one, is displayed in both the address bar and the status bar of the browser. You can visit here to test your browser (specifically IE version 6.0.2800.1106 on W2k is vulnerable; interestingly IE on Mac is not vulnerable; Camino, Firefox, Mozilla are not vulnerable either). We don't need an additional pattern for this trick since we already filter out URLs containing `@'.

(If you know any other URL obfuscation tricks, please drop me a line)

Case 7: Using tiny font to hide gibberish

Paul Maisano brought this to my attention: this case is similar to Case 3: they both try to hide the gibberish text used for confusing statistical filters. Look at the snapshot of this junk mail:

Spam using
	tiny fonts

Nothing special right? Look closer at the "dashed" line beneath the texts - that's actually another line of text, with 1 pixel font size! Here is the part of the HTML code:

<font style=font-size:1px>

A solution is simply look for this type of CSS code:

(?i)font-size:[^"]*?(?:(?<![1-9])[0-5](?![0-9])(?:px)?|(?<![1-9])[1-5]?\d%)

The ultimate solution, however, is again match with the final rendering of a message, as already illustrated in Case 1. This requires not only the information of the text aspect of rendering, but also the font style and even color aspect of rendering (Case 3 can be generalized to an absurd complexity using CSS).

Case 8: Web tracking tricks

From Paul Maisano:

Some spammers allow restricted redirection in order to track which particular promotion is causing the hit on their site. E.g. today I found that you can go to any site you like using something like:

http://xrntq.track.coolpicsandmore.com/_r.jpegg?cid=123&url=http://www.msn.com

You need a valid "promotion" id as the "cid" parameter or the page complains with an SQL error. We should publish URLs like this and flood them with useless data. That'll teach them. :-)

This practice has also existed for a long time now. If you don't expect people sending you mails will need to track your clicks, use this pattern to filter the tracking URLs out:

(?i)http://\S*\?\S*(?:id|ref)=

Admittedly this pattern works only when the variable name used to pass back tracking info is either id or ref. More drastic pattern is:

(?i)http://\S*\?\S+=

Paul went on to say this:

Spammers are particularly nasty when they include tracking information inside <IMAGE SRC="...">. Just viewing the email (using a html browser) will inform the spammers that their promotion has reached its target!

This practice has a name: "web bug". Since we already filter all external image links, this is not a problem.

I also look for any "<script>" inside the email -- that's always a good sign of spam; especially if it contains "window.open" or "Math.random".

This has indeed been addressed since earlier versions of JunkMatcher, using this pattern:

(?i)(?#<_)script[^>]+language[^>]+=

In fact we don't allow form method either for obvious reasons:

(?i)(?#<_)form[^>]+(?:method|action)

Case 9: Vacuous tags

This is yet another trick to stuff garbage within a word so that visually you can't see the garbage, but for a text scanner it might have difficulty if it tries to look for the word. For example, to break up a keyword that every anti-spam software loves to find: "viagra", it can be coded like this:

via<font></font>gra

Apparently vacuous tags are also used to hide URLs, like this:

<a href="http://www.innocent.org"></a>

Since there's no text surrounded by the tags, this renders nothing in your browser/mail client, therefore you can't even click on it. So "why" you ask? Because some anti-spam tools collect URLs for analysis/filtering (JunkMatcher version 1.06 and later features a "Has bad site(s)" email property), and spammers want to stuff you with these "diverting" URLs either to blow up the size of your site collections, or simply to make lots of noise.

Another similar trick that's used more often is stuffing in entirely bogus HTML tags within words to achieve the same effect:

via<sdlk3sd,>gra

There is no such tag "<sdlk3sd,>", but browsers will happily skip these mutants and render the word correctly.

In JunkMatcher both of these tricks are defeated by property tests: emails with vacuous tags can be matched using property "Has at least n vacuous tags" (since version 1.14), and emails with bogus tags can be identified using property "Has at least n bad tags". Also, both of these types of tags are removed before matching begins - so you get a nice, sanitized HTML message to match against, and no diverting URLs will be collected.

Note these tricks won't have any effect when matching against the final rendering of messages. So in JunkMatcher you can just add your normal patterns in the Rendering section. But detecting the presence of these tricks at the code level can be easier to do, and can speed up the matching process.

Case 10: Hidden URLs

Compared to the actual text used in junk mails, the websites (URLs) mentioned in them have far more distinguishing power in detecting if a message is spam or not. That's why in JunkMatcher version 1.06+ I have added the function to automatically collect"bad sites" for filtering more spam.

The URLs must have done some damage to spammers, because they start to stuff in garbage URLs just to overwhelm the collection process. Here is an example:

<a hrefloadingshref=http://kingdom.com href= "http://fosraw.biz/OEx/?affiliate_id=x&campaign_id=x">Windows XP Professional 2002 </a>

The "http://kingdom.com" is total garbage, since "hrefloadingshref" is not a valid attribute of HTML tag "a" (anchor). Obviously the goal here is to hide an URL such that it won't show up in your mail client (to avoid confusing people), but it can be collected by anti-spam software.

Generalizing this trick a bit, an URL can be hidden in any possible HTML tag with a garbage attribute.

To defeat this, in JunkMatcher version 1.15 and later a new property "Has >= n hidden URLs" has been added. Also these hidden URLs will not be collected when a message is filtered.