Pathological Study of Junk Mails
Ever heard of programming contests where people are competing to write
the most obfuscated code? There are programming languages that are far
easier than others to write really gibberish-looking code (see here and here),
and I'm afraid HTML is one of them (well arguably HTML is not a programming
language, in the traditional sense). That's right, my friends, these
guys behind the junk mails are playing exactly this kind of games to
evade the ever more powerful and robust statistical junk filters.
This page is devoted to the study of the obfuscated HTML - in particular,
for its use in junk emails. My hope is through spreading the knowledge
of the techniques the spammers use, we can better defend ourselves. You
are welcome to send me new cases and your study - or even better, a solution
to block them through decoding/pattern matching.
(Almost all of the tricks documented here have been addressed in the
implementation of JunkMatcher)
- Case 1: Hiding messages using tables
- Case 2: Using
<base ...> tag
to hide external links
- Case 3: Using white font color to hide texts
(updated 20040202)
- Case 4: Using
<meta http-equiv="refresh"...> trick
- Case 5: Abusing Google/Yahoo/MSN to redirect
(updated 20040425)
- Case 6: Obfuscating a URL (updated 20040128)
- Case 7: Using tiny font to hide gibberish
(updated 20040304)
- Case 8: Web tracking tricks
- Case 9: Vacuous tags (updated 20040425)
- Case 10: Hidden URLs
Case 1: Hiding messages using tables
This is a perfect example showing how obfuscated HTML can be. First,
here is the message you will see in any modern HTML-capable email client:
And you can click here to see the source
code (the links are edited out, and it's reformatted for clarity and
saved in text file format). Doesn't that look innocent? Now, look closer.
There are multiple tables in this HTML message, and the cells are aligned
such that the texts are split and placed into separate cells; but when
put them together, the real message shows up. It is clearer if you change
all border=0 to border=1 in the source, so
the grid lines of tables are shown. Now the message looks like this:
Solutions: This presents a difficult problem to both
pattern-based and statistically trained filters, since spammers can split
up the message anyway they want. A hackish way is filter out any message
using <table...> tag - effective but it probably will
filter out some of the benign emails (but then how many times your friends
or colleagues use tables in their HTML-based messages?) A more accurate
solution is to match against the final rendering of a message: for example,
in JunkMatcher a text-based rendering
for an HTML-based email is provided to you so you can build patterns
for it.
Case 2: Using <base ...> tag
to hide external links
You know your friends and colleagues rarely send you emails containing
image links pointing to some random external sites (something like <img
src="http:..."> ), so using JunkMatcher you
can specify this regular expression to filter out all emails containing
them:
(?i)<\s*img[^>]+(?:low)?src\s*=\s*(?:'|")?\s*http:
But then again we don't want to filter out the messages with "attached" images,
because sometimes people do send HTML messages with internal image
links (the <img...> links without a following http:... ).
Therein lies an opportunity waiting to be exploited. Look at the following
edited junk mail:
<HTML>
<body>
<base href="http://some.address.planted.by/spammers/">
...
<IMG SRC="p1_01.gif" border=0>
...
</body>
</HTML>
Just by looking at the <img...> line itself the message
seems fine, since this is an internal image link. BUT, when it's combined
with a <base...> statement, it's no longer internal!
This actually applies not only to image links, but to any kind of links!
Solutions: This one is easy - filter out all emails
containing a <base...> tag! Who with a benign motive
would send you a carefully disguised message using <base...> tag
to trick you thinking that an external link is an internal one? A pattern
achieving this is shown below:
(?i)<\s*base[^>]+href
Case 3: Using white font color
to hide texts
Look at the screenshot of a junk message below - nothing special right?
(except that it is a piece of spam email)
Now if you select all of the text so that it is highlighted, you'll
see there is actually hidden text - you can't see it without highlighting
the text because the font color is white:
One burning question: why did they make so much effort to hide some
seemingly off-topic text? In this case the hidden text reads like this: "whiff
kelp arisen sumptuous mardi bicameral coxcomb ashland nab transferral
poisson programmer cretin deciduous colatitude annihilate...". I
can think of at least two reasons: (i) to feed garbage to statistically
trained junk filters, and (ii) hiding some tracking information so by
looking at the message they know who got this message.
Solutions: Who in their right minds would send you
white text on white background? So this one is easy - filter out all
messages with white text! The following is the pattern written in regular
expression:
(?i)(?#<_)font[^>]+(?:color="?(?:#?(?:FFFF|F.F.F.)|white))
The RGB code for white color is #FFFFFF , but sometimes
spammers use "near-white" color, that's why I only match the first 4
hexadecimal digits, or match if there is an F in the odd-numbered place.
Also note this assumes the HTML message has white background, so white
text is meant to be hidden.
Case 4: Using <meta http-equiv="refresh"...> trick
Ok this one is obvious - this type of HTML-based messages usually come
empty but in the header section there is this sinister meta tag,
like this:
<meta http-equiv="refresh"
content="0;URL=http://I.am.a.spammer">
If you open this mail in an HTML-enabled mail client, this will immediately
direct you to another website. Think your friend would do this? If not,
just use this pattern in JunkMatcher:
(?i)<\s*meta\s+http-equiv\s*="?refresh
Case 5: Abusing Google/Yahoo/MSN
to redirect
This must have been going on for a while, but I only picked this up
lately from JunkMatcher log. This trick is to obfuscate a real website
link with a Google front, like this:
http://www.google.com/url?q=http://www.spammer.biz
The real site they want you to visit is
"http://www.spammer.biz ". There could be fluff in between
the `?' and "q=" but that doesn't matter. Also
"www.google.com " can be just
"google.com ". To filter this, use the following pattern
in JunkMatcher:
(?i)http://(?:www\.)?google\.com/url\?
Paul Maisano wrote to me about a new sighting: using Yahoo.com
to redirect:
"http://rd.yahoo.com/some/junk/*http://my.real.site/hahaha.html "
will take you to
"http://my.real.site/hahaha.html " via redirection from yahoo.com .
Use this pattern to filter this out:
(?i)http://rd\.yahoo\.com/\S*\*http
I've also found out more Yahoo sites: drs.yahoo.com , eur.rd.yahoo.com and srd.yahoo.com can
be used to to achieve the same effect. In fact the
"eur " part of eur.rd.yahoo.com can be replaced
by names of some other geographic regions, such as "us ".
Taking the common denominator, this pattern should be sufficient:
(?i)http://\S+\.yahoo\.com/\S*\*http
Here is another trick using MSN to redirect site traffic. An example:
http://g.msn.com/SOME_FLUFF?http://spammer.biz
In this case the real site is spammer.biz , and the SOME_FLUFF part
cannot be some random string, which implies it might contain necessary
tracking info. Filtering this is also simple:
(?i)http://\S+[*?]http
Actually the pattern above can filter both Yahoo and MSN tricks. Updated
20040425: another MSN site used for redirect is ads.msn.com -
the above pattern should work also.
JunkMatcher version 1.06+ introduced
a new email property: "Has bad site(s)", which
"incorporated" solutions to all of the tricks documented here.
Case 6: Obfuscating a URL
Again these tricks are old, but still worth mentioning. The first trick
is to use `@' in a URL:
http://www.this-is-bogus.com@www.spammer.biz
Everything up to the `@' does not contribute to the final destination
you'll link to; the real site in this case, is
"http://www.spammer.biz ". Use this pattern in JunkMatcher
to filter this out:
http://[^@> ]*?@
According to this article (20040128),
Microsoft plans to drop support for using usernames in URLs in the upcoming
Internet Explorer update. Whether this will remove entirely the use of
this trick remains to be seen if the plan goes through.
If you look closer at the pattern above, this will also filter out messages
containing links like this:
http://www.somesite.com/something?email=me@mycompany.com
This is a URL containing your email address
"me@mycompany.com ". Supposedly if you follow the link your
email address will be reported back. I consider this a bad practice in
email correspondence, but you might want to allow this.
The second trick, which has been rarely seen these days, is to use really
big decimal numbers to encode an IP address. For example, the URL
"http://2147666867/ " is really
"http://128.2.203.179 ", which is
"http://www.cs.cmu.edu ". The encoding is done by using this
formula:
obfuscated IP =
( (first octet * 2^24) + (second octet * 2^16) + (third octet
* 2^8) + (fourth octet) )
How do we filter this? The minimal obfuscated and meaningful IP (from "1.0.0.0 ")
according to the formula is 16777216 , and any "bigger" IP
will be converted to a greater number of digits. The pattern is:
http://\d{8,}
The third trick exploits specifically a flaw in Microsoft
Internet Explorer. Put in short, using a non-print character %01 combined
with the `@ ' trick, a link can be obfuscated so that a
bogus link address, instead of the real one, is displayed in both the
address bar and the status bar of the browser. You can visit here to test
your browser (specifically IE version 6.0.2800.1106 on W2k is vulnerable;
interestingly IE on Mac is not vulnerable; Camino, Firefox, Mozilla
are not vulnerable either). We don't need an additional pattern for
this trick since we already filter out URLs containing `@ '.
(If you know any other URL obfuscation tricks, please drop me a line)
Case 7: Using tiny font to hide
gibberish
Paul Maisano brought this to my attention: this case is similar
to Case 3: they both try to hide the gibberish text
used for confusing statistical filters. Look at the snapshot of this
junk mail:
Nothing special right? Look closer at the "dashed" line beneath the
texts - that's actually another line of text, with 1 pixel font size!
Here is the part of the HTML code:
<font style=font-size:1px>
A solution is simply look for this type of CSS code:
(?i)font-size:[^"]*?(?:(?<![1-9])[0-5](?![0-9])(?:px)?|(?<![1-9])[1-5]?\d%)
The ultimate solution, however, is again match with the final rendering of
a message, as already illustrated in Case 1. This
requires not only the information of the text aspect of rendering, but
also the font style and even color aspect of rendering (Case
3 can be generalized to an absurd complexity using CSS).
Case 8: Web tracking tricks
From Paul Maisano:
Some spammers allow restricted redirection in order to track which
particular promotion is causing the hit on their site. E.g. today I found
that you can go to any site you like using something like:
http://xrntq.track.coolpicsandmore.com/_r.jpegg?cid=123&url=http://www.msn.com
You need a valid "promotion" id as the "cid" parameter or the page
complains with an SQL error. We should publish URLs like this and flood
them with useless data. That'll teach them. :-)
This practice has also existed for a long time now. If you don't expect
people sending you mails will need to track your clicks, use this pattern
to filter the tracking URLs out:
(?i)http://\S*\?\S*(?:id|ref)=
Admittedly this pattern works only when the variable name used to pass
back tracking info is either id or ref . More
drastic pattern is:
(?i)http://\S*\?\S+=
Paul went on to say this:
Spammers are particularly nasty when they include tracking information
inside <IMAGE SRC="..."> . Just viewing the email (using
a html browser) will inform the spammers that their promotion has reached
its target!
This practice has a name:
"web bug". Since we already
filter all external image links, this is not a problem.
I also look for any "<script> " inside the email
-- that's always a good sign of spam; especially if it contains "window.open" or "Math.random".
This has indeed been addressed since earlier versions of JunkMatcher,
using this pattern:
(?i)(?#<_)script[^>]+language[^>]+=
In fact we don't allow form method either for obvious reasons:
(?i)(?#<_)form[^>]+(?:method|action)
Case 9: Vacuous tags
This is yet another trick to stuff garbage within a word so that visually
you can't see the garbage, but for a text scanner it might have difficulty
if it tries to look for the word. For example, to break up a keyword
that every anti-spam software loves to find: "viagra", it can
be coded like this:
via<font></font>gra
Apparently vacuous tags are also used to hide URLs, like this:
<a href="http://www.innocent.org"></a>
Since there's no text surrounded by the tags, this renders nothing in
your browser/mail client, therefore you can't even click on it. So "why" you
ask? Because some anti-spam tools collect URLs for analysis/filtering
(JunkMatcher version 1.06 and later
features a "Has bad site(s)" email property), and spammers want to stuff
you with these "diverting" URLs either to blow up the size of your site
collections, or simply to make lots of noise.
Another similar trick that's used more often is stuffing in entirely
bogus HTML tags within words to achieve the same effect:
via<sdlk3sd,>gra
There is no such tag "<sdlk3sd,> ", but browsers will
happily skip these mutants and render the word correctly.
In JunkMatcher both of these tricks
are defeated by property tests: emails with vacuous tags can be matched
using property "Has at least n vacuous tags" (since version
1.14), and emails with bogus tags can be identified using property "Has
at least n bad tags". Also, both of these types of tags are
removed before matching begins - so you get a nice, sanitized HTML message
to match against, and no diverting URLs will be collected.
Note these tricks won't have any effect when matching against the final
rendering of messages. So in JunkMatcher you can just add your normal
patterns in the Rendering section. But detecting the presence of these
tricks at the code level can be easier to do, and can speed
up the matching process.
Case 10: Hidden URLs
Compared to the actual text used in junk mails, the websites (URLs)
mentioned in them have far more distinguishing power in detecting if
a message is spam or not. That's why in JunkMatcher version 1.06+ I have
added the function to automatically collect"bad sites" for filtering
more spam.
The URLs must have done some damage to spammers, because they start
to stuff in garbage URLs just to overwhelm the collection process. Here
is an example:
<a hrefloadingshref=http://kingdom.com href=
"http://fosraw.biz/OEx/?affiliate_id=x&campaign_id=x">Windows
XP Professional 2002 </a>
The "http://kingdom.com " is total garbage, since "hrefloadingshref " is
not a valid attribute of HTML tag "a " (anchor). Obviously
the goal here is to hide an URL such that it won't show up in your mail
client (to avoid confusing people), but it can be collected by anti-spam
software.
Generalizing this trick a bit, an URL can be hidden in any possible
HTML tag with a garbage attribute.
To defeat this, in JunkMatcher version 1.15 and later a new property "Has >= n hidden
URLs" has been added. Also these hidden URLs will not be collected when
a message is filtered.
|