Email Guides and Essays
by Kaitlin Duck Sherwood,

2003 Anti-Spam Conference Trip Report

Kaitlin Duck Sherwood

Here are my observations of the 17 Jan 2003 (anti-)spam conference held at MIT. This is what I found interesting, anyway -- your milage may vary.

Background
General
The ISP Point of View
Spammer Techniques
Legal
Spammer Profiles
Infrastructure
Beyond Spam
Personal Notes

Background

There are two main approaches to fighting spam these days:
- fuzzy-logic algorithms that score human-derived email features, e.g. SpamAssassin
- probabilistic learning algorithms that require training, e.g. Bayesian filters.
A word that has moved into the spam-fighter's vocabulary is "ham", meaning anything that is not spam.
My interpretations are in square brackets.

General

There was general consensus that what is spam to one person might be interesting to another. This argues for user-specific anti-spam rules.
There was general consensus that a great diversity of anti-spam tools was a Good Thing as it made it more difficult for spammers to get around them all.
A number of speakers said that the exact algorithm that you use for adaptive filtering isn't nearly as important as the training data. Thus the biggest issue is deployment: how to make it easy for the casual user to train their data.
David D. Lewis, a natural language text classification expert, said that understanding your text-classification algorithm was much more important than which text-classification algorithm you use.
David Berlind of CNet -- a legitimate bulk emailer -- wants to have a meeting in February with "all major players" to discuss what to do about spam. [I presume he's hoping for structural change at the protocol level.]
Several speakers made a point of saying that they respected spammers' intelligence, that spammers were very clever in getting around various anti-spam techniques.
Ken Schneider of Brightmail says that spam has gone up from 8% of email traffic in Sept 2001 to 38% in Sept 2002. He speculated that the increase was due in part (but only in part) to people's concern about anthrax in the postal mail.
Spam is an even bigger problem for consumers in Europe than in the US. It's not that there is more, but Europeans generally are billed by the minute. Europe (and Japan) also have wider use of email-enabled mobile devices, which are now getting spam. Again, the per-minute mobile costs make spam infuriating.
Many of the speakers felt that the spammers were not going away -- that anti-spammers would always have to react to what spammers did. [I do not agree with this sentiment.]

The ISP Point of View

Barry Shein, CEO of an ISP called TheWorld, pointed out that in 1970-1994, the Internet ran on tax money. From 1994-2001, it was funded by "stupid investors". He's nervous about how it will continue, especially if ISPs have to keep spending energy dealing with spam.
Spam is souring his customers on the Internet: given a typical mailbox size and typical download rates, it would take two hours to download the whole mailbox.
Customers vent about spam at ISPs purely because the ISPs are the ones that customers can reach.
He has become suspicious of spam, wondering if it is just a very clever denial-of-service attacks. (He thinks a lot of spam might be script kiddies, given that frequently the contact information that is given is unreachable.)
He thinks that there needs to be structural reform of the Internet to fix the spam problem. He has little faith in technological solutions.

Spammer Techniques

Several speakers touched on techniques that spammers have started using in order to get around anti-spam programs, particularly ones that penalize "spammy" words but give a bonus to "hammy" words. Most of them relied on a difference between what was visible to humans ("eye-space") and what was visible to computers ("ASCII-space"). For example:

"spamus interruptus", where words are broken up by HTML tags, e.g.

	ma<!-- 2987 -->ke mon <!-- 9213 -->ey fast!
	me<b></b>ke mon<i></i>ey fast!

"invisible ink", where hammy words are displayed in the same color as the background, e.g.
```
	<FONT COLOR="white">http://www.wecanstopspam.org</FONT>
```

"slice and dice", where each character in a message is a cell in a table, e.g.

	<TABLE><TD>M</TD><TD>a</TD><TD>k</TD><TD>e</TD><TD> </TD>
	<TD>m</TD><TD>o</TD><TD>n</TD><TD>e</TD><TD>y</TD><TD> </TD>
	<TD>f</TD><TD>a</TD><TD>s</TD><TD>t</TD><TD>!</TD>

Javascript encoding of the advertisement
HTML with no words, just images

Bogus HTML tags with lots of hammy words, e.g.

   <Now is the time for all good men to come to the aid of their country>

Spaces between letters in a word, e.g.
```
	M a k e  M o n e y  F a s t ! !
```
A message in multipart/alternative format where the text/plain piece has lots and lots of hammy words and the text/html has the payload. (Most people's mail readers will ignore the text/plain piece and display the text/html piece.)
Spammy URLs encoded in decimal, hex, or URL-encoding.
Using the numeral one for the letter i, zero for the letter o, the numeral four for letter a, etc:
```
	V1DE0 T4PE M0RTG4GE
```
Using accented letters in English:
```
	Mäkè Mönéy Fäst!
```

Legal

There was a really interesting, non-technical talk by Jon Praed of the Internet Law Group. This guy sues spammers.

While more laws against spam are nice, it's already illegal under common-law "trespass to chattels". You can't hop your neighbor's fence and milk his cows or eat his apples, similarly, you can use your neighbor's SMTP server to send your spam.
There has recently been a judgement that ignorance is no defense, that the spammers should know that it's wrong. (Praed calls this the "kindergartener rule", something "every kindergartener should know", e.g. "any kindergartener knows that you can't just slap the guy sitting next to you".)
Spammers can't spam for very long without committing fraud -- falsifying links, falsifying domain registration info, subcontracting to hide their identity, etc.
While the domains may be offshore, and the companies may be registered offshore, the spammers and their servers are almost uniformly in the US.
Spammers frequently use temporary sites -- ones that they will only turn DNS on for two hours or so -- to make it hard to track them down. "Temporary connectivity is the norm."
He won a recent case giving the precedent that you can sue in the jurisdiction where the spam was received. [This is great for spam-fighting but worrying for people who fight for the rights of sexual minorities!]
After his talk, Praed said that better spam filters make his job easier: the more contortions that a spammer goes through to make sure that the messages go through, the easier it is to convince a judge that the spammer knew it was wrong.
Paul Judge of CipherTrust pointed out that as you go from the casual spammer (chain letters) to the large-scale, well-funded, knowledgeable spammers, technological effectiveness becomes less effective and legal action becomes more effective.

Spammer Profiles

Praed also had some interesting things to say about who is spamming.

Spammers are usually hackers gone bad or crooks gone geek. [I infer that thus it's rare that a law-abiding non-hacker becomes a spammer.]
Spammers have never been so successful at anything else. [I never heard a more polite description of "loser" before!] It's very hard for them to give up spamming because it's the only thing they've ever been good at.
Spammers tend to ignore court orders (perhaps because of the above); contempt-of-court judgements are common.
They tend to be big spenders -- there isn't much left in their bank accounts to cover court awards.
A lot of spam work gets subcontracted. You can get "spam kits" that help you do this. A lot of the subcontractors -- especially for porno-spam -- are kids. [Ulp!]

Infrastructure

Paul Judge from CipherTrust sees spam as a security problem (denial of service). He called for a foundation to provide infrastructure for fighting spam. In addition to large public archives of spam, he thinks we need

programs to anonymize the recipients of submitted spam (so people will submit spam),
measurement of global spam activity (and visualization tools)
automated testing (to figure out which algorithm truly is best)

Beyond Spam

John Graham-Cummings has found the Bayesian algorithms to work well for classifying not just spam and ham, but to classify messages e.g. by project. He says that adding new categories for messages to go into doesn't degrade his program's (POPFile) anti-spam success rate.

Personal Notes

It felt empowering and encouraging to be surrounded by so many people who wanted to kill spam.
It was also surprisingly funny! This is perhaps because the audience was so homogenous. It was mostly white male American anti-spam programmers. This meant that there were rich veins of shared context that the speakers could mine for laughs.