When you publicly assert that somebody sent spam, you need to ensure that your data is accurate. Your process needs to be simple and verifiable, and to compensate for any errors, you want your process to be transparent to the public with clear points of contact and line of responsibility. Here are some pointers from the operator of the bsdly.net greytrap-based blacklist.
Regular readers will be aware that bsdly.net started sharing our locally generated blacklist of known spam senders back in July 2007, and that we've offered hourly updates for free since then.
The mechanics of maintaining a list boil down to a few simple steps, as described in the original article and the various web pages it references as well several followups, but the probably most informative recipe for how it's all done was this one, written in May 2012 in response to (as usual) a heated exchange on openbsd-misc.
As I've explained in earlier articles, once the basic spamd(8) setup is in place, maintaining the blacklist starts with defining your list of known bad, never to become deliverable adresses in domains you control. It is worth noting that you can run spamd on any OpenBSD computer even if you do not run a real mail service (several of my correspondents do, and do evil things like crank up the time between response bytes to 10 seconds for entertainment), but as it happens we have a few real mail servers behind our spamd equipped gateways, so it seemed natural to restrict our pool of trap addresses to the domains that are actually served by our kit here.
Collecting addresses for the spamtraps list started with a totally manual process of fishing out addresses from the mail server logs, greping for log entries for delivery attempts to non-existent addresses in our domains. Spammers would do (as they still do) Joe jobs on one or more of our domains, making up or generating fake addresses to use as From: or Reply-to: addresses on their spam messages, and messages that for one reason or the other were not deliverable would end up generating bounce messages that our mail service would need to deal with. But a manual process is error prone and we're bound to have missed a few, so not too long after I'd written the script that generates the downloadable blacklist, I had it checking the active greylist for any addresses not already in the pool of known bad addresses.
This is the process that has helped generate the current list of 'imaginary friends', now 24,324 entries long and with a growth of usually a handful per day (but there have been whole days without a single new entry) but up to a few hundred, in rare cases, whenever the script runs. I assume there will be more entries arriving as I write and post this article, but right now the latest entry so far, received 13 Apr 2013 15:10 CEST, was firstname.lastname@example.org (which mildly suggests that somebody is having a bit of fun with my address and obvious keywords -- if you get the trap address list, you'll see that grep email@example.com sortlist turns up close to a hundred entries, mostly combinations of well-known keywords and my email address).
You could argue that fishing out bounce-to addresses of the greylist quickly for trapping purposes runs the risk of unfairly penalizing innoncent third parties with badly configured mail services, and I must admit that risk exists. However, my experiment of planting my own made-up adresses in the spamtraps list reveals that the list is indeed read and used by spammers, and after all, early sufferers would be blacklisted from here only for 24 hours after their last attempt at bouncing back the worthless stuff to us.
And once an address is in the spamtrap list, attempts at delivering mail to that address turns up in logs something like this:
Apr 14 15:19:22 skapet spamd: (GREY) 22.214.171.124: <firstname.lastname@example.org> -> <email@example.com>
Apr 14 15:19:22 skapet spamd: Trapping 126.96.36.199 for tuple 188.8.131.52 pc-126-127-215-201.cm.vtr.net <firstname.lastname@example.org> <email@example.com>
Apr 14 15:19:22 skapet spamd: sync_trap 184.108.40.206
the sync_trap line indicates that this spamd is set up to synchronize with a sister site, like I described in the In The Name Of Sane Email... article. When the miscreant returns, it looks something like this:
Apr 14 15:28:01 skapet spamd: 220.127.116.11: connected (3/3), lists: spamd-greytrap
Apr 14 15:30:15 skapet spamd: 18.104.22.168: disconnected after 134 seconds. lists: spamd-greytrap
most likely with repeat attempts until the sender gives up.
That's the basic mechanism. Now for the principles. I outlined some of the operating principles in a kind of terms of service statement here, but I'll offer a rehash here with a tiny sprinkling of tweaks I've made to the process in order to make the quality of the data I offer better.
First, as I already pointed out in the ingress, you want your process to be simple and verifiable. Run of the mill spamd greytrapping passes the first test with flying colors; after all, any host that ends up in the blacklist verifiably tried to send mail to a known bad address. Keep your logs around for a while, and you should be in good shape to verify what happened.
You also want your data to be accurate, with each entry representing a host that verifiably sent spam. This means watching out for errors of any kind, including but not limited to finding and removing false positives. The automatic 24-hour expiry that's part of the whole greytrapping experience helps a lot here. Any perpetrator or unlucky victim will be out of harm's or blockage's way within 24 hours of the last undesired action we register from their side. There is no requirement that the system administrator track down a web form and swear on their grandmother's pituitary gland that they have 'cleaned up the system'. We (perhaps naively) assume that anyone we don't hear from is no longer our problem.
However, spamd was designed to be a solution to a relatively simple and limited set of problems. Every day some spam messages will manage to get past the outer defenses and face the content filtering that in most cases makes the right decision and drops any spam messages that reaches it on the floor. And there is a small, but not entirely non-existent body of messages that are spam of some kind that will end up in users' inboxes.
For the case where messages are dropped by the content filtering, I found that it was fairly simple to extract the IP addresses of the last hop before entering our network from the logs generated by the content filtering, and at regular intervals these IP addresses are collected from the mail servers with the content filtering in place, and fed into the local greytrap via spamdb(8). It took more than a few dry runs before I trusted the process, but setting up something similar for your environment should be within any sysadmin's scripting skills. We use spamassassin and clamav here, but you should be able to extract fairly easily the information you need to fit the behavior of your particular combination of software. We also offer our users the option of saving messages in spam and not-spam folders on a network drive to train spamassassin's Bayesian engine, indirectly helping the quality of the generated blacklist via more accurate detection of spam. In addition, a so-minded administrator can even extract IP addresses from any headers the user had a mind to conserve and use spamdb(8) to manually insert offending IP addresses in the local greytrap list.
And finally to compensate for any errors, you want your process to be transparent to the public with clear points of contact and line of responsibility. In other words, make sure that you have people in place who are indeed accessible and responsive when somebody tries to contact you via any of the RFC 2142 required addresses. And post something like this article to somewhere reachable. At bsdly.net and associated domains, it's a distinct advantage that contact attempts happen from hosts not currently in the blacklists, but as far as I am aware any errors in the published list have been dealt with before anybody else noticed, and we have avoided being party to the blocklist vendettas and web forum flame wars that have plagued other blacklist maintainers (it has been suggested that the December 2012 DDOS incident could have been part of somebody's revenge, but we do not have sufficient evidence to point any fingers).
In short, you need to keep things simple, act responsibly and be responsive to anyone contacting you about your (mostly automatically generated) work product.
Good night and good luck.
2013-04-15 update: Clarified that manual spamdb(8) manipulation can be used to insert IP addreses in the blacklist too.
2013-04-16 update: It is also possible to fetch the hourly dump from the NUUG mirror here: https://home.nuug.no/~peter/bsdly.net.traplist. In fact, fetching from there should under most circumstances be faster than getting it from the original location. The file is copied at 15 minutes past the hour, while the generating starts at 10 past the hour.
In addition to the techniques described here, it is useful to know that OpenBSD developer Peter Hessler is working on distributing spamd data via BGP, as described in his AsiaBSDCon 2012 paper. Not part of the base distribution yet, but work continues and could come in useful in addition to the batch import of exported lists like the bsdly.net hourly dump.
If you're interested in setting up your own spamd, your main source of information is included in your OpenBSD (or FreeBSD or NetBSD) installation: the man pages such as the one I refer to here. Recommended secondary sources include my own The Book of PF and the PF tutorial (or here) it grew out of. You can even support the OpenBSD project by buying the book from them at the same time you buy your CD set, see the OpenBSD Orders page for more information.
If you're interested in OpenBSD in general, you have a real treat coming up in the form of Michael W. Lucas' Absolute OpenBSD, 2nd edition, also available from the OpenBSD site, and for a few hours more the auction of the first copy printed is running. Surely you can top USD 1145? With your boss' credit card, perhaps?
Upcoming talks: I'll be speaking at BSDCan 2013, on The Hail Mary Cloud And The Lessons Learned, with a preview planned for the BLUG meeting a couple of weeks before the conference. There will be no PF tutorial at this year's BSDCan, fortunately my staple tutorial item was crowded out by new initiatives from some truly excellent people. But you can lobby other organizers to host one.