It dawned on me a couple of days back that finding the "Unknown user" entries in the mail server logs means I find only the backscatter bounces that have managed to clear greylisting, sent by real mail servers which are misconfigured to deliver spam to their users. Clearing greylisting may take a while, but once the IP address enters the whitelist and the machine does not try to send again to any address which is already in the traplist, it will be able to deliver its spam or backscatter.
Harvest the noise while it's fresh Fortunately it's very easy to harvest the noise data while it's fresh. You search the greylist instead. A simple
$ sudo spamdb | grep GREY
gives you a list of all currently greylisted entries at that spamd instance, in a format which is well documented in the spamdb man page:
GREY|200.170.143.41|smtp6.netsite.com.br|<tbento@acipatos.org.br>| <peter@bsdly.net>|1185386752|1185401152|1185401152|1|0 GREY|217.19.208.25|idknet.com|<>|<credulity093@datadok.no>|1185386865 |1185401265|1185401265|1|0 GREY|85.249.128.205|neptune.usedns.com|<>|<credulity093@datadok.no>| 1185387329|1185401729|1185401729|1|0 GREY|194.183.162.193|scelto.relc.com|<>|<bequeathpi@datadok.no>|1185387398| 1185401798|1185401798|1|0
There will more likely be more than one, and in this format it's fairly easy to see at least two traplist candidates, credulity093@datadok.no and bequeathpi@datadok.no. I have no idea if 217.19.208.25, 85.249.128.205 or 194.183.162.193 would ever have cleared greylisting, but now that credulity093@datadok.no and bequeathpi@datadok.no are in my traplist (and yes, at least part of the process should be very easy to automate), they'll be stuttered at, starting with the next time they try to connect and most likely until they give up.
Now it's probably still useful to tail -f your spamd log anyway, but you can leave the harvesting off until you see a marked increase in simultaneous connections to spamd, as in when the first number in parentheses starts rising sharply. Here the number is low (the second number is the number of currently blacklisted hosts):
Jul 25 22:17:16 delilah spamd[11839]: 217.146.97.10: connected (12/12), lists: spamd-greytrap Jul 25 22:17:35 delilah spamd[11839]: 213.177.120.98: connected (13/13), lists: spamd-greytrap Jul 25 22:17:36 delilah spamd[11839]: 87.103.238.226: connected (14/14), lists: spamd-greytrap
When the first number rises sharply -- that's when the first wave of spam or backscatter hits, and you can harvest the noise while it's still fresh.
A good harvest means less work for your mail server.
SPF found potentially useful One recurring theme in greylisting discussions is how to deal with sites which do not play nicely with greylisting, specifically sites with many outgoing SMTP servers and no guarantee that the retries will come from the same IP (you can find a rather informal discussion in the PF tutorial, for example). If you can't get those sites to do the do the retry magic, you probably need to whitelist them, but in the case of large sites like google, how do you find out just which machines to white list?
For well run sites the answer is simple: if they publish SPF data, you use that. After all, that data is their own list of valid outgoing SMTP senders. The solution presented itself in a recent openbsd-misc post by Darrin Chandler. If you need to whitelist a site with many potential outgoing SMTP servers, the command is
$ host -ttxt example.com
That is, look up the text data in the domain's DNS data, which is where SPF data lives. The answer would typically be something like
example.com descriptive text "v=spf1 mx -all"
which essentially means, "for the example.com domain, only the mail exchangers are valid SMTP senders". The next step is easy: if the answer contained IP addresses or ip address ranges, you put those in your whitelist, in this case a
$ dig example.com mx
would get you the data you need (possibly after a few more host commands).
Frankly it would be a lot better if those sites learned to play well with greylisting, but if you choose to whitelist them anyway, at least this way you take their word for what their valid senders are.