Friday, December 9, 2022

Harvesting the Noise While it's Fresh, Revisited

A year's worth of logs yields entertaining but unsurprising findings about spammer behavior.
Spam mail, masked but detected, from the archive

Returning readers will be almost painfully aware that here at nxdomain.no (also known as bsdly.net) we host and maintain a blocklist, which in turn is the product of traffic that hits our mail system with attempts at delivery to one or more of the now more than three hundred thousand known bad addresses, also featured at the blocklist home page.

Note: This piece is also available without trackers but only basic formatting here

When I first set up the greytrapping back in 2007, the initial spamtraps were non-deliverable addresses in our domains that I had extracted from mail server logs. I won't bore you with the details (which are anyway documented at length in earlier articles), but it was clear from those logs that the domains we hosted back then were more or less continously subject to Joe jobs, as in somebody sending messages with a forged From: field with a made up address in our domains.

After a while I started extracting the potential new spamtraps from the greylist — actually dumping data from there once per hour as part of the script that also generated the exported blocklist. The basic process is described in the July 25 2007 article Harvesting the noise while it's still fresh; SPF found potentially useful (also available trackerless but with links to tracked articles).

Then today it struck me that while that method is useful, by extracting only from the greylist we will only ever collect the address from the initial connections. Any addresses attempted after the miscreants enter the blocklist will simply not be recorded there.

This of course lead to the question: What did we miss?

Fortunately I keep my logs around for a while, the most easily accessible log archive for my main spamd spans a lttle over a year. So I set about with some very basic grep and awk, which netted me this raw list of targeted addresses from the spamd logs.

The list weighs in at a total of 269903 entries, as counted by wc -l.

Some of those addresses are valid, and a small, but actually significant, number are in domains we do not actually serve here, and some entries do not look like mail addresses at all. The stranger ones could be strings encoded in a character set that spamd is not equipped to handle, or could be other binary data that might have been intended to trigger bugs in some of the variants of fully equipped SMTP servers that are out there. Or simply noise of any other kind, including a byproduct of the not very intelligent extraction one-liner I used.

The target addresses in foreign domains I take as a sign that at least some spamming operators mistake a reasonably configured spamd for an open relay, just like they did all those years ago when I started running the greytrapping.

Some things apparently stay the same no matter how the rest of the world has found a way to move forward.

While I did a few other tasks and finally started writing this article, the bulk of the processes that would answer the question posed earlier (What did we miss?) could fortunately run unattended in the background, and after some manual massaging we are left with a results file, with 1530 entries that were none of

  • actually useful deliverable addresses in our domains
  • existing spamtraps

This means of course that the collection of imaginary friends expanded by the same number, and now stands at 304154 entries.

Which I suppose means that harvesting the noise even after a period of aging for refinement can be a good thing.

The entries added represent a wide variety of phenomena. Quite a few seem to be truncated versions of earlier spamtrap entries, and a fair number of the new entries look like they may have descended from artifacts of stupidity such as products of SMTP callbacks. Proving mainly that in mail and spam handling, there appears to be a space still for the less intellectually astute.

With all of this said, the natural followup question is, given the modest net result, was this worth the effort?

Well, the raw output that yielded 269903 entries needed some manual operations in order to weed out the obvious noise (exact time used not recorded), followed by another background task that took, according to time(1)

    real        105m24.220s
    user        73m3.280s
    sys	        29m14.930s
    

which yielded 1577 entries that were pared down to 1530 entries that met the criteria for inclusion in the circle of imaginary friends (also known as spamtraps).

Before this experiment, the spamtraps list numbered 302625, after including the result here, the count stands at 304154, for a gain of less than one percent of the previous total. Again, if you check back at the traplist home page now, the total number is likely to have increased again.

So was it worth the effort? I feel that as an experiment, it was worth doing.

Whether or not it is an experiment that is worth repeating is a question for another day.

If you have opinions on this, I would love to hear from you, in comments, via email or messages on whichever social media brought you the link to this article.

As always, parties interested in studying the data referenced in this article and other pieces I have written are welcome to contact me for arrangements. I can easily dig out more and rawer data than directly referenced here on request.

Stay safe out there.


As a side note, a slightly improved way of extracting useful data about other domains' mail service via SPF records can be found in the November 2018 artice Goodness, Enumerated by Robots. Or, Handling Those Who Do Not Play Well With Greylisting.

That article (naturally) works from the premise that you are running a recent OpenBSD system.


No comments:

Post a Comment

Note: Comments are moderated. On-topic messages will be liberated from the holding queue at semi-random (hopefully short) intervals.

I invite comment on all aspects of the material I publish and I read all submitted comments. I occasionally respond in comments, but please do not assume that your comment will compel me to produce a public or immediate response.

Please note that comments consisting of only a single word or only a URL with no indication why that link is useful in the context will be immediately recycled so those poor electrons get another shot at a meaningful existence.

If your suggestions are useful enough to make me write on a specific topic, I will do my best to give credit where credit is due.