Sunday, December 25, 2022

The Despicable, No Good, Blackmail Campaign Targeting ... Imaginary Friends?

Natalia here speaks to our imaginary friend 185.150.184.92

In which we confront the pundits' assumption that the embarrasment-based extortion attempts would grow more “sophisticated and credible” over time with real data.

It's a problem that should not exist. 

It's a scam that's so obvious it should not work.

Yet we still see a stream of reports about people who have actually gone out and bought their first bitcoins (or more likely fractions of one) in order to pay off blackmailers who claim to have in their possesion videos that record the vicim while performing some autoerotic activity and the material they were supposedly viewing while performing that activity.

And occasionally one of those messages actually find their way to some pundit's inbox (like yours truly), and at times some of those pundits will say things like that those messages represent a real problem and will evolve to be ever more sophisticated.

Note: This piece is also available, with more basic formatting but with no trackers, here.

I am here to tell you that

  1. That incriminating video does not exist, and
  2. The pundits who predicted that those scams would evolve to become more sophisticated were wrong.

If you stumbled on this article because one of those messages reached you, it's safe to not read any further and please do ignore the extortion attempt.

I wrote a piece in 2019 The 'sextortion' Scams: The Numbers Show That What We Have Is A Failure Of Education, also available without trackers, where the summary is,

Every time I see one of those messages reach a mailbox that is actually read by one or more persons, I also see delivery attempts for near identical messages aimed at a subset of my now more than three hundred thousand spamtraps, also known imaginary friends.

Over the years since the piece was originally written, I have added several updates — generally when some of this nonsense reaches a mailbox I read — and while I have seen the messages in several languages, no real development beyond some variations in wording has happened.

Whenever one of those things does reach an inbox, my sequence of actions is generally to save the message and add it to the archive, see if the sending IP address has already entered the blocklist that is later exported and add it by hand if not. Then check if the number of trapped addesses has swelled recently by checking the log file from the export script

$ tail -n 96 /var/log/traplistcounts

See if there is a sharp increase since the last blocklist export

$ doas spamdb | grep -c TRAPPED

Then check for related activity in the log

$ tail -n 500 -f /var/log/spamd

Check for the full subject in the same log file

$ grep "You are in really big troubles therefore, you much better read" /var/log/spamd

Then check older, archived logs to see how long this campaign has been going on for

$ zgrep "You are in really big troubles therefore, you much better read" /var/log/spamd.0.gz

This time, the campaign had not gone on for long enough to show traces in the older archive, so I go on to extracting the sending IP addresses

$ grep "You are in really big troubles therefore, you much better read" /var/log/spamd | awk '{print $6}' | tr -d ':' | sort -u

Check for activity from one of the extracted addresses

$ grep 183.111.115.4 /var/log/spamd | tee wankstortion/20221123_trapped_183.111.115.4.txt

Extract the sender IP addresses to an environment variable to use in the next oneliner,

$ grep trouble /var/log/spamd | awk '{print $6}' | tr -d ':' | sort -u | grep -vc BLACK | tee -a wankstortion/20221123_campaign_ip_addresses.txt

which will record all activity involving those IP addresses since the last log rotation:

$ for foo in $troubles ; do grep $foo /var/log/spamd | tee -a wankstortion/20221123_campaign_log_extract.txt ; done

You will find all those files, along with some earlier samples, and by the time you read this, possibly even newer samples, in the archive.

When something of the sort inboxes, I probably will go on adding to the archive, and if I have time on my hands, also run similar extraction activities as the ones I just described. But unless something unexpected such as actual development in the senders' methods occurs, I will not bother to write about it.

The subject is simply not worth attention past persuading supposed victims to not bother to get bitcoins or spend any they might have to hand. None of my imaginary friends have, and they are just as fine as they were before somebot tried to scam them.

Good night and good luck.


 

Friday, December 23, 2022

Can Your Spam-eater Manage to Catch Seventy-one Percent Like This Other Service?

Measuring the effect of what you do is important. Equally important is knowing what is the measure of your actions.

A question turned up on IRC that had me thinking.

Do you have a percentage of the spam traffic you catch on your MXes? The reason I ask is I lust learned that fastmail.com claim they catch 71% of all incoming spam. Also a rate of false positives would be nice to have, but that's likely harder to measure.

My first impulse was that I would consider a seventy-one percent hit rate on the low side of what we are seeing here at bsdly.net and associated domains.

But getting actually useful data would require some thinking. That said, comparing a major mail operator that sells deliverability and promises a 71 percent catch rate for incoming spam and bsdly.net would be like comparing apples and oranges at best. 

While bsdly.net (which is also known under a few other domain names) is my main mail service for my personal use and for a very select number of other people, to the rest of the world it is primarily a honeypot that generates security relevant data that other sites use, and that contributes to IP reputation rankings.

The site has been in operation in those roles for a little more than 15 years, since shortly before the original announcement in the article Hey, spammer! Here's a list for you!. When we started using the greylisting and greytrapping based setup, we saw a sharp drop in undesirable messages actually reaching inboxes, and I observed a marked decrease in load on the mail servers that did the content filtering.

Not long after I had set up our early greylisting setup, a message turned up on the openbsd-misc mailing list that pretty much matched our experience — a 95% reduction in spam in line to be treated to content filtering — so setting up precise measuring became a thing to do when we could get around to it.

Now enough with the background. It is relatively easy to extract at least some data that would give us a rough picture of the relative effectiveness of the greylisting and greytrapping versus the content filtering on receipt. The setup is very similar to the one described in the practically-oriented parts of the Effective Spam and Malware Countermeasures - Network Noise Reduction Using Free Tools and is part of a syncronizing multi-domain setup rougly as described in the earlier article In The Name Of Sane Email: Setting Up OpenBSD's spamd(8) With Secondary MXes In Play - A Full Recipe.

Using only tools found in the OpenBSD base system, I went on to collect data.

Whenever spamd(8) closes a connection it logs a message to that effect, so

$ zgrep "Nov  1" /var/log/spamd.6.gz | grep -c disconnected

Supplies the total number of connections closed by spamd(8) during November 1st, fetched from the archived log file.

Similarily

$ zgrep "Nov  1" /var/log/spamd.6.gz | grep -c BLACK

provides the number of connections during the same 24 hour period initiated by hosts that were already in one of the blocklists used.

The command to get the number of connections that had cleared the first hurdle and entered greylisted status would be

$ zgrep "Nov  1" /var/log/spamd.6.gz | grep -c GREY

And the number of hosts that had been well behaved enough to enter the whitelist and be allowed to talk to the real SMTP service comes out of

$ zgrep "Nov  1" /var/log/spamd.6.gz | grep -c whitelisting

For hosts that have reached this far and did not fail the content filtering we do during receipt, we get the number with

$ doas zgrep 2022-11-02 /var/spool/exim/logs/main.log.6.gz | grep -c Completed

It is however worth noting that our MTA exim reports Completed for apparently message deliveries in both directions, so the number of received messages, or messages that did inbox is likely about thirty percent lower.

The number of messages rejected for one reason or the other, by being addressed to an undeliverable address or by failing content filtering we find with

$ doas zgrep 2022-11-02 /var/spool/exim/logs/main.log.6.gz | grep -c rejected

And finally, a side effect of a frequently run log reading script that adds hosts with certain kinds of characteristics such as not having a correct reverse DNS entry to a blocklist and kills all their connections will at times produce an unexpected disconnection while reading SMTP command message. We find those with

$ doas zgrep 2022-11-02 /var/spool/exim/logs/main.log.6.gz | grep -c unexpected

Those are hosts that somehow got past spamd(8) by behaving enough like a real SMTP server to clear greylisting. However spamd(8) does not have the ability to check for valid reverse, so that part is left in our case to check for by reading the log files at intervals.

The following table has the data for November 2022 —

Date Incoming SMTP
connections
BLACK
connections
GREY
connections
New whitelist
entries
Deliveries Rejected Unexpected
disconnect
2022-11-01 53303 38951 2580 54 1347 409 384
2022-11-02 55653 40467 2174 121 1297 549 330
2022-11-03 59658 43901 2086 85 1260 865 759
2022-11-04 57462 45674 1683 71 1270 30 0
2022-11-05 44993 43571 2146 105 1182 43 0
2022-11-06 36768 37802 2322 86 1366 184 0
2022-11-07 49464 44213 2398 182 1424 67 0
2022-11-08 52285 45904 2676 113 1513 69 3
2022-11-09 47652 47988 2085 105 1438 154 0
2022-11-10 57850 49875 2614 104 1435 192 2
2022-11-11 60269 56719 2355 99 1420 90 1
2022-11-12 46139 54073 1160 96 1182 29 0
2022-11-13 40497 40221 1777 70 1239 189 0
2022-11-14 59965 59951 2062 63 1382 145 73
2022-11-15 56265 32727 2304 113 1298 351 301
2022-11-16 77252 58029 1925 109 1340 282 33
2022-11-17 43107 30713 786 131 1250 215 17
2022-11-18 49448 48999 1590 96 1327 194 1
2022-11-19 42413 45927 973 92 1182 182 70
2022-11-20 50890 55318 1558 77 1203 358 33
2022-11-21 36601 35070 1707 125 1321 241 146
2022-11-22 37840 35499 2055 99 1359 142 17
2022-11-23 43186 34545 1314 114 1345 103 21
2022-11-24 46802 45765 1856 66 1269 729 52
2022-11-25 70911 52404 1315 89 1326 1488 395
2022-11-26 39780 32226 1500 77 1175 954 379
2022-11-27 67578 41581 1743 85 1231 523 315
2022-11-28 54688 37534 2433 77 1337 321 269
2022-11-29 70893 45917 2502 65 1248 87 39
2022-11-30 50280 35585 2567 67 1324 1293 1113

The table is also available as a comma separated (CSV) file.

As I mentioned earlier, the number of connections to the outer layer spamd(8) is likely higher than what would be expected on sites that are not considered a honeypot and home to in excess of three hundred thousand imaginary friends (see The Things Spammers Believe - A Tale of 300,000 Imaginary Friends or the trackerless version.

That said, I think the data shows that catching the unwanted traffic early, and discarding as much as possible of that traffic before it reaches the resource hungry content filtering is definitely beneficial. 

Even sites that do not actively bait the baddies out there would likely see noticeable energy bill savings by having their mail servers run quiter and cooler, as they definitely will after getting a greylisting, and optionally greytrapping setup in front of them. Those services have a truly low energy consumption profile.

If you found this article interesting, useful or just simply irritating, I would like to hear from you. Please use the comment field, or if you prefer, send email to nix at nxdomain dot no with a subject that at least tries to sound sensible and relevant.

As always, if you are interested in research on items mentioned in this article, I will be able to provide data for study. I will honor reasonable requests.


Friday, December 9, 2022

Harvesting the Noise While it's Fresh, Revisited

A year's worth of logs yields entertaining but unsurprising findings about spammer behavior.
Spam mail, masked but detected, from the archive

Returning readers will be almost painfully aware that here at nxdomain.no (also known as bsdly.net) we host and maintain a blocklist, which in turn is the product of traffic that hits our mail system with attempts at delivery to one or more of the now more than three hundred thousand known bad addresses, also featured at the blocklist home page.

Note: This piece is also available without trackers but only basic formatting here

When I first set up the greytrapping back in 2007, the initial spamtraps were non-deliverable addresses in our domains that I had extracted from mail server logs. I won't bore you with the details (which are anyway documented at length in earlier articles), but it was clear from those logs that the domains we hosted back then were more or less continously subject to Joe jobs, as in somebody sending messages with a forged From: field with a made up address in our domains.

After a while I started extracting the potential new spamtraps from the greylist — actually dumping data from there once per hour as part of the script that also generated the exported blocklist. The basic process is described in the July 25 2007 article Harvesting the noise while it's still fresh; SPF found potentially useful (also available trackerless but with links to tracked articles).

Then today it struck me that while that method is useful, by extracting only from the greylist we will only ever collect the address from the initial connections. Any addresses attempted after the miscreants enter the blocklist will simply not be recorded there.

This of course lead to the question: What did we miss?

Fortunately I keep my logs around for a while, the most easily accessible log archive for my main spamd spans a lttle over a year. So I set about with some very basic grep and awk, which netted me this raw list of targeted addresses from the spamd logs.

The list weighs in at a total of 269903 entries, as counted by wc -l.

Some of those addresses are valid, and a small, but actually significant, number are in domains we do not actually serve here, and some entries do not look like mail addresses at all. The stranger ones could be strings encoded in a character set that spamd is not equipped to handle, or could be other binary data that might have been intended to trigger bugs in some of the variants of fully equipped SMTP servers that are out there. Or simply noise of any other kind, including a byproduct of the not very intelligent extraction one-liner I used.

The target addresses in foreign domains I take as a sign that at least some spamming operators mistake a reasonably configured spamd for an open relay, just like they did all those years ago when I started running the greytrapping.

Some things apparently stay the same no matter how the rest of the world has found a way to move forward.

While I did a few other tasks and finally started writing this article, the bulk of the processes that would answer the question posed earlier (What did we miss?) could fortunately run unattended in the background, and after some manual massaging we are left with a results file, with 1530 entries that were none of

  • actually useful deliverable addresses in our domains
  • existing spamtraps

This means of course that the collection of imaginary friends expanded by the same number, and now stands at 304154 entries.

Which I suppose means that harvesting the noise even after a period of aging for refinement can be a good thing.

The entries added represent a wide variety of phenomena. Quite a few seem to be truncated versions of earlier spamtrap entries, and a fair number of the new entries look like they may have descended from artifacts of stupidity such as products of SMTP callbacks. Proving mainly that in mail and spam handling, there appears to be a space still for the less intellectually astute.

With all of this said, the natural followup question is, given the modest net result, was this worth the effort?

Well, the raw output that yielded 269903 entries needed some manual operations in order to weed out the obvious noise (exact time used not recorded), followed by another background task that took, according to time(1)

    real        105m24.220s
    user        73m3.280s
    sys	        29m14.930s
    

which yielded 1577 entries that were pared down to 1530 entries that met the criteria for inclusion in the circle of imaginary friends (also known as spamtraps).

Before this experiment, the spamtraps list numbered 302625, after including the result here, the count stands at 304154, for a gain of less than one percent of the previous total. Again, if you check back at the traplist home page now, the total number is likely to have increased again.

So was it worth the effort? I feel that as an experiment, it was worth doing.

Whether or not it is an experiment that is worth repeating is a question for another day.

If you have opinions on this, I would love to hear from you, in comments, via email or messages on whichever social media brought you the link to this article.

As always, parties interested in studying the data referenced in this article and other pieces I have written are welcome to contact me for arrangements. I can easily dig out more and rawer data than directly referenced here on request.

Stay safe out there.


As a side note, a slightly improved way of extracting useful data about other domains' mail service via SPF records can be found in the November 2018 artice Goodness, Enumerated by Robots. Or, Handling Those Who Do Not Play Well With Greylisting.

That article (naturally) works from the premise that you are running a recent OpenBSD system.


Addendum 2025-01-12

For those so inclined, it is perhaps worth noting that after a bit of pondering some time after writing this piece, I started looking at extracting other items from the spamd logs log entries.

I ended up with extracting the local parts for new spamtraps from the purported sender addreses of entries for trapped delivery attempts some time mid-2024. This made for a significant increase in the number of new imaginary friends, and by the final months of that year I had also started extracting similarly from the string offered by the spam senders as their host name in the EHLO/HELO exchange, which of course swelled the population further.

The effect is clearly to be seen in the file that records the number of spamtraps added per year, updated via trivial scriptery roughly daily.

I hope this article and its addendum helps inspire others in our efforts of green cybercrime prevention by giving the actually intelligent detection methods less work to do.


Addendum some more 2025-01-18

I suppose it had to happen sooner or later, but as commemmorated in this toot, which said

Likely not blogworthy in itself, but #openbsd #spamd aficionados will get a light chuckle from hearing that some scraping and massaging relevant logs had the number of imaginary friends at https://nxdomain.no/~peter/traplist.shtml for our not-friends to play with roll past the one million mark in the early hours of today CET.

The recent update of https://nxdomain.no/~peter/harvesting_the_noise_revisited.html has links to more info. #spam #antispam #greytrapping #blocklists #cybercrime

Yes, that's right, after I turned to extracting vaguely relevant data from logs in order to salt the mine and poison the well further, the number of imaginary friends quickly grew past the one million mark.

And as if this particular Saturday morning was not already quite weird enough for most tastes, somebot produced another remarkable item that I just could not restist tooting about,

And ref previous toot, the 1006089th imaginary friend to join the collection at https://nxdomain.no/~peter/traplist.shtml is, mail.protection.outlook.com@bsdly.net following this sequence: https://nxdomain.no/~peter/blogpix/2025-01_18_johnson@vicglobalintelligence.com_to_mail.protection.outlook.com@bsdly.net.txt

The bots never cease to amaze #openbsd #spamd #greytrapping #antispam #cybercrime

And the two episodes combined proved addendum-worty, at least, see https://nxdomain.no/~peter/harvesting_the_noise_revisited.html

Yes, you read that right: For reasons known only to the bots' herders (if that), the subdomain that houses mail services for a large number of Microsoft customers entered the lexicon of spammers' spanto: addresses. Only to be included at first sight in the herd of imaginary friends I hope will help poison the spammers' data further.

The activity here did of course not stop the bots from keeping on trying. A few minutes after the second addendum here was added and tooted out, my logs showed the following activity from the hosts involved in trying to spam mail.protection.outlook.com@bsdly.net: https://nxdomain.no/~peter/blogpix/2025-01-18_host_targeting_mail.protection.outlook.com@bsdly.net_all_spamd_log_entries.txt. And more likely than not, they will keep trying.

How was the start of your weekend?

Also worth noting is that if you do try to do this at home, please keep in mind that you will neeed to implement a scheme that keeps actually valid addresses in your domains out of the spamtrap pool. Otherwise regrettable episodes may arise.


Sunday, September 25, 2022

A Few of My Favorite Things About The OpenBSD Packet Filter Tools

The OpenBSD packet filter PF was introduced a little more than 20 years ago as part of OpenBSD 3.0. We'll take a short tour of PF features and tools that I have enjoyed using.



NOTE: If you are more of a slides person, the condensate for a SEMIBUG user group meeting is available here. A version without trackers but “classical” formatting is available here.

At the time the OpenBSD project introduced its new packet filter subsystem in 2001, I was nowhere near the essentially full time OpenBSD user I would soon become. I did however quickly recognize that even what was later dubbed “the working prototype” was reported to perform better in most contexts than the code it replaced.

The reason PF's predecessor needed to be replaced has been covered extensively by myself and others elsewhere, so I'll limit myself to noting that the reason was that several somebodies finally read and understood the code's license and decided that it was not in fact open source in any acceptable meaning of the term.

Anyway the initial PF release was very close in features and syntax to the code it replaced. And even at that time, the config syntax was a lot more human readable than the alternative I had been handling up to then, which was Linux' IPtables. The less is said about IPtables, the better.

But soon visible improvements in user friendliness, or at least admin friendliness, started turning up. With OpenBSD 3.2, the separate /etc/nat.conf network adress translation configuration file moved to the attic and the NAT and redirection options moved into the main PF config file /etc/pf.conf.

The next version, OpenBSD 3.3, saw the ALTQ queueing configuration move into pf.conf as well, and the previously separate altq.conf file became obsolete. What did not change, however, was the syntax, which was to remain just bothersome enough that many of us put off playing with traffic shaping until some years later. Other PF news in that release included anchors, or named sub-rulesets, as well as tables, described as "a very efficient way for large address lists in rules" and the initial release of spamd(8), the spam deferral daemon.

More on all of these things later, I will not bore you with a detailed history of PF features introduced or changed in OpenBSD over the last twenty-some years.

PF Rulesets: The Basics

So how do we go about writing that perfect firewall config?

I could go on about that at length, and I have been known to on occasion, but let us start with the simplest possible, yet absolutely secure PF ruleset:

block

With that in place, you are totally secure. No traffic will pass.

Or as they say in the trade, you have virtually unplugged yourself from the rest of the world.

By way of getting ahead of ourselves, that particular ruleset will expand to the following:

block drop all

But we are getting ahead of ourselves.

To provide you with a few tools and some context, these are the basic building blocks of a PF rule:

verb criteria action ... options

Here are a few sample rules to put it into context, all lifted from configurations I have put into production:

pass in on egress proto tcp to egress port ssh

This first sample says that if a packet arrives on the egress — an interface belonging to the group of interfaces that has a default route — and that packet is a TCP packet with a destination service ssh, let the packet pass to the interfaces belonging to the egress interface group.

Yes, when you write PF rulesets, you do not necessarily need to write port numbers for services and memorize what services hide behind port 80, 53 or 443. The common or standard services are known to the rules parsing part of pfctl(8), generally with the service names you can look up in the /etc/services file.

The interface groups concept is as far as I know an OpenBSD innovation. You can put interfaces into logical groups and reference the group name in PF configurations. A few default interface groups exist without you doing anything, egress is one, another common one is wlan where all configured WiFi interfaces are members by default. Keep in mind that you can create your own interface groups — set them up using ifconfig(8) — and refer to them in your rules.

match out on egress nat-to egress

This one matches outbound traffic, again on egress (which in the simpler cases consists of one interface) and applies the nat-to action on the packets, transforming them so that the next hops all the way to the destination will see packets where the source address is equal to the egress interface's address. If your network runs IPv4 and you have only one routeable address assigned, you will more than likely have something like this configured on your Internet-facing gateway.

It is worth noting that early PF versions did not have the match verb. After a few years of PF practice, developers and practitioners alike saw the need for a way to apply actions such as nat-to or other transformations without making a decision on whether to pass or block the traffic. The match keyword arrived in OpenBSD 4.6 and in retrospect seems like a prelude to more extensive changes that followed over the next few releases.

Next up is a variation on the initial absolutely secure ruleset.

block all

I will tell you now so you will not be surprised later: If you had made a configuration with those three rules in that order, your configuration would be functionally the same as the one word one we started with. This is because in PF configurations, the rules are evaluated from top to bottom, and the last matching rule wins.

The only escape from this progression is to insert a quick modifier after the verb, as in

pass quick from (self)

which will stop evaluation when a packet matches the criteria in the quick rule. Please use sparingly if at all.

There is a specific reason why PF behaves like this. The system that PF replaced in OpenBSD had the top to bottom, last match wins logic, and the developers did not want to break existing configurations too badly during the transition away from the old system.

So in practice you would put them in this order for a more functional setup,

  block all
  match out on egress nat-to egress
  pass in on egress proto tcp to egress port ssh
    

but likely supplemented by a few other items.

For those supplementing items, we can take a look at some of the PF features that can help you write readable and maintainable rulesets. And while a readable ruleset is not automatically a more secure one, readability certainly helps spot errors in your logic that could put the systems and users in your care in reach of potential threats.

To help that readability, it is important to be aware of these features:

Options: General configuration options that set the parameters for the ruleset, such as

  set limit states 100000
  set debug debug
  set loginterface dc0
  set timeout tcp.first 120 
  set timeout tcp.established 86400 
  set timeout { adaptive.start 6000, adaptive.end 12000 }
  

If the meaning of some of those do not seem terribly obvious to you at this point, that's fine. They are all extensively documented in the pf.conf man page.

Macros: Content that will expand in place, such as lists of services, interface names or other items you feel useful. Some examples along with rules that use them:

  ext_if = "kue0" 
  all_ifs = "{" $ext_if lo0 "}" 
  pass out on $ext_if from any to any 
  pass in  on $ext_if proto tcp from any to any port 25
  

Keep in mind that if your macros expand to lists of either ports or IP addresses, the macro expansion will create several rules to cover your definitions in the ruleset that is eventually loaded.

Tables: Data structures that are specifically designed to store IP addresses and networks. Originally devised to be a more efficient way to store IP addresses than macros that contained IP addresses and expanded to several rules that needed to be evaluated separately. Rules can refer to tables so the rule will match any member of the table.

  table <badhosts> persist counters file "/home/peter/badhosts"
  # ...
  block from <badhosts>
      

Here the table is loaded from a file. You can also initialize a table in pf.conf itself, and you can even manipulate table contents from the command line without reloading the rules:

$ doas pfctl -t badhosts -T add 192.0.2.11 2001:db8::dead:beef:baad:f00d

In addition, several of the daemons in the OpenBSD base system such as spamd, bgpd and dhcpd can be set up to interact with your PF rules.

Rules: The rules with the verbs, criteria and actions that determine how your system handles network traffic.

A very simple and reasonable baseline is one that blocks all incoming traffic but allows all traffic initiated on the local system:

  block
  pass from (self)
      

The pass rule lets our traffic pass to elsewhere, and since PF is a stateful firewall by default, return traffic for the connections the local system sends out will be allowed back.

You probably noticed the configuration here references something called (self).

The string self is a default macro which expands to all configured local interfaces on the host. Here, self is set inside parentheses () which indicates that one or more of the interfaces in self may have dynamically allocated addresses and that PF will detect any changes in the configured interface IP addresses.

This exact ruleset expanded to this on my laptop in my home network at one point:

 $ doas pfctl -vnf /etc/pf.conf
   block drop all
   pass inet6 from ::1 to any flags S/SA
   pass on lo0 inet6 from fe80::1 to any flags S/SA
   pass on iwm0 inet6 from fe80::a2a8:cdff:fe63:abb9 to any flags S/SA
   pass inet6 from 2001:470:28:658:a2a8:cdff:fe63:abb9 to any flags S/SA
   pass inet6 from 2001:470:28:658:8c43:4c81:e110:9d83 to any flags S/SA
   pass inet from 127.0.0.1 to any flags S/SA
   pass inet from 192.168.103.126 to any flags S/SA

The pfctl command here says to verbosely parse but do not load rules from the file /etc/pf.conf.

This shows what the loaded ruleset will be, after any macro expansions or optimizations.

For that exact reason, it is strongly recommended to review the output of pfctl -vnf on any configuration you write before loading it as your running configuration.

If you look closely at that command output, you will see both the inet and inet6 keywords. These designate IPv4 and IPv6 addresses respectively. PF since the earliest days has supported both, and if you do not specify which address family your rule applies to, it will apply to both.

But this has all been on a boring single host configuration. In my experience, the more interesting settings for PF use is when the configuration is for a host that handles traffic for other hosts, as a gateway or other intermediate host.

To forward traffic to and from other hosts, you need to enable forwarding. You can do that from the command line:

 # sysctl net.inet.ip.forwarding=1 
 # sysctl net.inet6.ip6.forwarding=1
	

But you will want to make the change permanent by putting the following lines in your /etc/sysctl.conf so the change survives reboots.

  net.inet.ip.forwarding=1 
  net.inet6.ip6.forwarding=1
	

With these settings in place, a configuration (/etc/pf.conf) like this might make sense if your system has two network interfaces that are both of the bge kind:

  ext_if=bge0
  int_if=bge1
  client_out = "{ ftp-data ftp ssh domain pop3, imaps nntp https }"
  udp_services = "{ domain ntp }"
  icmp_types = "echoreq unreach"
  match out on egress inet nat-to ($ext_if)
  block
  pass inet proto icmp all icmp-type $icmp_types keep state
  pass quick proto { tcp, udp } to port $udp_services keep state
  pass proto tcp from $int_if:network to port $client_out
  pass proto tcp to self port ssh
	

Your network likely differs in one or more ways from this example. See the references at the end for a more thorough treatment of all these options.

And once again, please do use the readability features of the PF syntax to keep you sane and safe.

A Configuration That Learns From Network Traffic Seen and Adapts To Conditions

With PF, you can create a network that learns. Fairly early in PF's history it occured to the developers that the network stack collects and keeps track of information about the traffic it sees, which could then be acted upon if the software became able to actively monitor the data and act on specified changes. So the state tracking options entered the pf.conf repertoire in their initial form with the OpenBSD 3.7 release.

A common use case is when you run an SSH service or really any kind of listening service with the option to log in, you will see some number of failed authentication attempts that generate noise in the logs. The password guessing, or as some of us say, password groping, can turn to be pretty annoying even if the miscreants do not actually manage to compromise any of your systems. So to eliminate noise in our logs we turn to the data that is anyway available in the state table, to track the state of active connections, and to act on limits you define such as number of connections from a single host over a set number of seconds.

The action could be to add the source IP that tripped the limit to a table. Additional rules could then subject the members of that table to special treatment. Since that time, my internet-facing rule sets have tended to include variations on

  table <bruteforce> persist
  block quick from <bruteforce>
  pass inet proto tcp from any to $localnet port $tcp_services \
        flags S/SA keep state \
	(max-src-conn 100, max-src-conn-rate 15/5, \
         overload <bruteforce> flush global)
	

which means that any host that tries more than 100 simultaneous connections or more than 15 new connections over 5 seconds are added to the table and blocked, with any existing connections terminated.

It is a good practice to let table entries in such setups expire eventually. How long entries stay is entirely up to you.

At first I set expiry at 24 hours, but with password gropers like those caught by this rule being what they are, I switched a few years ago to at four weeks at first, then upped again a few months later to six weeks. Groperbots tend to stay broken for that long. And since they target any service you may be running, state tracking options with overload tables can be useful in a lot of non-SSH contexts as well.

A point that observers often miss is that with this configuration, you have a firewall that learns from the traffic it sees and adapts to network conditions.

It is also worth noting that state tracking actions can be applied to all TCP traffic and that they can be useful for essentially all services.

The buzzwordability potential in the learning configurations is enormous, and I for one fail to see how the big names have failed to copy or imitate this feature and greytrapping which we will look at later, and capitalize on products with those features.

The article Forcing the password gropers through a smaller hole with OpenBSD's PF queues has a few suggestions on how to handle noise sources with various other services. More on queues in a few moments.

The Adaptive Firewall and the Greytrapping Game

At the risk of showing my age, I must admit that I have more or less always run a mail service. Once TCP/IP networking became available in some form for even small businesses and individuals during the early 1990s, once you were connected, it was simply one of those things you would do. Setting up an SMTP service (initially wrestling with sendmail and it legendary sendmail.cf configuration file) with accompanying pop3 and/or imap service was the done thing.

Over time the choice of mail server software changed, we introduced content filtering to beat the rise of the trashy, scanny spam mail and, since the majority of clients ran that operating system mail-borne malware. But even with state of the art content filtering some unwanted messages would make it into users' inboxes often enough to be annoying.

So when OpenBSD 3.3 shipped with the initial version of spamd it was quite a relief for people of my job category, even if that only would load lists of known bad senders' IP addresses and stutter at them one byte per second until the other side gave up.

Later versions introduced greylisting — answering SMTP connections from previously unknown senders with a temporary local error code and only accepting delivery if the same host tried again — which reduced the load on the content filtering machines significantly, and the real fun started with the introduction of greytrapping in the version of spamd(8) that shipped with OpenBSD 3.7.

Greytrapping is yet another adaptive or learning feature. The system identifies bad actors by comparing the destination email address in incoming SMTP traffic from unknown or already greylisted hosts with a list of known invalid addresses in the domains the site serves. The spamdb(8) command was extended to add features to add addresses to and delete from the spamtrap list.

Greytrapping was an extremely welcome new feature, and I adopted it eagerly. Soon after the feature became available, I set up for greytrapping. The spamtrap addresses were the ones initially addresses I fished out of my mail server logs — from entries produced by bounce messages that themselves turned out to be undeliverable at our end since the recipient did not exist — and after a few weeks I started publishing both the list of spamtraps and an hourly dump of currently trapped IP addresses.

The setup is amazingly easy. On a typical gateway in front of a mail server you instrument your /etc/pf.conf with a few lines, usually at the top,

  table <spamd-white> persist
  table <nospamd> persist file "/etc/mail/nospamd"
  pass in on egress proto tcp to any port smtp \
        divert-to 127.0.0.1 port spamd
  pass in on egress proto tcp from <nospamd> to any port smtp
  pass in log on egress proto tcp from <spamd-white> to any port smtp
  pass out log on egress proto tcp to any port smtp
    

Here we even suck in a file that contains the IP addresses of hosts that should not be subjected to the spamd treatment.

In addition you will need to set up with the correct options for spamd(8) and spamdlogd(8) in your /etc/rc.conf.local:

  spamd_flags="-v -G 2:8:864 -n "mailwalla 17.25" -c 1200 -C /etc/mail/fullchain.pem -K /etc/mail/privkey.pem -w 1 -y em1 -Y em1 -Y 158.36.191.225"
  spamdlogd_flags="-i em1 -Y 158.36.191.225"
      

The IP address here designates a sync partner, check out the spamd(8) man page for the other options. If you're interested, you can get the gory details of running a setup with several mail exchangers in the In The Name Of Sane Email: Setting Up OpenBSD's spamd(8) With Secondary MXes In Play - A Full Recipe article.

You probably do not need to edit the configuration file /etc/mail/spamd.conf much, but do look up the man page and possibly references to the bsdly.net blocklist. Finally, reload your PF configuration, start the daemons spamd(8) and spamdlogd(8) using rcctl, set up a crontab(5) line to run spamd-setup(8) at reasonable intervals to fetch updated blocklists.

The number of trapped addresses in the hourly dump has been anything from a few hundred in the earliest days, later in the thousands and even at times in the hundreds of thousands. For the last couple of years the number has generally been in the mid to low four digits, with each host typically hanging around longer to try delivery to an ever expanding number of invalid addresses in their database.

Just a few weeks ago, the list of “imaginary friends” rolled past 300,000 entries. The article The Things Spammers Believe - A Tale of 300,000 Imaginary Friends tells the story with copious links to earlier articles and other resources, while Maintaining A Publicly Available Blacklist - Mechanisms And Principles details the work involved in maintaining a blocklist that is offered to the public.

It's been good fun, with a liberal helping of bizarre as the number of spamtraps grew, sometimes with truly weird contents.

Traffic Shaping You Can Actually Understand

You've heard it before: Traffic shaping is hard. Hard to do and hard to understand.

Traditionally traffic shaping was available on all BSDs in the form of ALTQ, a codebase that its developers labeled experimental and contained implementations of several different traffic shaping algorithms. One central problem was that the configuration syntax was inelegant at best, even after the system was merged into the PF configuration.

In OpenBSD, which runs development on a strict six month release cycle, the code that would eventually replace ALTQ was introduced gradually over several releases.

The first feature to be introduced was always-on, settable priorities with the keyword prio.

A random example shows that this configuration prioritises ssh traffic above most others (the default is 3):

pass proto tcp to port ssh set prio 6

While this configuration makes an attempt at speeding up TCP traffic by assigning a higher priority to lowdelay packets, typically ACKs:

  match out on $ext_if proto tcp from $ext_if set prio (3, 7)
  match in  on $ext_if proto tcp to $ext_if set prio (3, 7)
	

Next up, the newqueue code did away with the multiple algorithms approach and settled on the Hierarchical fair-service curve (HFSC) as the most flexible option that would even make it possible to emulate or imitate the alternative shaping algorithms from the ALTQ experiment.

HFSC queues are defined on an interface with a hierarchy of child queues, where only the “leaf” queues can be assigned traffic. We take a look at a static allocation first:

  queue main on $ext_if bandwidth 20M
    queue defq parent main bandwidth 3600K default
    queue ftp parent main bandwidth 2000K
    queue udp parent main bandwidth 6000K
    queue web parent main bandwidth 4000K
    queue ssh parent main bandwidth 4000K
      queue ssh_interactive parent ssh bandwidth 800K
      queue ssh_bulk parent ssh bandwidth 3200K
    queue icmp parent main bandwidth 400K
  

You then tie in the queue assignment, here with match rules

  match log quick on $ext_if proto tcp to port ssh \
        queue (ssh_bulk, ssh_interactive)
  match in quick on $ext_if proto tcp to port ftp queue ftp
  match in quick on $ext_if proto tcp to port www queue http
  match out on $ext_if proto udp queue udp
  match out on $ext_if proto icmp queue icmp
  

which is definitely the way to add queueing to an existing configuration, and in my view also a good practice for configuration structure reasons. But you can also tack on queue this_or_that_queue at the end of pass rules.

There are two often forgotten facts about HFSC traffic shaping I would like to mention:

Traffic shaping is more often than not a matter of prioritizing which traffic you drop packets for, and no shaping at all takes place before the traffic volume approaches one or more of the limits set by the queue definitions.

One of the beautiful things about modern HFSC queueing is that you can build in flexibility, like this:

  queue rootq on $ext_if bandwidth 20M
    queue main parent rootq bandwidth 20479K min 1M max 20479K qlimit 100
    queue qdef parent main bandwidth 9600K min 6000K max 18M default
    queue qweb parent main bandwidth 9600K min 6000K max 18M
    queue qpri parent main bandwidth 700K min 100K max 1200K
    queue qdns parent main bandwidth 200K min 12K burst 600K for 3000ms
    queue spamd parent rootq bandwidth 1K min 0K max 1K qlimit 300
  
The min and max values are core to that flexibility. Subordinate queues can 'borrow' bandwidth up to their own max values within the allocation of the parent queue. The combined max queue bandwidth can exceed the root queue's bandwith and still be valid. However the allocation will always top out at the allocated or the actual physical limits of the interface the queue is configured on.

For bursty services such as DNS in our example you can allow burst for a specified time where the allocation can exceed the queue's max value, still within the limits set on the parent queue.

Finally, the qlimit sets the size of the queue's holding buffer. A larger buffer may lead to delays since it packets may be kept longer in the buffer before sending on their way out to the world.

And if you noticed the name of that final, tiny queue, you probably have guessed correctly what it was for. The traffic from hosts that were caught in the spamd net was really horrible, as this systat queues display shows:

 1 users Load 2.56 2.27 2.28                                      skapet.bsdly.net 20:55:50
 QUEUE                BW SCH  PRI    PKTS   BYTES   DROP_P   DROP_B QLEN BOR SUS  P/S   B/S
 rootq on bge0       20M                0       0        0        0    0            0     0
  main               20M                0       0        0        0    0            0     0
   qdef               9M          6416363   2338M      136    15371    0          462 30733
   qweb               9M           431590 144565K        0        0    0          0.6   480
   qpri               2M          2854556 181684K        5      390    0           79  5243
   qdns             100K           802874  68379K        0        0    0          0.6    52
  spamd               1K           596022  36021K  1177533 72871514  299            2   136
	    

It was good, clean fun. And that display did give me a feeling of Mission accomplished.

There are several other tools in the PF toolset such as carp(4) based redundancy for highly available service, relayd(8) for load balancing, application delivery and general network trickery, PF logs and the fact that tcpdump(8) is your friend, and several others that I have enjoyed using but I decided to skip since this was supposed to be a user group talk and a somewhat dense article.

I would encourage you to explore those topics further via the literature listed under the Resources heading for more on these.

Who Else Uses PF Today?

PF originated in OpenBSD, but word of the new subsystem reached other projects quickly and there was considerable interest from the very start.  Over the years, PF has been ported from the original OpenBSD to the other BSDs and a few other systems, including

Other than Oracle with their port to Solaris, most ports of the PF subsystem happened before the OpenBSD 4.7 NAT rewrite, and for that reason they have kept the previous syntax intact.

There may very well be others. There is no duty to actually advertise the fact that you have incorporated BSD licensed code in your product.

If you find other products using PF or other OpenBSD code in the wild, I am interested in hearing from you about it. Please comment or send email to nix at nxdomain dot no.

Resources for Further Exploration

The PF User's Guide

The Book of PF by Peter N. M. Hansteen

Absolute OpenBSD by Michael Lucas

Network Management with the OpenBSD Packet Filter toolset, by Peter N. M. Hansteen, Massimiliano Stucchi and Tom Smyth (A PF tutorial, this is the BSDCan 2024 edition). An earlier, even more extensive set of slides can be found in the 2016-vintage PF tutorial.

That Grumpy BSD Guy Blog posts by Peter N. M. Hansteen

OpenBSD Journal News items about OpenBSD, generally short with references to material elsewhere.

Wednesday, September 14, 2022

Open Source in Enterprise Environments - Where Are We Now and What Is Our Way Forward?

We have been used to hearing that free and open source software and enterprise environments in Big Business are fundamentally opposed and do not mix well. Is that actually the case, or should we rather explore how business and free software can both benefit going forward?

Puffy, the OpenBSD mascot, shiny version

Free and Open Source vs Enterprise and Business: The Bad Old Days

Open source, free software and enterprise IT environments have both been around for quite a while. I'm old enough to remember when the general perception was that the free exchange of source code was merely a game for amateurs, or at best an academic excercise. In contrast, the proper business way of doing things was to perhaps learn general principles and ideas from the academics, but real products for business use would be built to be sold as binary only, with any source code to be kept locked away and secret.

Note: This piece is also available without trackers but more basic formatting here.

If you're a little younger you may remember a time when Windows NT is the future was essentially gospel and all the business pundits were saying we would be seeing the last of Unix and mainframes both within only a handful of years.

Thinking back to the late 1980s and early 1990s it is hard to imagine now how clear the consensus seemed to be on the issue at that point. The PC architecture and a few other proprietary technologies was the way of business and the way forward.

No discussion or dissent seemed possible.

Then, The Internet Happened

Then the Internet happened. What few people outside some inner circles were aware that what actually made the Net work was code that came directly out of the Berkeley Software Distribution. BSD Unix, or simply BSD for short, was a freely licensed operating system that was the result of a rather informal cooperation of researchers in academia and business alike, originally derived from Unix source code.

When the United States Department of Defense wanted work done on resilient, device independent, distributed and autoconfiguring networks, the task of supplying the reference implementation for the TCP/IP stack, based on a stream of specifications dubbed Request for comments or RFCs, fell to the international group of developers coordinated by the Computer Science Research Group at the University of California's Berkeley campus. In short, the Internet came from BSD, which thanks to a decision made by the Regents of the University of California, was freely licensed.

The BSD sourced TCP/IP stack was part of all Internet capable systems until around the turn of the century, when Linux developers and later Microsoft started working on their own independent implementations. By that time it had been forcefully demonstrated to the developer community at least that open source code was indeed capable of scaling to industrial scale and beyond.

Due to a handful of accidents of history, mainly involving imperfect communications between groups of developers and combined with a somewhat misguided lawsuit involving the BSD code, it was Linux that became the general household term for free software in general and the re-emergence of Unix-like systems in the Internet connected server market space. Linux distributions came with a largely GNU userland as well as generous helpings of BSD code.

At roughly the same time Linux emerged, the BSD code became generally available via the FreeBSD and NetBSD projects, and soon after the OpenBSD project, which forked from the NetBSD code base in the mid 1990s. For a more detailed history of these developments, see the three part series on the APNIC blog starting with this piece. If that piqued your interest, you may enjoy this piece about some incremental improvements over time in OpenBSD.

The War on Linux and the Proliferation of Open Source Tools

During the 1990s and early 2000s the Internet and services of all kinds that ran on top of it expanded in all directions. That expansion had the effect of advancing the free unixlike systems such as Linux and the BSDs, which would run quite comfortably on commonly available hardware, along with an ever expanding number of development tools and software of all kinds to new categories of users.

The success of the open source software lead to what would be dubbed The War on Linux, a rather vicious defamation campaign executed in both PR campaigns and lawsuits, and driven mainly by the then-dominant desktop software vendor's ambition to dominate server space as well. One of the more bizarre sequences of Linux-targeting lawsuits was run by proxy, and is extensively documented at groklaw.net (Note: http-only site). It is worth noting that the process eventually lead to bankruptcy for the litigant.

Over the years it became clear to essentially everyone in the industry that open source tools were essential to development, and several practical aspects of developer life lead to ever increasing open source use. During the time of The War on Linux, the likes of Apple, Cisco, Netscaler (later acquired by Citrix) and Sun Microsystems (later acquired by Oracle) either incorporated open source code in their products and workflows, open sourced large parts of their own code or forked freely available code to base proprietary systems on. It may be worth discussing each of these approaches in detail later.

On to the Present: We All Use...

Fast forward to the present day, and I recently had colleagues sum up that in the enterprise environments we move in,

Software is developed on Macs,
deployed on a cloud somewhere,
which more likely than not runs on Linux.

And the software itself is likely built with open source tools and pulls in dependencies from open source projects, possibly hosted on Github or other public sites.

Your software in all probability uses some open source. And even if you are not a developer, you most likely use open source tools that are integrated in your operating system or common application software or web services.

On the client side of things, an ever increasing part of the volume comes from smartphones, tablets and the like, where the market share for open source based systems (Android and IOS) exceeds 90 percent. In a document we will come back to later, the Norwegian National Security Authority (NSM) estimates that approximately 90 to 98 per cent of all software in use to some extent has dependencies on open source software. Other relevant statistics can be found here, here and here. Or, if you're in a bit of a hurry: It is estimated that some 3.1 billion Linux-based Android phones are currently in use. In addtion, there is Apple, which we know has a significant amount of BSD code in their software.

It is of course worth noting that by now even the old open source arch-enemy Microsoft ships their offerings with what amounts to an almost complete Linux distribution as a subsystem. The same company regularly lobs cash over the wall to the likes of The OpenBSD Foundation and regularly contributes to other open source projects. Not to mention that much of what runs in their Azure cloud is one way or the other Linux based.

Security: QA Your Supply Chain, Excercise the Right to Repair

Back in the days of The War on Linux, and to some extent still, we have often been faced with claims that open source software could either never be as secure as proprietary software or that open source software was inherently more secure than the closed source kind, because "given enough eyes, all bugs are shallow".

Both assertions fail because even without access to source code, it is possible to probe running software for vulnerabilities, and on the other hand the shallowness of bugs depends critically on the eyes looking being attached to people with sufficient competence in the field.

The public reaction to a couple of security incidents during recent years that generated a flurry of largely uninformed punditry are worth revisiting for the lessons that can be learned.

The Solarwinds supply chain incident aka SUNBURST (2020) - One of the most widely publicized yet mostly quite poorly understood security incidents in recent years emerged when it was revealed that adversaries unknown had been able to compromise the build computers where the binaries for their widely used network management software was built for distribution.

The SANS institute has produced a fairly thorough writeup of the incident, which breaks down as follows: The first stage of a multi-stage compromise kit was included in binary distribution packages, complete with authentic signatures from the build system, that were largely put directly into production environments by network admins everywhere. The malware then went on to explore the networks they landed in, and through a process that made heavy use of crafted DNS queries and other non-obvious techniques, the miscreants were able to compromise several high security government and enterprise networks.

Several open source component supply chain incidents (2020 onwards) - Soon after the SUNBURST incident several incidents occured where popular open source components that other systems pulled in as dependencies started malfunctioning or were suddenly unavailable, causing complete malfunctions or loss of functionality such as a web service suddenly refusing to interact with specific networks.

The sudden breakage in open source components caused quite a bit of uproar, and predictably the chattering subset of the consulting class set about churning out dire warnings about the risk of using open source of any kind.

Watching from the sidelines it struck many open source oriented professionals, myself included, that the combination of these incidents carry an important lesson. It is obvious in a modern environment we suck in upgrades automatically and frequently, and that no untested code should ever be deployed directly to production.

Blind trust versus the right to read (and educate yourself) and the right to repair - In the case of proprietary, binary-only software, you have no choice but to trust your supplier and that they will address any defects in a timely manner. The upshot is that with proprietary, binary-only you do not have access to two important features of open source software: The right to read and study the code, and the right to repair any defects you find, potentially saving yourself potential service shutdowns or workarounds while the secret parts of your system get fixed elsewhere.

The lesson to be learned is that you need to run quality assurance on your supply chain. You may choose to trust, but you still need to verify. That goes for open source and proprietary software both.

This Norwegian felt slightly elated when reading that the Norwegian National Security Authority (NSM) provides essentially the same assessments in their published recommendations.

Contributing - Cooperating on Maintenance

As with any product it is entirely possible to be a relatively passive consumer, just install and use, and build whatever you need on top, interacting with the community only via downloading as needed from the mirror sites. Communicating via online forums, mailing lists or other channels is entirely optional.

If you are a developer or integrator with an ambition to make one or more opern source products central to your business either by using and contributing to an existing project or starting a new one, several approaches are possible.

Let's take a look at the strategies some big names adopted on open source in their products:

Grab and fork, sell hardware: The Netscaler load balancer and application delivery products were based on a fork of FreeBSD.

They appear to have rewritten large parts of the network stack and devised a multifunctional network product on top, which among other things features a slick web GUI for most if not all admin tasks.

If you look closely, Netscaler (since acquired and rebranded by Citrix) appear to cultivate a menagerie of open source projects to interface with their products.

However they appear not to have in particularly close contact with their main upstream. (It is worth noting that the BSD license does not require publishing changes to the code base.) When dropping to a shell on a Netscaler unit, last time I looked the output of uname -a seemed to indicate that their kernel was still based on FreeBSD 8.4, which the FreeBSD web site lists as End of Life by August 1, 2015.

Grab and fork, sell hardware, keep sync with your upstream: Starting with the initial release of macOS, Apple have maintained the software that drives their various devices, from phones to desktop computers and related services with generous helpings of open source code, along with what appears to be a general willingness to publish code and interact with upstream projects such as the FreeBSD project. Apple maintains the Open Source at Apple site for easy access to the open source components of their offerings.

This mode of open source interaction seems to be rather common, especially among network oriented suppliers of various specialty gear.

Open source everyting, sell support: Despite early scepticism from business circles, several companies have built successful companies on the model of participating or even driving the development of open sources systems or components, making support contracts (which may include early or privileged access to updates) as well as consulting services the main or sole source of company revenue.

Decide what code is both good enough to publish and useful elsewhere: Finally, for those of us in the services or consulting business who will occasionally write code that is not necessarily business specfic, the reasonable middle ground is just that. Identify code that meets the following criteria:

  1. Was developed by yourself and cleared by your organization and other stakeholders such as your customer as such
  2. Is high enough quality that you dare show it to others
  3. Does not reveal core aspects of your clients' business
  4. Is likely to be useful elsewhere too
  5. Would be nice to have exposed to other sets of eyes in order do identify bugs and fix them

If you have code under your care in your organization that meets those criteria, you should in my opinion be seriously considering making that code open source.

Your next adventure will then be to pick an appropriate license.

Now for Policies and Processes - Do You Have Them?

If you have followed on this far, you probably caught on to the notion that it is wise to set up clear policies and procedures for handling code, open source or otherwise.

Keep in mind that

A license is an assertion of authority. A license is a creator's message to the world that states the conditions others must abide by when using, or if they allow it, change and further develop the code.

Without a license the default regime is that only the person or persons who originated the code have the right to make changes or for that matter make further copies for redistribution.

For that reason it is important to ensure that every element of your project has a known copyright and license.

There have been quite a few instances of free software project rewriting functionally equivalent, or hopefully better, versions of whole subsystems because of unacceptable or unclear licenses (see the OpenBSD articles in the Resources section for some examples).

Procedures and policies, you need them. A self employed developer working on their own project is usually free to choose whatever license they please. In a corporate environment, any code developed is likely tied to a contract of some sort, which may or may not set the parameters of who holds the copyright or what licenses my be acceptable. The exact parameters of what can be decided by contract and what follows from copyright law my vary according to what jurisdiction you are in. When considering whether to publish your own code under an open source license, make sure all stakeholders (and certainly any parties to any relevant contract) agree on the policies and procedures.

Keep it simple, for your own sake. There are supposedly several hundred licenses in existence that the Open Source Initiative considers to be open source. In the interest of making life easier for anyone who would be interested in working on your code, please consider adopting one of those well-known licenses.

They range from the simplest, BSD or MIT style ones that run a handful of sentences and can be condensed to you can do whatever you like with this material except to claim that you made it all yourself to elaborate documents (the GNU GPL v3 comes to mind) which set out detailed terms and conditions, may require republication of any changes under the same terms, and could set up a specific regime with respect to patent disputes.

It is also important to consider that components you use in your project may have specific license requirements and that different licenses may contain terms that make the licenses incompatible in practice.

My general advice here is, make it as simple as possible, but no simpler.

Or to rephrase slightly: The general advice for dealing with licenses echoes that of dealing with crypto code: Do not set out writing your own unless you know exactly what you are doing. Avoid that path if at all possible.

When in need, call in Legal (but make sure they understand the issues). Lawyers endure a lengthy education in order to pass the bar and turn to practicing law, but there is no guarantee that a person well versed in other business legalese has any competence at all when it comes to matters of copyright law. When you do turn to Legal for help, be very exacting and stern in insisting that they demonstrate a command of copyright basics and if at all possible have a reasonable real world understanding of how software is built.

As in, you really do not want to spend an entire afternoon or more explaning the difference between static and dynamic linking and why this matters in the face of a certain license, or that specific terms of different licenses deemed open source by the Open Source Initiative may in fact be incompatible in practice.

It is important to keep in mind that doing open source is about making our lives more productive and enjoyable by exchanging ideas between quality professionals, perhaps sharing the load of maintenance and leaving us all more resources to develop our competence and products further.

The Way Forward - The Work Goes On

So this is where we are today. Modern software development and indeed a goodly chunk of business and society in general depends critically on open source software.

If you enjoyed this piece (or became annoyed by any part of it) I would like to hear from you. I especially welcome comments from colleagues who have experience with open source use and/or development in enterprise settings. Of course if you are just curious about open source software in these settings, you are welcome to drop me a line too. I am most easily reachable via email nix at nxdomain dot no.


I want to extend thanks to Malin Bruland and Knut Yrvin for excellent comments and proofreading.

Resources

All things open source (including an almost encyclopedic collection of licenses) at The Open Source Initiative

Wikipedia: Berkeley Software Distribution about where the Internet came from

The GNU Operating System, supported by The Free Software Foundation

The FreeBSD operating system project

Open Source at Apple

Peter Hansteen: What every IT person needs to know about OpenBSD Part 1: How it all started,
What every IT person needs to know about OpenBSD Part 2: Why use OpenBSD?,
What every IT person needs to know about OpenBSD Part 3: That packet filter
(or the whole shebang in the raw at bsdly.blogspot.com)


Bradford Morgan White: The Berkeley Software Distribution

Nasjonal Sikkerhetsmyndighet (NSM): Åpen kildekode i den digitale leverandørkjeden (Norwegian only)

Business of Apps: Android Statistics (2023)

Bank My Cell: How Many Android Users Are There? Global and US Statistics (2023) (Source: https://www.bankmycell.com/blog/how-many-android-users-are-there)

Statista: Market share held by Apple iOS operating system of smartphone shipments from 1st quarter 2011 to 4th quarter 2022

Appendix: License Complexity Measured by Word Count

While presenting on free and open source software in enterprise environments, the topic of license complexity and how to handle licensing matters usually generates questions of the type,

"Does doing open source mean we need to staff an Open Source Program Office?

Does this not add a considerable measure of complexity to the development organization?

Do the open source licenses mean we have to hire even more lawyers?"

So I set out to do a little research. I figured that the number of words in a text is a useful, if not perfect indicator of complexity, so we could use that measure as a useful and easy to obtain proxy for measuring how complex the licenses we are likely to encounter are in practice.

I headed over to the Open Source Initiative website and their excellent collection of open source licenses. I then picked out the more common open source licenses, and for each license I pasted the text into the word counter at wordcounter.net, which in addition to the word count provides an indication of likely target audience "reading level" and estimated reading time as well as a few other measures of the text characteristics.

The results are in the following table:


License complexity by wordcount
Word count Reading
Level
Reading
time
1-clause BSD License 160 College Graduate 35s
2-clause BSD License 191 College Graduate 42s
3-clause BSD License 220 College Graduate 48s
GNU GPL v2.0 2964 College Graduate 10m47s
GNU GPL v3.0 5608 College Graduate 20m30s
Apache License v2.0 1677 College Graduate 5m44s
Microsoft 365 Developer program license 4803 College Graduate 17m28s
Microsoft Windows 11 OS license terms 5766 College Graduate 20m58
Oracle End User License Agreement 2554 College Graduate 9m17s
Adobe End-User License Agreement 450 College Graduate 1m38s
Apple Licensed Application End User License Agreement 1524 College Graduate 5m32s

Once again, strict word count is not a perfect indicator of complexity — other measures such as sentence length and logical structure and interdependencies are likely to matter in real life scenarios.