diary at Telent Netowrks

Beating the dead hobby horse I: structure, not strings#

Fri, 22 Jan 2021 21:04:32 +0000

In the domain of "how to write computer problems" (or "how to solve problems using computers" if you prefer a more user-centred framing) there are two things I bang on about endlessly. Today I'm going to write the first of them down.

Process structured values, not serializations

Your program interfaces with the outside world, and most likely it (unavoidably) sends and receives streams of bytes across that interface. Internally though, your program should not be processing those byte streams at anywhere but its boundaries. On input, to read them into structured values, and on output, to serialize those values back into bytes.

That sounds ... obvious, so - why do we so often get it wrong? Let's look at an example. Suppose you're writing a web application. You have some user-supplied content and you want to display it in the browser. You might (but shouldn't) do something like this:

def greet(name)
  puts "<html><head><title>Hi</title></head><body><h1>#{name}</h1></body></html>"

What's wrong with this? Well, suppose the value of name is <blink>HAHA</blink> or </body> or <script>window.alert('pwned')</script> ... bad things happen. We need to "escape" that value before we print it, so that it does not contain syntax that will be treated as instructions to the browser's document parser.

It's reasonably straightforward to do so in that case, but now suppose that instead of replacing element content we want to replace an attribute value, or a class name, or a CSS style value or - oh my lord, the OWASP advice is hairy. Every time we have some variable content to interpolate into our template we need to figure out which context we're in and which rule or rules to apply. Whoever comes after us to review our code had better pay close attention too.

Is this the best we can do? No. Let me present to you another way of looking at this. In this perspective

Instead of interpolating our user content directly into the serialization as we write it out, it, we're going to build a document object with our user-supplied content and then only when we need to are we going to serialize the whole shebang.

def greet(name)
  doc = 
    [:html {}
      [:head {}
	[:title {} "Hi"]]
      [:body {}
	[:h1 {} [name]]]]
  # doc = transform_document_in_some_way(doc)

We've decoupled the document generation from the serialization.

We still need to do the serialization, of course. We still need something that understands the encoding rules so that it may encode the document safely, but that "something" is library code, it knows the context for each node and it can do the correct escaping to print the content of that node.

This approach has other advantages, too - we have a tree structure, so we can do structural transformations by walking the tree. Maybe we need to add script nodes to the head so that we can add privacy=invading third party JS scripts. Maybe we need to put in a Covid19 banner at the top of the page. Maybe we need to find all the relative links on the page and add a prefix to their paths.

I concede that there are some circumstances - perhaps you're running on a microcontroller, you have huge amounts of HTML and no RAM in which to assemble a document - in which this approach is contraindicated, but to my mind these are special cases not default practices.

Not just HTML

Mistaking a serialized file format for an internal representation is by no means confined only to HTML. At the time I write this, 4/10 of the OWASP Top Ten have the common symptom "you tried to insert data into the serialized form of a structured value without paying really close attention to the rules of the encoding data, and your interpolated data itself contained serialised structure fragments, not just the flat value that you assumed". SQL injections, command injections, path traversal attacks. The commonly-touted remedies: use placeholders, use execve instead of system, use a Pathname or File class instead of a string where the "/" has special meaning.

At both ends

So don't serialize until you have to, but also can we talk about input? Deserialize (parse) what you get from the outside world soon as you humanly can, and certainly before you start trying to make decisions based on it. Get those strings and turn them into structured values before you start doing anything else to or with them.

This is not novel or original

I've been thinking in these terms for a long time, originally due to something Erik Naggum said:

the first tenet of information representation is that external and internal data formats are incommensurate concepts. there simply is no possible way they could be conflated conceptually. to move from external to internal representation, you have to go through a process of reading the data, and to move from internal to external representation, you have to go through a process of writing the data. these processes are non-trivial, programmatically, conceptually, and physically.

but more recently Language-theoretic security

LANGSEC posits that the only path to trustworthy software that takes untrusted inputs is treating all valid or expected inputs as a formal language, and the respective input-handling routines as a recognizer for that language. The recognition must be feasible, and the recognizer must match the language in required computation power.

and my favourite blog post of 2019, Parse, don't validate

The common theme between all these [ parsing ] libraries is that they sit on the boundary between your Haskell application and the external world. That world doesn’t speak in product and sum types, but in streams of bytes, so there’s no getting around a need to do some parsing. Doing that parsing up front, before acting on the data, can go a long way toward avoiding many classes of bugs, some of which might even be security vulnerabilities.

New Years Ruminations#

Fri, 01 Jan 2021 16:13:10 +0000

2021 will be, I assert confidently, the year I get NixWRT running on my internet gateway at home. A short list of the yaks I need to shave to get there, which you will note is a lot more concrete at the front end than the back:

Twas the night before Christmas#

Thu, 24 Dec 2020 23:00:48 +0000

... and I've not picked up a computer all day.

Merry Christmas to all who celebrate it, and culturally appropriate seasonal best wishes to all who don't.

Markdown my words#

Wed, 23 Dec 2020 22:31:00 +0000

As a concept, this blog dates back to November 2001 - though I note now reviewing the first few entries, I was back then quite emphatic in my claim that it was not a blog. How times change.

In that time it has been implemented with

What's notable about Yablog is that it took a week in 2015 and has scarely been touched since, until yesterday, when I decided to add Markdown support. For the last 15 years (ever since I switched to Soks, bascially) I've been writing blog entries using Textile, and for the most recent ~ 10 of them, the blog entries are pretty much the only thing I've been writing in Textile, so this change is long overdue. I was getting a bit fed up of writing backticks and then publishing entries before remembering that Textile uses @ signs for that instead.

A year or so ago I suddenly realised that for the first time in a long time I no longer have a favourite programming language, but it's been apparent to me for a while before that that Clojure isn't it any longer anyway. That said, I must admit I'm quite happy that I can pick up a ~ 6 year old project, update the dependency versions and hack this in, all in the space of a couple of hours. I have trouble imagining that the same would be true in Ruby.

(Ruby isn't it either. But Ruby has never been it)

Shoutout to @yogthos@mastodon.social for markdown-clj which did all the hard work already.

Fail2ban or ban 2 fail#

Tue, 22 Dec 2020 20:52:45 +0000

One of the things that made yesterday's "why does google hate me" (I suppose it is fair to say that the feeling is mutual) introspection so frustrating was the sheer volume of crap that cascades through my syslog systemd journal making it quite hard to see what's going on. Most of it seems to be bots trying not-very-hard to look for open SMTP relays, and a particularly tedious strain is the ones that try to authenticate as

Apr 04 12:46:20 vritual postfix/smtpd[5250]: warning:
 unknown[]: SASL LOGIN authentication failed: UGFzc3dvcmQ6

I really do literally mean UGFzc3dvcmQ6 there. A web search will confirm that I'm not the only one to see this, and it's not the string of random alphanumerics you might think it is on first glance:

$ echo -n Password: | base64 

So I thought I might see about applying the banhammer, and as fail2ban is included in NixOS, let's set ourselves up a rule. This required a lot of futzing around in the "obvious in retrospect" space, so here is what I did that eventually worked.

  environment.etc."fail2ban/filter.d/postfix-login-failed.conf".text = ''

before = common.conf


_daemon = postfix(-\w+)?/\w+(?:/smtp[ds])?

failregex = warning:.*\[<HOST>\]: SASL LOGIN authentication failed: UGFzc3dvcmQ6$ ignoreregex =


journalmatch = _SYSTEMD_UNIT=postfix.service


services.fail2ban.jails.postfix-login-failed = '' filter = postfix-login-failed enabled = true action = iptables-multiport[name=SMTP, port="smtp,submission,submissions,imap,imaps"] '';

Things to look out for:

Observationally, most of the hosts trying to login to my server with the password Password: seem to be the same ones trying to send mail in six other ways, so this rule cuts out a lot of all the kinds of postfix lognoise.