Beating the dead hobby horse I: structure, not strings#

Fri, 22 Jan 2021 21:04:32 +0000

In the domain of "how to write computer problems" (or "how to solve problems using computers" if you prefer a more user-centred framing) there are two things I bang on about endlessly. Today I'm going to write the first of them down.

Process structured values, not serializations

Your program interfaces with the outside world, and most likely it (unavoidably) sends and receives streams of bytes across that interface. Internally though, your program should not be processing those byte streams at anywhere but its boundaries. On input, to read them into structured values, and on output, to serialize those values back into bytes.

That sounds ... obvious, so - why do we so often get it wrong? Let's look at an example. Suppose you're writing a web application. You have some user-supplied content and you want to display it in the browser. You might (but shouldn't) do something like this:

def greet(name)
  puts "<html><head><title>Hi</title></head><body><h1>#{name}</h1></body></html>"
end

What's wrong with this? Well, suppose the value of name is <blink>HAHA</blink> or </body> or <script>window.alert('pwned')</script> ... bad things happen. We need to "escape" that value before we print it, so that it does not contain syntax that will be treated as instructions to the browser's document parser.

It's reasonably straightforward to do so in that case, but now suppose that instead of replacing element content we want to replace an attribute value, or a class name, or a CSS style value or - oh my lord, the OWASP advice is hairy. Every time we have some variable content to interpolate into our template we need to figure out which context we're in and which rule or rules to apply. Whoever comes after us to review our code had better pay close attention too.

Is this the best we can do? No. Let me present to you another way of looking at this. In this perspective

the primary representation of a web document is a Document Object (in the browser they even call it the DOM, which should be a clue)
the HTML-marked-up text is just a serialization that the browser parses to create this object.

Instead of interpolating our user content directly into the serialization as we write it out, it, we're going to build a document object with our user-supplied content and then only when we need to are we going to serialize the whole shebang.

def greet(name)
  doc = 
    [:html {}
      [:head {}
	[:title {} "Hi"]]
      [:body {}
	[:h1 {} [name]]]]
  # doc = transform_document_in_some_way(doc)
  serialize_to_html(doc)
end

We've decoupled the document generation from the serialization.

We still need to do the serialization, of course. We still need something that understands the encoding rules so that it may encode the document safely, but that "something" is library code, it knows the context for each node and it can do the correct escaping to print the content of that node.

This approach has other advantages, too - we have a tree structure, so we can do structural transformations by walking the tree. Maybe we need to add script nodes to the head so that we can add privacy=invading third party JS scripts. Maybe we need to put in a Covid19 banner at the top of the page. Maybe we need to find all the relative links on the page and add a prefix to their paths.

I concede that there are some circumstances - perhaps you're running on a microcontroller, you have huge amounts of HTML and no RAM in which to assemble a document - in which this approach is contraindicated, but to my mind these are special cases not default practices.

Not just HTML

Mistaking a serialized file format for an internal representation is by no means confined only to HTML. At the time I write this, 4/10 of the OWASP Top Ten have the common symptom "you tried to insert data into the serialized form of a structured value without paying really close attention to the rules of the encoding data, and your interpolated data itself contained serialised structure fragments, not just the flat value that you assumed". SQL injections, command injections, path traversal attacks. The commonly-touted remedies: use placeholders, use execve instead of system, use a Pathname or File class instead of a string where the "/" has special meaning.

At both ends

So don't serialize until you have to, but also can we talk about input? Deserialize (parse) what you get from the outside world soon as you humanly can, and certainly before you start trying to make decisions based on it. Get those strings and turn them into structured values before you start doing anything else to or with them.

9, 9e0, 0x9 and 011 are all different ways to write the same number - if you're shotgun parsing it every time you need it, how sure are you that you're using the same parsing rules each time?
If you're rejecting pathnames that contain the string /../ when building a URL fron an HTTP POST parameter, then you probably wanted to decode /%C0AE%C0AE%C0AF%C0AE%C0AE%C0AFetca%C0AFpasswd before you did that - and you would be even better served to parse the string into a structured pathname object `['..', '..', 'etc', 'passwd']`.
where you have a finite number of alternatives for a property, and you handle each of them differently (your systemd service type is one of simple, forking, oneshot, dbus, notify or idle) - these are not just strings. They're instances of a ServiceType class, or they're members of a union, or something that gives them meaning beyond being collections of characters. Why? All the usual arguments against Primitive Obsession apply, but for me what's fundamental is that once you've transformed the text string that purports to be a service type (but might not be) into a value that can only be a valid service type, all your other code that reads the service type can do so without having to handle the "not actually valid" case.

This is not novel or original

I've been thinking in these terms for a long time, originally due to something Erik Naggum said:

the first tenet of information representation is that external and internal data formats are incommensurate concepts. there simply is no possible way they could be conflated conceptually. to move from external to internal representation, you have to go through a process of reading the data, and to move from internal to external representation, you have to go through a process of writing the data. these processes are non-trivial, programmatically, conceptually, and physically.

but more recently Language-theoretic security

LANGSEC posits that the only path to trustworthy software that takes untrusted inputs is treating all valid or expected inputs as a formal language, and the respective input-handling routines as a recognizer for that language. The recognition must be feasible, and the recognizer must match the language in required computation power.

and my favourite blog post of 2019, Parse, don't validate

The common theme between all these [ parsing ] libraries is that they sit on the boundary between your Haskell application and the external world. That world doesn’t speak in product and sum types, but in streams of bytes, so there’s no getting around a need to do some parsing. Doing that parsing up front, before acting on the data, can go a long way toward avoiding many classes of bugs, some of which might even be security vulnerabilities.