diary at Telent Netowrks

Beating the dead hobby horse I: structure, not strings#

Fri, 22 Jan 2021 21:04:32 +0000

In the domain of "how to write computer problems" (or "how to solve problems using computers" if you prefer a more user-centred framing) there are two things I bang on about endlessly. Today I'm going to write the first of them down.

Process structured values, not serializations

Your program interfaces with the outside world, and most likely it (unavoidably) sends and receives streams of bytes across that interface. Internally though, your program should not be processing those byte streams at anywhere but its boundaries. On input, to read them into structured values, and on output, to serialize those values back into bytes.

That sounds ... obvious, so - why do we so often get it wrong? Let's look at an example. Suppose you're writing a web application. You have some user-supplied content and you want to display it in the browser. You might (but shouldn't) do something like this:

def greet(name)
  puts "<html><head><title>Hi</title></head><body><h1>#{name}</h1></body></html>"
end  

What's wrong with this? Well, suppose the value of name is <blink>HAHA</blink> or </body> or <script>window.alert('pwned')</script> ... bad things happen. We need to "escape" that value before we print it, so that it does not contain syntax that will be treated as instructions to the browser's document parser.

It's reasonably straightforward to do so in that case, but now suppose that instead of replacing element content we want to replace an attribute value, or a class name, or a CSS style value or - oh my lord, the OWASP advice is hairy. Every time we have some variable content to interpolate into our template we need to figure out which context we're in and which rule or rules to apply. Whoever comes after us to review our code had better pay close attention too.

Is this the best we can do? No. Let me present to you another way of looking at this. In this perspective

Instead of interpolating our user content directly into the serialization as we write it out, it, we're going to build a document object with our user-supplied content and then only when we need to are we going to serialize the whole shebang.

def greet(name)
  doc = 
    [:html {}
      [:head {}
	[:title {} "Hi"]]
      [:body {}
	[:h1 {} [name]]]]
  # doc = transform_document_in_some_way(doc)
  serialize_to_html(doc)
end

We've decoupled the document generation from the serialization.

We still need to do the serialization, of course. We still need something that understands the encoding rules so that it may encode the document safely, but that "something" is library code, it knows the context for each node and it can do the correct escaping to print the content of that node.

This approach has other advantages, too - we have a tree structure, so we can do structural transformations by walking the tree. Maybe we need to add script nodes to the head so that we can add privacy=invading third party JS scripts. Maybe we need to put in a Covid19 banner at the top of the page. Maybe we need to find all the relative links on the page and add a prefix to their paths.

I concede that there are some circumstances - perhaps you're running on a microcontroller, you have huge amounts of HTML and no RAM in which to assemble a document - in which this approach is contraindicated, but to my mind these are special cases not default practices.

Not just HTML

Mistaking a serialized file format for an internal representation is by no means confined only to HTML. At the time I write this, 4/10 of the OWASP Top Ten have the common symptom "you tried to insert data into the serialized form of a structured value without paying really close attention to the rules of the encoding data, and your interpolated data itself contained serialised structure fragments, not just the flat value that you assumed". SQL injections, command injections, path traversal attacks. The commonly-touted remedies: use placeholders, use execve instead of system, use a Pathname or File class instead of a string where the "/" has special meaning.

At both ends

So don't serialize until you have to, but also can we talk about input? Deserialize (parse) what you get from the outside world soon as you humanly can, and certainly before you start trying to make decisions based on it. Get those strings and turn them into structured values before you start doing anything else to or with them.

This is not novel or original

I've been thinking in these terms for a long time, originally due to something Erik Naggum said:

the first tenet of information representation is that external and internal data formats are incommensurate concepts. there simply is no possible way they could be conflated conceptually. to move from external to internal representation, you have to go through a process of reading the data, and to move from internal to external representation, you have to go through a process of writing the data. these processes are non-trivial, programmatically, conceptually, and physically.

but more recently Language-theoretic security

LANGSEC posits that the only path to trustworthy software that takes untrusted inputs is treating all valid or expected inputs as a formal language, and the respective input-handling routines as a recognizer for that language. The recognition must be feasible, and the recognizer must match the language in required computation power.

and my favourite blog post of 2019, Parse, don't validate

The common theme between all these [ parsing ] libraries is that they sit on the boundary between your Haskell application and the external world. That world doesn’t speak in product and sum types, but in streams of bytes, so there’s no getting around a need to do some parsing. Doing that parsing up front, before acting on the data, can go a long way toward avoiding many classes of bugs, some of which might even be security vulnerabilities.