New Years Ruminations#

Fri Jan 1 15:13:10 2021

2021 will be, I assert confidently, the year I get NixWRT running on my internet gateway at home. A short list of the yaks I need to shave to get there, which you will note is a lot more concrete at the front end than the back:

swarm: test harness for event system
swarm: a well-tested event system :-)
swarm: anything else that needs doing
see if swarm can implement the existing service abstraction (can it plugin to replace monit?), do whatever necessary if it can't
dnsmasq
routing and firewalling (and NAT for IPv4)
ntp
bonus points: a better story for firmware upgrades
bonus points: external "secrets" config so passwords aren't baked in

Beating the dead hobby horse I: structure, not strings#

Fri Jan 22 20:04:32 2021

Topics: software ruby

In the domain of "how to write computer problems" (or "how to solve problems using computers" if you prefer a more user-centred framing) there are two things I bang on about endlessly. Today I'm going to write the first of them down.

Process structured values, not serializations

Your program interfaces with the outside world, and most likely it (unavoidably) sends and receives streams of bytes across that interface. Internally though, your program should not be processing those byte streams at anywhere but its boundaries. On input, to read them into structured values, and on output, to serialize those values back into bytes.

That sounds ... obvious, so - why do we so often get it wrong? Let's look at an example. Suppose you're writing a web application. You have some user-supplied content and you want to display it in the browser. You might (but shouldn't) do something like this:

def greet(name)
  puts "<html><head><title>Hi</title></head><body><h1>#{name}</h1></body></html>"
end

What's wrong with this? Well, suppose the value of name is <blink>HAHA</blink> or </body> or <script>window.alert('pwned')</script> ... bad things happen. We need to "escape" that value before we print it, so that it does not contain syntax that will be treated as instructions to the browser's document parser.

It's reasonably straightforward to do so in that case, but now suppose that instead of replacing element content we want to replace an attribute value, or a class name, or a CSS style value or - oh my lord, the OWASP advice is hairy. Every time we have some variable content to interpolate into our template we need to figure out which context we're in and which rule or rules to apply. Whoever comes after us to review our code had better pay close attention too.

Is this the best we can do? No. Let me present to you another way of looking at this. In this perspective

the primary representation of a web document is a Document Object (in the browser they even call it the DOM, which should be a clue)
the HTML-marked-up text is just a serialization that the browser parses to create this object.

Instead of interpolating our user content directly into the serialization as we write it out, it, we're going to build a document object with our user-supplied content and then only when we need to are we going to serialize the whole shebang.

def greet(name)
  doc =
    [:html {}
      [:head {}
	[:title {} "Hi"]]
      [:body {}
	[:h1 {} [name]]]]
  # doc = transform_document_in_some_way(doc)
  serialize_to_html(doc)
end

We've decoupled the document generation from the serialization.

We still need to do the serialization, of course. We still need something that understands the encoding rules so that it may encode the document safely, but that "something" is library code, it knows the context for each node and it can do the correct escaping to print the content of that node.

This approach has other advantages, too - we have a tree structure, so we can do structural transformations by walking the tree. Maybe we need to add script nodes to the head so that we can add privacy=invading third party JS scripts. Maybe we need to put in a Covid19 banner at the top of the page. Maybe we need to find all the relative links on the page and add a prefix to their paths.

I concede that there are some circumstances - perhaps you're running on a microcontroller, you have huge amounts of HTML and no RAM in which to assemble a document - in which this approach is contraindicated, but to my mind these are special cases not default practices.

Not just HTML

Mistaking a serialized file format for an internal representation is by no means confined only to HTML. At the time I write this, 4/10 of the OWASP Top Ten have the common symptom "you tried to insert data into the serialized form of a structured value without paying really close attention to the rules of the encoding data, and your interpolated data itself contained serialised structure fragments, not just the flat value that you assumed". SQL injections, command injections, path traversal attacks. The commonly-touted remedies: use placeholders, use execve instead of system, use a Pathname or File class instead of a string where the "/" has special meaning.

At both ends

So don't serialize until you have to, but also can we talk about input? Deserialize (parse) what you get from the outside world soon as you humanly can, and certainly before you start trying to make decisions based on it. Get those strings and turn them into structured values before you start doing anything else to or with them.

9, 9e0, 0x9 and 011 are all different ways to write the same number - if you're shotgun parsing it every time you need it, how sure are you that you're using the same parsing rules each time?
If you're rejecting pathnames that contain the string /../ when building a URL fron an HTTP POST parameter, then you probably wanted to decode /%C0AE%C0AE%C0AF%C0AE%C0AE%C0AFetca%C0AFpasswd before you did that - and you would be even better served to parse the string into a structured pathname object ['..', '..', 'etc', 'passwd'].
where you have a finite number of alternatives for a property, and you handle each of them differently (your systemd service type is one of simple, forking, oneshot, dbus, notify or idle) - these are not just strings. They're instances of a ServiceType class, or they're members of a union, or something that gives them meaning beyond being collections of characters. Why? All the usual arguments against Primitive Obsession apply, but for me what's fundamental is that once you've transformed the text string that purports to be a service type (but might not be) into a value that can only be a valid service type, all your other code that reads the service type can do so without having to handle the "not actually valid" case.

This is not novel or original

I've been thinking in these terms for a long time, originally due to something Erik Naggum said:

the first tenet of information representation is that external and internal data formats are incommensurate concepts. there simply is no possible way they could be conflated conceptually. to move from external to internal representation, you have to go through a process of reading the data, and to move from internal to external representation, you have to go through a process of writing the data. these processes are non-trivial, programmatically, conceptually, and physically.

but more recently Language-theoretic security

LANGSEC posits that the only path to trustworthy software that takes untrusted inputs is treating all valid or expected inputs as a formal language, and the respective input-handling routines as a recognizer for that language. The recognition must be feasible, and the recognizer must match the language in required computation power.

and my favourite blog post of 2019, Parse, don't validate

The common theme between all these [ parsing ] libraries is that they sit on the boundary between your Haskell application and the external world. That world doesn’t speak in product and sum types, but in streams of bytes, so there’s no getting around a need to do some parsing. Doing that parsing up front, before acting on the data, can go a long way toward avoiding many classes of bugs, some of which might even be security vulnerabilities.

Led astray#

Sat Jan 30 12:46:06 2021

Topics: nix

There are probably better ways to satisfy the need for a wall light than

cutting up a 150 LED Neopixel strip, sticking the pieces to a piece of aluminum, then soldering wires from each row to the next
building a ESP8226 circuit on stripboard to drive it
sending 10x15 arrays of RGB over MQTT to specify colours

The led strip came from my Christmas lights, as did the circuit - now transferred from breadboard to a more permanent soldered stripboard. The Arduino sketch, known as Dolores ("do lo-res") is all new, though, and parts of it (the bits that are plain C++ not Arduino code) are even unit-tested.

Perhaps worth noting

10x15 pixels with a byte each for R G B is 450 bytes of data, which is more than the default max packet size of the Arduino MQTT library I'm using. Fix this with a call to client.setBufferSize with a more appropriate value
after the first time I'd put it all together and installed it on the wall, I realised that software updates were going to be considerable faff, given they require unplugging the Wemos D1 from its circuit and attaching it to USB. OTA updates to the rescue! This makes the device run a HTTP server which new images can be POSTed to. There are better ways to do this - I'd rather it were an HTTP client which fetched images from a known URL, instead of accepting anything from anyone that can reach its port 80 - but that can wait until I have some kind of CI setup.
the Nix environment I'm using for Arduino IDE only works with nix-shell not nix-build. I'd like it to behave as a regular derivation so I can have a CI setup

But now I can send it any picture I like. Here's a digital clock -

while sleep 2 ; do \
  ( convert -support .5 -depth 8 -fill red \
            -font DejaVu-Sans-Condensed -pointsize 7 -size 15x10! \
	    -draw "text 0,7 '$(date +%H%M)'" -sharpen 8 'xc:#333'  RGB:- | \
    tee out.rgb | \
    nix run nixos.mosquitto -c mosquitto_pub  -h mymqttserver.lan \
    	    		         -s --username username --pw password  \
				 -t effects/84F3EBE5E57C/image ) ; \
done

⟪ Dec 2020 Feb 2021 ⟫