diary @ telent

"sanitize" is a code smell#

Sun Sep 7 08:46:27 2025

Topics: software rant

This is something of a hobby horse of mine, so forgive the rant: when I see something has been "sanitized" I treat it as a code smell (per Martin Fowler, "... a surface indication that usually corresponds to a deeper problem in the system"), and often find it reveals sloppy thinking which may not even prevent the exploits it is supposed to guard against.

Each data item in your system is a value, which has a canonical representation inside your system but may be represented in multiple different external formats at the boundaries of your system.

When we say "sanitize" we imply that the input data was "insanitary" (or even "insane", same etymological root I think) but it really probably wasn't - it just didn't conform to the rules of some particular representation you had in mind that you would later need to output. So why is that particular representation special? Should "sanitizing" strip out backticks (specal in shell)? The semicolon (special in SQL)? The angle brackets (HTML)? The string +++ (Hayes modem commands)? .. (pathnames)? ` The dollar sign (bound to be used somewhere)? Non-ASCII unicode characters (can't put those in a domain name)?

Don't "sanitize". Encode and decode between the canonical internal representation and the external representation you need to interface with. Mr O'Leary will be happy, Sigur Rós will appreciate you've spelled their name right, and Smith & Sons, Artisan Greengrocers won't have their ampersand dropped.