Illogical Volume Management#
Sun, 16 Jul 2023 12:29:18 +0000
I bought a new SSD for my primary desktop system, because the spinning rust storage I originally built it with is not keeping up with the all the new demands I'm making of it lately: when I sit down in front of it in the morning and wave the mouse around, I have to sit listening to rattly disk sounds for tens of seconds while it pages my desktop session back in. For reasons I can no longer remember, the primary system partition /dev/sda3
was originally set up as a LVM PV/VG/VL with a bcache layered on top of that. I took the small SSD out to put the big SSD in, so this seemed like a good time to straighten that all out.
Removing bcache
Happily, a bcache backing device is still readable even after the cache device has been removed.
echo eb99feda-fac7-43dc-b89d-18765e9febb6 > /sys/block/bcache0/bcache/detach
where the value of the uuid eb99...ebb6
is determined by looking in /sys/fs/bcache/
(h/t DanielSmedegaardBuus on Stack Overflow )
It took either a couple of attempts or some elapsed time for this to work, but eventually resulted in
# cat /sys/block/bcache0/bcache/state
no cache
so I was able to boot the computer from the old HDD without the old SSD present
Where is my mind?
At this time I did a fresh barebones NixOS 23.05 install onto the new SSD from an ISO image on a USB stick. Then I tried mounting the old disk to copy user files across, but it wouldn't. Even, for some reason, after I did modprobe bcache
. Maybe weird implicit module dependencies?
The internet says that you can mount a bcache backing device even without bcache kernel support, using a loop device with an offset:
If bcache is not available in the kernel, a filesystem on the backing device is still available at an 8KiB offset.
... but, that didn't work either? binwalk will save us:
$ nix-shell -p binwalk --run "sudo binwalk /dev/backing/nixos"|head
DECIMAL HEXADECIMAL DESCRIPTION
--------------------------------------------------------------------------------
41943040 0x400000 Linux EXT filesystem, blocks count: 730466304, image size: 747997495296, rev 1.0, ext4 filesystem data, UUID=37659245-3dd8-4c60-8aec-cdbddcb4dcb4, volume name "nixos"
The offset is not 8K, it's 8K * 512. Don't ask me why, I only work here. So we can get to the data using
$ sudo mount /dev/backing/nixos /mnt -o loop,offset=4194304
and copy across the important stuff like /home/dan/src
and my .emacs
. But I'd rather like a more permanent solution as I want to carry on using the HDD for archival (it's perfectly fast enough for my music, TV shows, Linux ISOs etc) and nixos-generate-config
gets confused by loop devices with offsets.
If it were an ordinary partition I'd simply edit the partition table to add 8192 sectors to the start address of sda3
, but I don't see a straightforward way to do the analogous thing with a logical volume.
Resolution
Courtesy of Andy Smith's helpful blog post (you should read it and not rely on my summary) and a large degree of luck, I was able to remove the LV completely and turn sda3
back into a plain ext4 partition. We follow the steps in his blog post to find out how many sectors at the start of sda3
are reserved for metadata (8192) and how big each extent is (8192 sectors again, or 4MiB). Then when I looked at the mappings:
sudo pvdisplay --maps /dev/sda3
--- Physical volume ---
PV Name /dev/sda3
VG Name backing
PV Size 2.72 TiB / not usable 7.44 MiB
Allocatable yes (but full)
PE Size 4.00 MiB
Total PE 713347
Free PE 0
Allocated PE 713347
PV UUID 7ec302-b413-8611-ea89-ed1c-1b0d-9c392d
--- Physical Segments ---
Physical extent 0 to 713344:
Logical volume /dev/backing/nixos
Logical extents 2 to 713346
Physical extent 713345 to 713346:
Logical volume /dev/backing/nixos
Logical extents 0 to 1
It's very nearly a continuous run, except that the first two 4MiB chunks are at the end. But ... we know there's a 4MiB offset from the start of the LV to the ext4 filesystem (because of bcache). Do the numbers match up? Yes!
Physical extent 713345 to 713346 are the first two 4MiB chunks of /dev/backing/nixos
. 0-4MiB is bcache junk, 4-8MiB is the beginning of the ext4 filesystem, all we need to do is copy that chunk into the gap at the start of sda3 which was reserved for PV metadata:
# check we've done the calculation correctly
# (extent 713346 + 4MiB for PV metadata)
$ sudo dd if=/dev/sda3 bs=4M skip=713347 count=1 | file -
/dev/stdin: Linux rev 1.0 ext4 filesystem data, UUID=37659245-3dd8-4c60-8aec-cdbddcb4e3c8, volume name "nixos" (extents) (64bit) (large files) (huge files)
# save the data
$ sudo dd if=/dev/sda3 bs=4M skip=713347 count=1 of=ext4-header
# backup the start of the disk, in case we got it wrong
$ sudo dd if=/dev/sda3 bs=4M count=4 of=sda3-head
# deep breath, in through nose
# exhale
# at your own risk, don't try this at home, etc etc
$ sudo dd bs=4M count=1 conv=nocreat,notrunc,fsync if=ext4-header of=/dev/sda3
$
It remains only to fsck /dev/sda3
, just in case, and then it can be mounted somewhere useful.
With hindsight, the maths is too neat to be a coincidence, so I think I must have used some kind of "make-your-file-system-into-a-bcache-device tool" to set it all up in the first place. I have absolutely no recollection of doing any such thing, but Firefox does say I've visited that repo before ...
Turning the nftables#
Fri, 02 Jun 2023 23:16:32 +0000
In the course of Liminix hacking it has become apparent that I need to understand the new Linux packet filtering ("firewall") system known as nftables
The introductory documentation for nftables is a textbook example of pattern 1 in Julia Evans Patterns in confusing explanations document. I have, nevertheless, read enough of it that I now think I understand what is going on, and am ready to attempt the challenge of describing
nftables without comparing to ip{tables,chains,fw}
We start with a picture:

This picture shows the flow of a network packet through the Linux kernel. Incoming packets are received from the driver on the far left and flow up to the aplication layer at the top, or rightwards to the be transmitted through the driver on the right. Locally generated packets start at the top and flow right.
The round-cornered rectangles depict hooks, which are the places where we can use nftables to intercept the flow and handle packets specially. For example:
- if we want to drop packets before they reach userspace (without affecting forwarding) we could do that in the "input" hook.
- if we want to do NAT - i.e. translate the network addresses embedded in packets from an internal 192.168.. (RFC 1918) network to a real internet address, we'd do that in the "postrouting" hook (and so that we get replies, we'd also do the opposite translation in the "prerouting" hook)
- if we're being DDoSed, maybe we want to drop packets in the "ingress" hook before they get any further.
The picture is actually part of the docs and I think it should be on the first page.
Chains and rules
A chain (more specifically, a "base chain") is registered with one of the hooks in the diagram, meaning that all the packets seen at that point will be sent to the chain. There may be multiple chains registered to the same hook: they get run in priority order (numerically lowest to highest), and packets accepted by an earlier chain are passed to the next one.
Each chain contains rules. A rule has a match - some criteria to decide which packets it applies to - and an action which says what should be done when the match succeeds.
A chain has a policy (accept
or drop
) which says what happens if a packet gets to the end of the chain without matching any rules.
You can also create chains which aren't registered with hooks, but are called by other chains that are. These are termed "regular chains" (as distinct from "base chains"). A rule with a jump
action will execute all the rules in the chain that's jumped to, then resume processing the calling chain. A rule with a goto
action will execute the new chain's rules in place of the rest of the current chain, and then the packet will be accepted or dropped as per the policy of the base chain.
[ Open question: the doc claims that a regular chain may also have a policy, but doesn't describe how/whether the policy applies when processing reaches the end of the called chain. I think this omission may be because it is incorrect in the first claim: a very sketchy reading of the source code suggests that you can't specify policy when creating a chain unless you also specify the hook. Also, it hurts my brain to think about it. ]
Chain types
A chain has a type, which is one of filter
, nat
or route
.
filter
does as the name suggests: filters packets.nat
is used for NAT - again, as the name suggests. It differs from filter
in that only the first packet of a given flow hits this chain; subsequent packets bypass it.route
allows changes to the content or metadata of the packet (e.g. setting the TTL, or packet mark/conntrack mark/priority) which can then be tested by policy-based routing (see ip-rule(8)) to send the packet somewhere non-usual. After the route chain runs, the kernel re-evaluates the packet routing decision - this doesn't happen for other chain types. route
only works in the output
hook.
Tables
Chains are contained in tables, which also contain sets, maps, flowtables, and stateful objects. The things in a table must all be of the same family, which is one of
ip
- IPv4 trafficip6
- IPv6 trafficinet
- IPv4 and IPv6 traffic. Rules in an inet chain may match ipv4 or ipv6 or higher-level protocols: an ipv6 packet won't be tested against an ipv4 rule (or vice versa) but a rule for a layer 3 protocol (e.g. UDP) will be tried against both. (Some people [who?] claim this family is less useful than you might first think it would be and in practice you just end up writing separate but similar chains for ip
and ip6
)arp
- note per the diagram that there is a disjoint set of hooks for ARP traffic, which allow only chains in arp
tablesbridge
- similarly, another set of hooks for bridge trafficnetdev
- for chains attached to the ingress
and egress
hooks, which are tied to a single network interface and see all traffic on that interface. This hook/chain type gives great power but correspondingly great faff levels, because the packets are still pretty raw. For example, the ingress chain runs before fragmented datagrams have been reassembled, so you can't match e.g. UDP destination port as it might not be present in the first fragment.
There's a handy summary in the docs describing which chains work with which families and which tables.
What next?
I hope that makes sense. I hope it's correct :-). I haven't explained anything about the syntax or CLI tools because there are perfectly good docs for that already which you now have the background to understand.
Now I'm going to read the script I cargo-culted when I wanted to see if Liminix packet forwarding was working, and replace/update it to perform as an adequate and actually useful firewall
Self-ghosting email#
Tue, 21 Mar 2023 22:13:55 +0000
[ Reminder: more regular updates on what I'm spending most of my time on lately are at https://www.liminix.org/ ]
I had occasion recently to set up some mailing lists and although the subject matter for those lists is Liminix-relevant, the route to their existence really isn't. So, some notes before I forget:
Anyway, that's where we are. I'm quite certain I've done something wrong, but I'm yet to discover what.
Sub-liminix messaging#
Wed, 15 Feb 2023 22:23:48 +0000
I am restarting/rewriting NixWRT,
he said, a few months ago. This is a short follow-up announcement to say that
I am very stoked about this. I'm aiming for ~ weekly updates in that place.
Crossing the threshold - Liminix#
Wed, 19 Oct 2022 21:20:32 +0000
I am restarting/rewriting NixWRT, which has seen no real development in, erm, about four years (my, how the time has flown) and is showing its age and showing my Nix inexperience.
līmen (genitive līminis) (neut.)
- threshold, doorstep, sill (bottom-most part of a doorway)
- lintel
- threshold, entrance, doorway, approach; door
- house, home, abode, dwelling
- beginning, commencement
- end, termination

Thus: Liminix, which stands at the threshold of your home network. According to the commit history I've been playing around with it for about a month now (so, since shortly after I broke the family internet for most of a morning while trying to upgrade OpenWrt ), so although it still doesn't actually do anything useful yet perhaps it's time to break cover.
The objectives are quite similar to the NixWRT objectives in that I want to have congruent configuration management on the "infrastructure" devices that make up my home network, and those devices are typically underpowered for running full-blown NixOS. I do though have a shopping list of things I want to do better/differently:
- a writable filesystem so that software updates or reconfiguration (e.g. changing passwords) don't require taking the device offline to reflash it.
- more flexible service management with dependencies, to allow configurations such as "route through PPPoE if it is healthy, with fallback to LTE"
- a spec for valid configuration options (a la NixOS module options) to that we can detect errors at evaluation time instead of producing a bad image.
- a network-based mechanism for secrets management so that changes can be pushed from a central location to several Liminix devices at once
- send device metrics and logs to a monitoring/alerting/o11y infrastructure
So far: we're using s6-rc for services, which seems to be quite nice and well-put together but I haven't tried too hard to hurt yet. We're using the NixOS module system infra for declaring configuration option types and merging logic. We have significantly more in the way of automated testing than NixWRT had - admittedly not a high bar - and an entirely unrealised/untested idea of how we might do secrets. And the "we" there is, yes, editorial
We don't yet have: writable filesystem (ubifs?); anything o11y; more than one hardware device. And it's not yet at the point that I can dogfood it. Although technically it boots and runs on my spare GL-AR750, I haven't ported wifi across yet.
The primary repo is at https://gti.telent.net/dan/liminix because the older I get the more stubborn I become about free "if you're not paying for it you're the the product" services, but there's a mirror on Github for everyone who's not me. Because federated Gitea is not yet an available thing, and I don't want to throw up all the barriers to contribution.