The symptom: we're getting a SIGSEGV due to writing into a write-enabled page#

Fri, 09 Aug 2002 00:25:51 +0000

The symptom: we're getting a SIGSEGV which (from looking at the arguments to the handler) appears to be due to writing into a write-enabled page. Yes, I did say enabled

The cause: I'd written (void *)foo-1 instead of (void **)foo-1

The intervening steps (in reverse order)

The SIGSEGV was actually from executing an iret instruction, and nothing (much) to do with write-enabled pages.
The iret was in a part of memory mostly filled with zeroes. In x86 assembler, zeroes disassemble to add %al,(%eax) which while pointless is basically harmless, so we really didn't know for how long it had been dashing through snowfields by the time it got there

So, perhaps we should have a look at the stack. Here is your five minute guide to interpreting sbcl x86 stack traces:

esp            0x403ff84c       0x403ff84c
ebp            0x403ff870       0x403ff870
0x403ff840:     0x00000008      0x00000008      0x0cafd99c      0x00000004
                                                                  ^ sp  
0x403ff850:     0x403ff850      0x0500000b      0x0d659a19      0x0caffeff
                                                                ^ return addr
0x403ff860:     0x0cacc14b      0x00000000      0x0a32fbc0      0x403ff890
                                                                ^ prev frame
0x403ff870:     0x0cace37f      0x0cacc0a3      0x0caa778b      0x00000008
                   ^ ebp
0x403ff880:     0x0a330f14      0x00000004      0x0a330f14      0x403ff8d0
0x403ff890:     0x403ff8b4      0x0500000b      0x09463627      0x403ff874
0x403ff8a0:     0x00000014      0x0b3effd1      0x0cacc0a3      0x0500000b

Start at ebp. The preceding word (address ebp-4) gives the ebp for the previous frame. Four words prior to that is the lisp return address (raw untagged address: x86 insns aren't all the same length after all)

0x0caffeff was full of zeroes. So were the code pointers for the preceding several frames
Why do we have apparently correct control frames which contain such obviously bogus return addresses? Well, what if there were valid code there originally, which got moved by, say, GC? Like, the GC that just occurred a minute ago
Oh look, we're scavenging the control stack from 0x403ffffe rather than 0x403ffffc as we should. See `The cause', above. Whatever values we look at from that angle, it's a pretty good bet they won't be recognisable lisp pointers. Duh.

Now I'm back at the state of having lisp which actually can build PCL and dump a core, I guess I should look at my new bind vop and see if it's doing anything yet