The symptom: we're getting a SIGSEGV due to writing into a write-enabled page

Fri, 09 Aug 2002 00:25:51 +0000

The symptom: we’re getting a SIGSEGV which (from looking at the arguments to the handler) appears to be due to writing into a write-enabled page. Yes, I did say enabled

The cause: I’d written (void *)foo-1 instead of (void **)foo-1

The intervening steps (in reverse order)

  1. The SIGSEGV was actually from executing an iret instruction, and nothing (much) to do with write-enabled pages.
  2. The iret was in a part of memory mostly filled with zeroes. In x86 assembler, zeroes disassemble to add %al,(%eax) which while pointless is basically harmless, so we really didn’t know for how long it had been dashing through snowfields by the time it got there
  3. So, perhaps we should have a look at the stack. Here is your five minute guide to interpreting sbcl x86 stack traces:
    esp            0×403ff84c       0×403ff84c
    ebp            0×403ff870       0×403ff870
    
    0×403ff840:     0×00000008      0×00000008      0×0cafd99c      0×00000004
                                                                      ^ sp  
    0×403ff850:     0×403ff850      0×0500000b      0×0d659a19      0×0caffeff
                                                                    ^ return addr
    0×403ff860:     0×0cacc14b      0×00000000      0×0a32fbc0      0×403ff890
                                                                    ^ prev frame
    0×403ff870:     0×0cace37f      0×0cacc0a3      0×0caa778b      0×00000008
                       ^ ebp
    0×403ff880:     0×0a330f14      0×00000004      0×0a330f14      0×403ff8d0
    0×403ff890:     0×403ff8b4      0×0500000b      0×09463627      0×403ff874
    0×403ff8a0:     0×00000014      0×0b3effd1      0×0cacc0a3      0×0500000b
    
    Start at ebp. The preceding word (address ebp-4) gives the ebp for the previous frame. Four words prior to that is the lisp return address (raw untagged address: x86 insns aren’t all the same length after all)
  4. 0x0caffeff was full of zeroes. So were the code pointers for the preceding several frames
  5. Why do we have apparently correct control frames which contain such obviously bogus return addresses? Well, what if there were valid code there originally, which got moved by, say, GC? Like, the GC that just occurred a minute ago
  6. Oh look, we’re scavenging the control stack from 0×403ffffe rather than 0×403ffffc as we should. See `The cause’, above. Whatever values we look at from that angle, it’s a pretty good bet they won’t be recognisable lisp pointers. Duh.

Now I’m back at the state of having lisp which actually can build PCL and dump a core, I guess I should look at my new bind vop and see if it’s doing anything yet