OK, it looks like that was half the problem#
Tue, 30 Sep 2003 12:36:40 +0000
OK, it looks like that was half the problem. Or, at least, one of the two problems.
Allocation is done in pseduo-atomic sections. When alloc() decides that it's time to GC, it uses the same deferred handler mechanism as an interrupt received during pseudo-atomic to schedule a collection as soon as the allocation itself is done. Problem is that it doesn't (didn't, anyway) check whether there was already a deferred handler to run, so the message that said "stop for another thread to gc us" got whapped by a message saying "now do a gc". It wouldn't break us to call gc from two threads at once - appropriate locking mechanisms are in place - but it doe hurt to not stop when people are waiting.
So, one down. The other one is that sometimes threads don't seem to wake up after gc, so after a few minutes of running, all our threads quietly come to rest waiting for a signal.
Earlier we asked "What would block SIGTRAP and SIGDEQUEUE?". wait-on-queue blocks SIGDEQUEUE temporarily while it frobs the waitqueue data before it can go to sleep. rundeferredhandler is called from the sigtrap_handler, and although we unblock the usual culprits before calling into Lisp, SIGTRAP is (along with SIGSEGV) not in that set.
I've added the good parts of this experimentation (without, I hope, the debugging cruft) to CVS under the tag atropos-branch. If you can deal with the shear abhorrence of all these signals, you're welcome to take a look.