[parisc-linux] [RFC] Clone and fork considered dangerous from 2.6.12 to 2.6.20, testers needed.

Carlos O'Donell carlos at systemhalted.org
Mon Feb 12 19:32:22 MST 2007


Sporadically when running the glibc test "tst-fork1" under 2.6.20 on a
64-bit kernel the parent process' threads stop running on the CPU, the
time running for each thread increments, but stepping instructions
e.g. gdb "si", never causes any execution

So far, each thread has been stuck in the delay slot of a branch,
usually just after returning from the clone syscall. I see a couple of
places in the kernel where we clear the PSW_B bit, but none seem to
trigger this type of behaviour.

Luckily I can reproduce the behaviour at will with the following
testcase. This test causes all sorts of nastiness for as old a kernel
as I could test, starting with 2.6.12 for either 32 or 64-bit.

Steps to reproduce:

A. Download http://www.parisc-linux.org/~carlos/tst-fork1.tar.bz2
B. Unpack tst-fork1.tar.bz2.

1. Boot a recent 2.6.20 kernel on a 32 or 64-bit box.
2. Recompiple tst-fork1 with: gcc -lpthread -o tst-fork1 tst-fork1.c
3. Run "./tst-fork1 >& run.log" until the test blocks.
4. Try to use "kill -9" to remove all "tst-fork1" threads.
5. Watch your kernel die with a null pointer dereference in
__wake_up_common indicating that there was some process list
corruption. On older 32-bit kernels you make get "Slab corruption"
warnings.

OR use a prebuilt static NPTL version:

1. Boot a recent 2.6.20 kernel on a 32 or 64-bit box.
2. Run "./tst-fork1-static >& run.log" until the test blocks.
3. Determine the parent pid e.g. "pgrep tst-fork1-static"
4. Determine the tid of any child e.g. "ls -alt /proc/$ppid/task/*"
5. Start gdb.
6. In gdb issue "attach $tid"
7. Issue a "bt" to view the current pc, then issue "si" to step.
8. Issue a "dissasemble" to verify that pc is stuck in a branch delay slot.
9. Type CTRL+C to stop the process again, issue "bt" to  see nothing changed.
10. Force the pc to the next instruction e.g. "set $pcoqh = 0x????"
where ???? is the address of the next instruction.
11. Type "continue" and watch your kernel die, or lockup, always
printing "die_if_kernel() recursion detected." before dying.

If others could verify similar behaviour on different hardware that
would be great.

Cheers,
Carlos.



More information about the parisc-linux mailing list