[parisc-linux] Let's Try Something New(tm)

James Bottomley James.Bottomley at SteelEye.com
Sat Feb 17 07:02:01 MST 2007


On Sat, 2007-02-17 at 03:30 -0700, Grant Grundler wrote:
> And yes, the fact that SMP is broken bothers me too.
> I've spent some time looking at the cpu_possible_map patch pointed
> out by james bottomly (IIRC) and I haven't figure out why that
> has the impact that it does.

Let me go over it one more time for the record, since the problem has
really only been outlined on IRC.

In the 2.6.18 timeframe Nick Piggin introduced a bug into the smp tuning
system with commit 5c1e176781f43bc902a51e5832f789756bff911b

What he's doing is trying to ensure /sbin/init doesn't start (or get
scheduled) on an isolated CPU.  However, he built this change on
cpu_online_map, which contains precisely the boot cpu at the point
sched_init_smp() is called.  The result was that init was restricted to
the boot CPU, as was everything that inherited its cpus_allowed from
init (i.e. every other process).  I think it may only have been parisc
that was affected like this, since we hotplug boot our cpus.

Post 2.6.19 the bug was spotted and fixed by Nathan Lynch with commit
e5e5673f828623e58a401862b33173591faaeaff.  When this commit was applied
parisc began hanging in the SMP boot sequence.

However, this patch is not the *cause* of the problem.  If you back port
it to 2.6.19-pa0 (which has the piggin bug) you'll see that it boots
just fine SMP and correctly schedules on all CPUs.

So, what happened is that the bug which is causing our SMP boot hangs
got introduced somewhere between 2.6.19 and the Lynch bug fix, but we
didn't spot it because parisc was only scheduling on the boot CPU and it
didn't show up.

Analysis of my system shows that it's occurring when a shell script dots
in another shell script (i.e. it could be a fork or signal bug).  Also
reverting all 30 commits affecting sched.c during the window doesn't fix
it.  Unfortunately, the tree isn't really in a good enough state to try
to bisect it.

> I also haven't had time to dig more into why the schedule is
> only running things on CPU 0.

Hopefully the above explains that.

James





More information about the parisc-linux mailing list