[parisc-linux] stalling system clues + parisc WCHAN hack
Paul Bame
bame@fc.hp.com
Tue, 21 May 2002 08:35:10 -0600
I doubt I'm the only who sees parisc systems become unusuably slow,
apparently because any command needing disk I/O has to wait a long time.
This isn't the same symptom as the traditional Linux problem where one
fills the buffer cache (say, by running a large tar) and then the first
interactive command is slow due to paging. In the traditional problem,
the system fairly quickly recovers normalcy, in our case it never does,
though processes eventually finish. It's as if a timeout is releasing a
needed lock or something.
FYI the load for reproducing this stalling behavior is to run several
network-based (haven't tried local) 'cvs update' of the linux kernel
mixed with some diffs. The load is running on a 50+G partition if that
matters, and I've seen problems in both ext2 and ext3.
It sounds like the disk is seeking in the pattern of a heartbeat, twice
a second. I think the front-panel has a heartbeat monitor with that rhythm.
So I did a quick, simple, ugly hack, mostly to arch-independent code, to
get WCHAN out of parisc (http://ftp.parisc-linux.org/patches/wchan.diff),
and ran a ps on a system which was stalling. The result is attached, as
is a copy of /proc/meminfo.
The interesting clue in the 'ps' to me are the 'D' processes, which I
suspect are those who've called down_uninterruptable. The most frequent
WCHAN culprits are wait_on_buffer/page. Where to go next solving this
problem (oh, with least effort too unfortunately)?
Linux b2000 2.4.18-pa25 #22 Fri May 17 11:04:28 MDT 2002 parisc unknown
PID CMD S WCHAN
1 ini S pipe_poll
2 [keventd] S context_thread
3 [ksoftirqd_CPU0] S start_context_thread
4 [kswapd] S kswapd
5 [bdflush] S start_context_thread
6 [kupdated] S sync_supers
9 [mdrecoveryd] S md_thread
10 [kjournald] S wait_on_buffer
62 [kjournald] S wait_on_buffer
98 /sbin/dhclient-2 S datagram_poll
110 /sbin/portmap S tcp_poll
175 /sbin/syslogd D wait_on_buffer
178 /sbin/klogd S syslog
182 /sbin/rpc.statd S tcp_poll
190 /usr/sbin/inetd S tcp_poll
206 nmbd -a S pipe_poll
208 /usr/sbin/sshd S tcp_poll
213 /usr/bin/X11/xfs S unix_poll
215 /usr/sbin/ntpd S datagram_poll
219 /usr/sbin/atd S wait_on_buffer
222 /usr/sbin/cron S wait4
238 -bash S wait4
783 /usr/sbin/apache S wait4
2748 /usr/sbin/lpd S tcp_poll
4356 /usr/sbin/apache S wait_for_connect
4357 /usr/sbin/apache S wait_for_connect
4358 /usr/sbin/apache S wait_for_connect
4359 /usr/sbin/apache S wait_for_connect
4360 /usr/sbin/apache S wait_for_connect
4361 /usr/sbin/apache S wait_for_connect
4717 /usr/sbin/sshd S normal_poll
4718 -bash S read_chan
4794 /USR/SBIN/CRON S pipe_wait
4795 /usr/bin/perl -w S wait4
4797 /usr/bin/ssh b20 S tcp_poll
4799 /usr/sbin/sshd S unix_poll
4800 /usr/bin/perl -w S wait4
4802 /usr/sbin/sendma S pipe_wait
4824 /bin/sh -eux /pr S wait4
5088 /USR/SBIN/CRON S pipe_wait
5089 /bin/sh -c cd ia S wait4
5090 /bin/sh -uex ./b S wait4
5092 /usr/sbin/sendma S pipe_wait
5179 /bin/sh -uex ./b S wait4
5180 diff -urN --excl D wait_on_page
5209 /bin/sh -eux /pr S wait4
5210 cvs -Qfz4 -d:pse D wait_on_page
5291 /USR/SBIN/CRON S pipe_wait
5292 /bin/sh -c test S wait4
5293 run-parts --repo S pipe_poll
5296 /bin/sh /etc/cro S wait4
5297 /bin/sh /usr/bin S wait4
5311 /bin/sh /usr/bin S wait4
5312 sort -f S pipe_wait
5313 /usr/lib/locate/ S pipe_wait
5314 /usr/bin/find / D wait_on_buffer
5367 /bin/sh ./daemon S wait4
5368 setiathome -nice R wait_on_buffer
5381 ps -eo pid,cmd,s R wait_on_buffer
total: used: free: shared: buffers: cached:
Mem: 525357056 521830400 3526656 0 70672384 356921344
Swap: 511696896 5632000 506064896
MemTotal: 513044 kB
MemFree: 3444 kB
MemShared: 0 kB
Buffers: 69016 kB
Cached: 347420 kB
SwapCached: 1136 kB
Active: 112296 kB
Inactive: 329744 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 513044 kB
LowFree: 3444 kB
SwapTotal: 499704 kB
SwapFree: 494204 kB