[kernel] bug#161: optimize csum_partial_copy_from_user()

grundler@dsl2.external.hp.com (Grant Grundler), 161@bugs.parisc-linux.org grundler@dsl2.external.hp.com (Grant Grundler), 161@bugs.parisc-linux.org


X-PA-RISC Linux-PR-Message: report 161
X-PA-RISC Linux-PR-Package: kernel
X-Loop: daniel_frazier@hp.com
Received: via spool by bugs@bugs.parisc-linux.org id=B.101612759020789
          (code B ref -1); Thu, 14 Mar 2002 17:48:01 GMT
To: submit@bugs.parisc-linux.org
Message-Id: <20020314173950.266AB488A@dsl2.external.hp.com>
Date: Thu, 14 Mar 2002 10:39:50 -0700 (MST)
From: grundler@dsl2.external.hp.com (Grant Grundler)


Package: kernel
Version: any
Severity: wishlist

from Documentation/parisc/unwritten

csum_partial_copy
csum_partial_copy_from_user
                arch/parisc/lib/checksum.c

We want optimized asm for both of those.
For PA2.0, we want a loop that can perform load,add,store per cycle.
This means interleaving registers in a loop and prefetch ~3 cachelines
ahead.  The main loop for from_user flavor could look something like:

	ldd,ma	0(%s3,src), b
	ldd,ma	0(%s3,src), c
	ldd,ma	0(%s3,src), d
big_loop:
	ldd	192(dst), %r0		; prefetch 3 cachelines ahead for write
	ldw	192(%s3,src), %r0	; prefetch 3 cachelines ahead for read
	ldd,ma	0(%s3,src), a
	addc	x,b,x
	stdd,ma b,0(dst)
	ldd,ma	0(%s3,src), b
	addc	y,c,y
	stdd,ma c,0(dst)
	ldd,ma	0(%s3,src), c
	addc	z,d,z
	stdd,ma d,0(dst)
	ldd,ma	0(%s3,src), d
	addc	w,a,w
	stdd,ma a,0(dst)
	/* psuedo C */
	if (src < (end & ~0x3fUL) goto loop;

	/* close up shop, fold cksum */
	addc	x,b,x
	stdd,ma b,0(dst)
	addc	w,x,w
	addc	y,c,y
	stdd,ma c,0(dst)
	addc	w,y,w
	addc	z,d,z
	stdd,ma d,0(dst)
	addc	w,z,w

	while (src 
	/* w needs to be folded smaller and returned */
	return (csum_fold(w));

The above code still needs lots of boundary condition checking:
	o verify src/dst are well aligned at start
	o verify len is > 32 bytes
	o handle misaligned leftovers