emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Untagging by subtraction instead of masking on USE_LSB_TAG


From: Thien-Thi Nguyen
Subject: Re: Untagging by subtraction instead of masking on USE_LSB_TAG
Date: Mon, 28 Jan 2008 04:52:22 +0100
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/23.0.50 (gnu/linux)

() YAMAMOTO Mitsuharu <address@hidden>
() Mon, 28 Jan 2008 11:07:28 +0900

     _cons_to_long:                  _cons_to_long:
             andi. r0,r3,7                   andi. r0,r3,7
             srawi r0,r3,3                   srawi r0,r3,3
             beq cr0,L592                    beq cr0,L592
             rlwinm r2,r3,0,0,28
A            lwz r9,4(r2)                    lwz r9,-1(r3)
B            lwz r3,0(r2)                    lwz r3,-5(r3)
             rlwinm r0,r9,0,29,31            rlwinm r0,r9,0,29,31
             cmpwi cr7,r0,5                  cmpwi cr7,r0,5
             bne cr7,L593                    bne cr7,L593
             rlwinm r2,r9,0,0,28
C            lwz r9,0(r2)                    lwz r9,-5(r9)
     L593:                           L593:
             rlwinm r2,r3,13,0,15            rlwinm r2,r3,13,0,15
             srawi r0,r9,3                   srawi r0,r9,3
             or r0,r2,r0                     or r0,r2,r0
     L592:                           L592:
             mr r3,r0                        mr r3,r0
             blr                             blr

   This would make sense if the latency of load/store does not
   depend on its displacement (I'm not sure if that is the case in
   general).  Comments?

For masking, i see offsets (lwz) of 4,0,0 (lines A,B,C).
For subtraction, -1,-5,-5.

It's very possible that the machine can handle 4,0,0 more
efficiently; those all are even (0, modulo 2) and in two cases
"nothing"!  Furthermore, the maximum absolute offset for the
subtraction method is 5, which is larger (faaarther away) than 4.

Anyway, here is an excerpt from p.532 of "PowerPC 405, Embedded
Processor Core, User's Manual":

| C.2.6     Alignment in Scalar Load and Store Instructions
| 
| The PPC405 requires an extra cycle to execute scalar loads and
| stores having unaligned big or little endian data (except for
| lwarx and stwcx., which require word-aligned operands). If the
| target data is not operand aligned, and the sum of the least two
| significant bits of the effective address (EA) and the byte count
| is greater than four, the PPC405 decomposes a load or store scalar
| into two load or store operations. That is, the PPC405 never
| presents the DCU with a request for a transfer that crosses a word
| boundary. For example, a lwz with an EA of 0b11 causes the PPC405
| to decompose the lwz into two load operations. The first load
| operation is for a byte at the starting effective address; the
| second load operation is for three bytes, starting at the next
| word address.

But don't heed my (mostly) ignorant gut feelings!  Esperience sez:
isolate the variable; build two versions; compare on "typical"
workload; if (dis)advantage is under some "wow!"  threshold, write
down your findings in the notebook (for Emacs, comments would be
fine), but prioritize maintainability (i.e, refrain from
implementing).

I am interested in how you define "typical" and "wow!".

Seasons change, pipelines change.  Keep in mind that sometimes
optimization now translates to pessimization down the road.

thi




reply via email to

[Prev in Thread] Current Thread [Next in Thread]