From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.3 required=5.0 tests=BAYES_00,INVALID_MSGID
	autolearn=no autolearn_force=no version=3.4.4
X-Google-Language: ENGLISH,ASCII-7-bit
X-Google-Thread: 103376,fedc2d05e82c9174
X-Google-Attributes: gid103376,public
From: "Robert I. Eachus" <eachus@mitre.org>
Subject: Re: Calculating SQRT in ADA
Date: 1999/04/02
Message-ID: <3705555E.5572A782@mitre.org>#1/1
X-Deja-AN: 462108094
Content-Transfer-Encoding: 7bit
References: <7dbv6t$4u5$1@nnrp1.dejanews.com>
 <19990324201959.00800.00000708@ngol04.aol.com>
 <7dei9a$dvo$1@nnrp1.dejanews.com> <umqzp50qstq.fsf@maestro.clustra.com>
 <7dhjhi$27a$1@nnrp1.dejanews.com> <36FFF83A.BE789C93@mitre.org>
 <7dq5b2$2dk$1@nnrp1.dejanews.com>
X-Accept-Language: en
Content-Type: text/plain; charset=us-ascii
Organization: The MITRE Corporation
Mime-Version: 1.0
Newsgroups: comp.lang.ada
Date: 1999-04-02T00:00:00+00:00
List-Id: <comp.lang.ada>


robert_dewar@my-dejanews.com wrote:

> I am not sure what this refers to, what "hardware" IEEE
> instructions are you referring to. Certainly IEEE does not
> include elementary functions except for sqrt, and this is
> of course NOT hardware on most machines.

   News to me.  A lot of the current processor architectures emulate
some of the trigonmetric and trancendental functions with microcode or
hardware traps to library routines, but there are still available
members of these families that implement the instructions in hardware. 
For example, in the 68000 family, the earlier processor families had all
the instructions in hardware in the 68881 and 68882 coprocessor chips. 
The 68040 implemented many floating point instructions in hardware and
emulated others, in the 68060, almost all of the instructions other than
the basic floating point operations are done in emulation libraries.

> Sure, a sqrt in hardware can be as fast as a divide, since
> a very similar algorithm can be used. But I challenge your
> initial statement here. Please cough up code on a specific
> machine to justify the statement that you can do a sqrt in
> floating-point divide time.

  Well the first manual I grabbed off the shelf surprised me slightly:
on the 68881, the floating point sqare root took two cycles more than a
divide--out of about 130. (The exact number of cycles depends on
register modes.)  Of course the cost of loading the second operand for
divide takes longer than two clocks--four to 40 depending on source and
memory speed.

  The FSQRT has been part of the SPARC architecture since version 7, it
is implemented in hardware on almost all chipsets.   I don't have timing
tables handy, but I have tested several SPARC processors where FSQRT is
faster than FDIV.  As above, the speed advantage comes from only having
one operand more than anything.  Loading FP registers, especially from
memory, costs.  (Of course, YMMV, but I was more concerned with cases
where I was calculating for a large set of points, so the high speed
caches didn't much effect the data loading.)

  For the integer case, a div.l takes about 90 clocks on a 68020, while
the corresponding square root algorithm takes 16 iterations through a
loop:

     L0:   MOVE.L (operand),D5;
           TRAPMI                       ; Error if operand is negative
           MOVEQ #15,D4;
           MOVEQ #0,D6;
           MOVEQ #1,D7;
      L1:  ROR #1,D6                    ; First rotation has no effect.
           ROR #2,D7                    ; First rotation results
in                                                                                                                                                                                                                                                                                                                                
; #8000000
           CMP.L D6,D5;
           BLT L2;
           ADD D7,D6;
           SUB D6,D5;
      L2:  DBF D4,L1;

     I'm not sure I have this correct from memory, but it is close.  The
version I used unwound the loop, used a 64-bit operand, and did a BFFFO
to skip leading zeros.  I needed to do SQRT(X*X+Y*Y) fast, again for
lots of points.