comp.lang.ada
 help / color / mirror / Atom feed
From: already5chosen@yahoo.com
Subject: Re: Trigonometric operations on x86 and x64 CPUs
Date: Sun, 18 Dec 2016 02:09:54 -0800 (PST)
Date: 2016-12-18T02:09:54-08:00	[thread overview]
Message-ID: <a6734b29-5ffc-4938-bbc2-453f7ae92325@googlegroups.com> (raw)
In-Reply-To: <ce9bddec-64ba-40e9-8fc9-c70adc0555c1@googlegroups.com>

On Saturday, December 17, 2016 at 1:20:03 AM UTC+2, Robert Eachus wrote:
> On Friday, December 16, 2016 at 3:16:25 PM UTC-5, Randy Brukardt wrote:
>  
> > (1) We don't use it for Generic_Elementary_Functions because it's impossible 
> > to be sure that the built-in instructions meet that Annex G accuracy 
> > requirements.
> >
> Correct
> > 
> > (2) Intel's documentation way back when made it clear that you had to do 
> > argument reduction first (yourself). The instructions were not intended to 
> > be accurate for values outside of +/- 2*PI (or something like that, I'm 
> > writing this from memory)
> >.
> Actually more like +/- Pi/4 for Cosine and +/- Pi/8 for tangent.
> > 
> > (3) Argument reduction is always going to lose a lot of precision for large 
> > values, when you start with a 64 bit value there isn't going to be much left 
> > if the value is large. Hard to blame that mathematical fact on the hardware.
> >
> The problem is the 66-bit value of Pi in the hardware. Look at the sine of a number close to Pi call it X.  The sine will be very close to X - Pi.  Assuming 64 bit (double) precision for X, the mantissa will be a couple bits, perhaps none, from X and the rest of the bits will come from bits 49-96 of Pi.  Use 80-bit extended which the x87 instructions support, and you will be taking bits 65 to 128 from the value of Pi.
> 

That part ignores the problem of GIGO.
When we say that argument of sin is x then 99.9999% of the time it does not mean that argument of sin is *exactly* x, but that it is most probably in range [x-0.5*ULP..x+0.5*ULP]. Or worse.
x87 implementation of sin(x) for abs(x) > 2*pi always returns the value in range [sin(x-0.5*ULP)..sin(x+0.5*ULP)] (actually, 2 times better than that for extended precision or 4096 times better for double precision) so  for 99.9999% of us it is more than good enough.
Remaining 0.0001% of us are either mathematicians that hopefully know what they are doing of sensationalist that should be ignored by any sane designer of RTL.

> Could Intel have done the range reduction right?  Sure.  It would add a few instructions to the micro code, and require a longer value for Pi. 
> > 
> > In any case, in general, I'd trust the Ada implementer to have looked at the 
> > issues and having come up with the best possible implementation on the 
> > hardware. They have a lot more at stake than any individual user (and a lot 
> > more tools as well). If they're not using something, most likely it's 
> > because of a good reason or two or six. :-)
> >
> Creating a package which does the range reduction right, and passes small values through to the hardware instructions is not all that hard.  

Such implementations are common, due the hype created by sensationalists.
But, for reasons stated above, I personally consider them as disservice for overwhelming majority of users that would be served much better by using x87 implementation "as is".

> However, FXSAVE and FXRSTOR do not save (and restore) the ST(x)/MMx registers >unless they have been used. Other threads running at the same time are >unlikely to be using these registers, but the OS will need to save and restore < these registers when moving to and from your thread.
> 
> In other words, the actual user instructions executed for a x87 trig function may be fewer and faster than doing it all in 64/128 bit XMM/YMM registers, but the overhead on thread switches and interrupts will more than make up for it. 

It depends on number of x87 instructions used between task switches.
If there are more than few dozens of normal arithmetic instructions or more than couple of transcendental instructions then an overhead per instruction will be negligible.

Also, as far as I am concerned, the best thing about x87 sin instruction is not that it is faster than the competent AVX implementation (for vectorizable cases it is likely non-trivially slower than competent AVX implementation, like one in IPP), but that, for small arguments (range [-pi/2..pi/2]), it is more precise. I.e. in competent AVX implementation one would expect correctly rounded results for 85-90% of small double precision arguments. On the other hand, x87 implementation will produce correctly rounded results for something like 99.98% of small double precision arguments.

>The Elementary_Functions package will have to run in a 32-bit thread, so >unless your entire program is in a 32-bit mode, you will pay this cost on >every call.

That part makes no sense to me whatsoever.
I think that you have some misconception about x87 in long mode.
FYI, as far as hardware and popular x64 OSes (Windows/Linux/BSD, I suppose MAC OS/X too, although I don't know it for sure) goes, x87 in long mode just works.




  reply	other threads:[~2016-12-18 10:09 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-12-16  0:38 Trigonometric operations on x86 and x64 CPUs Robert Eachus
2016-12-16 14:00 ` Luke A. Guest
2016-12-16 20:16 ` Randy Brukardt
2016-12-16 23:20   ` Robert Eachus
2016-12-18 10:09     ` already5chosen [this message]
2016-12-18 14:19       ` Robert Eachus
2016-12-18 15:45         ` hreba
2016-12-18 15:47         ` already5chosen
2016-12-19 23:11       ` Randy Brukardt
2016-12-19 23:49         ` already5chosen
2016-12-20  5:27           ` Niklas Holsti
2016-12-20  8:37             ` Simon Wright
2016-12-20  9:12               ` G.B.
2016-12-20 18:01             ` already5chosen
2016-12-21  1:20               ` Randy Brukardt
2016-12-21  9:29                 ` already5chosen
2016-12-16 20:50 ` Vadim Godunko
replies disabled

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox