Interesting performance quirk.

comp.lang.ada
 help / color / mirror / Atom feed

* Interesting performance quirk.
@ 2008-10-26  0:57 Peter C. Chapin
  2008-10-26  2:15 ` Jeffrey Creem
                   ` (3 more replies)
  0 siblings, 4 replies; 28+ messages in thread
From: Peter C. Chapin @ 2008-10-26  0:57 UTC (permalink / raw)


I'm working on a cryptographic library in Ada (it's a hobby project). In
addition to unit tests I also plan to provide some benchmark programs
since the performance of encryption and related algorithms is often of
interest. My library is in it infancy right now but I do have (what I
believe to be) a working implementation of Blowfish. So for fun I wrote
a simple benchmark program to exercise my Blowfish implementation.

My program encrypts a 1 MB array repeatedly and then decrypts it
repeatedly. Aside from using -O2 I'm not doing anything fancy with
compiler options nor have I made any particular effort to hand optimize
the code... at least not yet ("make it right before you make it fast").

Now the interesting part. My main development system is a Windows XP
laptop. On this system my "optimized" Blowfish benchmark encrypts or
decrypts at about 11 MB/s (curiously decryption is a little faster than
encryption, which seems odd). It also happens that I have OpenSUSE 10.2
Linux running on the same box in a VMware virtual machine. In that
environment my benchmark encrypts or decrypts at fully 27 MB/s. It's
over twice as fast! I'm using GNAT GPL 2008 in both cases with the same
compiler options and exactly the same source code. I'm even using the
same basic hardware although, as I said, one of my systems---the faster
one---is a virtual machine.

Should I be surprised at this performance difference? I wasn't expecting
it. Note that I'm using Ada.Calendar.Clock to track execution time. At
first I wondered if the virtual machine's notion of time was distorted
in some way but, no... the program is definitely faster in the VM (it
runs long enough so that the difference is speed is easily perceptible
by a human).

I don't really have a question or a problem with any of this, but I
found it curious and I thought I'd post in case someone had some useful
insight into what might be going on.

Peter



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Interesting performance quirk.
  2008-10-26  0:57 Interesting performance quirk Peter C. Chapin
@ 2008-10-26  2:15 ` Jeffrey Creem
  2008-10-26 11:16   ` Peter C. Chapin
  2008-10-26  4:57 ` tmoran
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 28+ messages in thread
From: Jeffrey Creem @ 2008-10-26  2:15 UTC (permalink / raw)


Peter C. Chapin wrote:

> Now the interesting part. My main development system is a Windows XP
> laptop. On this system my "optimized" Blowfish benchmark encrypts or
> decrypts at about 11 MB/s (curiously decryption is a little faster than
> encryption, which seems odd). It also happens that I have OpenSUSE 10.2
> Linux running on the same box in a VMware virtual machine. In that
> environment my benchmark encrypts or decrypts at fully 27 MB/s. It's
> over twice as fast! I'm using GNAT GPL 2008 in both cases with the same
> compiler options and exactly the same source code. I'm even using the
> same basic hardware although, as I said, one of my systems---the faster
> one---is a virtual machine.
> 

Hmm, certainly seems odd. I wonder if one of these two versions was 
configured differently and thus has a different set of default options.

It seems like it would be interesting to quickly disassembly the inner 
loop on the two machines and see if anything jumps out. From the 
description, it seems like you should be getting identical code.

Or maybe you just need to keep nesting the VMs until this code runs at 
200 GB/sec :)



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Interesting performance quirk.
  2008-10-26  2:15 ` Jeffrey Creem
@ 2008-10-26 11:16   ` Peter C. Chapin
  0 siblings, 0 replies; 28+ messages in thread
From: Peter C. Chapin @ 2008-10-26 11:16 UTC (permalink / raw)

Jeffrey Creem wrote:

> It seems like it would be interesting to quickly disassembly the inner
> loop on the two machines and see if anything jumps out. From the
> description, it seems like you should be getting identical code.

That might be an educational exercise in any case (for me at least). I
was also thinking of trying to profile the program as a first step in
increasing its performance in general; that might reveal something
interesting as well.

I am also planning to write a similar program in C using OpenSSL's
Blowfish implementation. I want to get a point of comparison to see if
my implementation has "reasonable" performance or not (here I'm assuming
that OpenSSL is reasonable). I'm curious now if the C/OpenSSL version
will also demonstrate this same performance quirk.

Peter

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Interesting performance quirk.
  2008-10-26  0:57 Interesting performance quirk Peter C. Chapin
  2008-10-26  2:15 ` Jeffrey Creem
@ 2008-10-26  4:57 ` tmoran
  2008-10-26 11:11   ` Peter C. Chapin
  2008-10-29 20:18 ` Florian Weimer
  2008-11-07  0:44 ` Randy Brukardt
  3 siblings, 1 reply; 28+ messages in thread
From: tmoran @ 2008-10-26  4:57 UTC (permalink / raw)


I/O or cache differences?



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Interesting performance quirk.
  2008-10-26  4:57 ` tmoran
@ 2008-10-26 11:11   ` Peter C. Chapin
  2008-10-28  8:49     ` Martin
  0 siblings, 1 reply; 28+ messages in thread
From: Peter C. Chapin @ 2008-10-26 11:11 UTC (permalink / raw)


tmoran@acm.org wrote:

> I/O or cache differences?

Well, the benchmark program doesn't do any I/O (except to report results
at the end). I'm using the same physical hardware in both cases, but
perhaps Linux and/or VMware is configuring the cache differently in some
way. Or perhaps Linux provides the memory to the process in a more
cache-friendly way. My program is passing over a 1 MB array from
beginning to end multiple times. This is also a dual core machine, but
the program has only a single task.

I could try experimenting with different array sizes.

Peter



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Interesting performance quirk.
  2008-10-26 11:11   ` Peter C. Chapin
@ 2008-10-28  8:49     ` Martin
  2008-10-28 11:35       ` Peter C. Chapin
  0 siblings, 1 reply; 28+ messages in thread
From: Martin @ 2008-10-28  8:49 UTC (permalink / raw)


On Oct 26, 11:11 am, "Peter C. Chapin" <pcc482...@gmail.com> wrote:
> tmo...@acm.org wrote:
> > I/O or cache differences?
>
> Well, the benchmark program doesn't do any I/O (except to report results
> at the end). I'm using the same physical hardware in both cases, but
> perhaps Linux and/or VMware is configuring the cache differently in some
> way. Or perhaps Linux provides the memory to the process in a more
> cache-friendly way. My program is passing over a 1 MB array from
> beginning to end multiple times. This is also a dual core machine, but
> the program has only a single task.
>
> I could try experimenting with different array sizes.
>
> Peter

Any tasking involved?...

Cheers
-- Martin



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Interesting performance quirk.
  2008-10-28  8:49     ` Martin
@ 2008-10-28 11:35       ` Peter C. Chapin
  2008-10-28 14:21         ` Robert A Duff
                           ` (4 more replies)
  0 siblings, 5 replies; 28+ messages in thread
From: Peter C. Chapin @ 2008-10-28 11:35 UTC (permalink / raw)


Martin wrote:

> Any tasking involved?...

No, it's just one task. The program could easily be written with
multiple tasks since I'm just using ECB mode (so I could encrypt
different parts of the array in parallel). However, my intent is not to
make the benchmark program as fast as possible but rather measure the
encryption performance of my implementation. I want to avoid doing
things that would obscure that. By themselves the timings mean very
little; it's the comparisons that are interesting.

In further developments...

I wrote a version of the benchmark program in C that uses OpenSSL's
Blowfish implementation. The program is the "same" in that it creates
the same amount of data and processes it in the same way. The only
significant difference is that it calls into OpenSSL. It's quite a bit
faster. Right now I get

Windows XP Laptop (GNAT GPL 2008 for the Ada, Cygwin gcc for the C)
-----
My Library => 11 MB/s  (with -O2 option and no debugging support)
OpenSSL    => 65 MB/s  (Wow!)

SUSE Linux in a VM on the same box
-----
My Library => 25 MB/s  (odd)
OpenSSL    => 65 MB/s  (now this makes sense at least)

I'm expecting my code to be slower because a) It is Ada and there are
probably some extra run time checks, and b) I just wrote this thing and
I haven't yet tried to tweak it for performance. The performance ratio
on the Linux system between my immature Ada code and the much more
mature OpenSSL seems pretty reasonable to me. I'm less content with the
ratio in the Window XP environment.

On the other hand: on the Blowfish home page Schneier warns about a
common implementation error that greatly reduces the security of the
algorithm. I thought, "I better take a look at that." It turned out that
the problem was related to mixing signed and unsigned characters (in C
code, of course). I thought, "Hey, I don't have this problem and I
didn't even have to try!" :-)

I tried running gprof over my code to see what I could learn, but I
don't know how to use that tool very well yet so I didn't learn much. I
discovered that a certain procedure is called a HUGE number of times,
but that's not unexpected considering where it's used. It's a small
procedure so I might try inlining it.

Anyway I'm having fun and considering that this is a hobby project
that's what matters. :-)

Peter



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Interesting performance quirk.
  2008-10-28 11:35       ` Peter C. Chapin
@ 2008-10-28 14:21         ` Robert A Duff
  2008-10-29  1:42           ` Peter C. Chapin
  2008-10-28 18:27         ` Jeffrey R. Carter
                           ` (3 subsequent siblings)
  4 siblings, 1 reply; 28+ messages in thread
From: Robert A Duff @ 2008-10-28 14:21 UTC (permalink / raw)

"Peter C. Chapin" <pcc482719@gmail.com> writes:

> I'm expecting my code to be slower because a) It is Ada and there are
> probably some extra run time checks, ...

You should also compile with all checks suppressed, to determine the
cost of the checks.  This doesn't mean you have to suppress all checks
in order to get good performance.  After this experiment, you might be
able to determine that suppressing checks in some inner loop (e.g. the
procedure you intend to inline, and its call site(s))) is good enough.

>... and b) I just wrote this thing and
> I haven't yet tried to tweak it for performance.

Sure.

> I tried running gprof over my code to see what I could learn, but I
> don't know how to use that tool very well yet so I didn't learn much.

It's worth learning.  The gprof output can be confusing at first, but if
you puzzle over it for a while, it begins to make more sense.  ;-)

>...I
> discovered that a certain procedure is called a HUGE number of times,
> but that's not unexpected considering where it's used. It's a small
> procedure so I might try inlining it.

It might also make sense to look at the machine code for that procedure.
And/or look at the expanded Ada code (using gnat's -gnatG switch).

You might also want to profile the C version, and look at whatever
tricks it uses in its inner loops, and perhaps duplicate them in
the Ada version.

Please continue to report your progress here.  I, and I'm sure others,
are interested.

- Bob

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Interesting performance quirk.
  2008-10-28 14:21         ` Robert A Duff
@ 2008-10-29  1:42           ` Peter C. Chapin
  0 siblings, 0 replies; 28+ messages in thread
From: Peter C. Chapin @ 2008-10-29  1:42 UTC (permalink / raw)


Robert A Duff wrote:

> You should also...

Thanks for all the great suggestions. I have made a note of them for
future reference. And yes, I definitely intend to practice with gprof.
It looks useful.

Thanks also for your words of encouragement. Unfortunately progress on
this project tends to be spotty because I can only work on it here and
there between my "real" work. However, if I find out anything else
interesting or have any more questions I'll definitely post.

Peter



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Interesting performance quirk.
  2008-10-28 11:35       ` Peter C. Chapin
  2008-10-28 14:21         ` Robert A Duff
@ 2008-10-28 18:27         ` Jeffrey R. Carter
  2008-10-29  1:39           ` Peter C. Chapin
  2008-10-28 23:22         ` Ludovic Brenta
                           ` (2 subsequent siblings)
  4 siblings, 1 reply; 28+ messages in thread
From: Jeffrey R. Carter @ 2008-10-28 18:27 UTC (permalink / raw)


Peter C. Chapin wrote:
> 
> I'm expecting my code to be slower because a) It is Ada and there are
> probably some extra run time checks, and b) I just wrote this thing and
> I haven't yet tried to tweak it for performance. The performance ratio
> on the Linux system between my immature Ada code and the much more
> mature OpenSSL seems pretty reasonable to me. I'm less content with the
> ratio in the Window XP environment.

Have you verified that your implementation and the OpenSSL library give the same 
encryption results? There's not much point in worrying about relative speeds if 
they aren't doing the same thing.

-- 
Jeff Carter
"Now look, Col. Batguano, if that really is your name."
Dr. Strangelove
31



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Interesting performance quirk.
  2008-10-28 18:27         ` Jeffrey R. Carter
@ 2008-10-29  1:39           ` Peter C. Chapin
  2008-10-29  5:27             ` Jeffrey R. Carter
  0 siblings, 1 reply; 28+ messages in thread
From: Peter C. Chapin @ 2008-10-29  1:39 UTC (permalink / raw)

Jeffrey R. Carter wrote:

> Have you verified that your implementation and the OpenSSL library give
> the same encryption results? There's not much point in worrying about
> relative speeds if they aren't doing the same thing.

Well, I have unit tests for my code that are based on the Blowfish test
vectors on Schneier's page. My code passes those cases, so the
encryption results are apparently what Schneier thinks they should be. I
can only assume that OpenSSL can also pass those test cases (didn't try it).

I can see the sense in verifying interoperability between my code and
another major crypto library. On the other hand, when there are
published test cases it may not be worth it. For instance the Blowfish
ECB tests on Schneier's page all use 64 bit keys. Considering that the
algorithm can handle key lengths from 32 bits to 448 bits, those test
cases aren't as broad as ideal. Checking interoperability with OpenSSL
(say) for a range of key sizes might be worth doing.

But... this is getting off the topic of Ada a bit. I apologize for rambling.

Peter

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Interesting performance quirk.
  2008-10-29  1:39           ` Peter C. Chapin
@ 2008-10-29  5:27             ` Jeffrey R. Carter
  0 siblings, 0 replies; 28+ messages in thread
From: Jeffrey R. Carter @ 2008-10-29  5:27 UTC (permalink / raw)


Peter C. Chapin wrote:
> 
> Well, I have unit tests for my code that are based on the Blowfish test
> vectors on Schneier's page. My code passes those cases, so the
> encryption results are apparently what Schneier thinks they should be. I
> can only assume that OpenSSL can also pass those test cases (didn't try it).

That's an assumption you ought to test. OpenSSL might be very fast because it's 
wrong.

Doesn't seem likely, though.

-- 
Jeff Carter
"Now look, Col. Batguano, if that really is your name."
Dr. Strangelove
31



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Interesting performance quirk.
  2008-10-28 11:35       ` Peter C. Chapin
  2008-10-28 14:21         ` Robert A Duff
  2008-10-28 18:27         ` Jeffrey R. Carter
@ 2008-10-28 23:22         ` Ludovic Brenta
  2008-10-29  8:42           ` oenone
  2008-10-29  9:59           ` Peter C. Chapin
  2008-10-29  9:54         ` Alex R. Mosteo
  2008-10-29 16:12         ` Colin Paul Gloster
  4 siblings, 2 replies; 28+ messages in thread
From: Ludovic Brenta @ 2008-10-28 23:22 UTC (permalink / raw)


Peter C. Chapin writes:
> Windows XP Laptop (GNAT GPL 2008 for the Ada, Cygwin gcc for the C)
> -----
> My Library => 11 MB/s  (with -O2 option and no debugging support)
> OpenSSL    => 65 MB/s  (Wow!)
>
> SUSE Linux in a VM on the same box
> -----
> My Library => 25 MB/s  (odd)
> OpenSSL    => 65 MB/s  (now this makes sense at least)

I believe OpenSSL uses hand-written and carefully optimised assembly
for its inner loops.

Besides that, I was wondering if by any chance you were using one 32-
bit and one 64-bit operating system; if so, which one is Linux and
which one is Windows?

--
Ludovic Brenta.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Interesting performance quirk.
  2008-10-28 23:22         ` Ludovic Brenta
@ 2008-10-29  8:42           ` oenone
  2008-10-29  9:59           ` Peter C. Chapin
  1 sibling, 0 replies; 28+ messages in thread
From: oenone @ 2008-10-29  8:42 UTC (permalink / raw)


On 29 Okt., 00:22, Ludovic Brenta <ludo...@ludovic-brenta.org> wrote:
> Besides that, I was wondering if by any chance you were using one 32-
> bit and one 64-bit operating system; if so, which one is Linux and
> which one is Windows?

And maybe check if there is a difference in static/dynamic linking...

The VM might do some JIT-optimizations on the binaries executing. You
could try running some Linux Live-CD with same GNAT version.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Interesting performance quirk.
  2008-10-28 23:22         ` Ludovic Brenta
  2008-10-29  8:42           ` oenone
@ 2008-10-29  9:59           ` Peter C. Chapin
  2008-10-29 10:19             ` Martin
  2008-11-17  6:31             ` David Thompson
  1 sibling, 2 replies; 28+ messages in thread
From: Peter C. Chapin @ 2008-10-29  9:59 UTC (permalink / raw)

Ludovic Brenta wrote:

> I believe OpenSSL uses hand-written and carefully optimised assembly
> for its inner loops.

That's interesting. It would help to explain the speed difference...
although as I said I'm sure I can do better, at least a little better,
with the compiled Ada than I am currently.

In any event I don't want to "resort" to assembly language or to
removing the Ada mandated checks (at least not in the final version). My
assumption is the the sort of person who would be interested in using a
crypto library in Ada would have that interest precisely because of the
safety afforded by the Ada language. This is why I'm content with the
library being a bit slower than the competitors that are throwing safety
to the wind by, for example, using assembly language.

Considering that cryptography is, almost by definition, used in security
sensitive programs it seems like issues of safety and correctness should
take priority over raw efficiency. Fast is good, of course, but it's
even more important to be both right and secure. Hence Ada and not C
(and definitely not assembly language).

> Besides that, I was wondering if by any chance you were using one 32-
> bit and one 64-bit operating system; if so, which one is Linux and
> which one is Windows?

No, they are both 32-bit systems. The Linux VM is on the same hardware
as the Windows system.

Peter

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Interesting performance quirk.
  2008-10-29  9:59           ` Peter C. Chapin
@ 2008-10-29 10:19             ` Martin
  2008-11-17  6:31             ` David Thompson
  1 sibling, 0 replies; 28+ messages in thread
From: Martin @ 2008-10-29 10:19 UTC (permalink / raw)


On Oct 29, 9:59 am, "Peter C. Chapin" <pcc482...@gmail.com> wrote:
> In any event I don't want to "resort" to assembly language or to
> removing the Ada mandated checks (at least not in the final version).

With a bit of care a lot of checks can be avoided or removed by the
compiler. A fairly common example of this wrt loops is:

   type Array_Index_Type is range ...;
   type Array_Type is array (Array_Index_Type) of ...;

   procedure Example (A : Array_Type) is
   begin
      for I in Array_Index_Type loop
         -- Do something with array I - a check is often included
      end loop;
   end Example;

is better written as:

   procedure Example (A : Array_Type) is
   begin
      for I in A'Range loop
         -- Do something with array I - no need for a check
      end loop;
   end Example;

I'm not a language lawyer, so I can't tell if that's compiler
inefficiencies or what the language requires but this was certainly
the behaviour of several compilers I've used***.

Cheers
-- Martin

*** I haven't checked this recently!!! :-)



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Interesting performance quirk.
  2008-10-29  9:59           ` Peter C. Chapin
  2008-10-29 10:19             ` Martin
@ 2008-11-17  6:31             ` David Thompson
  2008-11-17 11:51               ` Peter C. Chapin
  1 sibling, 1 reply; 28+ messages in thread
From: David Thompson @ 2008-11-17  6:31 UTC (permalink / raw)


On Wed, 29 Oct 2008 05:59:24 -0400, "Peter C. Chapin"
<pcc482719@gmail.com> wrote:

> Ludovic Brenta wrote:
> 
> > I believe OpenSSL uses hand-written and carefully optimised assembly
> > for its inner loops.
> 
For some algorithms and some targets, including x86 Blowfish. 
Unless deselected at build (specifically configuration) time.
(The actual assembler source is created by perl effectively macros, 
so it might be more exact to call it hand-designed.)

> That's interesting. It would help to explain the speed difference...
> although as I said I'm sure I can do better, at least a little better,
> with the compiled Ada than I am currently.
> 
> In any event I don't want to "resort" to assembly language or to
> removing the Ada mandated checks (at least not in the final version). My
> assumption is the the sort of person who would be interested in using a
> crypto library in Ada would have that interest precisely because of the
> safety afforded by the Ada language. This is why I'm content with the
> library being a bit slower than the competitors that are throwing safety
> to the wind by, for example, using assembly language.
> 
> Considering that cryptography is, almost by definition, used in security
> sensitive programs it seems like issues of safety and correctness should
> take priority over raw efficiency. Fast is good, of course, but it's
> even more important to be both right and secure. Hence Ada and not C
> (and definitely not assembly language).
> 
Correctness is certainly vital, but the extreme nonlinearity of any
good crypto algorithm means that even a small functional error will
almost certainly show up very quickly and obviously. However, in
modern systems it has become sometimes important to control exactly
the instructions executed, and CPU's execution of those instructions
(typically using assembler-dependent special operations), because of
'side-channel' (mostly timing and power) attacks that have been
developed that take advantage of things like scheduling and caching
behavior of the crypto machine code. So carefully written assembler
can actually be more secure! But I don't know if all the openssl is.)

- formerly david.thompson1 || achar(64) || worldnet.att.net



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Interesting performance quirk.
  2008-11-17  6:31             ` David Thompson
@ 2008-11-17 11:51               ` Peter C. Chapin
  0 siblings, 0 replies; 28+ messages in thread
From: Peter C. Chapin @ 2008-11-17 11:51 UTC (permalink / raw)

David Thompson wrote:

>> Considering that cryptography is, almost by definition, used in security
>> sensitive programs it seems like issues of safety and correctness should
>> take priority over raw efficiency. Fast is good, of course, but it's
>> even more important to be both right and secure. Hence Ada and not C
>> (and definitely not assembly language).
>>
> Correctness is certainly vital, but the extreme nonlinearity of any
> good crypto algorithm means that even a small functional error will
> almost certainly show up very quickly and obviously.

I understand your point, but its not necessarily that simple. Take a look at

	http://www.schneier.com/blowfish-bug.txt

which describes a "common" implementation bug where the code
inadvertently overwrites a large number of key bits with 1 bits. Because
encryption and decryption process the key in the same way, this bug does
not reveal itself as a problem with round trip processing.

Now in this case one would expect any reasonable degree of testing to
show the problem... but that assumes the test cases were generated with
an implementation that is free of this issue. If multiple carefully
written implementations all agree that increases one's confidence in the
result, but what does that really prove? They could all be wrong in
exactly the same way. Here "being wrong" might well mean "weakened
security relative to published theoretical results."

For example, consider Blowfish again... the algorithm requires a large
number of digits from the hexadecimal expansion of pi during its
initialization. Where did I get those initial values? I copied them from
Schneier's reference implementation. I'm sure that's probably what a lot
of Blowfish implementors do. Yet what if they are wrong? My test cases
based on Schneier's published test vectors (on the site above) wouldn't
show the problem because we'd both be using the wrong initial values.
Should I believe those are really the digits from the hexadecimal
expansion of pi just because Schneier says so? No... more checking needs
to be done.

What I *should* do is write my own program to compute those values, use
a tool like SPARK to prove my program correct, and then cross check its
output with several other independently computed values for pi. In fact,
it's on my to-do list to do exactly this (but I admit it's fairly low in
priority right now). I'd be very surprised, of course, if Schneier's
value of pi is wrong but, hey, you never know.

> However, in
> modern systems it has become sometimes important to control exactly
> the instructions executed, and CPU's execution of those instructions
> (typically using assembler-dependent special operations), because of
> 'side-channel' (mostly timing and power) attacks that have been
> developed that take advantage of things like scheduling and caching
> behavior of the crypto machine code. So carefully written assembler
> can actually be more secure!

That's interesting. I've heard of these kinds of attacks before but I
admit that I don't know much about them. I'll have to look into it.
Thanks for the heads-up.

Peter

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Interesting performance quirk.
  2008-10-28 11:35       ` Peter C. Chapin
                           ` (2 preceding siblings ...)
  2008-10-28 23:22         ` Ludovic Brenta
@ 2008-10-29  9:54         ` Alex R. Mosteo
  2008-10-30 11:16           ` Peter C. Chapin
  2008-10-29 16:12         ` Colin Paul Gloster
  4 siblings, 1 reply; 28+ messages in thread
From: Alex R. Mosteo @ 2008-10-29  9:54 UTC (permalink / raw)


Peter C. Chapin wrote:

> I tried running gprof over my code to see what I could learn, but I
> don't know how to use that tool very well yet so I didn't learn much. I
> discovered that a certain procedure is called a HUGE number of times,
> but that's not unexpected considering where it's used. It's a small
> procedure so I might try inlining it.

I'd also recomend valgrind with, e.g., kcachegrind to visualize the results. It might be easier to get into than gprof, and doesn't require special parameters to build the executable.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Interesting performance quirk.
  2008-10-29  9:54         ` Alex R. Mosteo
@ 2008-10-30 11:16           ` Peter C. Chapin
  0 siblings, 0 replies; 28+ messages in thread
From: Peter C. Chapin @ 2008-10-30 11:16 UTC (permalink / raw)


Alex R. Mosteo wrote:

> I'd also recomend valgrind with, e.g., kcachegrind to visualize the results. It might be easier to get into than gprof, and doesn't require special parameters to build the executable.

That's for the suggestion. I've never used valgrind before, but it is
actually on my to-do list to figure it out in connection with an
unrelated project (that actually pertains to my day job).

Peter



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Interesting performance quirk.
  2008-10-28 11:35       ` Peter C. Chapin
                           ` (3 preceding siblings ...)
  2008-10-29  9:54         ` Alex R. Mosteo
@ 2008-10-29 16:12         ` Colin Paul Gloster
  2008-10-30 11:23           ` Peter C. Chapin
  4 siblings, 1 reply; 28+ messages in thread
From: Colin Paul Gloster @ 2008-10-29 16:12 UTC (permalink / raw)


In
news:4904516f$0$28741$4d3efbfe@news.sover.net
on October 26th, 2008, Peter C. Chapin submitted:
|------------------------------------------------------------------------|
|"[..]                                                                   |
|                                                                        |
|I am also planning to write a similar program in C using OpenSSL's      |
|Blowfish implementation. I want to get a point of comparison to see if  |
|my implementation has "reasonable" performance or not (here I'm assuming|
|that OpenSSL is reasonable). I'm curious now if the C/OpenSSL version   |
|will also demonstrate this same performance quirk."                     |
|------------------------------------------------------------------------|

I had intended to reply with a claim that any language would be sped
up in the emulated system, however instead in
news:4906f908$0$5781$4d3efbfe@news.sover.net
on Tue, 28 Oct 2008, Peter C. Chapin submitted:
|------------------------------------------------------------------------|
|"[..]                                                                   |
|                                                                        |
|In further developments...                                              |
|                                                                        |
|I wrote a version of the benchmark program in C that uses OpenSSL's     |
|Blowfish implementation. The program is the "same" in that it creates   |
|the same amount of data and processes it in the same way. The only      |
|significant difference is that it calls into OpenSSL. It's quite a bit  |
|faster. Right now I get                                                 |
|                                                                        |
|Windows XP Laptop (GNAT GPL 2008 for the Ada, Cygwin gcc for the C)     |
|-----                                                                   |
|My Library => 11 MB/s  (with -O2 option and no debugging support)       |
|OpenSSL    => 65 MB/s  (Wow!)                                           |
|                                                                        |
|SUSE Linux in a VM on the same box                                      |
|-----                                                                   |
|My Library => 25 MB/s  (odd)                                            |
|OpenSSL    => 65 MB/s  (now this makes sense at least)                  |
|                                                                        |
|[..]"                                                                   |
|------------------------------------------------------------------------|

Earlier this year I had used QEMU on Windows (possibly not Windows XP)
to have a GNU/Linux distribution (possibly RedHat) emulated. I ran a
Bourne shell script or a Bourne Again SHell script in the emulated
system which made thousands of fairly short I/O transactions. The
emulated system including its pretend harddisk were kept small enough
(no more than a few hundred megabytes) to be kept solely in the real
physical primary memory instead of relying on virtual memory.

It was faster than running the same script on Cygwin on the same
machine. (The throughput of my shell script was faster in the emulated
system, whereas the human user interface of the emulated system was
dominated by how quickly it starts to respond instead of throughput,
and suffered from sluggish latencies.) Conversely, on
WWW.Debian.org/intro/why_debian
it was claimed:
!-------------------------------------------------------------------!
!"If you are not already a GNU/Linux user, you may also enjoy the   !
!following benefits:                                                !
!                                                                   !
![..]                                                               !
!Fast and easy on memory                                            !
!    Other operating systems may be as fast in one or two areas, but!
!    being based on GNU/Linux, Debian is lean and mean. Windows     !
!    software run from GNU/Linux using an emulator sometimes runs   !
!    faster than when run in the native environment.                !
![..]"                                                              !
!-------------------------------------------------------------------!
That Debian webpage lacks important information for a fair comparison
with Windows.

However, I do not know why Peter C. Chapin's Ada code is sped up but
OpenSSL is not.

Yours sincerely,
Colin Paul Gloster



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Interesting performance quirk.
  2008-10-29 16:12         ` Colin Paul Gloster
@ 2008-10-30 11:23           ` Peter C. Chapin
  2008-10-31 13:41             ` Colin Paul Gloster
  0 siblings, 1 reply; 28+ messages in thread
From: Peter C. Chapin @ 2008-10-30 11:23 UTC (permalink / raw)

Colin Paul Gloster wrote:

> Earlier this year I had used QEMU on Windows (possibly not Windows XP)
> to have a GNU/Linux distribution (possibly RedHat) emulated. I ran a
> Bourne shell script or a Bourne Again Shell script in the emulated
> system which made thousands of fairly short I/O transactions. The
> emulated system including its pretend harddisk were kept small enough
> (no more than a few hundred megabytes) to be kept solely in the real
> physical primary memory instead of relying on virtual memory.
> 
> It was faster than running the same script on Cygwin on the same
> machine.

That's interesting. I think it's probably conventional wisdom that doing
I/O in a VM would be slower than outside the virtual machine. I'm sure
that's true in many cases, although the situation you described shows
that it's not always true.

My program doesn't do any I/O during its main loop. Also the memory
block I work over is only 1 MB long so I don't think paging would be an
issue (there is no disk activity when I run it). In some respects the
program is ideal for performance analysis in that there are relatively
few complicating factors involved. In fact, that was my intention when I
wrote it.

One complicating issue that remains is the behavior of the memory cache.
I wonder if one of the programs is missing the cache more than the
other. I'm not clear on why it would do that, however. The same hardware
is being used after all. Perhaps the Windows compiler has organized the
executable in some cache un-friendly way.

Peter

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Interesting performance quirk.
  2008-10-30 11:23           ` Peter C. Chapin
@ 2008-10-31 13:41             ` Colin Paul Gloster
  2008-11-01 15:41               ` Gene
  0 siblings, 1 reply; 28+ messages in thread
From: Colin Paul Gloster @ 2008-10-31 13:41 UTC (permalink / raw)


On Thu, 30 Oct 2008, Peter C. Chapin wrote:

|------------------------------------------------------------------------|
|"Colin Paul Gloster wrote:                                              |
|                                                                        |
|> Earlier this year I had used QEMU on Windows (possibly not Windows XP)|
|> to have a GNU/Linux distribution (possibly RedHat) emulated. I ran a  |
|> Bourne shell script or a Bourne Again Shell script in the emulated    |
|> system which made thousands of fairly short I/O transactions. The     |
|> emulated system including its pretend harddisk were kept small enough |
|> (no more than a few hundred megabytes) to be kept solely in the real  |
|> physical primary memory instead of relying on virtual memory.         |
|>                                                                       |
|> It was faster than running the same script on Cygwin on the same      |
|> machine.                                                              |
|                                                                        |
|That's interesting. I think it's probably conventional wisdom that doing|
|I/O in a VM would be slower than outside the virtual machine. I'm sure  |
|that's true in many cases, although the situation you described shows   |
|that it's not always true."                                             |
|------------------------------------------------------------------------|

For clarity, I explain that the I/O of the virtual machine which I
referred to was merely I/O to its emulated filesystems, all of which
together plus the emulated memory were small enough to fit into the
genuine physical memory of the host operating system.

|------------------------------------------------------------------------|
|"My program [..] the memory                                             |
|block I work over is only 1 MB long so I don't think paging would be an |
|issue (there is no disk activity when I run it). [..]                   |
|                                                                        |
|[..]"                                                                   |
|------------------------------------------------------------------------|

Your program's memory block's size might be of the order of one
megabyte, but I do not know whether the emulated filesystems which you
used were also small enough to fit into emulated memory. However, this
does not explain why one program you have tried has been sped up by
emulation whereas another has not been sped up.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Interesting performance quirk.
  2008-10-31 13:41             ` Colin Paul Gloster
@ 2008-11-01 15:41               ` Gene
  0 siblings, 0 replies; 28+ messages in thread
From: Gene @ 2008-11-01 15:41 UTC (permalink / raw)


On Oct 31, 9:41 am, Colin Paul Gloster <Colin_Paul_Glos...@ACM.org>
wrote:
> On Thu, 30 Oct 2008, Peter C. Chapin wrote:
>
> |------------------------------------------------------------------------||"Colin Paul Gloster wrote:                                              |
>
> |                                                                        |
> |> Earlier this year I had used QEMU on Windows (possibly not Windows XP)|
> |> to have a GNU/Linux distribution (possibly RedHat) emulated. I ran a  |
> |> Bourne shell script or a Bourne Again Shell script in the emulated    |
> |> system which made thousands of fairly short I/O transactions. The     |
> |> emulated system including its pretend harddisk were kept small enough |
> |> (no more than a few hundred megabytes) to be kept solely in the real  |
> |> physical primary memory instead of relying on virtual memory.         |
> |>                                                                       |
> |> It was faster than running the same script on Cygwin on the same      |
> |> machine.                                                              |
> |                                                                        |
> |That's interesting. I think it's probably conventional wisdom that doing|
> |I/O in a VM would be slower than outside the virtual machine. I'm sure  |
> |that's true in many cases, although the situation you described shows   |
> |that it's not always true."                                             |
> |------------------------------------------------------------------------|
>
> For clarity, I explain that the I/O of the virtual machine which I
> referred to was merely I/O to its emulated filesystems, all of which
> together plus the emulated memory were small enough to fit into the
> genuine physical memory of the host operating system.
>
> |------------------------------------------------------------------------|
> |"My program [..] the memory                                             |
> |block I work over is only 1 MB long so I don't think paging would be an |
> |issue (there is no disk activity when I run it). [..]                   |
> |                                                                        |
> |[..]"                                                                   |
> |------------------------------------------------------------------------|
>
> Your program's memory block's size might be of the order of one
> megabyte, but I do not know whether the emulated filesystems which you
> used were also small enough to fit into emulated memory. However, this
> does not explain why one program you have tried has been sped up by
> emulation whereas another has not been sped up.

Exactly right.  The most obvious explanation is that system-dependent
code or build conventions have led to some important difference in the
run-time support.  Detailed profiling is probably the only way to
figure how where.  FWIW, I remember a similar situation that finally
turned out to be explained by compilation with the Windows
multithreaded debugging libraries.  When we switched to production,
single-threaded libraries, the differences vanished or went in favor
of Windows.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Interesting performance quirk.
  2008-10-26  0:57 Interesting performance quirk Peter C. Chapin
  2008-10-26  2:15 ` Jeffrey Creem
  2008-10-26  4:57 ` tmoran
@ 2008-10-29 20:18 ` Florian Weimer
  2008-10-30 11:15   ` Peter C. Chapin
  2008-11-07  0:44 ` Randy Brukardt
  3 siblings, 1 reply; 28+ messages in thread
From: Florian Weimer @ 2008-10-29 20:18 UTC (permalink / raw)


* Peter C. Chapin:

> I don't really have a question or a problem with any of this, but I
> found it curious and I thought I'd post in case someone had some useful
> insight into what might be going on.

Is memory allocation or finalization involved in an inner loop?
It might also help to compare -gnatdg output or assembly (-S).



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Interesting performance quirk.
  2008-10-29 20:18 ` Florian Weimer
@ 2008-10-30 11:15   ` Peter C. Chapin
  0 siblings, 0 replies; 28+ messages in thread
From: Peter C. Chapin @ 2008-10-30 11:15 UTC (permalink / raw)

Florian Weimer wrote:

> Is memory allocation or finalization involved in an inner loop?
> It might also help to compare -gnatdg output or assembly (-S).

No, the inner loop is pretty much plain computation. The array over
which I work is dynamically allocated, however. So I suppose the Ada
version is doing checks to make sure the access value that points at the
array isn't null, whereas the C version is probably not doing that. I'm
not sure how much of an effect those null checks would have, though.
Blowfish's uses 16 "rounds" on each block so depending on where the
checks are done they could be lost in the overhead of the algorithm...
or not.

As Martin said in another post, I understand that by making some
adjustments to the code one can often coax the compiler to skip checks
that are actually unnecessary without violating the semantics of Ada. I
will definitely have to look into that. Thanks for your suggests above
for getting further information. I would definitely like to understand
*why* the code is slow before spending too much time try to fix it. I've
optimized code before (not Ada code) and I know that just guessing about
the problem often doesn't work well.

Peter

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Interesting performance quirk.
  2008-10-26  0:57 Interesting performance quirk Peter C. Chapin
                   ` (2 preceding siblings ...)
  2008-10-29 20:18 ` Florian Weimer
@ 2008-11-07  0:44 ` Randy Brukardt
  2008-11-07  1:23   ` Peter C. Chapin
  3 siblings, 1 reply; 28+ messages in thread
From: Randy Brukardt @ 2008-11-07  0:44 UTC (permalink / raw)


"Peter C. Chapin" <pcc482719@gmail.com> wrote in message 
news:4903c066$0$28676$4d3efbfe@news.sover.net...
...
> Now the interesting part. My main development system is a Windows XP
> laptop. On this system my "optimized" Blowfish benchmark encrypts or
> decrypts at about 11 MB/s (curiously decryption is a little faster than
> encryption, which seems odd). It also happens that I have OpenSUSE 10.2
> Linux running on the same box in a VMware virtual machine. In that
> environment my benchmark encrypts or decrypts at fully 27 MB/s. It's
> over twice as fast! I'm using GNAT GPL 2008 in both cases with the same
> compiler options and exactly the same source code. I'm even using the
> same basic hardware although, as I said, one of my systems---the faster
> one---is a virtual machine.
>
> Should I be surprised at this performance difference? I wasn't expecting
> it. Note that I'm using Ada.Calendar.Clock to track execution time. At
> first I wondered if the virtual machine's notion of time was distorted
> in some way but, no... the program is definitely faster in the VM (it
> runs long enough so that the difference is speed is easily perceptible
> by a human).

I can't answer whether you should be surprised, but I'm not. My experience 
is that modern CPU chips have performance characteristics that seem random 
and depend on things that no one has any control over.

My most recent example was a hobby program, much think yours. I was 
surprised to see that fixing a memory management flaw caused the program to 
run twice as fast. That temporarily caused rejoicing, until improving the 
behavior of a non critical piece of the program caused the program to slow 
by 50%! (This effect showed up on several Windows OSes on different Intel 
processors. But not on the old Pentium IIIs.) Experimenting, I discovered 
that I could change code in units totally unrelated to the "hot" areas of 
the program and cause vast changes in the performance of the inner loops.

I of course verified that the generated code really was unchanged (it was).

I went as far as reading the lastest Intel literature on these topics (and 
it is huge). I thought that the effect might have had something to do with 
the alignment of the innermost loops, but adding options to control that to 
Janus/Ada didn't help much (it did get rid of the slowest versions, but the 
performance still could vary wildly, about 30% if I remember correctly). 
Having wasted most of a nice weekend messing with this (and having no 
customer requirements at the time), I finally gave up and just twiddled with 
some unrelated code until the program ran fast.

So I don't quite know what is going on. I suspect it is related in some way 
to alignment, but it might be necessary for some code to be page aligned for 
maximum performance (and that is way too expensive to use within loops and 
other code that is going to be executed - you have to fill the empty space 
with no-ops, and executing them takes some time. Intel actually recommends 
no-op sequences to use to fill space in order to minimize time - yuck).

So it is possible that the performance difference has everything to do with 
unrelated parts of your program (such as the I/O libraries), which are going 
to be different for the two OSes. And nothing to do with your Ada code or 
anything that your compiler has control over.

                                                     Randy.





^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Interesting performance quirk.
  2008-11-07  0:44 ` Randy Brukardt
@ 2008-11-07  1:23   ` Peter C. Chapin
  0 siblings, 0 replies; 28+ messages in thread
From: Peter C. Chapin @ 2008-11-07  1:23 UTC (permalink / raw)


Randy Brukardt wrote:

> So it is possible that the performance difference has everything to do with 
> unrelated parts of your program (such as the I/O libraries), which are going 
> to be different for the two OSes. And nothing to do with your Ada code or 
> anything that your compiler has control over.

Perhaps I'm lucky that I will be able to retire in another decade or so.

Peter



^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2008-11-17 11:51 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-10-26  0:57 Interesting performance quirk Peter C. Chapin
2008-10-26  2:15 ` Jeffrey Creem
2008-10-26 11:16   ` Peter C. Chapin
2008-10-26  4:57 ` tmoran
2008-10-26 11:11   ` Peter C. Chapin
2008-10-28  8:49     ` Martin
2008-10-28 11:35       ` Peter C. Chapin
2008-10-28 14:21         ` Robert A Duff
2008-10-29  1:42           ` Peter C. Chapin
2008-10-28 18:27         ` Jeffrey R. Carter
2008-10-29  1:39           ` Peter C. Chapin
2008-10-29  5:27             ` Jeffrey R. Carter
2008-10-28 23:22         ` Ludovic Brenta
2008-10-29  8:42           ` oenone
2008-10-29  9:59           ` Peter C. Chapin
2008-10-29 10:19             ` Martin
2008-11-17  6:31             ` David Thompson
2008-11-17 11:51               ` Peter C. Chapin
2008-10-29  9:54         ` Alex R. Mosteo
2008-10-30 11:16           ` Peter C. Chapin
2008-10-29 16:12         ` Colin Paul Gloster
2008-10-30 11:23           ` Peter C. Chapin
2008-10-31 13:41             ` Colin Paul Gloster
2008-11-01 15:41               ` Gene
2008-10-29 20:18 ` Florian Weimer
2008-10-30 11:15   ` Peter C. Chapin
2008-11-07  0:44 ` Randy Brukardt
2008-11-07  1:23   ` Peter C. Chapin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox