GNAT (GCC) Profile Guided Compilation

comp.lang.ada
 help / color / mirror / Atom feed

* GNAT (GCC) Profile Guided Compilation
@ 2012-06-29  9:17 Keean Schupke
  2012-06-29  9:34 ` Dmitry A. Kazakov
                   ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Keean Schupke @ 2012-06-29  9:17 UTC (permalink / raw)


I am developing a Monte-Carlo simulation engine, and I have implementations in C++ and Ada. All performance figures come from the same reference hardware. Both are built using GCC, and similar optimisation is applied to each (same optimisation level, cpu arch, inlining etc).

My C++ implementation achieves 40,000 simulations per second.

My Ada port initially achieves 32,000 iterations per second.

I refactored the Ada port to use System.Address to access the main data structure (as in the 'C++' version benchmarking showed a clear performance benefit to *ptr_x compared to array[x])

The Ada port using System.Address achieves 40,000 simulations per second - so this is looking good.

However, when I use profile guided compilation, the C++ code performance leaps to 56,000 simulations per second, whereas the Ada only achieves 44,000 simulations per second.

Anyone have any ideas why profile guided compilation performs so poorly with Ada compared to C++? The initial with normal compilation seemed very promising.


Cheers,
Keean.



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: GNAT (GCC) Profile Guided Compilation
  2012-06-29  9:17 GNAT (GCC) Profile Guided Compilation Keean Schupke
@ 2012-06-29  9:34 ` Dmitry A. Kazakov
  2012-06-29 10:01   ` Keean Schupke
  2012-06-29 10:48 ` Simon Wright
  2012-06-29 12:39 ` gautier_niouzes
  2 siblings, 1 reply; 37+ messages in thread
From: Dmitry A. Kazakov @ 2012-06-29  9:34 UTC (permalink / raw)


On Fri, 29 Jun 2012 02:17:19 -0700 (PDT), Keean Schupke wrote:

> Anyone have any ideas why profile guided compilation performs so poorly
> with Ada compared to C++?

Probably because you programmed it using low level C-ish stuff like
addresses instead of accessing data directly. In general, if you write a C
program in Ada (or any other language), you cannot expect it do better than
in C. To beat C, the precondition is to design it in Ada way.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: GNAT (GCC) Profile Guided Compilation
  2012-06-29  9:34 ` Dmitry A. Kazakov
@ 2012-06-29 10:01   ` Keean Schupke
  2012-06-29 10:24     ` Keean Schupke
  2012-06-29 12:05     ` Dmitry A. Kazakov
  0 siblings, 2 replies; 37+ messages in thread
From: Keean Schupke @ 2012-06-29 10:01 UTC (permalink / raw)
  Cc: mailbox

On Friday, 29 June 2012 10:34:19 UTC+1, Dmitry A. Kazakov  wrote:
> On Fri, 29 Jun 2012 02:17:19 -0700 (PDT), Keean Schupke wrote:
> 
> > Anyone have any ideas why profile guided compilation performs so poorly
> > with Ada compared to C++?
> 
> Probably because you programmed it using low level C-ish stuff like
> addresses instead of accessing data directly. In general, if you write a C
> program in Ada (or any other language), you cannot expect it do better than
> in C. To beat C, the precondition is to design it in Ada way.
> 
> -- 
> Regards,
> Dmitry A. Kazakov
> http://www.dmitry-kazakov.de

I don't think you are right here, but I would be happy to be proved wrong. 

In general I find the choice of algorithm dominates performance. In this case most of the performance is dominated by a disjoint-set (union-find) algorithm operating on an array of data. The data nominally represents a 2D space (although I am using a 1D array to avoid the cost of multiplication on every access).

Most operations access the neighbouring cells, hence require a +/- one cell, or +/- one row operations.

Maintaining the address of the current cell is faster than using indexed addressing (mainly because of the extra parameter that needs to be passed around increasing the pressure on register allocation).

I would be interested to understand how an algorithm can be implemented in a more Ada way?

Cheers,
Keean.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: GNAT (GCC) Profile Guided Compilation
  2012-06-29 10:01   ` Keean Schupke
@ 2012-06-29 10:24     ` Keean Schupke
  2012-06-29 12:26       ` stefan-lucks
  2012-06-29 12:05     ` Dmitry A. Kazakov
  1 sibling, 1 reply; 37+ messages in thread
From: Keean Schupke @ 2012-06-29 10:24 UTC (permalink / raw)
  Cc: mailbox

On Friday, 29 June 2012 11:01:30 UTC+1, Keean Schupke  wrote:
> On Friday, 29 June 2012 10:34:19 UTC+1, Dmitry A. Kazakov  wrote:
> > On Fri, 29 Jun 2012 02:17:19 -0700 (PDT), Keean Schupke wrote:
> > 
> > > Anyone have any ideas why profile guided compilation performs so poorly
> > > with Ada compared to C++?
> > 
> > Probably because you programmed it using low level C-ish stuff like
> > addresses instead of accessing data directly. In general, if you write a C
> > program in Ada (or any other language), you cannot expect it do better than
> > in C. To beat C, the precondition is to design it in Ada way.
> > 
> > -- 
> > Regards,
> > Dmitry A. Kazakov
> > http://www.dmitry-kazakov.de
> 
> I don't think you are right here, but I would be happy to be proved wrong. 
> 
> In general I find the choice of algorithm dominates performance. In this case most of the performance is dominated by a disjoint-set (union-find) algorithm operating on an array of data. The data nominally represents a 2D space (although I am using a 1D array to avoid the cost of multiplication on every access).
> 
> Most operations access the neighbouring cells, hence require a +/- one cell, or +/- one row operations.
> 
> Maintaining the address of the current cell is faster than using indexed addressing (mainly because of the extra parameter that needs to be passed around increasing the pressure on register allocation).
> 
> I would be interested to understand how an algorithm can be implemented in a more Ada way?
> 
> 
> Cheers,
> Keean.

To facilitate the discussion I will post the core packages from the implementation. First the slower version using array indexing (this is the 32k simulations per second version). As my application is lookup dominated, this algorithm (the eager disjoint set algorithm) is faster than the WQUPC (Weighted Quick Union with Path Compression) algorithm for my application.



    -- (C)2011 Keean Schupke, all rights reserved.
    generic
        type Element_Type is limited private;
        type Element_Access is not null access all Element_Type;
        type Set_Index is range <>;
    package Eager_Disjointsets is
        type Set_Type is limited private;
        type Set_Access is not null access all Set_Type;
        function Find(
            Set : in Set_Access;
            Index : in Set_Index
        ) return Set_Index;
        pragma Inline_Always(Find);

        procedure Makeset(
            Set : in Set_Access;
            Index : in Set_Index
        );
        pragma Inline_Always(Makeset);

        procedure Link(
            Set : in Set_Access;
            Left, Right : in out Set_Index
        );
        pragma Inline_Always(Link);

        function Get(
            Set : in Set_Access;
            Index : in Set_Index
        ) return Element_Access;
        pragma Inline_Always(Get);

        function Next(
            Set : in Set_Access;
            Index : in Set_Index
        ) return Set_Index;
        pragma Inline_Always(Next);

        procedure Set_Next(
            Set : in Set_Access;
            Index : in Set_Index;
            Next : in Set_Index
        );
        pragma Inline_Always(Set_Next);

    private
        type Node_Type is limited record
            Canonical : Set_Index;
            Successor : Set_Index;
            Size : Natural;
            Value : aliased Element_Type;
        end record;
        type Set_Type is array (Set_Index) of Node_Type;
    end Eager_Disjointsets;

    package body Eager_Disjointsets is
        procedure Makeset(
            Set : in Set_Access;
            Index : in Set_Index
        ) is begin
            Set(Index).Canonical := Index;
            Set(Index).Size := 1;
        end Makeset;

        function Find(
            Set : in Set_Access;
            Index : in Set_Index
        ) return Set_Index is begin
            return Set(Index).Canonical;
        end Find;

        procedure Swap_Indexes is new Swap(Set_Index);

        procedure Link(
            Set : in Set_Access;
            Left, Right : in out Set_Index) is
        begin
            if Set(Left).Size <= Set(Right).Size then
                Swap_Indexes(Left, Right);
            end if;

            declare
                Index : Set_Index := Right;
            begin
                Link_Loop : loop
                    Set(Index).Canonical := Left;
                    Index := Set(Index).Successor;
                    exit Link_Loop when Index = Right;
                end loop Link_Loop;
            end;

            Add(Set(Left).Size, Set(Right).Size);

            Swap_Indexes(Set(Left).Successor, Set(Right).Successor);
        end Link;

        function Get(
            Set : in Set_Access;
            Index : in Set_Index
        ) return Element_Access is begin
            return Set(Index).Value'Access;
        end Get;

        function Next(
            Set : in Set_Access;
            Index : in Set_Index
        ) return Set_Index is begin
            return Set(Index).Successor;
        end Next;

        procedure Set_Next(
            Set : in Set_Access;
            Index : in Set_Index;
            Next : in Set_Index
        ) is begin
            Set(Index).Successor := Next;
        end Set_Next;
    end Eager_Disjointsets;



And this is the faster 40k simulations per second version:



    -- (C)2011 Keean Schupke, all rights reserved.
    generic
        Size : Int;
        type Element_Type is limited private;
        type Element_Access is not null access all Element_Type;
    package Eager_Disjointsets is
        type Set_Type is limited private;
        type Set_Access is not null access all Set_Type;
        type Node_Type is limited private;
        type Cursor is access all Node_Type;

        function Find(Node : in Cursor) return Cursor;
        pragma Inline_Always(Find);
        procedure Makeset(Node : in Cursor);
        pragma Inline_Always(Makeset);
        procedure Link(Left, Right : in out Cursor);
        pragma Inline_Always(Link);

        function Succ(Node : in Cursor) return Cursor;
        pragma Inline_Always(Succ);
        procedure Set_Succ(Node, Succ : in Cursor);
        pragma Inline_Always(Set_Succ);

        function First(Set : in Set_Access) return Cursor;
        pragma Inline_Always(First);
        function Element(Node : in Cursor) return Element_Access;
        pragma Inline_Always(Element);
        function Index(Set : in Set_Access; Node : in Cursor) return Int;
        pragma Inline_Always(Index);
        procedure Next(Node : in out Cursor);
        pragma Inline_Always(Next);

        function Step return Int;
        pragma Inline_Always(Step);
        function "+" (Left : in Cursor; Right : in Int) return Cursor;
        pragma Inline_Always("+");
        function "-" (Left : in Cursor; Right : in Int) return Cursor;
        pragma Inline_Always("-");
    private
        type Node_Type is limited record
            Canonical : Cursor;
            Successor : Cursor;
            Size : Int;
            Value : aliased Element_Type;
        end record;
        type Set_Type is array (0 .. Size - 1) of aliased Node_Type;

        package Node_Address is new System.Address_To_Access_Conversions(Node_Type);

        Node_Size : Storage_Offset := Node_Type'Object_Size
            / System.Storage_Elements.Storage_Element'Size;
    end Eager_Disjointsets;

    package body Eager_Disjointsets is
        function To_Cursor(A : System.Address) return Cursor is
        begin
            return Cursor(Node_Address.To_Pointer(A));
        end To_Cursor;
        pragma Inline_Always(To_Cursor);

        function To_Address(C : Cursor) return System.Address is
        begin
            return Node_Address.To_Address(Node_Address.Object_Pointer(C));
        end To_Address;
        pragma Inline_Always(To_Address);

        procedure Makeset(
            Node : in Cursor
        ) is begin
            Node.Canonical := Node;
            Node.Size := 1;
        end Makeset;

        function Find(
            Node : in Cursor
        ) return Cursor is begin
            return Cursor(Node.Canonical);
        end Find;

        procedure Swap_Cursors is new Swap(Cursor);

        procedure Link(
            Left, Right : in out Cursor) is
        begin
            if Left.Size <= Right.Size then
                Swap_Cursors(Cursor(Left), Cursor(Right));
            end if;

            declare
                Node : Cursor := Right;
            begin
                Link_Loop : loop
                    Node.Canonical := Left;
                    Node := Cursor(Node.Successor);
                    exit Link_Loop when Node = Right;
                end loop Link_Loop;
            end;

            Add(Left.Size, Right.Size);
            Swap_Cursors(Cursor(Left.Successor), Cursor(Right.Successor));
        end Link;

        function Succ(
            Node : in Cursor
        ) return Cursor is begin
            return Cursor(Node.Successor);
        end Succ;
        procedure Set_Succ(
            Node, Succ : in Cursor
        ) is begin
            Node.Successor := Succ;
        end Set_Succ;

        function First(
            Set : in Set_Access
        ) return Cursor is begin
            return Set(Set_Type'First)'Access;
        end First;

        function Element(
            Node : in Cursor
        ) return Element_Access is begin
            return Node.Value'Access;
        end Element;

        function Index(
            Set : in Set_Access;
            Node : in Cursor
        ) return Int is begin
            return Int((To_Address(Node) - To_Address(First(Set))) / Node_Size);
        end Index;

        procedure Next(
            Node : in out Cursor
        ) is begin
            Node := To_Cursor(To_Address(Node) + Node_Size);
        end Next;

        function Step return Int is
        begin
            return Int(Node_Size);
        end Step;

        function "+" (Left : in Cursor; Right : in Int) return Cursor is
        begin
            return To_Cursor(To_Address(Left) + Storage_Offset(Right));
        end "+";

        function "-" (Left : in Cursor; Right : in Int) return Cursor is
        begin
            return To_Cursor(To_Address(Left) - Storage_Offset(Right));
        end "-";
    end Eager_Disjointsets;



Would appreciate any advice on how this could be implemented in a more Ada friendly way, or any general performance tips.



Cheers,
Keean.



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: GNAT (GCC) Profile Guided Compilation
  2012-06-29  9:17 GNAT (GCC) Profile Guided Compilation Keean Schupke
  2012-06-29  9:34 ` Dmitry A. Kazakov
@ 2012-06-29 10:48 ` Simon Wright
  2012-06-29 11:14   ` Keean Schupke
  2012-06-29 12:39 ` gautier_niouzes
  2 siblings, 1 reply; 37+ messages in thread
From: Simon Wright @ 2012-06-29 10:48 UTC (permalink / raw)


Keean Schupke <keean.schupke@googlemail.com> writes:

> I am developing a Monte-Carlo simulation engine, and I have
> implementations in C++ and Ada. All performance figures come from the
> same reference hardware. Both are built using GCC, and similar
> optimisation is applied to each (same optimisation level, cpu arch,
> inlining etc).
>
> My C++ implementation achieves 40,000 simulations per second.
>
> My Ada port initially achieves 32,000 iterations per second.
>
> I refactored the Ada port to use System.Address to access the main
> data structure (as in the 'C++' version benchmarking showed a clear
> performance benefit to *ptr_x compared to array[x])
>
> The Ada port using System.Address achieves 40,000 simulations per
> second - so this is looking good.

Did you try -gnatp (suppress all checks)? (only to be used when you're
sure the code is correct!)

You could also use pragma Suppress at the hot spots.



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: GNAT (GCC) Profile Guided Compilation
  2012-06-29 10:48 ` Simon Wright
@ 2012-06-29 11:14   ` Keean Schupke
  0 siblings, 0 replies; 37+ messages in thread
From: Keean Schupke @ 2012-06-29 11:14 UTC (permalink / raw)


On Friday, 29 June 2012 11:48:10 UTC+1, Simon Wright  wrote:
> Keean Schupke:
> 
> > I am developing a Monte-Carlo simulation engine, and I have
> > implementations in C++ and Ada. All performance figures come from the
> > same reference hardware. Both are built using GCC, and similar
> > optimisation is applied to each (same optimisation level, cpu arch,
> > inlining etc).
> >
> > My C++ implementation achieves 40,000 simulations per second.
> >
> > My Ada port initially achieves 32,000 iterations per second.
> >
> > I refactored the Ada port to use System.Address to access the main
> > data structure (as in the 'C++' version benchmarking showed a clear
> > performance benefit to *ptr_x compared to array[x])
> >
> > The Ada port using System.Address achieves 40,000 simulations per
> > second - so this is looking good.
> 
> Did you try -gnatp (suppress all checks)? (only to be used when you're
> sure the code is correct!)
> 
> You could also use pragma Suppress at the hot spots.



Here are the build scripts for the C++ and Ada versions:



For normal compilation I am using:

gnatmake -march=native -O3 -gnatp -gnatN xxx
g++ -o yyy -std=gnu++0x -march=native -O3 yyy.cpp

The performance compiled like this of the Ada and C++ versions is virtually identical (if anything Ada is slightly faster) at about 40k simulations per second.



For profile guided compilation I am using:

xxx: xxx.adb
	rm -f *.gcda
	gnatmake -march=native -O3 -fprofile-generate -gnatp -gnatN xxx
	./xxx
	touch xxx.adb
	gnatmake -march=native -O3 -fprofile-use -gnatp -gnatN xxx

yyy : yyy.cpp
    rm -f *.gcda
    g++ -o yyy -std=gnu++0x -march=native -O3 -fprofile-generate yyy.cpp
    ./yyy
    touch yyy.cpp
    g++ -o yyy -std=gnu++0x -march=native -O3 -fprofile-use yyy.cpp

With profile guided compilation, Ada is getting 44.5k simulations per second, to C++ at ~55k per second which makes the C++ 25% faster.



Cheers,
Keean.



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: GNAT (GCC) Profile Guided Compilation
  2012-06-29 10:01   ` Keean Schupke
  2012-06-29 10:24     ` Keean Schupke
@ 2012-06-29 12:05     ` Dmitry A. Kazakov
  1 sibling, 0 replies; 37+ messages in thread
From: Dmitry A. Kazakov @ 2012-06-29 12:05 UTC (permalink / raw)


On Fri, 29 Jun 2012 03:01:30 -0700 (PDT), Keean Schupke wrote:

> On Friday, 29 June 2012 10:34:19 UTC+1, Dmitry A. Kazakov  wrote:
>> On Fri, 29 Jun 2012 02:17:19 -0700 (PDT), Keean Schupke wrote:
>> 
>>> Anyone have any ideas why profile guided compilation performs so poorly
>>> with Ada compared to C++?
>> 
>> Probably because you programmed it using low level C-ish stuff like
>> addresses instead of accessing data directly. In general, if you write a C
>> program in Ada (or any other language), you cannot expect it do better than
>> in C. To beat C, the precondition is to design it in Ada way.

> In general I find the choice of algorithm dominates performance. In this
> case most of the performance is dominated by a disjoint-set (union-find)
> algorithm operating on an array of data. The data nominally represents a
> 2D space (although I am using a 1D array to avoid the cost of
> multiplication on every access).

That still does not justify machine addresses or pointers to limited
targets.

I would not be so sure that access to 2D array is slower when done by the
compiler. That depends on too many factors.

I also do not know what effect has profiling on inlining and optimizations
of generic instances in presence of numerous very short subprograms. I
guess that should severely distort the picture, maybe beyond recognition. I
must admit it, I never used gcc profiling any seriously, instead, I always
did direct time measures.

If performance is OK without profiling, why do you care?

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: GNAT (GCC) Profile Guided Compilation
  2012-06-29 10:24     ` Keean Schupke
@ 2012-06-29 12:26       ` stefan-lucks
  2012-06-29 12:51         ` Keean Schupke
  0 siblings, 1 reply; 37+ messages in thread
From: stefan-lucks @ 2012-06-29 12:26 UTC (permalink / raw)

On Fri, 29 Jun 2012, Keean Schupke wrote:

> On Friday, 29 June 2012 11:01:30 UTC+1, Keean Schupke  wrote:

> > Maintaining the address of the current cell is faster than using 
> > indexed addressing (mainly because of the extra parameter that needs 
> > to be passed around increasing the pressure on register allocation).

I guess you did some benchmarking, regarding that issue?

> > I would be interested to understand how an algorithm can be 
> > implemented in a more Ada way?

Faster or slower, you are extensively using access types (Element_Access 
and Set_Access). This is something Ada programmers usually try to avoid, 
except when there is an urgent need for some dynamic memory management 
(e.g., implementing one's own Container library). 

>     generic
>         type Element_Type is limited private;
>         type Element_Access is not null access all Element_Type;
>         type Set_Index is range <>;
>     package Eager_Disjointsets is
>         type Set_Type is limited private;
>         type Set_Access is not null access all Set_Type;

What, exactly, do you use Element_Access for? And why, exactly, do you 
bother to define Set_Access?

>         function Find(
>             Set : in Set_Access;
>             Index : in Set_Index
>         ) return Set_Index;
>         pragma Inline_Always(Find);

Why don't you just write the following? (Same for your other operations.)

	function find
	  (Set   : in Set_Type;
	   Index : in Set_Index) return Set_Index;

[...]

>     private
>         type Node_Type is limited record
>             Canonical : Set_Index;
>             Successor : Set_Index;
>             Size : Natural;
>             Value : aliased Element_Type;
>         end record;
>         type Set_Type is array (Set_Index) of Node_Type;
>     end Eager_Disjointsets;

Why, exactly, do you make Value aliased? Is this related to your "faster 
than indexed addressing" point above?

This may be hard for the compiler to properly optimize: An aliased 
component Element_Type in a record ... and then there is an array of such 
records, and all is inside a generic package. 

If you really need Value to be aliased (do you?), then you can try out 
rewriting Set_Type:

  type Set_Index_Array is array(Set_Index) of Set_Index;
  type Size_Array      is array(Set_Index) of Natural;
  type Value_Array     is array(Set_Index) of aliased Element_Type;

  type Set_Type is record
    Canonical, Successor: Set_Index_Array;
    Size:                 Size_Array;
    Value:                Value_Array;
  end record;

I have no idea if the above modification helps to speed up the 
computation, or not. It may backfire, I don't know. 

>     package body Eager_Disjointsets is
>         function Find(
>             Set : in Set_Access;
>             Index : in Set_Index
>         ) return Set_Index is begin
>             return Set(Index).Canonical;
>         end Find;

Now, here you would have to write "return Set.Canonical(Index);"

>     -- (C)2011 Keean Schupke, all rights reserved.
>     generic
>         Size : Int;
>         type Element_Type is limited private;
>         type Element_Access is not null access all Element_Type;
>     package Eager_Disjointsets is
>         type Set_Type is limited private;
>         type Set_Access is not null access all Set_Type;
>         type Node_Type is limited private;
>         type Cursor is access all Node_Type;

Is Int another name for Integer?

I understand that for performance reason you introduced Cursor as access 
all Node_Type and later Set_Array as aliased array of Node_Type.

But again, what do you need Element_Access and Set_Access for?

>         package Node_Address is new 
>         System.Address_To_Access_Conversions(Node_Type);

Probably, the compiler finds that out by itself -- but you may want to 
make it explicit that Node_Size doesn't change:

>         Node_Size : constant Storage_Offset := Node_Type'Object_Size
>             / System.Storage_Elements.Storage_Element'Size;

Also, if Node_Size isn't a power of 2, it may be difficult for the 
compiler to decide on the alignment of objects of type Node_Type -- with 
possible effects on caching. On one hand, it may be good to pack the 
objects closely together, to make as many fit into a single cache line as 
possible. On the other hand, it is bad if a part of a certain node is in 
one cache line, another part in another cache line.

>         function Find(
>             Node : in Cursor
>         ) return Cursor is begin
>             return Cursor(Node.Canonical);
>         end Find;

Not related to performance, but why don't you just write 
"return Node.Canonical;" as Node.Canonical is of type Cursor, anyway?

-- 
---- Stefan.Lucks (at) uni-weimar.de, University of Weimar, Germany  ----
    <http://www.uni-weimar.de/cms/medien/mediensicherheit/home.html>
------  I  love  the  taste  of  Cryptanalysis  in  the  morning!  ------

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: GNAT (GCC) Profile Guided Compilation
  2012-06-29  9:17 GNAT (GCC) Profile Guided Compilation Keean Schupke
  2012-06-29  9:34 ` Dmitry A. Kazakov
  2012-06-29 10:48 ` Simon Wright
@ 2012-06-29 12:39 ` gautier_niouzes
  2012-06-29 12:52   ` Keean Schupke
  2 siblings, 1 reply; 37+ messages in thread
From: gautier_niouzes @ 2012-06-29 12:39 UTC (permalink / raw)


Hello,

Did you try with integer array indices instead of Access or System.Address ?
It is possible that you make some counter-productive manual optimization.

If you give some reference about the algorithm I am tempted to give an help, it seems a very exciting area to learn...

Cheers
_________________________ 
Gautier's Ada programming 
http://gautiersblog.blogspot.com/search/label/Ada 
NB: follow the above link for a valid e-mail address 



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: GNAT (GCC) Profile Guided Compilation
  2012-06-29 12:26       ` stefan-lucks
@ 2012-06-29 12:51         ` Keean Schupke
  0 siblings, 0 replies; 37+ messages in thread
From: Keean Schupke @ 2012-06-29 12:51 UTC (permalink / raw)
  Cc: stefan-lucks

On Friday, 29 June 2012 13:26:23 UTC+1, (unknown)  wrote:
> On Fri, 29 Jun 2012, Keean Schupke wrote:
> 
> > On Friday, 29 June 2012 11:01:30 UTC+1, Keean Schupke  wrote:
> 
> > > Maintaining the address of the current cell is faster than using 
> > > indexed addressing (mainly because of the extra parameter that needs 
> > > to be passed around increasing the pressure on register allocation).
> 
> I guess you did some benchmarking, regarding that issue?
> 
> > > I would be interested to understand how an algorithm can be 
> > > implemented in a more Ada way?
> 
> Faster or slower, you are extensively using access types (Element_Access 
> and Set_Access). This is something Ada programmers usually try to avoid, 
> except when there is an urgent need for some dynamic memory management 
> (e.g., implementing one's own Container library). 
> 
> >     generic
> >         type Element_Type is limited private;
> >         type Element_Access is not null access all Element_Type;
> >         type Set_Index is range <>;
> >     package Eager_Disjointsets is
> >         type Set_Type is limited private;
> >         type Set_Access is not null access all Set_Type;
> 
> What, exactly, do you use Element_Access for? And why, exactly, do you 
> bother to define Set_Access?

I use Element_Access because the element is going to be a record that needs to be updated in place. To copy the record is too slow. If you look at the get method we want to do:

Get(Set, Index).Something := xxx

The record stored in each Node of the set datatype will be limited, so you cannot return it from a function, you can only return an access to it.

Why Set_Access, because in order to avoid an access-level problem you cannot define the element-access at a different block level from the set-access, so in order to safely return an element-access we must pass a set-access in.

> 
> >         function Find(
> >             Set : in Set_Access;
> >             Index : in Set_Index
> >         ) return Set_Index;
> >         pragma Inline_Always(Find);
>  
> Why don't you just write the following? (Same for your other operations.)
> 
> 	function find
> 	  (Set   : in Set_Type;
> 	   Index : in Set_Index) return Set_Index;

Because of the above requirements of the Get function, it is better to make all the methods take a Set_Access for consistency. The we only need to pass a Set_Access around, never a set, forcing pass-by-reference and not copying.

> 
> [...]
> 
> >     private
> >         type Node_Type is limited record
> >             Canonical : Set_Index;
> >             Successor : Set_Index;
> >             Size : Natural;
> >             Value : aliased Element_Type;
> >         end record;
> >         type Set_Type is array (Set_Index) of Node_Type;
> >     end Eager_Disjointsets;
> 
> Why, exactly, do you make Value aliased? Is this related to your "faster 
> than indexed addressing" point above?

This is because we need to return an access to the value from 'Get' so that it can be updated in place. Remember 'Element_Type' will be limited.

> 
> This may be hard for the compiler to properly optimize: An aliased 
> component Element_Type in a record ... and then there is an array of such 
> records, and all is inside a generic package. 

Actually the compiler does pretty well. 40k per second it matches the C++ version. It is only when trying to use profile-guided optimisation that Ada is not improving.

> 
> If you really need Value to be aliased (do you?), then you can try out 
> rewriting Set_Type:
> 
>   type Set_Index_Array is array(Set_Index) of Set_Index;
>   type Size_Array      is array(Set_Index) of Natural;
>   type Value_Array     is array(Set_Index) of aliased Element_Type;
> 
>   type Set_Type is record
>     Canonical, Successor: Set_Index_Array;
>     Size:                 Size_Array;
>     Value:                Value_Array;
>   end record;

I dont think this will be faster as we now have to compute the array index offset for accessing each property. Happy to benchmark for you...

> 
> I have no idea if the above modification helps to speed up the 
> computation, or not. It may backfire, I don't know. 
> 
> >     package body Eager_Disjointsets is
> >         function Find(
> >             Set : in Set_Access;
> >             Index : in Set_Index
> >         ) return Set_Index is begin
> >             return Set(Index).Canonical;
> >         end Find;
> 
> Now, here you would have to write "return Set.Canonical(Index);"
> 
> >     -- (C)2011 Keean Schupke, all rights reserved.
> >     generic
> >         Size : Int;
> >         type Element_Type is limited private;
> >         type Element_Access is not null access all Element_Type;
> >     package Eager_Disjointsets is
> >         type Set_Type is limited private;
> >         type Set_Access is not null access all Set_Type;
> >         type Node_Type is limited private;
> >         type Cursor is access all Node_Type;
> 
> Is Int another name for Integer?

Sorry, its a type name for the local best performing integer type, which is Integer_64 in my case. 

> 
> I understand that for performance reason you introduced Cursor as access 
> all Node_Type and later Set_Array as aliased array of Node_Type.
> 
> But again, what do you need Element_Access and Set_Access for?

See above.

> 
> >         package Node_Address is new 
> >         System.Address_To_Access_Conversions(Node_Type);
> 
> Probably, the compiler finds that out by itself -- but you may want to 
> make it explicit that Node_Size doesn't change:
> 
> >         Node_Size : constant Storage_Offset := Node_Type'Object_Size
> >             / System.Storage_Elements.Storage_Element'Size;

Yes, thats a good spot - I missed that.

> 
> Also, if Node_Size isn't a power of 2, it may be difficult for the 
> compiler to decide on the alignment of objects of type Node_Type -- with 
> possible effects on caching. On one hand, it may be good to pack the 
> objects closely together, to make as many fit into a single cache line as 
> possible. On the other hand, it is bad if a part of a certain node is in 
> one cache line, another part in another cache line.

Packing is slower, the whole data set is small enough to fit in the cache (<5k) so alignment wins over packing.

> 
> >         function Find(
> >             Node : in Cursor
> >         ) return Cursor is begin
> >             return Cursor(Node.Canonical);
> >         end Find;
> 
> Not related to performance, but why don't you just write 
> "return Node.Canonical;" as Node.Canonical is of type Cursor, anyway?

Again, good spot, Canonical was not always a Cursor, so I missed it.

> 
> 
> -- 
> ---- Stefan.Lucks (at) uni-weimar.de, University of Weimar, Germany  ----
>     <http://www.uni-weimar.de/cms/medien/mediensicherheit/home.html>
> ------  I  love  the  taste  of  Cryptanalysis  in  the  morning!  ------


Cheers,
Keean.



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: GNAT (GCC) Profile Guided Compilation
  2012-06-29 12:39 ` gautier_niouzes
@ 2012-06-29 12:52   ` Keean Schupke
  2012-06-29 14:14     ` gautier_niouzes
  0 siblings, 1 reply; 37+ messages in thread
From: Keean Schupke @ 2012-06-29 12:52 UTC (permalink / raw)


On Friday, 29 June 2012 13:39:47 UTC+1, (unknown)  wrote:
> Hello,
> 
> Did you try with integer array indices instead of Access or System.Address ?
> It is possible that you make some counter-productive manual optimization.
> 
> If you give some reference about the algorithm I am tempted to give an help, it seems a very exciting area to learn...
> 
> Cheers
> _________________________ 
> Gautier's Ada programming 
> http://gautiersblog.blogspot.com/search/label/Ada 
> NB: follow the above link for a valid e-mail address

First version I posted uses Integer indexes.

Cheers, 
Keean.



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: GNAT (GCC) Profile Guided Compilation
  2012-06-29 12:52   ` Keean Schupke
@ 2012-06-29 14:14     ` gautier_niouzes
  2012-06-29 15:05       ` gautier_niouzes
  0 siblings, 1 reply; 37+ messages in thread
From: gautier_niouzes @ 2012-06-29 14:14 UTC (permalink / raw)


> First version I posted uses Integer indexes.

OK... I meant something more radical (was actually discussed in another thread): getting rid of all pointers. I.e. (I borrow from the thread in question ;-) ):
        function find 
           (Set   : in Set_Type; 
            Index : in Set_Index) return Set_Index; 
GNAT is smart enough to select the pass-by-reference method.
In a similar experiment (a BZip2 decompressor) the pointer-free version was the fastest. Of course, no guarantee it will happen in your case... But it is worth a try.
G. 



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: GNAT (GCC) Profile Guided Compilation
  2012-06-29 14:14     ` gautier_niouzes
@ 2012-06-29 15:05       ` gautier_niouzes
  2012-06-29 17:03         ` Keean Schupke
  0 siblings, 1 reply; 37+ messages in thread
From: gautier_niouzes @ 2012-06-29 15:05 UTC (permalink / raw)


More specifically, this:

--8<------8<------8<----
-- (C)2011 Keean Schupke, all rights reserved. 
generic 
  type Element_Type is private; 
  -- Element_Type is a small object; could be also eventually an index
  -- for an array of larger objects, or an access type to a larger object
  type Set_Index is range <>; 
package Eager_Disjointsets is 
  type Set_Type is limited private; 
  function Find( 
                Set : in Set_Type; 
                Index : in Set_Index 
               ) return Set_Index; 
  pragma Inline_Always(Find); 

  procedure Makeset( 
                    Set : in out Set_Type; 
                    Index : in Set_Index 
                   ); 
  pragma Inline_Always(Makeset); 

  procedure Link( 
                 Set : in out Set_Type; 
                 Left, Right : in out Set_Index 
                ); 
  pragma Inline_Always(Link); 

  function Get( 
               Set : in Set_Type; 
               Index : in Set_Index 
              ) return Element_Type; 
  pragma Inline_Always(Get); 

  function Next( 
                Set : in Set_Type; 
                Index : in Set_Index 
               ) return Set_Index; 
  pragma Inline_Always(Next); 

  procedure Set_Next( 
                     Set : in out Set_Type; 
                     Index : in Set_Index; 
                     Next : in Set_Index 
                    ); 
  pragma Inline_Always(Set_Next); 

private 
  type Node_Type is limited record 
    Canonical : Set_Index; 
    Successor : Set_Index; 
    Size : Natural; 
    Value : aliased Element_Type; 
  end record; 
  type Set_Type is array (Set_Index) of Node_Type; 
end Eager_Disjointsets; 

package body Eager_Disjointsets is

  -- Ad-hoc added by GdM for compiling the package
  generic
    type Element is private;
  procedure Swap(a, b: in out Element);
  pragma Inline(Swap);
  procedure Swap(a, b: in out Element) is
    c: Element;
  begin
    c:= a;
    a:= b;
    b:= c;
  end Swap;

  -- Ad-hoc added by GdM for compiling the package
  procedure Add(to: in out Natural; what: Natural) is
    pragma Inline(Add);
  begin
    to:= to + what;
  end Add;

  procedure Makeset(
                    Set : in out Set_Type;
                    Index : in Set_Index
                   ) is 
  begin
    Set(Index).Canonical := Index;
    Set(Index).Size := 1;
  end Makeset;

  function Find(
                Set : in Set_Type;
                Index : in Set_Index
               ) return Set_Index is begin
    return Set(Index).Canonical;
  end Find;

  procedure Swap_Indexes is new Swap(Set_Index);

  procedure Link(
                 Set : in out Set_Type;
                 Left, Right : in out Set_Index) is
  begin
    if Set(Left).Size <= Set(Right).Size then
      Swap_Indexes(Left, Right);
    end if;

    declare
      Index : Set_Index := Right;
    begin
      Link_Loop : loop
        Set(Index).Canonical := Left;
        Index := Set(Index).Successor;
        exit Link_Loop when Index = Right;
      end loop Link_Loop;
    end;

    Add(Set(Left).Size, Set(Right).Size);

    Swap_Indexes(Set(Left).Successor, Set(Right).Successor);
  end Link;

  function Get(
               Set : in Set_Type;
               Index : in Set_Index
              ) return Element_Type is begin
    return Set(Index).Value;
  end Get;

  function Next(
                Set : in Set_Type;
                Index : in Set_Index
               ) return Set_Index is begin
    return Set(Index).Successor;
  end Next;

  procedure Set_Next(
                     Set : in out Set_Type;
                     Index : in Set_Index;
                     Next : in Set_Index
                    ) is begin
    Set(Index).Successor := Next;
  end Set_Next;
end Eager_Disjointsets;



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: GNAT (GCC) Profile Guided Compilation
  2012-06-29 15:05       ` gautier_niouzes
@ 2012-06-29 17:03         ` Keean Schupke
  2012-07-01  9:29           ` Georg Bauhaus
  2012-07-01 17:45           ` Georg Bauhaus
  0 siblings, 2 replies; 37+ messages in thread
From: Keean Schupke @ 2012-06-29 17:03 UTC (permalink / raw)


On Friday, 29 June 2012 16:05:07 UTC+1, (unknown)  wrote:
> More specifically, this:
> 
> --8<------8<------8<----
> -- (C)2011 Keean Schupke, all rights reserved. 
> generic 
>   type Element_Type is private; 
>   -- Element_Type is a small object; could be also eventually an index
>   -- for an array of larger objects, or an access type to a larger object
>   type Set_Index is range <>; 
> package Eager_Disjointsets is 
>   type Set_Type is limited private; 
>   function Find( 
>                 Set : in Set_Type; 
>                 Index : in Set_Index 
>                ) return Set_Index; 
>   pragma Inline_Always(Find); 
> 
>   procedure Makeset( 
>                     Set : in out Set_Type; 
>                     Index : in Set_Index 
>                    ); 
>   pragma Inline_Always(Makeset); 
> 
>   procedure Link( 
>                  Set : in out Set_Type; 
>                  Left, Right : in out Set_Index 
>                 ); 
>   pragma Inline_Always(Link); 
> 
>   function Get( 
>                Set : in Set_Type; 
>                Index : in Set_Index 
>               ) return Element_Type; 
>   pragma Inline_Always(Get); 
> 
>   function Next( 
>                 Set : in Set_Type; 
>                 Index : in Set_Index 
>                ) return Set_Index; 
>   pragma Inline_Always(Next); 
> 
>   procedure Set_Next( 
>                      Set : in out Set_Type; 
>                      Index : in Set_Index; 
>                      Next : in Set_Index 
>                     ); 
>   pragma Inline_Always(Set_Next); 
> 
> private 
>   type Node_Type is limited record 
>     Canonical : Set_Index; 
>     Successor : Set_Index; 
>     Size : Natural; 
>     Value : aliased Element_Type; 
>   end record; 
>   type Set_Type is array (Set_Index) of Node_Type; 
> end Eager_Disjointsets; 
> 
> package body Eager_Disjointsets is
> 
>   -- Ad-hoc added by GdM for compiling the package
>   generic
>     type Element is private;
>   procedure Swap(a, b: in out Element);
>   pragma Inline(Swap);
>   procedure Swap(a, b: in out Element) is
>     c: Element;
>   begin
>     c:= a;
>     a:= b;
>     b:= c;
>   end Swap;
> 
>   -- Ad-hoc added by GdM for compiling the package
>   procedure Add(to: in out Natural; what: Natural) is
>     pragma Inline(Add);
>   begin
>     to:= to + what;
>   end Add;
> 
>   procedure Makeset(
>                     Set : in out Set_Type;
>                     Index : in Set_Index
>                    ) is 
>   begin
>     Set(Index).Canonical := Index;
>     Set(Index).Size := 1;
>   end Makeset;
> 
>   function Find(
>                 Set : in Set_Type;
>                 Index : in Set_Index
>                ) return Set_Index is begin
>     return Set(Index).Canonical;
>   end Find;
> 
>   procedure Swap_Indexes is new Swap(Set_Index);
> 
>   procedure Link(
>                  Set : in out Set_Type;
>                  Left, Right : in out Set_Index) is
>   begin
>     if Set(Left).Size <= Set(Right).Size then
>       Swap_Indexes(Left, Right);
>     end if;
> 
>     declare
>       Index : Set_Index := Right;
>     begin
>       Link_Loop : loop
>         Set(Index).Canonical := Left;
>         Index := Set(Index).Successor;
>         exit Link_Loop when Index = Right;
>       end loop Link_Loop;
>     end;
> 
>     Add(Set(Left).Size, Set(Right).Size);
> 
>     Swap_Indexes(Set(Left).Successor, Set(Right).Successor);
>   end Link;
> 
>   function Get(
>                Set : in Set_Type;
>                Index : in Set_Index
>               ) return Element_Type is begin
>     return Set(Index).Value;
>   end Get;
> 
>   function Next(
>                 Set : in Set_Type;
>                 Index : in Set_Index
>                ) return Set_Index is begin
>     return Set(Index).Successor;
>   end Next;
> 
>   procedure Set_Next(
>                      Set : in out Set_Type;
>                      Index : in Set_Index;
>                      Next : in Set_Index
>                     ) is begin
>     Set(Index).Successor := Next;
>   end Set_Next;
> end Eager_Disjointsets;

I will give it a go, but if you notice my implementation puts the Element_Type into the Node structure itself rather than having a pointer (Access) to a separate object. In C++ this node is 20 bytes (8 each for the canonical and successor pointers, and 4 for the subset-size). The 'Element' I am using is 12 bytes (a record of two integers and and array of 4 bytes). In my tests in C++ a single array of 32 bytes packing the payload into the node using templates (or generics in Ada) outperformed two lists, as there was one less pointer dereference. Here I think the performance is dominated by the cache behaviour, so I don't expect this change to be faster, but I will give it a go.

Note: this is not really answering my question about the lack of improvement due to profile-guided compilation - but perhaps that side of things is more a compiler issue. Does anyone have any ideas about that side of things?


Cheers,
Keean.



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: GNAT (GCC) Profile Guided Compilation
  2012-06-29 17:03         ` Keean Schupke
@ 2012-07-01  9:29           ` Georg Bauhaus
  2012-07-01 17:45           ` Georg Bauhaus
  1 sibling, 0 replies; 37+ messages in thread
From: Georg Bauhaus @ 2012-07-01  9:29 UTC (permalink / raw)


On 29.06.12 19:03, Keean Schupke wrote:
> Note: this is not really answering my question about the lack of improvement due to profile-guided compilation - but perhaps that side of things is more a compiler issue. Does anyone have any ideas about that side of things?


In the few cases I have tried -fprofile-generate and -profile-use,
there were more or less moderate improvements.

For example, a program that writes a mandelbrot set executing on
4 cores runs in (user/real) 1.3s/10.0s and in 1.2s/9.3s, resp.

Other things of interest included (perhaps nothing new here)
preventing inline for some subprograms where inlining it would
have had adverse effects. Maybe it was steering the optimizers in
the wrong direction, preventing some better uses of registers.




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: GNAT (GCC) Profile Guided Compilation
  2012-06-29 17:03         ` Keean Schupke
  2012-07-01  9:29           ` Georg Bauhaus
@ 2012-07-01 17:45           ` Georg Bauhaus
  2012-07-01 22:57             ` Keean Schupke
  1 sibling, 1 reply; 37+ messages in thread
From: Georg Bauhaus @ 2012-07-01 17:45 UTC (permalink / raw)


On 29.06.12 19:03, Keean Schupke wrote:
> Note: this is not really answering my question about the lack of improvement due to profile-guided compilation - but perhaps that side of things is more a compiler issue. Does anyone have any ideas about that side of things?

Some more observations, collected with the help of the simple
two example programs appended below.

For Ada and GNAT at least, any advantages to be had from profile-guided
compilation seem to vary with optimization options and sizes of data.
The same is true about a 1D approach vs a 2D approach.

Among the various arrangements, one of the best I have got
from both GNAT GPL 2012, and from an FSF GNAT as of June 2012
uses
*) -O2 -funroll-loops -gnatp
*)  the 2D approach

(If it means a anything, adding -ftree-vectorize to the -O2 set
produces only the second best Ada result, the same as the -O3 times
listed below. I had chosen -ftree-vectorize as this switch is in the
O3 set.) Trying profile guided compilation at higher optimization
levels consistently slowed the Ada program down (other Ada programs
were faster after profile guided compilation, as mentioned elsewhere).

The result at -O2 seems nice, though, in that the 2D approach is
natural, the compiler switches are the ones that the manual recommends.
and this combination produces the fastest Ada program.

In short, from worst to best 1D / 2D, Runs = 100:

Ada profile guided -> .88 / .80   (this program only!)
Ada at -O3         -> .68 / .66
C++ at -O3         -> .66
Ada at -O2 -funr.. -> .68 / .47
C++ profile guided -> .31


Some differences seem to vary with hardware, too.
On one computer I have seen a minor speedup, on another
a very small slowdown, and a different ratio of 1D / 2D.

For the C++ case I have tested a 1D approach only, not
knowing how to write 2D arrays in C++ using C style arrays.
I'm not a C++ programmer, apologies for my C++.


Ada at -O2 -funroll-loops -gnatp -mtune=native
1D:  0.680750000 27303
2D:  0.469094000 27303  <-- Best for Ada!

Ada at -O3 -gnatNp -fomit-frame-pointer -mtune=native
1D:  0.676616000 27303
2D:  0.664194000 27303

previous with profile-guided compilation
1D:  0.888206000 27303
2D:  0.806196000 27303

C++ at -O3 -mtune=native
1D: 0.681721 0

previous with profile-guided compilation
1D: 0.31165 0

Clang++
1D: 0.611522 0

The GNU C++ compiler is from the GNAT GPL 2012 distribution for Mac OS X,
Clang++ is Apple's v 3.1.


=== 8< === 8< === 8< === 8< ===

package Fringes is

    pragma Pure (Fringes);

    type Full_Index is mod 2**32;
    subtype Num is Full_Index Range 0 .. 100_000;
    subtype Index is Full_Index range 0 .. 1000;
    Len : constant Full_Index := Index'Last - Index'First + 1;

    type Matrix_1D is
      array (Index'First .. Index'First + Len * Len - 1) of Num;
    type Matrix_2 is array (Index, Index) of Num;

    procedure Compute_1D (A : in out Matrix_1D);
    procedure Compute_2  (A : in out Matrix_2);

end Fringes;

package body Fringes is

    --
    --  Each Compute procedure has A pointless loop that assigns each
    --  inner field the sum of non-diagonal neighbors modulo Num'Last
    --

    procedure NOT_USED_Compute_1D_Slow (A : in out Matrix_1D) is
       J : Full_Index;
    begin
       J := A'First + Len;
       loop
         exit when J > A'Last - Len;
         for K in J + 1 .. J + Len - 1 - 1 loop
            A (K) := (A(K + 1)
                      + A(K - Len)
                      + A(K - 1)
                      + A(K + Len)) mod Num'Last;
         end loop;
         J := J + Len;
       end loop;
    end NOT_USED_Compute_1D_Slow;

    procedure Compute_1D (A : in out Matrix_1D) is
    begin
       for K in A'First + Len + 1 .. A'Last - Len - 1 loop
          case K mod Len is
          when 0 | Len - 1 => null;
          when others =>
             A (K) := (A(K + 1)
                       + A(K - Len)
                       + A(K - 1)
                       + A(K + Len)) mod Num'Last;
          end case;
       end loop;
    end Compute_1D;

    procedure Compute_2 (A : in out Matrix_2) is
    begin
       for J in A'First(1) + 1 .. A'Last(1) - 1 loop
          for K in A'First(2) + 1 .. A'Last(2) - 1 loop
             A (J, K) := (A(J, K + 1)
                          + A(j - 1, K)
                          + A(J, K - 1)
                          + A(j + 1, K)) mod Num'Last;
          end loop;
       end loop;
    end Compute_2;

end Fringes;


with Fringes;
with Ada.Command_Line, Ada.Real_Time;
procedure Test_Fringes is

    use Ada.Real_Time;

    Runs : constant Natural :=
      Natural'Value (Ada.Command_Line.Argument(1));

    Start, Stop: Time;

    package Os_Or_Gnat_Stacksize_Nuisance is
       use Fringes;
       type P1 is access Matrix_1D;
       type P2 is access Matrix_2;
       M1P : constant P1 := new Matrix_1D'(others => 123);
       M2p : constant P2 := new Matrix_2'(Index => (Index => 123));
       M1D : Matrix_1D renames M1P.all;
       M2 : Matrix_2 renames M2P.all;
    end Os_Or_Gnat_Stacksize_Nuisance;
    use Os_Or_Gnat_Stacksize_Nuisance;

    procedure Print_Timing (Part : String; N : Fringes.Num) is separate;
    use type Fringes.Full_Index;
begin
    Start := Clock;
    for Run in 1 .. Runs loop
       Fringes.Compute_1D (M1D);
    end loop;
    Stop := Clock;
    Print_Timing ("1D", M1D ((M1D'First + 1 * Fringes.Len) + (2)));

    Start := Clock;
    for Run in 1 .. Runs loop
       Fringes.Compute_2 (M2);
    end loop;
    Stop := Clock;
    Print_Timing ("2D", M2 (1, 2));
end Test_Fringes;

with Ada.Text_IO;
separate (Test_Fringes)
procedure Print_Timing (Part : String; N : Fringes.Num) is
    use Ada.Text_IO;
begin
    Put (Part);
    Put (": ");
    Put (Duration'Image (To_Duration(Stop - Start)));
    Put (Fringes.Num'Image (N));
    New_Line;
end print_timing;


=== 8< === 8< === 8< === 8< ===

#include <stdint.h>

namespace Fringes {

     typedef  uint32_t Full_Index;
#define Num_Last 100000
     typedef Full_Index Num;
#define Index_Last 1000
     typedef Full_Index Index;
     const Full_Index Len = Index_Last + 1;

#define A_Last ((Len * Len) - 1)
     typedef Num Matrix_1D[A_Last + 1];

     void Compute_1D (Matrix_1D&  A);
     //  Each Compute procedure has a pointless loop that assigns each
     //  inner field the sum of non-diagonal neighbors modulo Num_Last
     void Compute_1D (Matrix_1D&  A) {

         for (Full_Index K = Len + 1; K < A_Last - Len - 1; ++K) {
             switch (K % Len) {
             case 0: case Len - 1: break;
             default:
                 A [K] = (A[K + 1]
                           + A[K - Len]
                           + A[K - 1]
                           + A[K + Len]) % Num_Last;
             }
         }
     }
}

#include <sys/time.h>
#include <string>

class Reporter {
     struct timeval Start, Stop;
public:
     void logStart();
     void logStop();
     void Print_Timing (const std::string Part, const Fringes::Num N);
};


#include <sstream>

int main(int argc, char* argv[]) {
     using namespace Fringes;

     int Runs;
     Matrix_1D* M1D = new Matrix_1D[Len];
     Reporter history;

     if (argc == 0) {
         throw "usage";
     } else {
         std::istringstream converter(argv[1]);

         if (! ((converter >> Runs) && (Runs >= 0))) {
             throw "argument error?";
         }
     }
     
     history.logStart();
     for (int Run = 1; Run <= Runs; ++Run) {
         Compute_1D (*M1D);
     }
     history.logStop();
     history.Print_Timing ("1D", (*M1D) [(1 * Len) + (2)]);
}


#include <iostream>
#include <sys/time.h>

void Reporter::Print_Timing (const std::string Part, const Fringes::Num N) {
     double difference = (Stop.tv_sec - Start.tv_sec) * 1000000
             + (Stop.tv_usec - Start.tv_usec);
     std::cout << Part << ": ";
     std::cout << (difference / 1000000.0) << " " << N;
     std::cout << '\n';
}

void Reporter::logStart() {
     gettimeofday(&this->Start, 0);
}

void Reporter::logStop() {
     gettimeofday(&this->Stop, 0);
}





^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: GNAT (GCC) Profile Guided Compilation
  2012-07-01 17:45           ` Georg Bauhaus
@ 2012-07-01 22:57             ` Keean Schupke
  2012-07-02 17:15               ` Georg Bauhaus
  0 siblings, 1 reply; 37+ messages in thread
From: Keean Schupke @ 2012-07-01 22:57 UTC (permalink / raw)


On Sunday, 1 July 2012 18:45:22 UTC+1, Georg Bauhaus  wrote:
> On 29.06.12 19:03, Keean Schupke wrote:
> > Note: this is not really answering my question about the lack of improvement due to profile-guided compilation - but perhaps that side of things is more a compiler issue. Does anyone have any ideas about that side of things?
> 
> Some more observations, collected with the help of the simple
> two example programs appended below.
> 
> For Ada and GNAT at least, any advantages to be had from profile-guided
> compilation seem to vary with optimization options and sizes of data.
> The same is true about a 1D approach vs a 2D approach.
> 
> Among the various arrangements, one of the best I have got
> from both GNAT GPL 2012, and from an FSF GNAT as of June 2012
> uses
> *) -O2 -funroll-loops -gnatp
> *)  the 2D approach
> 
> (If it means a anything, adding -ftree-vectorize to the -O2 set
> produces only the second best Ada result, the same as the -O3 times
> listed below. I had chosen -ftree-vectorize as this switch is in the
> O3 set.) Trying profile guided compilation at higher optimization
> levels consistently slowed the Ada program down (other Ada programs
> were faster after profile guided compilation, as mentioned elsewhere).
> 
> The result at -O2 seems nice, though, in that the 2D approach is
> natural, the compiler switches are the ones that the manual recommends.
> and this combination produces the fastest Ada program.
> 
> In short, from worst to best 1D / 2D, Runs = 100:
> 
> Ada profile guided -> .88 / .80   (this program only!)
> Ada at -O3         -> .68 / .66
> C++ at -O3         -> .66
> Ada at -O2 -funr.. -> .68 / .47
> C++ profile guided -> .31
> 
> 
> Some differences seem to vary with hardware, too.
> On one computer I have seen a minor speedup, on another
> a very small slowdown, and a different ratio of 1D / 2D.
> 
> For the C++ case I have tested a 1D approach only, not
> knowing how to write 2D arrays in C++ using C style arrays.
> I'm not a C++ programmer, apologies for my C++.
> 
> 
> Ada at -O2 -funroll-loops -gnatp -mtune=native
> 1D:  0.680750000 27303
> 2D:  0.469094000 27303  <-- Best for Ada!
> 
> Ada at -O3 -gnatNp -fomit-frame-pointer -mtune=native
> 1D:  0.676616000 27303
> 2D:  0.664194000 27303
> 
> previous with profile-guided compilation
> 1D:  0.888206000 27303
> 2D:  0.806196000 27303
> 
> C++ at -O3 -mtune=native
> 1D: 0.681721 0
> 
> previous with profile-guided compilation
> 1D: 0.31165 0
> 
> Clang++
> 1D: 0.611522 0
> 
> The GNU C++ compiler is from the GNAT GPL 2012 distribution for Mac OS X,
> Clang++ is Apple's v 3.1.
> 
> 
> === 8< === 8< === 8< === 8< ===
> 
> package Fringes is
> 
>     pragma Pure (Fringes);
> 
>     type Full_Index is mod 2**32;
>     subtype Num is Full_Index Range 0 .. 100_000;
>     subtype Index is Full_Index range 0 .. 1000;
>     Len : constant Full_Index := Index'Last - Index'First + 1;
> 
>     type Matrix_1D is
>       array (Index'First .. Index'First + Len * Len - 1) of Num;
>     type Matrix_2 is array (Index, Index) of Num;
> 
>     procedure Compute_1D (A : in out Matrix_1D);
>     procedure Compute_2  (A : in out Matrix_2);
> 
> end Fringes;
> 
> package body Fringes is
> 
>     --
>     --  Each Compute procedure has A pointless loop that assigns each
>     --  inner field the sum of non-diagonal neighbors modulo Num'Last
>     --
> 
>     procedure NOT_USED_Compute_1D_Slow (A : in out Matrix_1D) is
>        J : Full_Index;
>     begin
>        J := A'First + Len;
>        loop
>          exit when J > A'Last - Len;
>          for K in J + 1 .. J + Len - 1 - 1 loop
>             A (K) := (A(K + 1)
>                       + A(K - Len)
>                       + A(K - 1)
>                       + A(K + Len)) mod Num'Last;
>          end loop;
>          J := J + Len;
>        end loop;
>     end NOT_USED_Compute_1D_Slow;
> 
>     procedure Compute_1D (A : in out Matrix_1D) is
>     begin
>        for K in A'First + Len + 1 .. A'Last - Len - 1 loop
>           case K mod Len is
>           when 0 | Len - 1 => null;
>           when others =>
>              A (K) := (A(K + 1)
>                        + A(K - Len)
>                        + A(K - 1)
>                        + A(K + Len)) mod Num'Last;
>           end case;
>        end loop;
>     end Compute_1D;
> 
>     procedure Compute_2 (A : in out Matrix_2) is
>     begin
>        for J in A'First(1) + 1 .. A'Last(1) - 1 loop
>           for K in A'First(2) + 1 .. A'Last(2) - 1 loop
>              A (J, K) := (A(J, K + 1)
>                           + A(j - 1, K)
>                           + A(J, K - 1)
>                           + A(j + 1, K)) mod Num'Last;
>           end loop;
>        end loop;
>     end Compute_2;
> 
> end Fringes;
> 
> 
> with Fringes;
> with Ada.Command_Line, Ada.Real_Time;
> procedure Test_Fringes is
> 
>     use Ada.Real_Time;
> 
>     Runs : constant Natural :=
>       Natural'Value (Ada.Command_Line.Argument(1));
> 
>     Start, Stop: Time;
> 
>     package Os_Or_Gnat_Stacksize_Nuisance is
>        use Fringes;
>        type P1 is access Matrix_1D;
>        type P2 is access Matrix_2;
>        M1P : constant P1 := new Matrix_1D'(others => 123);
>        M2p : constant P2 := new Matrix_2'(Index => (Index => 123));
>        M1D : Matrix_1D renames M1P.all;
>        M2 : Matrix_2 renames M2P.all;
>     end Os_Or_Gnat_Stacksize_Nuisance;
>     use Os_Or_Gnat_Stacksize_Nuisance;
> 
>     procedure Print_Timing (Part : String; N : Fringes.Num) is separate;
>     use type Fringes.Full_Index;
> begin
>     Start := Clock;
>     for Run in 1 .. Runs loop
>        Fringes.Compute_1D (M1D);
>     end loop;
>     Stop := Clock;
>     Print_Timing ("1D", M1D ((M1D'First + 1 * Fringes.Len) + (2)));
> 
>     Start := Clock;
>     for Run in 1 .. Runs loop
>        Fringes.Compute_2 (M2);
>     end loop;
>     Stop := Clock;
>     Print_Timing ("2D", M2 (1, 2));
> end Test_Fringes;
> 
> with Ada.Text_IO;
> separate (Test_Fringes)
> procedure Print_Timing (Part : String; N : Fringes.Num) is
>     use Ada.Text_IO;
> begin
>     Put (Part);
>     Put (": ");
>     Put (Duration'Image (To_Duration(Stop - Start)));
>     Put (Fringes.Num'Image (N));
>     New_Line;
> end print_timing;
> 
> 
> === 8< === 8< === 8< === 8< ===
> 
> #include <stdint.h>
> 
> namespace Fringes {
> 
>      typedef  uint32_t Full_Index;
> #define Num_Last 100000
>      typedef Full_Index Num;
> #define Index_Last 1000
>      typedef Full_Index Index;
>      const Full_Index Len = Index_Last + 1;
> 
> #define A_Last ((Len * Len) - 1)
>      typedef Num Matrix_1D[A_Last + 1];
> 
>      void Compute_1D (Matrix_1D&  A);
>      //  Each Compute procedure has a pointless loop that assigns each
>      //  inner field the sum of non-diagonal neighbors modulo Num_Last
>      void Compute_1D (Matrix_1D&  A) {
> 
>          for (Full_Index K = Len + 1; K < A_Last - Len - 1; ++K) {
>              switch (K % Len) {
>              case 0: case Len - 1: break;
>              default:
>                  A [K] = (A[K + 1]
>                            + A[K - Len]
>                            + A[K - 1]
>                            + A[K + Len]) % Num_Last;
>              }
>          }
>      }
> }
> 
> #include <sys/time.h>
> #include <string>
> 
> class Reporter {
>      struct timeval Start, Stop;
> public:
>      void logStart();
>      void logStop();
>      void Print_Timing (const std::string Part, const Fringes::Num N);
> };
> 
> 
> #include <sstream>
> 
> int main(int argc, char* argv[]) {
>      using namespace Fringes;
> 
>      int Runs;
>      Matrix_1D* M1D = new Matrix_1D[Len];
>      Reporter history;
> 
>      if (argc == 0) {
>          throw "usage";
>      } else {
>          std::istringstream converter(argv[1]);
> 
>          if (! ((converter >> Runs) && (Runs >= 0))) {
>              throw "argument error?";
>          }
>      }
>      
>      history.logStart();
>      for (int Run = 1; Run <= Runs; ++Run) {
>          Compute_1D (*M1D);
>      }
>      history.logStop();
>      history.Print_Timing ("1D", (*M1D) [(1 * Len) + (2)]);
> }
> 
> 
> #include <iostream>
> #include <sys/time.h>
> 
> void Reporter::Print_Timing (const std::string Part, const Fringes::Num N) {
>      double difference = (Stop.tv_sec - Start.tv_sec) * 1000000
>              + (Stop.tv_usec - Start.tv_usec);
>      std::cout << Part << ": ";
>      std::cout << (difference / 1000000.0) << " " << N;
>      std::cout << '\n';
> }
> 
> void Reporter::logStart() {
>      gettimeofday(&this->Start, 0);
> }
> 
> void Reporter::logStop() {
>      gettimeofday(&this->Stop, 0);
> }

On Sunday, 1 July 2012 18:45:22 UTC+1, Georg Bauhaus  wrote:
> On 29.06.12 19:03, Keean Schupke wrote:
> > Note: this is not really answering my question about the lack of improvement due to profile-guided compilation - but perhaps that side of things is more a compiler issue. Does anyone have any ideas about that side of things?
> 
> Some more observations, collected with the help of the simple
> two example programs appended below.
> 
> For Ada and GNAT at least, any advantages to be had from profile-guided
> compilation seem to vary with optimization options and sizes of data.
> The same is true about a 1D approach vs a 2D approach.
> 
> Among the various arrangements, one of the best I have got
> from both GNAT GPL 2012, and from an FSF GNAT as of June 2012
> uses
> *) -O2 -funroll-loops -gnatp
> *)  the 2D approach
> 
> (If it means a anything, adding -ftree-vectorize to the -O2 set
> produces only the second best Ada result, the same as the -O3 times
> listed below. I had chosen -ftree-vectorize as this switch is in the
> O3 set.) Trying profile guided compilation at higher optimization
> levels consistently slowed the Ada program down (other Ada programs
> were faster after profile guided compilation, as mentioned elsewhere).
> 
> The result at -O2 seems nice, though, in that the 2D approach is
> natural, the compiler switches are the ones that the manual recommends.
> and this combination produces the fastest Ada program.
> 
> In short, from worst to best 1D / 2D, Runs = 100:
> 
> Ada profile guided -> .88 / .80   (this program only!)
> Ada at -O3         -> .68 / .66
> C++ at -O3         -> .66
> Ada at -O2 -funr.. -> .68 / .47
> C++ profile guided -> .31
> 
> 
> Some differences seem to vary with hardware, too.
> On one computer I have seen a minor speedup, on another
> a very small slowdown, and a different ratio of 1D / 2D.
> 
> For the C++ case I have tested a 1D approach only, not
> knowing how to write 2D arrays in C++ using C style arrays.
> I'm not a C++ programmer, apologies for my C++.
> 
> 
> Ada at -O2 -funroll-loops -gnatp -mtune=native
> 1D:  0.680750000 27303
> 2D:  0.469094000 27303  <-- Best for Ada!
> 
> Ada at -O3 -gnatNp -fomit-frame-pointer -mtune=native
> 1D:  0.676616000 27303
> 2D:  0.664194000 27303
> 
> previous with profile-guided compilation
> 1D:  0.888206000 27303
> 2D:  0.806196000 27303
> 
> C++ at -O3 -mtune=native
> 1D: 0.681721 0
> 
> previous with profile-guided compilation
> 1D: 0.31165 0
> 
> Clang++
> 1D: 0.611522 0
> 
> The GNU C++ compiler is from the GNAT GPL 2012 distribution for Mac OS X,
> Clang++ is Apple's v 3.1.
> 
> 
> === 8< === 8< === 8< === 8< ===
> 
> package Fringes is
> 
>     pragma Pure (Fringes);
> 
>     type Full_Index is mod 2**32;
>     subtype Num is Full_Index Range 0 .. 100_000;
>     subtype Index is Full_Index range 0 .. 1000;
>     Len : constant Full_Index := Index'Last - Index'First + 1;
> 
>     type Matrix_1D is
>       array (Index'First .. Index'First + Len * Len - 1) of Num;
>     type Matrix_2 is array (Index, Index) of Num;
> 
>     procedure Compute_1D (A : in out Matrix_1D);
>     procedure Compute_2  (A : in out Matrix_2);
> 
> end Fringes;
> 
> package body Fringes is
> 
>     --
>     --  Each Compute procedure has A pointless loop that assigns each
>     --  inner field the sum of non-diagonal neighbors modulo Num'Last
>     --
> 
>     procedure NOT_USED_Compute_1D_Slow (A : in out Matrix_1D) is
>        J : Full_Index;
>     begin
>        J := A'First + Len;
>        loop
>          exit when J > A'Last - Len;
>          for K in J + 1 .. J + Len - 1 - 1 loop
>             A (K) := (A(K + 1)
>                       + A(K - Len)
>                       + A(K - 1)
>                       + A(K + Len)) mod Num'Last;
>          end loop;
>          J := J + Len;
>        end loop;
>     end NOT_USED_Compute_1D_Slow;
> 
>     procedure Compute_1D (A : in out Matrix_1D) is
>     begin
>        for K in A'First + Len + 1 .. A'Last - Len - 1 loop
>           case K mod Len is
>           when 0 | Len - 1 => null;
>           when others =>
>              A (K) := (A(K + 1)
>                        + A(K - Len)
>                        + A(K - 1)
>                        + A(K + Len)) mod Num'Last;
>           end case;
>        end loop;
>     end Compute_1D;
> 
>     procedure Compute_2 (A : in out Matrix_2) is
>     begin
>        for J in A'First(1) + 1 .. A'Last(1) - 1 loop
>           for K in A'First(2) + 1 .. A'Last(2) - 1 loop
>              A (J, K) := (A(J, K + 1)
>                           + A(j - 1, K)
>                           + A(J, K - 1)
>                           + A(j + 1, K)) mod Num'Last;
>           end loop;
>        end loop;
>     end Compute_2;
> 
> end Fringes;
> 
> 
> with Fringes;
> with Ada.Command_Line, Ada.Real_Time;
> procedure Test_Fringes is
> 
>     use Ada.Real_Time;
> 
>     Runs : constant Natural :=
>       Natural'Value (Ada.Command_Line.Argument(1));
> 
>     Start, Stop: Time;
> 
>     package Os_Or_Gnat_Stacksize_Nuisance is
>        use Fringes;
>        type P1 is access Matrix_1D;
>        type P2 is access Matrix_2;
>        M1P : constant P1 := new Matrix_1D'(others => 123);
>        M2p : constant P2 := new Matrix_2'(Index => (Index => 123));
>        M1D : Matrix_1D renames M1P.all;
>        M2 : Matrix_2 renames M2P.all;
>     end Os_Or_Gnat_Stacksize_Nuisance;
>     use Os_Or_Gnat_Stacksize_Nuisance;
> 
>     procedure Print_Timing (Part : String; N : Fringes.Num) is separate;
>     use type Fringes.Full_Index;
> begin
>     Start := Clock;
>     for Run in 1 .. Runs loop
>        Fringes.Compute_1D (M1D);
>     end loop;
>     Stop := Clock;
>     Print_Timing ("1D", M1D ((M1D'First + 1 * Fringes.Len) + (2)));
> 
>     Start := Clock;
>     for Run in 1 .. Runs loop
>        Fringes.Compute_2 (M2);
>     end loop;
>     Stop := Clock;
>     Print_Timing ("2D", M2 (1, 2));
> end Test_Fringes;
> 
> with Ada.Text_IO;
> separate (Test_Fringes)
> procedure Print_Timing (Part : String; N : Fringes.Num) is
>     use Ada.Text_IO;
> begin
>     Put (Part);
>     Put (": ");
>     Put (Duration'Image (To_Duration(Stop - Start)));
>     Put (Fringes.Num'Image (N));
>     New_Line;
> end print_timing;
> 
> 
> === 8< === 8< === 8< === 8< ===
> 
> #include <stdint.h>
> 
> namespace Fringes {
> 
>      typedef  uint32_t Full_Index;
> #define Num_Last 100000
>      typedef Full_Index Num;
> #define Index_Last 1000
>      typedef Full_Index Index;
>      const Full_Index Len = Index_Last + 1;
> 
> #define A_Last ((Len * Len) - 1)
>      typedef Num Matrix_1D[A_Last + 1];
> 
>      void Compute_1D (Matrix_1D&  A);
>      //  Each Compute procedure has a pointless loop that assigns each
>      //  inner field the sum of non-diagonal neighbors modulo Num_Last
>      void Compute_1D (Matrix_1D&  A) {
> 
>          for (Full_Index K = Len + 1; K < A_Last - Len - 1; ++K) {
>              switch (K % Len) {
>              case 0: case Len - 1: break;
>              default:
>                  A [K] = (A[K + 1]
>                            + A[K - Len]
>                            + A[K - 1]
>                            + A[K + Len]) % Num_Last;
>              }
>          }
>      }
> }
> 
> #include <sys/time.h>
> #include <string>
> 
> class Reporter {
>      struct timeval Start, Stop;
> public:
>      void logStart();
>      void logStop();
>      void Print_Timing (const std::string Part, const Fringes::Num N);
> };
> 
> 
> #include <sstream>
> 
> int main(int argc, char* argv[]) {
>      using namespace Fringes;
> 
>      int Runs;
>      Matrix_1D* M1D = new Matrix_1D[Len];
>      Reporter history;
> 
>      if (argc == 0) {
>          throw "usage";
>      } else {
>          std::istringstream converter(argv[1]);
> 
>          if (! ((converter >> Runs) && (Runs >= 0))) {
>              throw "argument error?";
>          }
>      }
>      
>      history.logStart();
>      for (int Run = 1; Run <= Runs; ++Run) {
>          Compute_1D (*M1D);
>      }
>      history.logStop();
>      history.Print_Timing ("1D", (*M1D) [(1 * Len) + (2)]);
> }
> 
> 
> #include <iostream>
> #include <sys/time.h>
> 
> void Reporter::Print_Timing (const std::string Part, const Fringes::Num N) {
>      double difference = (Stop.tv_sec - Start.tv_sec) * 1000000
>              + (Stop.tv_usec - Start.tv_usec);
>      std::cout << Part << ": ";
>      std::cout << (difference / 1000000.0) << " " << N;
>      std::cout << '\n';
> }
> 
> void Reporter::logStart() {
>      gettimeofday(&this->Start, 0);
> }
> 
> void Reporter::logStop() {
>      gettimeofday(&this->Stop, 0);
> }


Interesting stuff, I need a bit more time to read and digest, but I did notice one thing that I think is worth pointing out. 

The real benefit (and performance gains) from profile guided compilation come from correcting branch prediction. As such the gains will be most apparent when there is an 'if' statement in the inner loop of the code. Try something where you are taking the sign of an int in the formula and have three cases <0 =0 >0.

The other point is that for any single branch there is at least a 50% chance it is already optimal, for example:

if x > 0 then
    y := 1 + y;
else
    y := 1 - y;
end if;

will get implemented in assembly as something like:

 tst rx
 ble xyz
 ld #1, r0
 sub r0, ry
 bra zyx
.xyz
 add #1, ry
.zyx

OR

 tst rx
 bgt xyz
 add #1, ry
 bra zyx
.xyz
 ld #1, r0
 sub r0, ry
.zyx

whether this is optimal depends on the number of times x <= 0 and on what the CPU's branch predictor guesses for 'ble xyz'. Most branch predictors start with the assumption that a backwards branch will be taken (looping) and a forward branch will not be taken, but then they also build statistical tables based on the address of the branch instruction to improve that guess on future iterations.

The speed difference can be large because a branch predict failure can have a high penalty - a CPU pipe stall. Modern CPUs use speculative execution (they keep the pipe filled by using branch-predict to assume whether the branch will or will not be taken) so if the branch predict guesses correctly there is no pipe stall and the CPU continues through the code at full speed. A pipe stall can cost 10s or 100s of clock cycles each time it happens. 

I initially expected the dynamic branch predictor would be doing a good job and changing the order of the 'if' statements would have no effect, but benchmarking on 64bit x86 hardware showed a strong effect for changing the order of branches.

Anyway the point is, even if you have an 'if' statement in the inner loop, you may be lucky and already have it the 'faster' way round.

So you need to benchmark:

if x > 0 then
    y := 1 + y;
else
    y := 1 - y;
end if;

AND

if x <= 0 then
    y := 1 - y;
else
    y := 1 + y;
end if;

Then you can confirm that profile guided compilation gives you the faster of the two benchmark times, no matter which way round the if statement is written in the code.

Obviously for one if statement in the inner loop you can practically benchmark both alternatives and chose the fastest.

With my Monte-Carlo simulation, the inner loop logic is a complex state-machine, and it would be impractical to try all possibly combinations (due to the explosion of combinations at the speed of 2^N, just 8 branches would require 256 code change and rebuild-cycles, 16 = 65536 etc...

This is where profile-guided compilation is a valuable tool. Both the Ada code and C++ code share the same branching in the state machine, and both have the same initial performance now (approx 40k simulations per second). However the C++ compiler gains 25% from the profile guided compilation (by changing branch orders and re-ordering functions). This gain should be available to the Ada compiler too, as the same branches are getting generated in the output assembly - but its just not working as well.

My latest results are better, I now have both C++ and Ada running from the same GCC-4.7.1 compiler and my latest performance figures are:

Just so its clear, you first build with '-fprofile-generate' then you run the code to build the .gcda profile files, then you build again with '-fprofile-use' to build the 'profile-guided' version. My code runs 1,000,000 iterations between the 'generate' and the 'use' steps.


C++ first compile: 39k
C++ profile guided: 55k

Ada first compile: 41k
Ada profile guided: 46k


So gcc-4.7.1 seems to improve the results slightly for Ada, but it seems to be still falling short of the potential improvement. I am becoming more convinced that there is nothing in the Ada code itself that is causing this, but more likely something about the Ada GCC front-end itself not using the profiling information as effectively as C++. 


Cheers,
Keean.



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: GNAT (GCC) Profile Guided Compilation
  2012-07-01 22:57             ` Keean Schupke
@ 2012-07-02 17:15               ` Georg Bauhaus
  2012-07-02 17:26                 ` Keean Schupke
  0 siblings, 1 reply; 37+ messages in thread
From: Georg Bauhaus @ 2012-07-02 17:15 UTC (permalink / raw)


On 02.07.12 00:57, Keean Schupke wrote:
> The real benefit (and performance gains) from profile guided compilation come from correcting branch prediction. As such the gains will be most apparent when there is an 'if' statement in the inner loop of the code. Try something where you are taking the sign of an int in the formula and have three cases <0 =0 >0.


Thanks for your lucid words, I was mostly guessing at what profile
guided compilation might actually do. Indeed, now that I have started
playing with conditionals, the translations show very different effects
already, for variations of the procedure below,

   procedure Compute_1D (A : in out Matrix_1D) is
   begin
      for K in A'First + Len + 1 .. A'Last - Len - 1 loop
         case K mod Len is
         when 0 | Len - 1 => null;
         when others =>
            A (K) := (A(K + 1)
                        + A(K - Len)
                        + A(K - 1)
                        + A(K + Len)) mod Num'Last;
         end case;
         if A (K) mod 6 = 0 then
            A (K) := (A (K) - 1) mod Num'Last;
         else
            A (K) := K mod Num'Last;
         end if;
      end loop;
   end Compute_1D;

Ada and C++ are mostly on a par without help from a profile
(the 2D approach is still better in the Ada case; perhaps mod 6
isn't true for that many K). C++ gains 8%, Ada only 4%, though.


Cheers,
Georg



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: GNAT (GCC) Profile Guided Compilation
  2012-07-02 17:15               ` Georg Bauhaus
@ 2012-07-02 17:26                 ` Keean Schupke
  2012-07-02 23:48                   ` Keean Schupke
  0 siblings, 1 reply; 37+ messages in thread
From: Keean Schupke @ 2012-07-02 17:26 UTC (permalink / raw)


On Monday, 2 July 2012 18:15:28 UTC+1, Georg Bauhaus  wrote:
> On 02.07.12 00:57, Keean Schupke wrote:
> > The real benefit (and performance gains) from profile guided compilation come from correcting branch prediction. As such the gains will be most apparent when there is an 'if' statement in the inner loop of the code. Try something where you are taking the sign of an int in the formula and have three cases <0 =0 >0.
> 
> 
> Thanks for your lucid words, I was mostly guessing at what profile
> guided compilation might actually do. Indeed, now that I have started
> playing with conditionals, the translations show very different effects
> already, for variations of the procedure below,
> 
>    procedure Compute_1D (A : in out Matrix_1D) is
>    begin
>       for K in A'First + Len + 1 .. A'Last - Len - 1 loop
>          case K mod Len is
>          when 0 | Len - 1 => null;
>          when others =>
>             A (K) := (A(K + 1)
>                         + A(K - Len)
>                         + A(K - 1)
>                         + A(K + Len)) mod Num'Last;
>          end case;
>          if A (K) mod 6 = 0 then
>             A (K) := (A (K) - 1) mod Num'Last;
>          else
>             A (K) := K mod Num'Last;
>          end if;
>       end loop;
>    end Compute_1D;
> 
> Ada and C++ are mostly on a par without help from a profile
> (the 2D approach is still better in the Ada case; perhaps mod 6
> isn't true for that many K). C++ gains 8%, Ada only 4%, though.
> 
> 
> Cheers,
> Georg


As it happens, the branch predictor is quite good at predicting regular 'mod' patterns. See:

http://en.wikipedia.org/wiki/Branch_predictor

And look for the section on the two level adaptive predictor.

I think Monte-Carlo techniques must be particularly sensitive to branch predictor error, as each iteration the branching is controlled by a pseudo random number (and we hope the branch predictor cannot predict that).

So if for each iteration you pick a random number, and that controls your branch pattern in the inner loop, you should see a stronger effect from the profile-guided optimisation.


Cheers,
Keean.



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: GNAT (GCC) Profile Guided Compilation
  2012-07-02 17:26                 ` Keean Schupke
@ 2012-07-02 23:48                   ` Keean Schupke
  2012-07-04 10:38                     ` Georg Bauhaus
  0 siblings, 1 reply; 37+ messages in thread
From: Keean Schupke @ 2012-07-02 23:48 UTC (permalink / raw)


On Monday, 2 July 2012 18:26:58 UTC+1, Keean Schupke  wrote:
> On Monday, 2 July 2012 18:15:28 UTC+1, Georg Bauhaus  wrote:
> > On 02.07.12 00:57, Keean Schupke wrote:
> > > The real benefit (and performance gains) from profile guided compilation come from correcting branch prediction. As such the gains will be most apparent when there is an 'if' statement in the inner loop of the code. Try something where you are taking the sign of an int in the formula and have three cases <0 =0 >0.
> > 
> > 
> > Thanks for your lucid words, I was mostly guessing at what profile
> > guided compilation might actually do. Indeed, now that I have started
> > playing with conditionals, the translations show very different effects
> > already, for variations of the procedure below,
> > 
> >    procedure Compute_1D (A : in out Matrix_1D) is
> >    begin
> >       for K in A'First + Len + 1 .. A'Last - Len - 1 loop
> >          case K mod Len is
> >          when 0 | Len - 1 => null;
> >          when others =>
> >             A (K) := (A(K + 1)
> >                         + A(K - Len)
> >                         + A(K - 1)
> >                         + A(K + Len)) mod Num'Last;
> >          end case;
> >          if A (K) mod 6 = 0 then
> >             A (K) := (A (K) - 1) mod Num'Last;
> >          else
> >             A (K) := K mod Num'Last;
> >          end if;
> >       end loop;
> >    end Compute_1D;
> > 
> > Ada and C++ are mostly on a par without help from a profile
> > (the 2D approach is still better in the Ada case; perhaps mod 6
> > isn't true for that many K). C++ gains 8%, Ada only 4%, though.
> > 
> > 
> > Cheers,
> > Georg
> 
> 
> As it happens, the branch predictor is quite good at predicting regular 'mod' patterns. See:
> 
> http://en.wikipedia.org/wiki/Branch_predictor
> 
> And look for the section on the two level adaptive predictor.
> 
> I think Monte-Carlo techniques must be particularly sensitive to branch predictor error, as each iteration the branching is controlled by a pseudo random number (and we hope the branch predictor cannot predict that).
> 
> So if for each iteration you pick a random number, and that controls your branch pattern in the inner loop, you should see a stronger effect from the profile-guided optimisation.
> 
> 
> Cheers,
> Keean.


I have done some testing with the linux "perf" tool. These are some figures for the Ada version:

         1,014,900 l1-dcache-load-misses     #    0.01% of all L1-dcache hits
    12,462,973,199 l1-dcache-loads
         7,311,495 cache-references
            38,804 cache-misses              #    0.531 % of all cache refs
     2,588,686,069 branch-instructions
       388,460,030 branch-misses             #   15.01% of all branches
      21.885512117 seconds time elapsed

And here are the results for the C++ version:

           840,245 l1-dcache-load-misses     #    0.01% of all L1-dcache hits
    11,140,761,995 l1-dcache-loads
         6,019,321 cache-references
            27,584 cache-misses              #    0.458 % of all cache refs
     3,049,597,029 branch-instructions
       560,173,316 branch-misses             #   18.37% of all branches
      17.823476294 seconds time elapsed


So the interesting thing is that the Ada version has less overall branches and less branch misses than the C++ version, so it seems the profile-guided compilation has achieved as much. There is another factor limiting performance. The interesting figure would appear to be the cache-misses.

So it would appear I need to focus on the cache utilisation of the Ada code.


Cheers,
Keean.



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: GNAT (GCC) Profile Guided Compilation
  2012-07-02 23:48                   ` Keean Schupke
@ 2012-07-04 10:38                     ` Georg Bauhaus
  2012-07-04 10:57                       ` Keean Schupke
  0 siblings, 1 reply; 37+ messages in thread
From: Georg Bauhaus @ 2012-07-04 10:38 UTC (permalink / raw)

On 03.07.12 01:48, Keean Schupke wrote:
> I have done some testing with the linux "perf" tool. These are some figures for the Ada version:
>
>           1,014,900 l1-dcache-load-misses     #    0.01% of all L1-dcache hits
>      12,462,973,199 l1-dcache-loads
>           7,311,495 cache-references
>              38,804 cache-misses              #    0.531 % of all cache refs
>       2,588,686,069 branch-instructions
>         388,460,030 branch-misses             #   15.01% of all branches
>        21.885512117 seconds time elapsed
>
> And here are the results for the C++ version:
>
>             840,245 l1-dcache-load-misses     #    0.01% of all L1-dcache hits
>      11,140,761,995 l1-dcache-loads
>           6,019,321 cache-references
>              27,584 cache-misses              #    0.458 % of all cache refs
>       3,049,597,029 branch-instructions
>         560,173,316 branch-misses             #   18.37% of all branches
>        17.823476294 seconds time elapsed
>
>
> So the interesting thing is that the Ada version has less overall branches and less branch misses than the C++ version, so it seems the profile-guided compilation has achieved as much. There is another factor limiting performance. The interesting figure would appear to be the cache-misses.
>
> So it would appear I need to focus on the cache utilisation of the Ada code.

FWIW, looking at the 1D vs 2D subprograms in order to learn
about a (dis)advantage of writing 2D arrays,I found some
things potentially interesting.

When there is no additional test in the loops,
Apple's Instruments shows two orders of magnitude fewer
branch instructions executed by the 2D subprogram
compared to the 1D subprogram, 5M : 2G. This seems huge to me,
but is reproducible. A naive look at the assembly listing offers
some confirmation, mentioned below, though not on the same order.

With the "mod" based test added to the respective loops the number
of branch instructions executed by the 2D subprogram increases
to about one half of that of the 1D subprogram's. Still better.

The assembly listing of the subprograms without tests added has

- [compute_1d] 3 pairs of forward je and 1 backward jne near
   the end

- [compute_2] 1 pair of backward jne near the end,

It appears that unrolling yields two somewhat differently
structured lists of instructions, but I'm drifting away
from Ada.

Compiling with profile data rearranges the jumps for 1D, adds jumps to 2D,
and shortens both procedures. However, this slows both down using the latest
GNAT GPL on Core i7; there is some speed-up of the 1D procedure with
Debian's GNAT 4.4.5 on Xeon E5645, though. (-O2 -funroll-loops -gnatp)

All of this breaks once I turn on -O3.
Not sure whether this is a lottery or a mine field. ;-)

Cheers,
Georg

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: GNAT (GCC) Profile Guided Compilation
  2012-07-04 10:38                     ` Georg Bauhaus
@ 2012-07-04 10:57                       ` Keean Schupke
  2012-07-04 12:36                         ` Mark Lorenzen
  2012-07-04 12:38                         ` Georg Bauhaus
  0 siblings, 2 replies; 37+ messages in thread
From: Keean Schupke @ 2012-07-04 10:57 UTC (permalink / raw)


On Wednesday, 4 July 2012 11:38:57 UTC+1, Georg Bauhaus  wrote:
> On 03.07.12 01:48, Keean Schupke wrote:
> > I have done some testing with the linux "perf" tool. These are some figures for the Ada version:
> >
> >           1,014,900 l1-dcache-load-misses     #    0.01% of all L1-dcache hits
> >      12,462,973,199 l1-dcache-loads
> >           7,311,495 cache-references
> >              38,804 cache-misses              #    0.531 % of all cache refs
> >       2,588,686,069 branch-instructions
> >         388,460,030 branch-misses             #   15.01% of all branches
> >        21.885512117 seconds time elapsed
> >
> > And here are the results for the C++ version:
> >
> >             840,245 l1-dcache-load-misses     #    0.01% of all L1-dcache hits
> >      11,140,761,995 l1-dcache-loads
> >           6,019,321 cache-references
> >              27,584 cache-misses              #    0.458 % of all cache refs
> >       3,049,597,029 branch-instructions
> >         560,173,316 branch-misses             #   18.37% of all branches
> >        17.823476294 seconds time elapsed
> >
> >
> > So the interesting thing is that the Ada version has less overall branches and less branch misses than the C++ version, so it seems the profile-guided compilation has achieved as much. There is another factor limiting performance. The interesting figure would appear to be the cache-misses.
> >
> > So it would appear I need to focus on the cache utilisation of the Ada code.
> 
> FWIW, looking at the 1D vs 2D subprograms in order to learn
> about a (dis)advantage of writing 2D arrays,I found some
> things potentially interesting.
> 
> When there is no additional test in the loops,
> Apple's Instruments shows two orders of magnitude fewer
> branch instructions executed by the 2D subprogram
> compared to the 1D subprogram, 5M : 2G. This seems huge to me,
> but is reproducible. A naive look at the assembly listing offers
> some confirmation, mentioned below, though not on the same order.
> 
> With the "mod" based test added to the respective loops the number
> of branch instructions executed by the 2D subprogram increases
> to about one half of that of the 1D subprogram's. Still better.
> 
> The assembly listing of the subprograms without tests added has
> 
> - [compute_1d] 3 pairs of forward je and 1 backward jne near
>    the end
> 
> - [compute_2] 1 pair of backward jne near the end,
> 
> It appears that unrolling yields two somewhat differently
> structured lists of instructions, but I'm drifting away
> from Ada.
> 
> Compiling with profile data rearranges the jumps for 1D, adds jumps to 2D,
> and shortens both procedures. However, this slows both down using the latest
> GNAT GPL on Core i7; there is some speed-up of the 1D procedure with
> Debian's GNAT 4.4.5 on Xeon E5645, though. (-O2 -funroll-loops -gnatp)
> 
> All of this breaks once I turn on -O3.
> Not sure whether this is a lottery or a mine field. ;-)
> 
> Cheers,
> Georg


How can I turn off inlining for a function in GNAT?

GNAT seems to be automatically inlining some functions despite not having -gnatn enabled nor having a pragma Inline for the function.


On the profile-guided stuff, I think you have to benchmark with your real application. I get a consistent improvement of 25% with C++ and 15% for Ada. I just can't work out at the moment why the Ada is slower.


Cheers,
Keean.



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: GNAT (GCC) Profile Guided Compilation
  2012-07-04 10:57                       ` Keean Schupke
@ 2012-07-04 12:36                         ` Mark Lorenzen
  2012-07-04 12:38                         ` Georg Bauhaus
  1 sibling, 0 replies; 37+ messages in thread
From: Mark Lorenzen @ 2012-07-04 12:36 UTC (permalink / raw)


On 4 Jul., 12:57, Keean Schupke <keean.schu...@googlemail.com> wrote:
>
> How can I turn off inlining for a function in GNAT?
>
> GNAT seems to be automatically inlining some functions despite not having -gnatn enabled nor having a pragma Inline for the function.
>

Option "-fno-inline"?

Regards,

Mark L



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: GNAT (GCC) Profile Guided Compilation
  2012-07-04 10:57                       ` Keean Schupke
  2012-07-04 12:36                         ` Mark Lorenzen
@ 2012-07-04 12:38                         ` Georg Bauhaus
  2012-07-14 20:17                           ` Keean Schupke
  1 sibling, 1 reply; 37+ messages in thread
From: Georg Bauhaus @ 2012-07-04 12:38 UTC (permalink / raw)


On 04.07.12 12:57, Keean Schupke wrote:
> On Wednesday, 4 July 2012 11:38:57 UTC+1, Georg Bauhaus  wrote:
>> On 03.07.12 01:48, Keean Schupke wrote:
>>> I have done some testing with the linux "perf" tool. These are some figures for the Ada version:
>>>
>>>           1,014,900 l1-dcache-load-misses     #    0.01% of all L1-dcache hits
>>>      12,462,973,199 l1-dcache-loads
>>>           7,311,495 cache-references
>>>              38,804 cache-misses              #    0.531 % of all cache refs
>>>       2,588,686,069 branch-instructions
>>>         388,460,030 branch-misses             #   15.01% of all branches
>>>        21.885512117 seconds time elapsed
>>>
>>> And here are the results for the C++ version:
>>>
>>>             840,245 l1-dcache-load-misses     #    0.01% of all L1-dcache hits
>>>      11,140,761,995 l1-dcache-loads
>>>           6,019,321 cache-references
>>>              27,584 cache-misses              #    0.458 % of all cache refs
>>>       3,049,597,029 branch-instructions
>>>         560,173,316 branch-misses             #   18.37% of all branches
>>>        17.823476294 seconds time elapsed
>>>
>>>
>>> So the interesting thing is that the Ada version has less overall branches and less branch misses than the C++ version, so it seems the profile-guided compilation has achieved as much. There is another factor limiting performance. The interesting figure would appear to be the cache-misses.
>>>
>>> So it would appear I need to focus on the cache utilisation of the Ada code.
>>
>> FWIW, looking at the 1D vs 2D subprograms in order to learn
>> about a (dis)advantage of writing 2D arrays,I found some
>> things potentially interesting.
>>
>> When there is no additional test in the loops,
>> Apple's Instruments shows two orders of magnitude fewer
>> branch instructions executed by the 2D subprogram
>> compared to the 1D subprogram, 5M : 2G. This seems huge to me,
>> but is reproducible. A naive look at the assembly listing offers
>> some confirmation, mentioned below, though not on the same order.
>>
>> With the "mod" based test added to the respective loops the number
>> of branch instructions executed by the 2D subprogram increases
>> to about one half of that of the 1D subprogram's. Still better.
>>
>> The assembly listing of the subprograms without tests added has
>>
>> - [compute_1d] 3 pairs of forward je and 1 backward jne near
>>    the end
>>
>> - [compute_2] 1 pair of backward jne near the end,
>>
>> It appears that unrolling yields two somewhat differently
>> structured lists of instructions, but I'm drifting away
>> from Ada.
>>
>> Compiling with profile data rearranges the jumps for 1D, adds jumps to 2D,
>> and shortens both procedures. However, this slows both down using the latest
>> GNAT GPL on Core i7; there is some speed-up of the 1D procedure with
>> Debian's GNAT 4.4.5 on Xeon E5645, though. (-O2 -funroll-loops -gnatp)
>>
>> All of this breaks once I turn on -O3.
>> Not sure whether this is a lottery or a mine field. ;-)
>>
>> Cheers,
>> Georg
> 
> 
> How can I turn off inlining for a function in GNAT?

Sometimes by reordering code, making sure the body hasn't
been seen when the compiler sees the call statement.
Or try separate compilation.  The following arrangement
appears to prevent inline expansion of Inc, even when
just the main unit is fed to gnatmake -O3 -gnatNp, so that
GNAT translates everything else automatically, using the
same switches.

-fno-inline is another switch to consider. However, it
appears to be interfering with other optimizations (loop
unrolling, vectorizer, from what I can guess).

package Prevent_Inline is
   type List is array (Positive range <>) of Integer;
   procedure Inc (X : in out Integer);
   procedure Inc_All (A : in out List);
end Prevent_Inline;

with Prevent_Inline.Aux;
package body Prevent_Inline is

   procedure Inc (X : in out Integer) is
   begin
      X := X + 1;
   end Inc;

   procedure Inc_All (A : in out List)
     renames Prevent_Inline.Aux;

end Prevent_Inline;

procedure Prevent_Inline.Aux (A : in out List) is
begin
   for X of A loop
      Inc (X);
   end loop;
end Prevent_Inline.Aux;

with Prevent_Inline;    use Prevent_Inline;
procedure Test_Prevent_Inline is
   X : List (1 .. 10);
begin
   Inc_All (X);
end Test_Prevent_Inline;



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: GNAT (GCC) Profile Guided Compilation
  2012-07-04 12:38                         ` Georg Bauhaus
@ 2012-07-14 20:17                           ` Keean Schupke
  2012-07-14 20:33                             ` Keean Schupke
  2012-07-14 20:43                             ` Niklas Holsti
  0 siblings, 2 replies; 37+ messages in thread
From: Keean Schupke @ 2012-07-14 20:17 UTC (permalink / raw)


On Wednesday, 4 July 2012 13:38:45 UTC+1, Georg Bauhaus  wrote:
> On 04.07.12 12:57, Keean Schupke wrote:
> &gt; On Wednesday, 4 July 2012 11:38:57 UTC+1, Georg Bauhaus  wrote:
> &gt;&gt; On 03.07.12 01:48, Keean Schupke wrote:
> &gt;&gt;&gt; I have done some testing with the linux &quot;perf&quot; tool. These are some figures for the Ada version:
> &gt;&gt;&gt;
> &gt;&gt;&gt;           1,014,900 l1-dcache-load-misses     #    0.01% of all L1-dcache hits
> &gt;&gt;&gt;      12,462,973,199 l1-dcache-loads
> &gt;&gt;&gt;           7,311,495 cache-references
> &gt;&gt;&gt;              38,804 cache-misses              #    0.531 % of all cache refs
> &gt;&gt;&gt;       2,588,686,069 branch-instructions
> &gt;&gt;&gt;         388,460,030 branch-misses             #   15.01% of all branches
> &gt;&gt;&gt;        21.885512117 seconds time elapsed
> &gt;&gt;&gt;
> &gt;&gt;&gt; And here are the results for the C++ version:
> &gt;&gt;&gt;
> &gt;&gt;&gt;             840,245 l1-dcache-load-misses     #    0.01% of all L1-dcache hits
> &gt;&gt;&gt;      11,140,761,995 l1-dcache-loads
> &gt;&gt;&gt;           6,019,321 cache-references
> &gt;&gt;&gt;              27,584 cache-misses              #    0.458 % of all cache refs
> &gt;&gt;&gt;       3,049,597,029 branch-instructions
> &gt;&gt;&gt;         560,173,316 branch-misses             #   18.37% of all branches
> &gt;&gt;&gt;        17.823476294 seconds time elapsed
> &gt;&gt;&gt;
> &gt;&gt;&gt;
> &gt;&gt;&gt; So the interesting thing is that the Ada version has less overall branches and less branch misses than the C++ version, so it seems the profile-guided compilation has achieved as much. There is another factor limiting performance. The interesting figure would appear to be the cache-misses.
> &gt;&gt;&gt;
> &gt;&gt;&gt; So it would appear I need to focus on the cache utilisation of the Ada code.
> &gt;&gt;
> &gt;&gt; FWIW, looking at the 1D vs 2D subprograms in order to learn
> &gt;&gt; about a (dis)advantage of writing 2D arrays,I found some
> &gt;&gt; things potentially interesting.
> &gt;&gt;
> &gt;&gt; When there is no additional test in the loops,
> &gt;&gt; Apple&#39;s Instruments shows two orders of magnitude fewer
> &gt;&gt; branch instructions executed by the 2D subprogram
> &gt;&gt; compared to the 1D subprogram, 5M : 2G. This seems huge to me,
> &gt;&gt; but is reproducible. A naive look at the assembly listing offers
> &gt;&gt; some confirmation, mentioned below, though not on the same order.
> &gt;&gt;
> &gt;&gt; With the &quot;mod&quot; based test added to the respective loops the number
> &gt;&gt; of branch instructions executed by the 2D subprogram increases
> &gt;&gt; to about one half of that of the 1D subprogram&#39;s. Still better.
> &gt;&gt;
> &gt;&gt; The assembly listing of the subprograms without tests added has
> &gt;&gt;
> &gt;&gt; - [compute_1d] 3 pairs of forward je and 1 backward jne near
> &gt;&gt;    the end
> &gt;&gt;
> &gt;&gt; - [compute_2] 1 pair of backward jne near the end,
> &gt;&gt;
> &gt;&gt; It appears that unrolling yields two somewhat differently
> &gt;&gt; structured lists of instructions, but I&#39;m drifting away
> &gt;&gt; from Ada.
> &gt;&gt;
> &gt;&gt; Compiling with profile data rearranges the jumps for 1D, adds jumps to 2D,
> &gt;&gt; and shortens both procedures. However, this slows both down using the latest
> &gt;&gt; GNAT GPL on Core i7; there is some speed-up of the 1D procedure with
> &gt;&gt; Debian&#39;s GNAT 4.4.5 on Xeon E5645, though. (-O2 -funroll-loops -gnatp)
> &gt;&gt;
> &gt;&gt; All of this breaks once I turn on -O3.
> &gt;&gt; Not sure whether this is a lottery or a mine field. ;-)
> &gt;&gt;
> &gt;&gt; Cheers,
> &gt;&gt; Georg
> &gt; 
> &gt; 
> &gt; How can I turn off inlining for a function in GNAT?
> 
> Sometimes by reordering code, making sure the body hasn&#39;t
> been seen when the compiler sees the call statement.
> Or try separate compilation.  The following arrangement
> appears to prevent inline expansion of Inc, even when
> just the main unit is fed to gnatmake -O3 -gnatNp, so that
> GNAT translates everything else automatically, using the
> same switches.
> 
> -fno-inline is another switch to consider. However, it
> appears to be interfering with other optimizations (loop
> unrolling, vectorizer, from what I can guess).
> 
> package Prevent_Inline is
>    type List is array (Positive range &lt;&gt;) of Integer;
>    procedure Inc (X : in out Integer);
>    procedure Inc_All (A : in out List);
> end Prevent_Inline;
> 
> with Prevent_Inline.Aux;
> package body Prevent_Inline is
> 
>    procedure Inc (X : in out Integer) is
>    begin
>       X := X + 1;
>    end Inc;
> 
>    procedure Inc_All (A : in out List)
>      renames Prevent_Inline.Aux;
> 
> end Prevent_Inline;
> 
> procedure Prevent_Inline.Aux (A : in out List) is
> begin
>    for X of A loop
>       Inc (X);
>    end loop;
> end Prevent_Inline.Aux;
> 
> with Prevent_Inline;    use Prevent_Inline;
> procedure Test_Prevent_Inline is
>    X : List (1 .. 10);
> begin
>    Inc_All (X);
> end Test_Prevent_Inline;

Okay, I think I have tracked down the performance problem, but I am not sure how to fix it. It would appear that C++ code that returns a boolean from a function, generates a decision tree using tests and branches, whereas Ada is setting the result into a Boolean variable. This has the result that C++ is bailing out of the evaluation as soon as it can (IE if one side of an and is false, or one side of an or is true), but Ada is always evaluating all parts of the expressions.

Is this a difference in language semantics, and what is the best way to deal with it? Do I need to rewrite all 'and' and 'or' statements in conditionals as nested if statements to get the evaluate only as far as necessary semantics like C/C++?


Cheers,
Keean.



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: GNAT (GCC) Profile Guided Compilation
  2012-07-14 20:17                           ` Keean Schupke
@ 2012-07-14 20:33                             ` Keean Schupke
  2012-07-14 20:43                             ` Niklas Holsti
  1 sibling, 0 replies; 37+ messages in thread
From: Keean Schupke @ 2012-07-14 20:33 UTC (permalink / raw)


On Saturday, 14 July 2012 21:17:27 UTC+1, Keean Schupke  wrote:
> On Wednesday, 4 July 2012 13:38:45 UTC+1, Georg Bauhaus  wrote:
> &gt; On 04.07.12 12:57, Keean Schupke wrote:
> &gt; &amp;gt; On Wednesday, 4 July 2012 11:38:57 UTC+1, Georg Bauhaus  wrote:
> &gt; &amp;gt;&amp;gt; On 03.07.12 01:48, Keean Schupke wrote:
> &gt; &amp;gt;&amp;gt;&amp;gt; I have done some testing with the linux &amp;quot;perf&amp;quot; tool. These are some figures for the Ada version:
> &gt; &amp;gt;&amp;gt;&amp;gt;
> &gt; &amp;gt;&amp;gt;&amp;gt;           1,014,900 l1-dcache-load-misses     #    0.01% of all L1-dcache hits
> &gt; &amp;gt;&amp;gt;&amp;gt;      12,462,973,199 l1-dcache-loads
> &gt; &amp;gt;&amp;gt;&amp;gt;           7,311,495 cache-references
> &gt; &amp;gt;&amp;gt;&amp;gt;              38,804 cache-misses              #    0.531 % of all cache refs
> &gt; &amp;gt;&amp;gt;&amp;gt;       2,588,686,069 branch-instructions
> &gt; &amp;gt;&amp;gt;&amp;gt;         388,460,030 branch-misses             #   15.01% of all branches
> &gt; &amp;gt;&amp;gt;&amp;gt;        21.885512117 seconds time elapsed
> &gt; &amp;gt;&amp;gt;&amp;gt;
> &gt; &amp;gt;&amp;gt;&amp;gt; And here are the results for the C++ version:
> &gt; &amp;gt;&amp;gt;&amp;gt;
> &gt; &amp;gt;&amp;gt;&amp;gt;             840,245 l1-dcache-load-misses     #    0.01% of all L1-dcache hits
> &gt; &amp;gt;&amp;gt;&amp;gt;      11,140,761,995 l1-dcache-loads
> &gt; &amp;gt;&amp;gt;&amp;gt;           6,019,321 cache-references
> &gt; &amp;gt;&amp;gt;&amp;gt;              27,584 cache-misses              #    0.458 % of all cache refs
> &gt; &amp;gt;&amp;gt;&amp;gt;       3,049,597,029 branch-instructions
> &gt; &amp;gt;&amp;gt;&amp;gt;         560,173,316 branch-misses             #   18.37% of all branches
> &gt; &amp;gt;&amp;gt;&amp;gt;        17.823476294 seconds time elapsed
> &gt; &amp;gt;&amp;gt;&amp;gt;
> &gt; &amp;gt;&amp;gt;&amp;gt;
> &gt; &amp;gt;&amp;gt;&amp;gt; So the interesting thing is that the Ada version has less overall branches and less branch misses than the C++ version, so it seems the profile-guided compilation has achieved as much. There is another factor limiting performance. The interesting figure would appear to be the cache-misses.
> &gt; &amp;gt;&amp;gt;&amp;gt;
> &gt; &amp;gt;&amp;gt;&amp;gt; So it would appear I need to focus on the cache utilisation of the Ada code.
> &gt; &amp;gt;&amp;gt;
> &gt; &amp;gt;&amp;gt; FWIW, looking at the 1D vs 2D subprograms in order to learn
> &gt; &amp;gt;&amp;gt; about a (dis)advantage of writing 2D arrays,I found some
> &gt; &amp;gt;&amp;gt; things potentially interesting.
> &gt; &amp;gt;&amp;gt;
> &gt; &amp;gt;&amp;gt; When there is no additional test in the loops,
> &gt; &amp;gt;&amp;gt; Apple&amp;#39;s Instruments shows two orders of magnitude fewer
> &gt; &amp;gt;&amp;gt; branch instructions executed by the 2D subprogram
> &gt; &amp;gt;&amp;gt; compared to the 1D subprogram, 5M : 2G. This seems huge to me,
> &gt; &amp;gt;&amp;gt; but is reproducible. A naive look at the assembly listing offers
> &gt; &amp;gt;&amp;gt; some confirmation, mentioned below, though not on the same order.
> &gt; &amp;gt;&amp;gt;
> &gt; &amp;gt;&amp;gt; With the &amp;quot;mod&amp;quot; based test added to the respective loops the number
> &gt; &amp;gt;&amp;gt; of branch instructions executed by the 2D subprogram increases
> &gt; &amp;gt;&amp;gt; to about one half of that of the 1D subprogram&amp;#39;s. Still better.
> &gt; &amp;gt;&amp;gt;
> &gt; &amp;gt;&amp;gt; The assembly listing of the subprograms without tests added has
> &gt; &amp;gt;&amp;gt;
> &gt; &amp;gt;&amp;gt; - [compute_1d] 3 pairs of forward je and 1 backward jne near
> &gt; &amp;gt;&amp;gt;    the end
> &gt; &amp;gt;&amp;gt;
> &gt; &amp;gt;&amp;gt; - [compute_2] 1 pair of backward jne near the end,
> &gt; &amp;gt;&amp;gt;
> &gt; &amp;gt;&amp;gt; It appears that unrolling yields two somewhat differently
> &gt; &amp;gt;&amp;gt; structured lists of instructions, but I&amp;#39;m drifting away
> &gt; &amp;gt;&amp;gt; from Ada.
> &gt; &amp;gt;&amp;gt;
> &gt; &amp;gt;&amp;gt; Compiling with profile data rearranges the jumps for 1D, adds jumps to 2D,
> &gt; &amp;gt;&amp;gt; and shortens both procedures. However, this slows both down using the latest
> &gt; &amp;gt;&amp;gt; GNAT GPL on Core i7; there is some speed-up of the 1D procedure with
> &gt; &amp;gt;&amp;gt; Debian&amp;#39;s GNAT 4.4.5 on Xeon E5645, though. (-O2 -funroll-loops -gnatp)
> &gt; &amp;gt;&amp;gt;
> &gt; &amp;gt;&amp;gt; All of this breaks once I turn on -O3.
> &gt; &amp;gt;&amp;gt; Not sure whether this is a lottery or a mine field. ;-)
> &gt; &amp;gt;&amp;gt;
> &gt; &amp;gt;&amp;gt; Cheers,
> &gt; &amp;gt;&amp;gt; Georg
> &gt; &amp;gt; 
> &gt; &amp;gt; 
> &gt; &amp;gt; How can I turn off inlining for a function in GNAT?
> &gt; 
> &gt; Sometimes by reordering code, making sure the body hasn&amp;#39;t
> &gt; been seen when the compiler sees the call statement.
> &gt; Or try separate compilation.  The following arrangement
> &gt; appears to prevent inline expansion of Inc, even when
> &gt; just the main unit is fed to gnatmake -O3 -gnatNp, so that
> &gt; GNAT translates everything else automatically, using the
> &gt; same switches.
> &gt; 
> &gt; -fno-inline is another switch to consider. However, it
> &gt; appears to be interfering with other optimizations (loop
> &gt; unrolling, vectorizer, from what I can guess).
> &gt; 
> &gt; package Prevent_Inline is
> &gt;    type List is array (Positive range &amp;lt;&amp;gt;) of Integer;
> &gt;    procedure Inc (X : in out Integer);
> &gt;    procedure Inc_All (A : in out List);
> &gt; end Prevent_Inline;
> &gt; 
> &gt; with Prevent_Inline.Aux;
> &gt; package body Prevent_Inline is
> &gt; 
> &gt;    procedure Inc (X : in out Integer) is
> &gt;    begin
> &gt;       X := X + 1;
> &gt;    end Inc;
> &gt; 
> &gt;    procedure Inc_All (A : in out List)
> &gt;      renames Prevent_Inline.Aux;
> &gt; 
> &gt; end Prevent_Inline;
> &gt; 
> &gt; procedure Prevent_Inline.Aux (A : in out List) is
> &gt; begin
> &gt;    for X of A loop
> &gt;       Inc (X);
> &gt;    end loop;
> &gt; end Prevent_Inline.Aux;
> &gt; 
> &gt; with Prevent_Inline;    use Prevent_Inline;
> &gt; procedure Test_Prevent_Inline is
> &gt;    X : List (1 .. 10);
> &gt; begin
> &gt;    Inc_All (X);
> &gt; end Test_Prevent_Inline;
> 
> Okay, I think I have tracked down the performance problem, but I am not sure how to fix it. It would appear that C++ code that returns a boolean from a function, generates a decision tree using tests and branches, whereas Ada is setting the result into a Boolean variable. This has the result that C++ is bailing out of the evaluation as soon as it can (IE if one side of an and is false, or one side of an or is true), but Ada is always evaluating all parts of the expressions.
> 
> Is this a difference in language semantics, and what is the best way to deal with it? Do I need to rewrite all &#39;and&#39; and &#39;or&#39; statements in conditionals as nested if statements to get the evaluate only as far as necessary semantics like C/C++?
> 
> 
> Cheers,
> Keean.

Just answering my own question, looks like I should be using "and then" and "or else" for "and" and "or" in boolean expressions.

Cheers,
Keean.



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: GNAT (GCC) Profile Guided Compilation
  2012-07-14 20:17                           ` Keean Schupke
  2012-07-14 20:33                             ` Keean Schupke
@ 2012-07-14 20:43                             ` Niklas Holsti
  2012-07-14 22:32                               ` Keean Schupke
  1 sibling, 1 reply; 37+ messages in thread
From: Niklas Holsti @ 2012-07-14 20:43 UTC (permalink / raw)


On 12-07-14 23:17 , Keean Schupke wrote:
>
> Okay, I think I have tracked down the performance problem, but I am
> not sure how to fix it. It would appear that C++ code that returns a
> boolean from a function, generates a decision tree using tests and
> branches, whereas Ada is setting the result into a Boolean variable.
> This has the result that C++ is bailing out of the evaluation as soon
> as it can (IE if one side of an and is false, or one side of an or is
> true), but Ada is always evaluating all parts of the expressions.
>
> Is this a difference in language semantics, and what is the best way
> to deal with it? Do I need to rewrite all 'and' and 'or' statements
> in conditionals as nested if statements to get the evaluate only as
> far as necessary semantics like C/C++?

Use the short-circuit forms of "and" and "or": "and then" and "or else".

-- 
Niklas Holsti
Tidorum Ltd
niklas holsti tidorum fi
       .      @       .





^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: GNAT (GCC) Profile Guided Compilation
  2012-07-14 20:43                             ` Niklas Holsti
@ 2012-07-14 22:32                               ` Keean Schupke
  2012-07-14 23:40                                 ` Keean Schupke
  0 siblings, 1 reply; 37+ messages in thread
From: Keean Schupke @ 2012-07-14 22:32 UTC (permalink / raw)


On Saturday, 14 July 2012 21:43:20 UTC+1, Niklas Holsti  wrote:
> On 12-07-14 23:17 , Keean Schupke wrote:
> &gt;
> &gt; Okay, I think I have tracked down the performance problem, but I am
> &gt; not sure how to fix it. It would appear that C++ code that returns a
> &gt; boolean from a function, generates a decision tree using tests and
> &gt; branches, whereas Ada is setting the result into a Boolean variable.
> &gt; This has the result that C++ is bailing out of the evaluation as soon
> &gt; as it can (IE if one side of an and is false, or one side of an or is
> &gt; true), but Ada is always evaluating all parts of the expressions.
> &gt;
> &gt; Is this a difference in language semantics, and what is the best way
> &gt; to deal with it? Do I need to rewrite all &#39;and&#39; and &#39;or&#39; statements
> &gt; in conditionals as nested if statements to get the evaluate only as
> &gt; far as necessary semantics like C/C++?
> 
> Use the short-circuit forms of &quot;and&quot; and &quot;or&quot;: &quot;and then&quot; and &quot;or else&quot;.
> 
> -- 
> Niklas Holsti
> Tidorum Ltd
> niklas holsti tidorum fi
>        .      @       .

Yes, this appears to be the biggest cause of the profile guided compilation not working as well for Ada as C++. With the short-circuit forms I get the following speeds (thousands of simulations per second).

C++: 40k, profile-guided: 56k
Ada: 40k, profile-guided: 54k

So thats a huge improvement (up from 46k to 54k).

Cheers,
Keean.



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: GNAT (GCC) Profile Guided Compilation
  2012-07-14 22:32                               ` Keean Schupke
@ 2012-07-14 23:40                                 ` Keean Schupke
  2012-07-15  7:15                                   ` Niklas Holsti
  0 siblings, 1 reply; 37+ messages in thread
From: Keean Schupke @ 2012-07-14 23:40 UTC (permalink / raw)


On Saturday, 14 July 2012 23:32:49 UTC+1, Keean Schupke  wrote:
> On Saturday, 14 July 2012 21:43:20 UTC+1, Niklas Holsti  wrote:
> &gt; On 12-07-14 23:17 , Keean Schupke wrote:
> &gt; &amp;gt;
> &gt; &amp;gt; Okay, I think I have tracked down the performance problem, but I am
> &gt; &amp;gt; not sure how to fix it. It would appear that C++ code that returns a
> &gt; &amp;gt; boolean from a function, generates a decision tree using tests and
> &gt; &amp;gt; branches, whereas Ada is setting the result into a Boolean variable.
> &gt; &amp;gt; This has the result that C++ is bailing out of the evaluation as soon
> &gt; &amp;gt; as it can (IE if one side of an and is false, or one side of an or is
> &gt; &amp;gt; true), but Ada is always evaluating all parts of the expressions.
> &gt; &amp;gt;
> &gt; &amp;gt; Is this a difference in language semantics, and what is the best way
> &gt; &amp;gt; to deal with it? Do I need to rewrite all &amp;#39;and&amp;#39; and &amp;#39;or&amp;#39; statements
> &gt; &amp;gt; in conditionals as nested if statements to get the evaluate only as
> &gt; &amp;gt; far as necessary semantics like C/C++?
> &gt; 
> &gt; Use the short-circuit forms of &amp;quot;and&amp;quot; and &amp;quot;or&amp;quot;: &amp;quot;and then&amp;quot; and &amp;quot;or else&amp;quot;.
> &gt; 
> &gt; -- 
> &gt; Niklas Holsti
> &gt; Tidorum Ltd
> &gt; niklas holsti tidorum fi
> &gt;        .      @       .
> 
> Yes, this appears to be the biggest cause of the profile guided compilation not working as well for Ada as C++. With the short-circuit forms I get the following speeds (thousands of simulations per second).
> 
> C++: 40k, profile-guided: 56k
> Ada: 40k, profile-guided: 54k
> 
> So thats a huge improvement (up from 46k to 54k).
> 
> Cheers,
> Keean.

After a bit of checking line-by-line to make sure I am using 'and then', 'or else' and 'constant' everywhere I can in the code, Ada is outperforming C++ when using profile-guided compilation for the first time. both C++ and Ada are getting about 40k simulations per second with normal compilation, C++ is achieving 56k simulations per second profile-guided, and Ada 57k per second.


Cheers,
Keean.

On Saturday, 14 July 2012 23:32:49 UTC+1, Keean Schupke  wrote:
> On Saturday, 14 July 2012 21:43:20 UTC+1, Niklas Holsti  wrote:
> &gt; On 12-07-14 23:17 , Keean Schupke wrote:
> &gt; &amp;gt;
> &gt; &amp;gt; Okay, I think I have tracked down the performance problem, but I am
> &gt; &amp;gt; not sure how to fix it. It would appear that C++ code that returns a
> &gt; &amp;gt; boolean from a function, generates a decision tree using tests and
> &gt; &amp;gt; branches, whereas Ada is setting the result into a Boolean variable.
> &gt; &amp;gt; This has the result that C++ is bailing out of the evaluation as soon
> &gt; &amp;gt; as it can (IE if one side of an and is false, or one side of an or is
> &gt; &amp;gt; true), but Ada is always evaluating all parts of the expressions.
> &gt; &amp;gt;
> &gt; &amp;gt; Is this a difference in language semantics, and what is the best way
> &gt; &amp;gt; to deal with it? Do I need to rewrite all &amp;#39;and&amp;#39; and &amp;#39;or&amp;#39; statements
> &gt; &amp;gt; in conditionals as nested if statements to get the evaluate only as
> &gt; &amp;gt; far as necessary semantics like C/C++?
> &gt; 
> &gt; Use the short-circuit forms of &amp;quot;and&amp;quot; and &amp;quot;or&amp;quot;: &amp;quot;and then&amp;quot; and &amp;quot;or else&amp;quot;.
> &gt; 
> &gt; -- 
> &gt; Niklas Holsti
> &gt; Tidorum Ltd
> &gt; niklas holsti tidorum fi
> &gt;        .      @       .
> 
> Yes, this appears to be the biggest cause of the profile guided compilation not working as well for Ada as C++. With the short-circuit forms I get the following speeds (thousands of simulations per second).
> 
> C++: 40k, profile-guided: 56k
> Ada: 40k, profile-guided: 54k
> 
> So thats a huge improvement (up from 46k to 54k).
> 
> Cheers,
> Keean.




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: GNAT (GCC) Profile Guided Compilation
  2012-07-14 23:40                                 ` Keean Schupke
@ 2012-07-15  7:15                                   ` Niklas Holsti
  2012-07-15  8:27                                     ` Keean Schupke
  2012-07-15 11:02                                     ` Niklas Holsti
  0 siblings, 2 replies; 37+ messages in thread
From: Niklas Holsti @ 2012-07-15  7:15 UTC (permalink / raw)

On 12-07-15 02:40 , Keean Schupke wrote:
> After a bit of checking line-by-line to make sure I am using 'and
> then', 'or else' and 'constant' everywhere I can in the code, Ada is
> outperforming C++ when using profile-guided compilation for the first
> time. both C++ and Ada are getting about 40k simulations per second
> with normal compilation, C++ is achieving 56k simulations per second
> profile-guided, and Ada 57k per second.

Good news.

An Ada copmiler is of course free to implement the ordinary 
"long-circuit" operators "and"/"or" using short-circuit code, if the 
evaluation of the operands has no side effects. What are the operands 
typically like in your program? Are they function calls, or simple 
variables?

It seems to me that the general belief, regarding the expected relative 
speeds of the short-circuit code versus the long-circuit code for 
Boolean expressions with simple operands, is that the branch penalties 
on modern processors are so large that the short-circuit form is not 
obviously faster. This may explain why the Ada compiler is not using the 
short-circuit code automatically.

Clearly, if the expression is "(simple operand likely to be True) and 
(longer and longer expression)", at some point the short-circuit code 
(or changing to "and then") will become faster than the long-circuit 
code, whatever the branch penalty. This point will come sooner if 
profile-guidance is used to reduce the branch penalty.

-- 
Niklas Holsti
Tidorum Ltd
niklas holsti tidorum fi
       .      @       .

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: GNAT (GCC) Profile Guided Compilation
  2012-07-15  7:15                                   ` Niklas Holsti
@ 2012-07-15  8:27                                     ` Keean Schupke
  2012-07-18 10:01                                       ` Georg Bauhaus
  2012-07-15 11:02                                     ` Niklas Holsti
  1 sibling, 1 reply; 37+ messages in thread
From: Keean Schupke @ 2012-07-15  8:27 UTC (permalink / raw)

On Sunday, 15 July 2012 08:15:35 UTC+1, Niklas Holsti  wrote:
> On 12-07-15 02:40 , Keean Schupke wrote:
> &gt; After a bit of checking line-by-line to make sure I am using &#39;and
> &gt; then&#39;, &#39;or else&#39; and &#39;constant&#39; everywhere I can in the code, Ada is
> &gt; outperforming C++ when using profile-guided compilation for the first
> &gt; time. both C++ and Ada are getting about 40k simulations per second
> &gt; with normal compilation, C++ is achieving 56k simulations per second
> &gt; profile-guided, and Ada 57k per second.
> 
> Good news.
> 
> An Ada copmiler is of course free to implement the ordinary 
> &quot;long-circuit&quot; operators &quot;and&quot;/&quot;or&quot; using short-circuit code, if the 
> evaluation of the operands has no side effects. What are the operands 
> typically like in your program? Are they function calls, or simple 
> variables?
> 
> It seems to me that the general belief, regarding the expected relative 
> speeds of the short-circuit code versus the long-circuit code for 
> Boolean expressions with simple operands, is that the branch penalties 
> on modern processors are so large that the short-circuit form is not 
> obviously faster. This may explain why the Ada compiler is not using the 
> short-circuit code automatically.
> 
> Clearly, if the expression is &quot;(simple operand likely to be True) and 
> (longer and longer expression)&quot;, at some point the short-circuit code 
> (or changing to &quot;and then&quot;) will become faster than the long-circuit 
> code, whatever the branch penalty. This point will come sooner if 
> profile-guidance is used to reduce the branch penalty.
> 
> -- 
> Niklas Holsti
> Tidorum Ltd
> niklas holsti tidorum fi
>        .      @       .

The code is side effect free. I wonder if using "pragma Pure_Function(X)" would help here?

Actually I think the assumption that branch misprediction has a high cost is wrong on modern speculative processors. In my tests on different core2 processors simple branching beats all the branchless methods. This is probably because they can decode up to multiple instructions per cycle, and execute both branch paths in parallel speculatively, so a branch misprediction just throws away the micro-ops from the not-taken path, and the cost is fairly low. I realise this somewhat contradicts what I said about profile guided compilation, but there is still a higher cost to misprediction, and for my code that was worth about 6k simulations using profile-guided compilation. However the short-circuit execution plus profile guided compilation was worth another 10k simulations. It would appear the biggest benefit is in re-arranging the order of the clauses of boolean logic so the most often taken short-circuit path is first, although the re-arranging of general branches still has a significant effect.

The most critical Boolean expression has the a format like this:

function F(N : Node, V : Value) return Boolean is begin (
    return (N.Enum = Const) or else ((N.Enum = V) = (N.Number = 0));
)

B : constant Boolean = F(N1, V) 
    and then F(N2, V)
    and then F(N3, V)
    and then F(N4, V);

Cheers,
Keean.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: GNAT (GCC) Profile Guided Compilation
  2012-07-15  7:15                                   ` Niklas Holsti
  2012-07-15  8:27                                     ` Keean Schupke
@ 2012-07-15 11:02                                     ` Niklas Holsti
  2012-07-15 12:48                                       ` Keean Schupke
  1 sibling, 1 reply; 37+ messages in thread
From: Niklas Holsti @ 2012-07-15 11:02 UTC (permalink / raw)


On 12-07-15 10:15 , Niklas Holsti wrote:

> It seems to me that the general belief, regarding the expected relative
> speeds of the short-circuit code versus the long-circuit code for
> Boolean expressions with simple operands, is that the branch penalties
> on modern processors are so large that the short-circuit form is not
> obviously faster. This may explain why the Ada compiler is not using the
> short-circuit code automatically.
>
> Clearly, if the expression is "(simple operand likely to be True) and

Oops, my brain fart: change that to "likely to be False".

> (longer and longer expression)", at some point the short-circuit code
> (or changing to "and then") will become faster than the long-circuit
> code, whatever the branch penalty. This point will come sooner if
> profile-guidance is used to reduce the branch penalty.

-- 
Niklas Holsti
Tidorum Ltd
niklas holsti tidorum fi
       .      @       .





^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: GNAT (GCC) Profile Guided Compilation
  2012-07-15 11:02                                     ` Niklas Holsti
@ 2012-07-15 12:48                                       ` Keean Schupke
  0 siblings, 0 replies; 37+ messages in thread
From: Keean Schupke @ 2012-07-15 12:48 UTC (permalink / raw)


On Sunday, 15 July 2012 12:02:31 UTC+1, Niklas Holsti  wrote:
> On 12-07-15 10:15 , Niklas Holsti wrote:
> 
> &gt; It seems to me that the general belief, regarding the expected relative
> &gt; speeds of the short-circuit code versus the long-circuit code for
> &gt; Boolean expressions with simple operands, is that the branch penalties
> &gt; on modern processors are so large that the short-circuit form is not
> &gt; obviously faster. This may explain why the Ada compiler is not using the
> &gt; short-circuit code automatically.
> &gt;
> &gt; Clearly, if the expression is &quot;(simple operand likely to be True) and
> 
> Oops, my brain fart: change that to &quot;likely to be False&quot;.
> 
> &gt; (longer and longer expression)&quot;, at some point the short-circuit code
> &gt; (or changing to &quot;and then&quot;) will become faster than the long-circuit
> &gt; code, whatever the branch penalty. This point will come sooner if
> &gt; profile-guidance is used to reduce the branch penalty.
> 
> -- 
> Niklas Holsti
> Tidorum Ltd
> niklas holsti tidorum fi
>        .      @       .

Surely True for "or else" (as soon as one argument is true the other need not be evaluated)and false for "and then" (as soon as one argument is false the other need not be evaluated).

Cheers,
Keean.



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: GNAT (GCC) Profile Guided Compilation
  2012-07-15  8:27                                     ` Keean Schupke
@ 2012-07-18 10:01                                       ` Georg Bauhaus
  2012-07-18 17:36                                         ` Keean Schupke
  0 siblings, 1 reply; 37+ messages in thread
From: Georg Bauhaus @ 2012-07-18 10:01 UTC (permalink / raw)


On 15.07.12 10:27, Keean Schupke wrote:
> function F(N : Node, V : Value) return Boolean is begin (
>     return (N.Enum = Const) or else ((N.Enum = V) = (N.Number = 0));
> )
> 
> B : constant Boolean = F(N1, V) 
>     and then F(N2, V)
>     and then F(N3, V)
>     and then F(N4, V);
> 

FWIW, I have two observations after playing with the above function:

Using different ways of supplying the variables to F and to functions
like it, two things seemed to have noticeable influence:

1) making the Node record limited (good)
2) supplying the values, not the record, to F (possibly good)

The results have varied a lot with everything (CPU, compiler, switches, ...),
and I haven't checked the indirections for correctness; in any case,
plain F (record components) did well.

with System;
package News_23.Choice is

   function F  (N : Node; V : Value) return Boolean;
   function FA (N : Node; V : Value) return Boolean;
   function FP1 (p : System.Address; V : Value) return Boolean;
   function FP2 (p : System.Address; V : Value) return Boolean;
   function FP3 (p : System.Address; V : Value) return Boolean;
   function F_3_Args (Enum   : Value;
		      Number : Numeric;
		      V	     : Value) return Boolean;
private
   Const : constant Value := Two;
end News_23.choice;

with System.Address_To_Access_Conversions, System.Storage_Elements;
with Ada.Unchecked_Conversion;
package body News_23.Choice is

   use System.Storage_Elements;
   Enum_Offset : constant := 4 * Storage_Element'Size;
   Number_Offset : constant := 8 * Storage_Element'Size;

   function F(N : Node; V : Value) return Boolean is begin
     return (N.Enum = Const) or else ((N.Enum = V) = (N.Number = 0));
   end;

   package Value_P is new System.Address_To_Access_Conversions
     (Object => Value);
   package Numeric_P is new System.Address_To_Access_Conversions
     (Object => Numeric);

   function FA(N : Node; V : Value) return Boolean is
   begin
      declare
--	 Enm : Value_P.Object_Pointer renames Value_P.To_Pointer (N'Address +
N.Enum'Position);
--	 Num : Numeric_P.Object_Pointer renames Numeric_P.To_Pointer (N'Address +
N.Number'Position);
	 Enm : Value_P.Object_Pointer renames Value_P.To_Pointer (N'Address +
Enum_Offset);
	 Num : Numeric_P.Object_Pointer renames Numeric_P.To_Pointer (N'Address +
Number_Offset);
      begin
	 return (Enm.all = Const) or else ((Enm.all = V) = (Num.all = 0));
      end;
   end FA;

   function FP1 (P : System.Address; V : Value) return Boolean is
      Enm : Value;
      pragma Import (Ada, Enm);
      for Enm'Address use P + Enum_Offset;
      Num : Numeric;
      pragma Import (Ada, Num);
      for Num'Address use P + Number_Offset;
   begin
      pragma Inspection_Point (P);
      return (Enm = Const) or else ((Enm = V) = (Num = 0));
   end FP1;

   function FP2 (P : System.Address; V : Value) return Boolean is
      Enm : Value;
      pragma Import (Ada, Enm);
      for Enm'Address use To_Address (To_Integer (P) + Enum_Offset);
      Num : Numeric;
      pragma Import (Ada, Num);
      for Num'Address use To_Address (To_Integer (P) + Number_Offset);
   begin
      pragma Inspection_Point (P);
      return (Enm = Const) or else ((Enm = V) = (Num = 0));
   end FP2;

   type Adr is mod 2**Standard'Address_Size;
   function To_N is new Ada.Unchecked_Conversion (System.Address, Adr);
   function To_Adr is new Ada.Unchecked_Conversion (Adr, System.Address);

   function FP3 (P : System.Address; V : Value) return Boolean is
      Enm : Value;
      pragma Import (Ada, Enm);
      for Enm'Address use To_Adr (To_N (P) + Enum_Offset);
      Num : Numeric;
      pragma Import (Ada, Num);
      for Num'Address use To_Adr (To_N (P) + Number_Offset);
   begin
      pragma Inspection_Point (P);
      return (Enm = Const) or else ((Enm = V) = (Num = 0));
   end FP3;

   function F_3_Args(Enum : Value; Number : Numeric ; V : Value) return
Boolean is begin
     return (Enum = Const) or else ((Enum = V) = (Number = 0));
   end F_3_Args;

end News_23.Choice;





^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: GNAT (GCC) Profile Guided Compilation
  2012-07-18 10:01                                       ` Georg Bauhaus
@ 2012-07-18 17:36                                         ` Keean Schupke
  2012-07-19  5:42                                           ` Georg Bauhaus
  0 siblings, 1 reply; 37+ messages in thread
From: Keean Schupke @ 2012-07-18 17:36 UTC (permalink / raw)


On Wednesday, 18 July 2012 11:01:11 UTC+1, Georg Bauhaus  wrote:
> On 15.07.12 10:27, Keean Schupke wrote:
> &gt; function F(N : Node, V : Value) return Boolean is begin (
> &gt;     return (N.Enum = Const) or else ((N.Enum = V) = (N.Number = 0));
> &gt; )
> &gt; 
> &gt; B : constant Boolean = F(N1, V) 
> &gt;     and then F(N2, V)
> &gt;     and then F(N3, V)
> &gt;     and then F(N4, V);
> &gt; 
> 
> FWIW, I have two observations after playing with the above function:
> 
> Using different ways of supplying the variables to F and to functions
> like it, two things seemed to have noticeable influence:
> 
> 1) making the Node record limited (good)
> 2) supplying the values, not the record, to F (possibly good)
> 
> The results have varied a lot with everything (CPU, compiler, switches, ...),
> and I haven&#39;t checked the indirections for correctness; in any case,
> plain F (record components) did well.
> 
> with System;
> package News_23.Choice is
> 
>    function F  (N : Node; V : Value) return Boolean;
>    function FA (N : Node; V : Value) return Boolean;
>    function FP1 (p : System.Address; V : Value) return Boolean;
>    function FP2 (p : System.Address; V : Value) return Boolean;
>    function FP3 (p : System.Address; V : Value) return Boolean;
>    function F_3_Args (Enum   : Value;
> 		      Number : Numeric;
> 		      V	     : Value) return Boolean;
> private
>    Const : constant Value := Two;
> end News_23.choice;
> 
> with System.Address_To_Access_Conversions, System.Storage_Elements;
> with Ada.Unchecked_Conversion;
> package body News_23.Choice is
> 
>    use System.Storage_Elements;
>    Enum_Offset : constant := 4 * Storage_Element&#39;Size;
>    Number_Offset : constant := 8 * Storage_Element&#39;Size;
> 
>    function F(N : Node; V : Value) return Boolean is begin
>      return (N.Enum = Const) or else ((N.Enum = V) = (N.Number = 0));
>    end;
> 
>    package Value_P is new System.Address_To_Access_Conversions
>      (Object =&gt; Value);
>    package Numeric_P is new System.Address_To_Access_Conversions
>      (Object =&gt; Numeric);
> 
>    function FA(N : Node; V : Value) return Boolean is
>    begin
>       declare
> --	 Enm : Value_P.Object_Pointer renames Value_P.To_Pointer (N&#39;Address +
> N.Enum&#39;Position);
> --	 Num : Numeric_P.Object_Pointer renames Numeric_P.To_Pointer (N&#39;Address +
> N.Number&#39;Position);
> 	 Enm : Value_P.Object_Pointer renames Value_P.To_Pointer (N&#39;Address +
> Enum_Offset);
> 	 Num : Numeric_P.Object_Pointer renames Numeric_P.To_Pointer (N&#39;Address +
> Number_Offset);
>       begin
> 	 return (Enm.all = Const) or else ((Enm.all = V) = (Num.all = 0));
>       end;
>    end FA;
> 
>    function FP1 (P : System.Address; V : Value) return Boolean is
>       Enm : Value;
>       pragma Import (Ada, Enm);
>       for Enm&#39;Address use P + Enum_Offset;
>       Num : Numeric;
>       pragma Import (Ada, Num);
>       for Num&#39;Address use P + Number_Offset;
>    begin
>       pragma Inspection_Point (P);
>       return (Enm = Const) or else ((Enm = V) = (Num = 0));
>    end FP1;
> 
>    function FP2 (P : System.Address; V : Value) return Boolean is
>       Enm : Value;
>       pragma Import (Ada, Enm);
>       for Enm&#39;Address use To_Address (To_Integer (P) + Enum_Offset);
>       Num : Numeric;
>       pragma Import (Ada, Num);
>       for Num&#39;Address use To_Address (To_Integer (P) + Number_Offset);
>    begin
>       pragma Inspection_Point (P);
>       return (Enm = Const) or else ((Enm = V) = (Num = 0));
>    end FP2;
> 
>    type Adr is mod 2**Standard&#39;Address_Size;
>    function To_N is new Ada.Unchecked_Conversion (System.Address, Adr);
>    function To_Adr is new Ada.Unchecked_Conversion (Adr, System.Address);
> 
>    function FP3 (P : System.Address; V : Value) return Boolean is
>       Enm : Value;
>       pragma Import (Ada, Enm);
>       for Enm&#39;Address use To_Adr (To_N (P) + Enum_Offset);
>       Num : Numeric;
>       pragma Import (Ada, Num);
>       for Num&#39;Address use To_Adr (To_N (P) + Number_Offset);
>    begin
>       pragma Inspection_Point (P);
>       return (Enm = Const) or else ((Enm = V) = (Num = 0));
>    end FP3;
> 
>    function F_3_Args(Enum : Value; Number : Numeric ; V : Value) return
> Boolean is begin
>      return (Enum = Const) or else ((Enum = V) = (Number = 0));
>    end F_3_Args;
> 
> end News_23.Choice;

I think if you use -O3 -gnatn (and pragma Inline(X)) the function will be inlined. Does it still make a difference then?

Cheers,
Keean.



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: GNAT (GCC) Profile Guided Compilation
  2012-07-18 17:36                                         ` Keean Schupke
@ 2012-07-19  5:42                                           ` Georg Bauhaus
  2012-07-19 10:18                                             ` Keean Schupke
  0 siblings, 1 reply; 37+ messages in thread
From: Georg Bauhaus @ 2012-07-19  5:42 UTC (permalink / raw)


On 18.07.12 19:36, Keean Schupke wrote:
> I think if you use -O3 -gnatn (and pragma Inline(X)) the function will be inlined. Does it still make a difference then?

It did. In particular, A_T_A_Conversion seemed to need -gnatN.
The differences were small, in a range between .67s and .73s
for 50_000_000 iterations, trying a number of combinations of
switches. I haven't tried profile guided optimization, though.






^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: GNAT (GCC) Profile Guided Compilation
  2012-07-19  5:42                                           ` Georg Bauhaus
@ 2012-07-19 10:18                                             ` Keean Schupke
  0 siblings, 0 replies; 37+ messages in thread
From: Keean Schupke @ 2012-07-19 10:18 UTC (permalink / raw)


On Thursday, 19 July 2012 06:42:05 UTC+1, Georg Bauhaus  wrote:
> On 18.07.12 19:36, Keean Schupke wrote:
> &gt; I think if you use -O3 -gnatn (and pragma Inline(X)) the function will be inlined. Does it still make a difference then?
> 
> It did. In particular, A_T_A_Conversion seemed to need -gnatN.
> The differences were small, in a range between .67s and .73s
> for 50_000_000 iterations, trying a number of combinations of
> switches. I haven&#39;t tried profile guided optimization, though.

Here's my makefile if its useful:


test: test.adb
    rm -f *.gcda
    gnatmake -march=native -O3 -fprofile-generate -ftest-coverage -gnatp -gnatn test -bargs -shared -cargs -fdata-sections -ffunction-sections -largs -Wl,--gc-sections
    ./test
    touch test.adb
    gnatmake -march=native -O3 -fprofile-use -gnatp -gnatn test -bargs -shared -cargs -fdata-sections -ffunction-sections -largs -Wl,--gc-sections


Cheers,
Keean.



^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2012-07-26 14:53 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-29  9:17 GNAT (GCC) Profile Guided Compilation Keean Schupke
2012-06-29  9:34 ` Dmitry A. Kazakov
2012-06-29 10:01   ` Keean Schupke
2012-06-29 10:24     ` Keean Schupke
2012-06-29 12:26       ` stefan-lucks
2012-06-29 12:51         ` Keean Schupke
2012-06-29 12:05     ` Dmitry A. Kazakov
2012-06-29 10:48 ` Simon Wright
2012-06-29 11:14   ` Keean Schupke
2012-06-29 12:39 ` gautier_niouzes
2012-06-29 12:52   ` Keean Schupke
2012-06-29 14:14     ` gautier_niouzes
2012-06-29 15:05       ` gautier_niouzes
2012-06-29 17:03         ` Keean Schupke
2012-07-01  9:29           ` Georg Bauhaus
2012-07-01 17:45           ` Georg Bauhaus
2012-07-01 22:57             ` Keean Schupke
2012-07-02 17:15               ` Georg Bauhaus
2012-07-02 17:26                 ` Keean Schupke
2012-07-02 23:48                   ` Keean Schupke
2012-07-04 10:38                     ` Georg Bauhaus
2012-07-04 10:57                       ` Keean Schupke
2012-07-04 12:36                         ` Mark Lorenzen
2012-07-04 12:38                         ` Georg Bauhaus
2012-07-14 20:17                           ` Keean Schupke
2012-07-14 20:33                             ` Keean Schupke
2012-07-14 20:43                             ` Niklas Holsti
2012-07-14 22:32                               ` Keean Schupke
2012-07-14 23:40                                 ` Keean Schupke
2012-07-15  7:15                                   ` Niklas Holsti
2012-07-15  8:27                                     ` Keean Schupke
2012-07-18 10:01                                       ` Georg Bauhaus
2012-07-18 17:36                                         ` Keean Schupke
2012-07-19  5:42                                           ` Georg Bauhaus
2012-07-19 10:18                                             ` Keean Schupke
2012-07-15 11:02                                     ` Niklas Holsti
2012-07-15 12:48                                       ` Keean Schupke

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox