* sub-optimal code for packed boolean arrays -- bug or inherent limitation @ 2007-07-02 19:34 Alinabi 2007-07-02 20:08 ` Ludovic Brenta ` (6 more replies) 0 siblings, 7 replies; 29+ messages in thread From: Alinabi @ 2007-07-02 19:34 UTC (permalink / raw) Hello, everyone. I am writing a chess program in Ada, using bitboards for position representation. Bitboards are just 64 bit integers in which each bit represents a predicate about a square of the board. For example, you would have a bitboard called WhitePawns, which has a one for every square that contains a white pawn, and zero for every other square. In Ada, there are two possible implementations of a bitboard: using Unsigned_64, which would closely mirror the way things are done in C, and using packed arrays of booleans. The C-like version, using Unsigned_64, leads to some rather obfuscated code, so I decided to investigate the efficiency of the code that gnat generates for some basic operations on packed arrays, which I implemented in the following package: package Test is type Index_T is new Integer range 0..63; type Bitboard_T is private; procedure Set (B : in out Bitboard_T; I : in Index_T); procedure Clear (B : in out Bitboard_T; I : in Index_T); procedure Clear (B : in out Bitboard_T); procedure Flip (B : in out Bitboard_T; I : in Index_T); function Is_Set(B : in Bitboard_T; I : in Index_T) return Boolean ; private type Bitboard_T is array(Index_T) of Boolean; pragma Pack(Bitboard_T); pragma Inline_Always(Set, Clear, Flip, Is_Set); end Test; package body Test is pragma Optimize(Time); procedure Set(B : in out Bitboard_T; I : in Index_T) is begin B(I) := True; end; procedure Clear(B : in out Bitboard_T; I : in Index_T) is begin B(I) := False; end; procedure Clear(B : in out Bitboard_T) is begin B := B xor B; end; procedure Flip(B : in out Bitboard_T; I : in Index_T) is begin B(i) := not B(i); end; function Is_Set(B : in Bitboard_T; I : in Index_T) return Boolean is begin return B(I); end; end Test; When compiled with gnatmake -g -gnatp -gnatn -march=opteron -O3 -fdump-tree-optimized test.adb using the ada compiler in gcc 4.1.2, the code generated is generally optimal, or close to optimal. For example, the code generated for Set is procedure Set(B : in out Bitboard_T; I : in Index_T) is 0: 89 f1 mov %esi,%ecx begin B(I) := True; 2: b8 01 00 00 00 mov $0x1,%eax 7: 48 d3 e0 shl %cl,%rax a: 48 09 f8 or %rdi,%rax end; d: c3 retq The only notable exception is procedure Flip, which becomes procedure Flip(B : in out Bitboard_T; I : in Index_T) is 30: 89 f1 mov %esi,%ecx begin B(i) := not B(i); 32: 48 c7 c0 fe ff ff ff mov $0xfffffffffffffffe,%rax 39: 48 d3 c0 rol %cl,%rax 3c: 48 21 f8 and %rdi,%rax 3f: 48 d3 ef shr %cl,%rdi 42: 83 f7 01 xor $0x1,%edi 45: 83 e7 01 and $0x1,%edi 48: 48 d3 e7 shl %cl,%rdi 4b: 48 09 f8 or %rdi,%rax end; 4e: c3 retq instead of the shorter mov %esi, %ecx mov 0x1, %rax shl %cl, %rax xor % rax, %rdx retq I don't know much (if anything) about compiler internals, so I was wondering if I should file a bug report. Is there some good theoretical justification for all that extraneous shifts and rotations? I would think that if the compiler can figure out the best way to set or clear a bit in a register using shift and logical ops, than it should also be able to negate it efficiently. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: sub-optimal code for packed boolean arrays -- bug or inherent limitation 2007-07-02 19:34 sub-optimal code for packed boolean arrays -- bug or inherent limitation Alinabi @ 2007-07-02 20:08 ` Ludovic Brenta 2007-07-03 1:01 ` Jeffrey R. Carter ` (5 subsequent siblings) 6 siblings, 0 replies; 29+ messages in thread From: Ludovic Brenta @ 2007-07-02 20:08 UTC (permalink / raw) Alinabi <alexander.the.average@gmail.com> writes: > Hello, everyone. Unfortunately I don't know enough of GCC's optimisers to answer. I think you should ask your (interesting) question on the GCC mailing list, where not only the GNAT developers but the optimiser developers dwell. The list is gcc@gcc.gnu.org. -- Ludovic Brenta. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: sub-optimal code for packed boolean arrays -- bug or inherent limitation 2007-07-02 19:34 sub-optimal code for packed boolean arrays -- bug or inherent limitation Alinabi 2007-07-02 20:08 ` Ludovic Brenta @ 2007-07-03 1:01 ` Jeffrey R. Carter 2007-07-03 7:22 ` Harald Korneliussen 2007-07-03 7:59 ` gautier_niouzes ` (4 subsequent siblings) 6 siblings, 1 reply; 29+ messages in thread From: Jeffrey R. Carter @ 2007-07-03 1:01 UTC (permalink / raw) Alinabi wrote: > > The only notable exception is procedure Flip, which becomes > > procedure Flip(B : in out Bitboard_T; I : in Index_T) is > 30: 89 f1 mov %esi,%ecx > begin > B(i) := not B(i); > 32: 48 c7 c0 fe ff ff ff mov $0xfffffffffffffffe,%rax > 39: 48 d3 c0 rol %cl,%rax > 3c: 48 21 f8 and %rdi,%rax > 3f: 48 d3 ef shr %cl,%rdi > 42: 83 f7 01 xor $0x1,%edi > 45: 83 e7 01 and $0x1,%edi > 48: 48 d3 e7 shl %cl,%rdi > 4b: 48 09 f8 or %rdi,%rax > end; > 4e: c3 retq > > instead of the shorter > > mov %esi, %ecx > mov 0x1, %rax > shl %cl, %rax > xor % rax, %rdx > retq > > I don't know much (if anything) about compiler internals, so I was > wondering if I should file a bug report. Is there some good > theoretical justification for all that extraneous shifts and > rotations? I would think that if the compiler can figure out the best > way to set or clear a bit in a register using shift and logical ops, > than it should also be able to negate it efficiently. The code gives the correct results, so it can't be a compiler bug. Sometimes a compiler will do better than hand-coded assembler; sometimes it will do worse. I doubt the compiler's code will make your chess program fail to meet its timing requirements; I doubt if you'll notice the difference at all, so I don't see why you care. If you must have a specific sequence of machine code operations, you should use a machine-code insertion. -- Jeff Carter "Strange women lying in ponds distributing swords is no basis for a system of government." Monty Python & the Holy Grail 66 ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: sub-optimal code for packed boolean arrays -- bug or inherent limitation 2007-07-03 1:01 ` Jeffrey R. Carter @ 2007-07-03 7:22 ` Harald Korneliussen 2007-07-03 8:37 ` Georg Bauhaus 0 siblings, 1 reply; 29+ messages in thread From: Harald Korneliussen @ 2007-07-03 7:22 UTC (permalink / raw) On Jul 3, 3:01 am, "Jeffrey R. Carter" <spam.jrcarter....@acm.nospam.org> wrote: > I doubt the compiler's code will make your chess > program fail to meet its timing requirements; I doubt if you'll notice > the difference at all, so I don't see why you care. If you must have a > specific sequence of machine code operations, you should use a > machine-code insertion. > Now now, chess programs written to compete with other chess programs need all the performance they can get. You could argue that he should be using C with inline assembly (and probably another compiler than GCC), but wouldn't it be kind of neat if a competitive and readable chess program was written in Ada? It would be a feather in the hat for Ada's low-level features, which is what this is all about. I would ask the GCC team, after looking at the intermediate code. I'm sure they're not content with merely having the compiler generate correct code. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: sub-optimal code for packed boolean arrays -- bug or inherent limitation 2007-07-03 7:22 ` Harald Korneliussen @ 2007-07-03 8:37 ` Georg Bauhaus 0 siblings, 0 replies; 29+ messages in thread From: Georg Bauhaus @ 2007-07-03 8:37 UTC (permalink / raw) On Tue, 2007-07-03 at 00:22 -0700, Harald Korneliussen wrote: > On Jul 3, 3:01 am, "Jeffrey R. Carter" > <spam.jrcarter....@acm.nospam.org> wrote: > > I doubt the compiler's code will make your chess > > program fail to meet its timing requirements; I doubt if you'll notice > > the difference at all, so I don't see why you care. If you must have a > > specific sequence of machine code operations, you should use a > > machine-code insertion. > > > Now now, chess programs written to compete with other chess programs > need all the performance they can get. You could argue that he should > be using C with inline assembly (and probably another compiler than > GCC), but wouldn't it be kind of neat if a competitive and readable > chess program was written in Ada? Clear Ada is the point of this Ada program as the OP has explained on the GCC mailing list. > I would ask the GCC team, after looking at the intermediate code. I'm > sure they're not content with merely having the compiler generate > correct code. The suggestion was different. Still, the problem is identified as non-optimal code. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: sub-optimal code for packed boolean arrays -- bug or inherent limitation 2007-07-02 19:34 sub-optimal code for packed boolean arrays -- bug or inherent limitation Alinabi 2007-07-02 20:08 ` Ludovic Brenta 2007-07-03 1:01 ` Jeffrey R. Carter @ 2007-07-03 7:59 ` gautier_niouzes 2007-07-03 9:25 ` Stefan Lucks ` (3 subsequent siblings) 6 siblings, 0 replies; 29+ messages in thread From: gautier_niouzes @ 2007-07-03 7:59 UTC (permalink / raw) ... > The only notable exception is procedure Flip, which becomes > > procedure Flip(B : in out Bitboard_T; I : in Index_T) is > 30: 89 f1 mov %esi,%ecx > begin > B(i) := not B(i); > 32: 48 c7 c0 fe ff ff ff mov $0xfffffffffffffffe,%rax > 39: 48 d3 c0 rol %cl,%rax > 3c: 48 21 f8 and %rdi,%rax > 3f: 48 d3 ef shr %cl,%rdi > 42: 83 f7 01 xor $0x1,%edi > 45: 83 e7 01 and $0x1,%edi > 48: 48 d3 e7 shl %cl,%rdi > 4b: 48 09 f8 or %rdi,%rax > end; > 4e: c3 retq > > instead of the shorter > > mov %esi, %ecx > mov 0x1, %rax > shl %cl, %rax > xor % rax, %rdx > retq > > I don't know much (if anything) about compiler internals, so I was > wondering if I should file a bug report. Is there some good > theoretical justification for all that extraneous shifts and > rotations? I would think that if the compiler can figure out the best > way to set or clear a bit in a register using shift and logical ops, > than it should also be able to negate it efficiently. Did you time both codes ? Often a much shorter code is not faster at all. BTW, you can also do assembler insertions with Ada (although it would be a pity if you had to). G. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: sub-optimal code for packed boolean arrays -- bug or inherent limitation 2007-07-02 19:34 sub-optimal code for packed boolean arrays -- bug or inherent limitation Alinabi ` (2 preceding siblings ...) 2007-07-03 7:59 ` gautier_niouzes @ 2007-07-03 9:25 ` Stefan Lucks 2007-07-03 12:40 ` Stefan Lucks 2007-07-03 15:42 ` Adam Beneschan ` (2 subsequent siblings) 6 siblings, 1 reply; 29+ messages in thread From: Stefan Lucks @ 2007-07-03 9:25 UTC (permalink / raw) As Clear, internally using XOR, is optimised well, have you tried to rewrite Flip using XOR for bit-flipping, e.g., as follows: procedure Flip(B : in out Bitboard_T; I : in Index_T) is One: constant Bitborard := (others => True); begin B(i) := B(i) xor One; end; (Even though the optimiser appararently could be improved here -- I would not consider this a bug, it is more like a missing feature -- this might perhaps solve your current problem without having to use inline-assembly.) -- Stefan Lucks (moved to Bauhaus-University Weimar, Germany) <Stefan.Lucks at medien.uni-weimar.de> ------ I love the taste of Cryptanalysis in the morning! ------ ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: sub-optimal code for packed boolean arrays -- bug or inherent limitation 2007-07-03 9:25 ` Stefan Lucks @ 2007-07-03 12:40 ` Stefan Lucks 0 siblings, 0 replies; 29+ messages in thread From: Stefan Lucks @ 2007-07-03 12:40 UTC (permalink / raw) On Tue, 3 Jul 2007, Stefan Lucks wrote: > As Clear, internally using XOR, is optimised well, have you tried to rewrite > Flip using XOR for bit-flipping, e.g., as follows: sorry, I meant procedure Flip(B : in out Bitboard_T; I : in Index_T) is begin B(I) := B(I) xor True; end; -- Stefan Lucks (moved to Bauhaus-University Weimar, Germany) <Stefan.Lucks at medien.uni-weimar.de> ------ I love the taste of Cryptanalysis in the morning! ------ ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: sub-optimal code for packed boolean arrays -- bug or inherent limitation 2007-07-02 19:34 sub-optimal code for packed boolean arrays -- bug or inherent limitation Alinabi ` (3 preceding siblings ...) 2007-07-03 9:25 ` Stefan Lucks @ 2007-07-03 15:42 ` Adam Beneschan 2007-07-03 18:04 ` Alinabi 2007-07-08 22:58 ` Robert A Duff 6 siblings, 0 replies; 29+ messages in thread From: Adam Beneschan @ 2007-07-03 15:42 UTC (permalink / raw) On Jul 2, 12:34 pm, Alinabi <alexander.the.aver...@gmail.com> wrote: > procedure Flip(B : in out Bitboard_T; I : in Index_T) is > begin > B(i) := not B(i); > end; Just as an experiment, I wonder what the following does. I realize that this code isn't useful. procedure Sideways_Flip(B : in out Bitboard_T; I, J : in Index_T) is begin B(i) := not B(j); end; If the assembly code looks substantially the same as your example, that's an indication that your compiler isn't recognizing that the B(i) on the left and right sides of the assignment are the same. (It makes perfect sense to me that the compiler might be able to recognize a simple variable that's the same on both sides, but not an indexed variable.) I don't know that this helps solve the problem, except to indicate that most likely any other "solution" that has B(i) on both sides of := isn't going to be any better. Hmmm... it just occurred to me that if the compiler can figure out to optimize simple names on both sides of := but not indexed variables, maybe this will work: procedure Flip(B : in out Bitboard_T; I : Index_T) is The_Bit : Boolean renames B(I); begin The_Bit := not The_Bit; end; It's worth a try. Sometimes doing stuff like this is enough to confuse a compiler into doing the right thing. If nothing else works, here's a possible workaround, but be warned that I haven't tried this: procedure Flip(B : in out Bitboard_T; I : in Index_T) is type Int_64 is mod 2**64; B_Overlay : Int_64; for B_Overlay'Address use B'Address; begin B_Overlay := B_Overlay xor 2**Integer(I); --or B_Overlay := B_Overlay xor 2**Integer(63-I); end; I don't know which of the xor's will work, or if either of them work; it depends on how the elements of Bitboard_T are ordered. The above code is evil, and it's the kind of thing I'd do only out of desperation---but at least the obfuscation in your program will be limited to one place. -- Adam ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: sub-optimal code for packed boolean arrays -- bug or inherent limitation 2007-07-02 19:34 sub-optimal code for packed boolean arrays -- bug or inherent limitation Alinabi ` (4 preceding siblings ...) 2007-07-03 15:42 ` Adam Beneschan @ 2007-07-03 18:04 ` Alinabi 2007-07-03 18:09 ` Alinabi ` (2 more replies) 2007-07-08 22:58 ` Robert A Duff 6 siblings, 3 replies; 29+ messages in thread From: Alinabi @ 2007-07-03 18:04 UTC (permalink / raw) First, let me thank all of you for your replies. I feel that I have to clarify a few things about this pet project of mine. I used to play chess at a competition level and I always wanted to write a chess program, but part of the motivation for starting the project is to learn the "dark corners" of Ada. So, I am not interested in switching to C. As I said, the bitboard operations can be implemented in C-like fashion, using Unsigned_64, masks and shifts, and that's how I have implemented them so far. This makes the static evaluation function hard to understand for someone who is not necessarily an experienced programmer. While I used to be a candidate master once upon a time, I do not play at that strength anymore and I could benefit from the advice of a strong player (who might not be a programmer). For such a person B(i) := not B(i) is much easier to understand than B := B xor Bitborad_T(Shift_Left(1, i)). I asked my question on the gcc mailing list and the answer was that I should not bother with a bug report. Their point of view, which I can understand, is that since the code generated is correct and this is such a niche feature, nobody will be in any hurry to work on it. They did suggest that I try to fix it myself, and they even pointed me at the file of interest in the gcc source. I barely have free time for my chess program, so I don't think I'll take them up on that any time soon, but if anybody here wants to give it a try, the relevant file is exp_pakd.adb Now, concerning efficiency: these basic operations on bitboards are used 1e8 to 1e9 times every time the program tries to decide what to move next. Even the smallest improvement in speed can mean the difference between searching 10 or 11 ply deep, which can mean an improvement of over 50 Elo points in strength. For Stefan Lucks: I tried both procedure Flip(B : in out Bitboard_T; I : in Index_T) is begin B(I) := B(I) xor True; end; and procedure Flip(B : in out Bitboard_T; I : in Index_T) is One: constant Bitborard := (others => False); begin One(i) := True; B(i) := B(i) xor One; end; The first one results in the same code as the original (B(i) := not B(i)) version, while the second one generates the optimal code, but if I am to explicitly use masks, I might as well go with the Unsigned_64 version. The whole point of using packed arrays is to write easy to read code and let the compiler take care of the masks and shifts behind the scene. For Adam Beneschan: procedure Flip(B : in out Bitboard_T; I : Index_T) is The_Bit : Boolean renames B(I); begin The_Bit := not The_Bit; end; results in the same code as my initial version. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: sub-optimal code for packed boolean arrays -- bug or inherent limitation 2007-07-03 18:04 ` Alinabi @ 2007-07-03 18:09 ` Alinabi 2007-07-03 18:17 ` Alinabi 2007-07-03 18:36 ` Jeffrey R. Carter 2007-07-04 9:00 ` Jean-Pierre Rosen 2 siblings, 1 reply; 29+ messages in thread From: Alinabi @ 2007-07-03 18:09 UTC (permalink / raw) Sorry, I meant to say procedure Flip(B : in out Bitboard_T; I : in Index_T) is One: constant Bitborard := (others => False); begin One(i) := True; B := B xor One; end; On Jul 3, 2:04 pm, Alinabi <alexander.the.aver...@gmail.com> wrote: > > For Stefan Lucks: I tried both > > procedure Flip(B : in out Bitboard_T; I : in Index_T) is > begin > B(I) := B(I) xor True; > end; > > and > > procedure Flip(B : in out Bitboard_T; I : in Index_T) is > One: constant Bitborard := (others => False); > begin > One(i) := True; > B(i) := B(i) xor One; > end; > ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: sub-optimal code for packed boolean arrays -- bug or inherent limitation 2007-07-03 18:09 ` Alinabi @ 2007-07-03 18:17 ` Alinabi 2007-07-10 2:06 ` Randy Brukardt 0 siblings, 1 reply; 29+ messages in thread From: Alinabi @ 2007-07-03 18:17 UTC (permalink / raw) One more thing: some inline assembly is unavoidable. There are two other bitboard operations that are required: First_One(Bitboard_T) returns Natural and Last_One(Bitboard_T) returns Natural; They return the position of the least significant and most significant set bit respectively. On x86 processors you can do this with one instruction BSF (bit scan forward) and BSR (bit scan reverse), and you cannot expect a compiler to generate them. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: sub-optimal code for packed boolean arrays -- bug or inherent limitation 2007-07-03 18:17 ` Alinabi @ 2007-07-10 2:06 ` Randy Brukardt 0 siblings, 0 replies; 29+ messages in thread From: Randy Brukardt @ 2007-07-10 2:06 UTC (permalink / raw) "Alinabi" <alexander.the.average@gmail.com> wrote in message news:1183486662.806585.71680@q69g2000hsb.googlegroups.com... > One more thing: some inline assembly is unavoidable. There are two > other bitboard operations that are required: First_One(Bitboard_T) > returns Natural and Last_One(Bitboard_T) returns Natural; They return > the position of the least significant and most significant set bit > respectively. On x86 processors you can do this with one instruction > BSF (bit scan forward) and BSR (bit scan reverse), and you cannot > expect a compiler to generate them. There's a reason for that: Intel recommended to compiler writers to not use "complex" instructions, as sets of simple instructions are often supposed to be faster. The "complex" instructions were implemented essentially as "macros" of simple instructions. (This is information is somewhat old, so it may vary on the most recent processors.) My point is that it might not actually be faster to use those instructions than to use a loop of your own design (and there is a small chance that they'd be slower). As suggested elsewhere, you need to test that in a suitable benchmark. The number of instructions has had no real bearing on the execution time since the 486 came out (and indeed, given that some instructions are much slower than others, it never had much bearing on Intel's x86 processors). Randy. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: sub-optimal code for packed boolean arrays -- bug or inherent limitation 2007-07-03 18:04 ` Alinabi 2007-07-03 18:09 ` Alinabi @ 2007-07-03 18:36 ` Jeffrey R. Carter 2007-07-03 19:42 ` Alinabi 2007-07-08 22:53 ` Robert A Duff 2007-07-04 9:00 ` Jean-Pierre Rosen 2 siblings, 2 replies; 29+ messages in thread From: Jeffrey R. Carter @ 2007-07-03 18:36 UTC (permalink / raw) Alinabi wrote: > > Now, concerning efficiency: these basic operations on bitboards > are used 1e8 to 1e9 times every time the program tries to decide > what to move next. Even the smallest improvement in speed can > mean the difference between searching 10 or 11 ply deep, which > can mean an improvement of over 50 Elo points in strength. I'm sure that's true. However, you have not demonstrated that there's a speed difference between the 2 versions. Even if there is, it doesn't necessarily mean that there will be a difference in the # of plys that can be searched. The old rule, "1st get it right, then make it fast," still applies. Once you have it finished you can easily see if modifying this single procedure actually makes a difference. Until then, you're wasting a lot of effort on something you don't know is important. You should also be aware that you're sending the compiler conflicting messages. Packing the array indicates that you want the compiler to minimize storage, even at the expense of speed, while "pragma Optimize (Time);" indicates that you want to minimize time, even at the expense of storage. Obtaining speed often requires wasting storage; the fastest implementation might be to not pack the array. I don't know how many of these you have, but with the large memories on modern machines, you might be better off with the additional 56 bytes per array. Again, you won't know for sure until you have a working version you can measure. -- Jeff Carter "Gentlemen, you can't fight in here. This is the War Room!" Dr. Strangelove 30 ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: sub-optimal code for packed boolean arrays -- bug or inherent limitation 2007-07-03 18:36 ` Jeffrey R. Carter @ 2007-07-03 19:42 ` Alinabi 2007-07-04 1:12 ` Jeffrey R. Carter 2007-07-04 3:22 ` Steve 2007-07-08 22:53 ` Robert A Duff 1 sibling, 2 replies; 29+ messages in thread From: Alinabi @ 2007-07-03 19:42 UTC (permalink / raw) The bitboard representation does not to save space. In fact it is quite wasteful, which is why it is not used in embedded devices, like cell phones. If you wonder how can this save time, here is an example: suppose that you have a white pawn on c4 and you want to determine if it is a passed pawn. That means that the pawn can no longer be blocked or attacked by a black pawn, and therefore it threatens to advance and get promoted. You would have to check if there are any black pawns at b5, b6, b7, c5, c6, c7, d5, d6, d7. In the bitboard representation, you would have a bitboard called BlackPawns which has a one for every square occupied by a black pawn and than you would keep an array of bitboard masks BlackBlockers : array(a1..h8) of Bitboard, such that, for example, BlackBlockers(c4) has 1's at b5, b6, b7, c5, c6, c7, d5, d6, d7, and 0's everywhere else. Then the white pawn at c4 is passed if BlackPawns and BlackBlockers(c4) = 0, which is two machine instruction, one for indexing and one for the logical operator. As for the proof that the hand coded version is faster, here it is: Hand coded: 2 MOV, 1 SHL, 1 XOR Compiler: 2 MOV, 1 SHL, 1 XOR, 1 SHR, 1 ROL, 2 AND, 1 OR --------------------- 5 extraneous instructions, Q.E.D. As I mentioned in a previous post, I already implemented all these bitboard operations the usual way, using Unsigned_64, shifts, masks and even some inline assembly. I am not opposed to that. I was just wondering if I could use packed arrays of booleans to make the code more legible and delegate the low level work to the compiler. I did not expect the compiler to perform as well as it did. Since it only failed far off the mark in one case, I was wondering if that would be considered a bug. That question was answered in the negative. You are getting way to defensive about this. I don't have a gnat bashing agenda. On Jul 3, 2:36 pm, "Jeffrey R. Carter" <spam.jrcarter....@acm.nospam.org> wrote: > Alinabi wrote: > > > Now, concerning efficiency: these basic operations on bitboards > > are used 1e8 to 1e9 times every time the program tries to decide > > what to move next. Even the smallest improvement in speed can > > mean the difference between searching 10 or 11 ply deep, which > > can mean an improvement of over 50 Elo points in strength. > > I'm sure that's true. However, you have not demonstrated that there's a > speed difference between the 2 versions. Even if there is, it doesn't > necessarily mean that there will be a difference in the # of plys that > can be searched. The old rule, "1st get it right, then make it fast," > still applies. Once you have it finished you can easily see if modifying > this single procedure actually makes a difference. Until then, you're > wasting a lot of effort on something you don't know is important. > > You should also be aware that you're sending the compiler conflicting > messages. Packing the array indicates that you want the compiler to > minimize storage, even at the expense of speed, while "pragma Optimize > (Time);" indicates that you want to minimize time, even at the expense > of storage. Obtaining speed often requires wasting storage; the fastest > implementation might be to not pack the array. I don't know how many of > these you have, but with the large memories on modern machines, you > might be better off with the additional 56 bytes per array. Again, you > won't know for sure until you have a working version you can measure. > > -- > Jeff Carter > "Gentlemen, you can't fight in here. This is the War Room!" > Dr. Strangelove > 30 ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: sub-optimal code for packed boolean arrays -- bug or inherent limitation 2007-07-03 19:42 ` Alinabi @ 2007-07-04 1:12 ` Jeffrey R. Carter 2007-07-04 10:15 ` Jeffrey Creem 2007-07-04 3:22 ` Steve 1 sibling, 1 reply; 29+ messages in thread From: Jeffrey R. Carter @ 2007-07-04 1:12 UTC (permalink / raw) Alinabi wrote: > The bitboard representation does not to save space. In fact it is > quite wasteful, which is why it is not used in embedded devices, like > cell phones. If you wonder how can this save time, here is an example: > suppose that you have a white pawn on c4 and you want to determine if > it is a passed pawn. That means that the pawn can no longer be blocked > or attacked by a black pawn, and therefore it threatens to advance and > get promoted. You would have to check if there are any black pawns at > b5, b6, b7, c5, c6, c7, d5, d6, d7. In the bitboard representation, > you would have a bitboard called BlackPawns which has a one for every > square occupied by a black pawn and than you would keep an array of > bitboard masks BlackBlockers : array(a1..h8) of Bitboard, such that, > for example, BlackBlockers(c4) has 1's at b5, b6, b7, c5, c6, c7, d5, > d6, d7, and 0's everywhere else. Then the white pawn at c4 is passed > if BlackPawns and BlackBlockers(c4) = 0, which is two machine > instruction, one for indexing and one for the logical operator. Zero is not an array of Boolean. You'd have to compare against (others => False). That the compiler recognizes this as 0 might be something else you'll want to check. > As for the proof that the hand coded version is faster, here it is: > > Hand coded: 2 MOV, 1 SHL, 1 XOR > Compiler: 2 MOV, 1 SHL, 1 XOR, 1 SHR, 1 ROL, 2 AND, 1 OR > --------------------- > 5 extraneous instructions, Q.E.D. This demonstrates only that the hand-coded version has fewer instructions. To prove that it's faster, you have to time them both. However, granting that the hand-coded version is faster, you haven't demonstrated that it makes an actual difference. > As I mentioned in a previous post, I already implemented all these > bitboard operations the usual way, using Unsigned_64, shifts, masks > and even some inline assembly. I am not opposed to that. I was just > wondering if I could use packed arrays of booleans to make the code > more legible and delegate the low level work to the compiler. I did > not expect the compiler to perform as well as it did. Since it only > failed far off the mark in one case, I was wondering if that would be > considered a bug. That question was answered in the negative. You are > getting way to defensive about this. I don't have a gnat bashing > agenda. My reaction is due to an allergy to premature optimization, the root of all evil. -- Jeff Carter "Gentlemen, you can't fight in here. This is the War Room!" Dr. Strangelove 30 ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: sub-optimal code for packed boolean arrays -- bug or inherent limitation 2007-07-04 1:12 ` Jeffrey R. Carter @ 2007-07-04 10:15 ` Jeffrey Creem 2007-07-04 18:28 ` Jeffrey R. Carter 0 siblings, 1 reply; 29+ messages in thread From: Jeffrey Creem @ 2007-07-04 10:15 UTC (permalink / raw) Jeffrey R. Carter wrote: > My reaction is due to an allergy to premature optimization, the root of > all evil. > "Premature Optimization" is one of those pieces of conventional wisdom that gets repeated often but which is in fact not the problem at all..Not really. In any real-world problem, you rarely get to spend 6 months making it faster because you did such a poor job up front in designing the problem. Often, by the time one has created a design mess that is slow everywhere, spending the short time optimizing the scary loops is not enough because the design itself is broken. The real problem is not so much premature optimization but crazy micro-optimizations done early that also hurt maintainabilty/clarity of the code and often do little if anything to actually make the code faster. Spending time up front thinking about the overall design and even worrying a little about specific performance details when not taken to the extreme and when done with the benefit of metrics backed experience (v.s. I heard one time that compiler X was slow on this) is actually not all that bad. This is of course especially true when one considers that by the end of the project when you go to start profiling the code to find the hotspots and find out that your lousy tool set does not support profiling with programs with tasks, or does not support profiling of non-trivial programs at all. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: sub-optimal code for packed boolean arrays -- bug or inherent limitation 2007-07-04 10:15 ` Jeffrey Creem @ 2007-07-04 18:28 ` Jeffrey R. Carter 0 siblings, 0 replies; 29+ messages in thread From: Jeffrey R. Carter @ 2007-07-04 18:28 UTC (permalink / raw) Jeffrey Creem wrote: > > In any real-world problem, you rarely get to spend 6 months making it > faster because you did such a poor job up front in designing the problem. The rule is, "1st get it right, then make it fast." If you did a poor job up front, you didn't get it right 1st. And usually no amount of small, local optimization such as the OP is discussing will help you. > The real problem is not so much premature optimization but crazy > micro-optimizations done early that also hurt maintainabilty/clarity of > the code and often do little if anything to actually make the code faster. That's pretty much the definition of premature optimization: extra (sometimes substantial) effort spent on obfuscation "for speed", done early, without any measurements to indicate that it will help. -- Jeff Carter "Hello! Smelly English K...niggets." Monty Python & the Holy Grail 08 ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: sub-optimal code for packed boolean arrays -- bug or inherent limitation 2007-07-03 19:42 ` Alinabi 2007-07-04 1:12 ` Jeffrey R. Carter @ 2007-07-04 3:22 ` Steve 2007-07-04 6:31 ` Harald Korneliussen 1 sibling, 1 reply; 29+ messages in thread From: Steve @ 2007-07-04 3:22 UTC (permalink / raw) "Alinabi" <alexander.the.average@gmail.com> wrote in message news:1183491750.177186.154490@k29g2000hsd.googlegroups.com... [snip] > > As for the proof that the hand coded version is faster, here it is: > > Hand coded: 2 MOV, 1 SHL, 1 XOR > Compiler: 2 MOV, 1 SHL, 1 XOR, 1 SHR, 1 ROL, 2 AND, 1 OR > --------------------- > 5 extraneous instructions, Q.E.D. > Twenty years ago I might ave agreed with this logic, but not today. In the good ole' days you could associate an amount of time with each instructions, add them up and get a total amount of execution time. This hasn't been possible for a long time. Ever heard of "instruction scheduling" and "concurrent execution"? Todays CPU's contain "pipelines" that sometimes merge several operations into a single clock. In some cases you will find that NOP's are added to increase the speed of execution based on a detailed knowledge of the underlying processor. I agree with Jeff's assesment. You must benchmark to measure and compare performance. I work frequently with time intensive code where simulating more permutations translate to more value recovered. These small differences in code generation seldom have a significant impact on the overall performance. Greater benefits are found by changes to algorithms or approaches to a problem. Regards, Steve (The Duck) ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: sub-optimal code for packed boolean arrays -- bug or inherent limitation 2007-07-04 3:22 ` Steve @ 2007-07-04 6:31 ` Harald Korneliussen 0 siblings, 0 replies; 29+ messages in thread From: Harald Korneliussen @ 2007-07-04 6:31 UTC (permalink / raw) On Jul 4, 5:22 am, "Steve" <nospam_steve...@comcast.net> wrote: > I work frequently with time intensive code where simulating more > permutations translate to more value recovered. These small differences in > code generation seldom have a significant impact on the overall performance. > Greater benefits are found by changes to algorithms or approaches to a > problem. > If Alinabi went looking for a better chess playing algorithm, he'd have to be either a seriously good SC researcher, or wasting his time. It's well-trodden ground. Arguably, the opitimization isn't premature either, since it seems this is very much what writing a competitive chess program is about. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: sub-optimal code for packed boolean arrays -- bug or inherent limitation 2007-07-03 18:36 ` Jeffrey R. Carter 2007-07-03 19:42 ` Alinabi @ 2007-07-08 22:53 ` Robert A Duff 2007-07-09 6:09 ` tmoran 1 sibling, 1 reply; 29+ messages in thread From: Robert A Duff @ 2007-07-08 22:53 UTC (permalink / raw) "Jeffrey R. Carter" <spam.jrcarter.not@acm.nospam.org> writes: > You should also be aware that you're sending the compiler conflicting > messages. Packing the array indicates that you want the compiler to > minimize storage, even at the expense of speed,... Well, sort of. Pack means to minimize storage even at the (possible) expense of speed of accessing the components. Speed overall can be improved by packing. For example, speed of copying and comparing the whole packed object, and passing it around as parameters, will typically be improved by packing. And of course using less memory will typically improve cache and paging performance. So on a desktop computer with "plenty" of memory, packing is mainly used to improve speed -- there's no "time/space" tradeoff. A small embedded system might have such tradeoffs. >... while "pragma Optimize > (Time);" indicates that you want to minimize time, even at the expense > of storage. Obtaining speed often requires wasting storage; the fastest ^^^^^ I'd say, "sometimes", nor "often". Usually, wasting storage means wasting time. > implementation might be to not pack the array. I don't know how many of > these you have, but with the large memories on modern machines, you > might be better off with the additional 56 bytes per array. Again, you > won't know for sure until you have a working version you can measure. Measurement is a Good Thing. - Bob ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: sub-optimal code for packed boolean arrays -- bug or inherent limitation 2007-07-08 22:53 ` Robert A Duff @ 2007-07-09 6:09 ` tmoran 0 siblings, 0 replies; 29+ messages in thread From: tmoran @ 2007-07-09 6:09 UTC (permalink / raw) > Well, sort of. Pack means to minimize storage even at the (possible) > expense of speed of accessing the components. Speed overall can be > improved by packing. For example, speed of copying and comparing the > ... > Measurement is a Good Thing. In my benchmark runs, the B(I) := not B(I); and B(I) := B(I) xor True; versions went from 10 down to 7 nanoseconds when I removed the pragma pack, but the version with One(i) := True; B := B xor One; went from 5 to 700 nanoseconds. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: sub-optimal code for packed boolean arrays -- bug or inherent limitation 2007-07-03 18:04 ` Alinabi 2007-07-03 18:09 ` Alinabi 2007-07-03 18:36 ` Jeffrey R. Carter @ 2007-07-04 9:00 ` Jean-Pierre Rosen 2007-07-04 18:27 ` tmoran 2007-07-04 18:51 ` tmoran 2 siblings, 2 replies; 29+ messages in thread From: Jean-Pierre Rosen @ 2007-07-04 9:00 UTC (permalink / raw) Alinabi a �crit : > While I used to be a candidate master > once upon a time, I do not play at that strength anymore and I > could benefit from the advice of a strong player (who might not > be a programmer). For such a person B(i) := not B(i) is much > easier to understand than B := B xor Bitborad_T(Shift_Left(1, > i)). Hmmm... that person will look at "Flip (B(I))", not at the details of the Flip function. Therefore, you should use whatever implementation is most efficient. But do not forget to make measurements, efficiency issues are often... surprising to say the least. -- --------------------------------------------------------- J-P. Rosen (rosen@adalog.fr) Visit Adalog's web site at http://www.adalog.fr ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: sub-optimal code for packed boolean arrays -- bug or inherent limitation 2007-07-04 9:00 ` Jean-Pierre Rosen @ 2007-07-04 18:27 ` tmoran 2007-07-04 19:16 ` Pascal Obry 2007-07-05 4:53 ` Jeffrey R. Carter 2007-07-04 18:51 ` tmoran 1 sibling, 2 replies; 29+ messages in thread From: tmoran @ 2007-07-04 18:27 UTC (permalink / raw) Using -O2 with the old gnat 3.15p on a 3GHz dual core Pentium (but only programming to use one core) under W2k, I find that the versions with B(I) := not B(I); and B(I) := B(I) xor True; take about 14 nanoseconds, while the version with One(i) := True; B := B xor One; takes about 6. With 1e9 calls, that's an 8 second difference. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: sub-optimal code for packed boolean arrays -- bug or inherent limitation 2007-07-04 18:27 ` tmoran @ 2007-07-04 19:16 ` Pascal Obry 2007-07-05 1:45 ` tmoran 2007-07-05 4:53 ` Jeffrey R. Carter 1 sibling, 1 reply; 29+ messages in thread From: Pascal Obry @ 2007-07-04 19:16 UTC (permalink / raw) To: tmoran tmoran@acm.org a �crit : > Using -O2 with the old gnat 3.15p on a 3GHz dual core Pentium (but only > programming to use one core) under W2k, I find that the versions with > B(I) := not B(I); > and > B(I) := B(I) xor True; > take about 14 nanoseconds, while the version with > One(i) := True; > B := B xor One; > takes about 6. With 1e9 calls, that's an 8 second difference. That's not a proper timing. 3.15p is using a very old backend (2.x if my memory is good). Since then there have been two major versions of the GCC backend, both bringing better optimizations. It would be better to do the measurement with GCC/FSF 4.2 or GNAT GPL 2007. Pascal. -- --|------------------------------------------------------ --| Pascal Obry Team-Ada Member --| 45, rue Gabriel Peri - 78114 Magny Les Hameaux FRANCE --|------------------------------------------------------ --| http://www.obry.net --| "The best way to travel is by means of imagination" --| --| gpg --keyserver wwwkeys.pgp.net --recv-key C1082595 ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: sub-optimal code for packed boolean arrays -- bug or inherent limitation 2007-07-04 19:16 ` Pascal Obry @ 2007-07-05 1:45 ` tmoran 0 siblings, 0 replies; 29+ messages in thread From: tmoran @ 2007-07-05 1:45 UTC (permalink / raw) > That's not a proper timing. 3.15p is using a very old backend (2.x if my > memory is good). Since then there have been two major versions of the > GCC backend, both bringing better optimizations. It would be better to > do the measurement with GCC/FSF 4.2 or GNAT GPL 2007. Gnat 3.15p was handy. Please post your results on timing the different codes with a different Ada compiler. Since my machine does have two cores, I tried it with a second task running the benchmark subroutine, ie two copies of the benchmark were running simultaneously. Elapsed time till they were both done was only 20% longer than running a single copy. So it might be worth looking into a multi-tasking design of the chess program. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: sub-optimal code for packed boolean arrays -- bug or inherent limitation 2007-07-04 18:27 ` tmoran 2007-07-04 19:16 ` Pascal Obry @ 2007-07-05 4:53 ` Jeffrey R. Carter 1 sibling, 0 replies; 29+ messages in thread From: Jeffrey R. Carter @ 2007-07-05 4:53 UTC (permalink / raw) tmoran@acm.org wrote: > Using -O2 with the old gnat 3.15p on a 3GHz dual core Pentium (but only > programming to use one core) under W2k, I find that the versions with > B(I) := not B(I); > and > B(I) := B(I) xor True; > take about 14 nanoseconds, while the version with > One(i) := True; > B := B xor One; > takes about 6. With 1e9 calls, that's an 8 second difference. Yes. But at 6 mins/move, is 8 secs enough to search another ply? That's the important time. Since the OP said it might be the difference between searching 10 and 11 plys, that implies 36 secs/ply, so the answer would seem to be no. Actually, Flip is only 1 of the 5 basic operations, and it's all the basic operations that are called 1e8 to 1e9 times per move, so the savings are more like 1.6 secs. A real improvement would be to parallelize the algorithm and run it on a multiprocessor. -- Jeff Carter "Hello! Smelly English K...niggets." Monty Python & the Holy Grail 08 ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: sub-optimal code for packed boolean arrays -- bug or inherent limitation 2007-07-04 9:00 ` Jean-Pierre Rosen 2007-07-04 18:27 ` tmoran @ 2007-07-04 18:51 ` tmoran 1 sibling, 0 replies; 29+ messages in thread From: tmoran @ 2007-07-04 18:51 UTC (permalink / raw) > Using -O2 with the old gnat 3.15p on a 3GHz dual core Pentium (but only Sorry, that was -O3 ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: sub-optimal code for packed boolean arrays -- bug or inherent limitation 2007-07-02 19:34 sub-optimal code for packed boolean arrays -- bug or inherent limitation Alinabi ` (5 preceding siblings ...) 2007-07-03 18:04 ` Alinabi @ 2007-07-08 22:58 ` Robert A Duff 6 siblings, 0 replies; 29+ messages in thread From: Robert A Duff @ 2007-07-08 22:58 UTC (permalink / raw) Alinabi <alexander.the.average@gmail.com> writes: > I don't know much (if anything) about compiler internals, so I was > wondering if I should file a bug report. Well, it's not quite a bug, but there's no harm in sending an "enhancement request" to AdaCore. We might do something about it. I would like to see operations on packed arrays of Booleans be as efficient as similar operations on modular/unsigned integers. >... Is there some good > theoretical justification for all that extraneous shifts and > rotations? I would think that if the compiler can figure out the best > way to set or clear a bit in a register using shift and logical ops, > than it should also be able to negate it efficiently. Yes, it could. - Bob ^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2007-07-10 2:06 UTC | newest] Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2007-07-02 19:34 sub-optimal code for packed boolean arrays -- bug or inherent limitation Alinabi 2007-07-02 20:08 ` Ludovic Brenta 2007-07-03 1:01 ` Jeffrey R. Carter 2007-07-03 7:22 ` Harald Korneliussen 2007-07-03 8:37 ` Georg Bauhaus 2007-07-03 7:59 ` gautier_niouzes 2007-07-03 9:25 ` Stefan Lucks 2007-07-03 12:40 ` Stefan Lucks 2007-07-03 15:42 ` Adam Beneschan 2007-07-03 18:04 ` Alinabi 2007-07-03 18:09 ` Alinabi 2007-07-03 18:17 ` Alinabi 2007-07-10 2:06 ` Randy Brukardt 2007-07-03 18:36 ` Jeffrey R. Carter 2007-07-03 19:42 ` Alinabi 2007-07-04 1:12 ` Jeffrey R. Carter 2007-07-04 10:15 ` Jeffrey Creem 2007-07-04 18:28 ` Jeffrey R. Carter 2007-07-04 3:22 ` Steve 2007-07-04 6:31 ` Harald Korneliussen 2007-07-08 22:53 ` Robert A Duff 2007-07-09 6:09 ` tmoran 2007-07-04 9:00 ` Jean-Pierre Rosen 2007-07-04 18:27 ` tmoran 2007-07-04 19:16 ` Pascal Obry 2007-07-05 1:45 ` tmoran 2007-07-05 4:53 ` Jeffrey R. Carter 2007-07-04 18:51 ` tmoran 2007-07-08 22:58 ` Robert A Duff
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox