B-tree performance

comp.lang.ada
 help / color / mirror / Atom feed

* B-tree performance
@ 2013-08-31  7:14 Peter Brooks
  2013-08-31 15:26 ` Shark8
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Peter Brooks @ 2013-08-31  7:14 UTC (permalink / raw)


I'm interested to know if anybody has any real-world evidence for this hypothesis, or if they agree with my gut feel.

If I build a B-tree, then the conventional means to search it would be to start in the middle, then cascade left and right following the pointer structure. That's slow because most searches won't turn up a value in the first N levels of the b-tree.

So it'd make sense to have an array which contains the address values of the B-tree down to a certain level. That way, for a look up, you search the array, which is nice and quick, and this leads you to the correct sub-tree and you then continue as usual, from the top of that sub-tree.

The question is how many levels would it make sense to keep in the array. Too many and the array itself consumes too many resources and slows things down. Too few and you don't gain much, particularly if the B-tree is sparse. The sizes of the buckets, or sub-b-trees would be:

Level      Buckets
1                 18446744073709551616
3                 9223372036854775808
5                 4611686018427387904
9                 2305843009213693952
17                1152921504606846976
33                576460752303423488
65                576460752303423488
129               288230376151711744
257               144115188075855872
513               72057594037927936
1025              36028797018963968
2049              18014398509481984
4097              9007199254740992
8193              2251799813685248
16385             1125899906842624
32769             562949953421312
65537             140737488355328
131073            70368744177664
262145            35184372088832
524289            17592186044416
1048577           8796093022208
2097153           4398046511104
4194305           2199023255552
8388609           1099511627776
16777217          549755813888
33554433          274877906944
67108865          137438953472
134217729         68719476736
268435457         34359738368
536870913         17179869184
1073741825        8589934592
2147483649        4294967296
4294967297        2147483648
8589934593        1073741824

13 levels  would, I think, speed things up considerably, with only 4k's worth of overhead. How much deeper would it make sense to go before you get diminishing returns?



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: B-tree performance
  2013-08-31  7:14 B-tree performance Peter Brooks
@ 2013-08-31 15:26 ` Shark8
  2013-08-31 15:47   ` Peter Brooks
  2013-09-01  4:57 ` Niklas Holsti
  2013-09-03 16:14 ` B-tree performance Dan'l Miller
  2 siblings, 1 reply; 13+ messages in thread
From: Shark8 @ 2013-08-31 15:26 UTC (permalink / raw)


I think that the correct optimization here would be highly dependent on the information you're dealing with. If, for example, you're handling a dictionary or phone-book, you could index off of the first letter [or two] and use that as your starting-range's endpoint.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: B-tree performance
  2013-08-31 15:26 ` Shark8
@ 2013-08-31 15:47   ` Peter Brooks
  0 siblings, 0 replies; 13+ messages in thread
From: Peter Brooks @ 2013-08-31 15:47 UTC (permalink / raw)


On Saturday, 31 August 2013 17:26:41 UTC+2, Shark8  wrote:
> I think that the correct optimization here would be highly dependent on the information you're dealing with. If, for example, you're handling a dictionary or phone-book, you could index off of the first letter [or two] and use that as your starting-range's endpoint.
>
Yes, that's true.

I'll be working with the hash codes of RDF URIs, which should be uniformly dense - well, actually, they will be because I'll be experimenting with hash functions to find one that is.

I suppose that I'll just have to try a few options an time the result.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: B-tree performance
  2013-08-31  7:14 B-tree performance Peter Brooks
  2013-08-31 15:26 ` Shark8
@ 2013-09-01  4:57 ` Niklas Holsti
  2013-09-01  6:05   ` Peter Brooks
  2013-09-03 16:14 ` B-tree performance Dan'l Miller
  2 siblings, 1 reply; 13+ messages in thread
From: Niklas Holsti @ 2013-09-01  4:57 UTC (permalink / raw)

On 13-08-31 10:14 , Peter Brooks wrote:
> I'm interested to know if anybody has any real-world evidence for
> this hypothesis, or if they agree with my gut feel.
> 
> If I build a B-tree, then the conventional means to search it would
> be to start in the middle, then cascade left and right following the
> pointer structure. That's slow because most searches won't turn up a
> value in the first N levels of the b-tree.

I find your question hard to understand. Are you really talking about
B-trees (http://en.wikipedia.org/wiki/B-tree), with fan-out typically
much larger than 2? Or are you talking about binary trees with 2-way
fan-out? Your phrase "left and right" suggests the latter.

> So it'd make sense to have an array which contains the address values
> of the B-tree down to a certain level. That way, for a look up, you
> search the array, which is nice and quick, and this leads you to the
> correct sub-tree and you then continue as usual, from the top of that
> sub-tree.

If you really mean B-trees, it seems to me that you could just increase
the fan-out of the B-tree to get the same effect. With a fan-out of say
4000, the root of the B-tree consists of a sorted array of 4000 (or
3999) key values, which is nice and quick to search.

-- 
Niklas Holsti
Tidorum Ltd
niklas holsti tidorum fi
      .      @       .

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: B-tree performance
  2013-09-01  4:57 ` Niklas Holsti
@ 2013-09-01  6:05   ` Peter Brooks
  2013-09-01  8:05     ` Niklas Holsti
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Brooks @ 2013-09-01  6:05 UTC (permalink / raw)


On Sunday, 1 September 2013 06:57:00 UTC+2, Niklas Holsti  wrote:
> 
> 
> I find your question hard to understand. Are you really talking about
> B-trees (http://en.wikipedia.org/wiki/B-tree), with fan-out typically
> much larger than 2? Or are you talking about binary trees with 2-way
> fan-out? Your phrase "left and right" suggests the latter.
> 
Yes, I think you're right. Though, because of clashes, leaf nodes can have many members, so it isn't quite a binary tree.
> 
> > So it'd make sense to have an array which contains the address values
> > of the B-tree down to a certain level. That way, for a look up, you
> 
> > search the array, which is nice and quick, and this leads you to the
> 
> > correct sub-tree and you then continue as usual, from the top of that
> 
> > sub-tree.
>  
> If you really mean B-trees, it seems to me that you could just increase
> the fan-out of the B-tree to get the same effect. With a fan-out of say
> 4000, the root of the B-tree consists of a sorted array of 4000 (or
> 3999) key values, which is nice and quick to search.
> 
Yes, that is really my questions. Since the hash algorithm will produce a uniformly dense distribution, the array of key values that will fit best will be one that divides 64bit space up evenly.

My question is when you get diminishing returns from using a sorted array. Or, equivalently, what would be the optimal size of the array of key values?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: B-tree performance
  2013-09-01  6:05   ` Peter Brooks
@ 2013-09-01  8:05     ` Niklas Holsti
  2013-09-01  8:38       ` Peter Brooks
  0 siblings, 1 reply; 13+ messages in thread
From: Niklas Holsti @ 2013-09-01  8:05 UTC (permalink / raw)

On 13-09-01 09:05 , Peter Brooks wrote:
> On Sunday, 1 September 2013 06:57:00 UTC+2, Niklas Holsti  wrote:
>> 
>> 
>> I find your question hard to understand. Are you really talking
>> about B-trees (http://en.wikipedia.org/wiki/B-tree), with fan-out
>> typically much larger than 2? Or are you talking about binary trees
>> with 2-way fan-out? Your phrase "left and right" suggests the
>> latter.
>> 
> Yes, I think you're right. Though, because of clashes, leaf nodes can
> have many members, so it isn't quite a binary tree.

Ah... you are first computing a hash-value from the actual key, and then
looking up the hash-value in a tree? That is an unusual way to use hashing.

Usually, the hash-value is arithmetically converted into a direct "hash
index" of the hash-table (for example, using "hash_value mod
hash_table_size"), and the hash-table entry at that index contains all
objects that have this hash-index (often as a linked list, but other
ways exist). It is this direct computation of an array index that makes
hashing fast, compared with containers that need some kind of search.
See http://en.wikipedia.org/wiki/Hash_table.

(It is also the reason why there is little need to go beyond 32 bits for
the hash-value size, since the hash-value will be folded into a much
smaller array index anyway.)

If the average number of collisions (= objects with the same hash-index)
is small, a linear traversal of the collision set is fast enough to
locate the object with the desired actual key.

If the number of collisions is large, it means that the hash-table is
too small (or the hash function is bad). Hashed containers typically
increase the hash-table size automatically when the frequency of
collisions exceeds some threshold.

If for some reason you want to the keep the hash-table small, although
this leads to many collisions, I suppose you could organize the
colliding object sets into some kind of keyed containers, for example
B-trees, but then you should use the actual key to index these
containers, not the hash-value.

I assume your data structures will be in main memory. If some parts are
in secondary memory (disk files) there are more trade-offs to consider.

> My question is when you get diminishing returns from using a sorted
> array. Or, equivalently, what would be the optimal size of the array
> of key values?

Either make up an analytical model (which will of course depend on lots
of parameters of the problem and processor) or experiment and measure. I
would do the latter. But it will probably also depend on the pattern of
accesses. For example, if the accesses have temporal locality, using
some kind of "move to front" heuristic on the lists of colliding objects
can help.

-- 
Niklas Holsti
Tidorum Ltd
niklas holsti tidorum fi
      .      @       .

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: B-tree performance
  2013-09-01  8:05     ` Niklas Holsti
@ 2013-09-01  8:38       ` Peter Brooks
  2013-09-02  4:28         ` Niklas Holsti
  2013-09-05  8:25         ` Memory mapping and network file systems (Was: B-tree performance) Jacob Sparre Andersen
  0 siblings, 2 replies; 13+ messages in thread
From: Peter Brooks @ 2013-09-01  8:38 UTC (permalink / raw)


On Sunday, 1 September 2013 10:05:27 UTC+2, Niklas Holsti  wrote:
>  
> 
> If for some reason you want to the keep the hash-table small, although
> this leads to many collisions, I suppose you could organize the
> colliding object sets into some kind of keyed containers, for example
> B-trees, but then you should use the actual key to index these
> containers, not the hash-value.
> 
I'm using a tree structure to represent a sparse array because it makes inserts and deletes easier and quicker, and cleaner. I'm using the large hash index to minimise the number of collisions.

If there was a better sparse array algorithm that'd probably suit me better. Then I'd simply hash everything to a 32 bit (or even larger) array index and be done with it. 

With dynamic allocation, memory is used more efficiently. If you use a sparse array, it can grow in size slowly (needing periodic cleaning) because deletions don't delete the location, but simply put the contents to null, and such null entries accumulate.
> 
> 
> I assume your data structures will be in main memory. If some parts are
> in secondary memory (disk files) there are more trade-offs to consider.
> 
I'm hoping to use persistent objects to reduce the extra complications there - using OS paging makes more sense, from a performance point of view, than using files. 

If the persistent objects are shadowed in RAM disc and that's shared over a network, it can make the solution nicely scalable across multiple networked machines.
> 
> Either make up an analytical model (which will of course depend on lots
> of parameters of the problem and processor) or experiment and measure. I
> would do the latter. But it will probably also depend on the pattern of
> accesses. For example, if the accesses have temporal locality, using
> some kind of "move to front" heuristic on the lists of colliding objects
> can help.
> 
Yes, some sort of caching might help, particularly for predicates.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: B-tree performance
  2013-09-01  8:38       ` Peter Brooks
@ 2013-09-02  4:28         ` Niklas Holsti
  2013-09-03  2:15           ` Peter Brooks
  2013-09-05  8:25         ` Memory mapping and network file systems (Was: B-tree performance) Jacob Sparre Andersen
  1 sibling, 1 reply; 13+ messages in thread
From: Niklas Holsti @ 2013-09-02  4:28 UTC (permalink / raw)


On 13-09-01 11:38 , Peter Brooks wrote:
> On Sunday, 1 September 2013 10:05:27 UTC+2, Niklas Holsti  wrote:
>>  
>>
>> If for some reason you want to the keep the hash-table small, although
>> this leads to many collisions, I suppose you could organize the
>> colliding object sets into some kind of keyed containers, for example
>> B-trees, but then you should use the actual key to index these
>> containers, not the hash-value.
>>
> I'm using a tree structure to represent a sparse array because it
> makes inserts and deletes easier and quicker, and cleaner.

But insertion and deletion are simple and fast in the basic hash-table
structure, too, as long as the collision sets are small.

In the traditional hash-table structure, the collision sets are
implemented as unordered linked lists of objects. You can insert or
delete any list entry with a few link changes. If you adjust the size of
the hash table to keep the collision set size bounded by some constant,
insertion and deletions are O(1) on average.

-- 
Niklas Holsti
Tidorum Ltd
niklas holsti tidorum fi
      .      @       .


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: B-tree performance
  2013-09-02  4:28         ` Niklas Holsti
@ 2013-09-03  2:15           ` Peter Brooks
  0 siblings, 0 replies; 13+ messages in thread
From: Peter Brooks @ 2013-09-03  2:15 UTC (permalink / raw)


On Monday, 2 September 2013 06:28:57 UTC+2, Niklas Holsti  wrote:
> 
> 
> But insertion and deletion are simple and fast in the basic hash-table
> structure, too, as long as the collision sets are small.
> 
Only if the hash-table is large relative to the date. The test data I'm using contains 13,225,167 triples - a real triplestore might have ten times that. The file is nearly 2Gb in size.
> 
> In the traditional hash-table structure, the collision sets are
> implemented as unordered linked lists of objects. You can insert or
> delete any list entry with a few link changes. If you adjust the size of
> the hash table to keep the collision set size bounded by some constant,
> 
I am meaning, as I said, to use a binary tree - a linked list. So collisions will just be further links.

The problem I see is the tree getting too deep, so that it takes too long to find items (as usual the time to add something is less critical than the time to find something). The fewer collisions, the less lookup time, but the more items in the top-level hash table, the shallower the overall structure. 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: B-tree performance
  2013-08-31  7:14 B-tree performance Peter Brooks
  2013-08-31 15:26 ` Shark8
  2013-09-01  4:57 ` Niklas Holsti
@ 2013-09-03 16:14 ` Dan'l Miller
  2013-09-03 17:43   ` Peter Brooks
  2 siblings, 1 reply; 13+ messages in thread
From: Dan'l Miller @ 2013-09-03 16:14 UTC (permalink / raw)

On Saturday, August 31, 2013 2:14:22 AM UTC-5, Peter Brooks wrote:
> I'm interested to know if anybody has any real-world evidence for this hypothesis, or if they agree with my gut feel.

I disagree with the direction that your gut feel took you.  I will diverge from your original posting at the appropriate points below.

> If I build a B-tree, then the conventional means to search it would be to start in the middle, then cascade left and right following the pointer structure. That's slow because most searches won't turn up a value in the first N levels of the b-tree.
> 
> So it'd make sense to have [...snip...]

So it would make sense to not use a B-tree.  Indeed, it would make sense to not restart the comparison again at each node, because it wastes time and bloats memory-space.  Hence, it would make sense to use a radix trie.  http://en.wikipedia.org/wiki/Radix_trie

Because a URI is a sequence of ISO10646/Unicode characters (i.e., a textual string), PATRICIA tries (i.e., radix tries with radix = 2) might be an option to minimize sparseness, which tends to lessen space utilization, when compared to too-large of radices, i.e., too large of powers of two, which cause too sparsely-branching nodes of an excessively n-ary trie.

> The question is how many levels would it make sense to keep in the array.  [...snip...]

Hence, it would make sense to use a radix trie, with a small radix, say, 2 or 4 that fits into your processor's level-1 or level-2 cache-line of 32B or 64B.  Optimizing for a 64b addresses (i.e., 8B) and a 32B level-1 cache-line, 4 of those addresses would fit into one cache-line, so radix = 4, which means compare the key 2 bits at a time for a 4-ary branching at each node, which means a O(log-base-4 n) search, a O(log-base-4 n) insertion, and a O(log-base-4 n) deletion, where n is the maximum quantity of bit-pairs in the URI string, which is 4 times string byte-length if the URI is in UTF8 encoding.

(Btw, yeah, I know that I should eliminate the base from the logarithm in Bachmann-Landau big-omicron notation, but I am emphasizing than radix=4 has typically half the work to accomplish than radix=2, while not increasing the quantity of cache-lines per node.  And, yes, I know that many people were taught incorrectly that upper-bound growth rate is Latin capital letter O, not Greek capital letter Omicron, but I am loyal to Landau's original seminal references in 1909 through 1914.)  https://en.wikipedia.org/wiki/Bachmann-Landau_notation#Family_of_Bachmann.E2.80.93Landau_notations

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: B-tree performance
  2013-09-03 16:14 ` B-tree performance Dan'l Miller
@ 2013-09-03 17:43   ` Peter Brooks
  0 siblings, 0 replies; 13+ messages in thread
From: Peter Brooks @ 2013-09-03 17:43 UTC (permalink / raw)

On Tuesday, 3 September 2013 18:14:19 UTC+2, Dan'l Miller  wrote:
> 
> 
> So it would make sense to not use a B-tree.  Indeed, it would make sense to not restart the comparison again at each node, because it wastes time and bloats memory-space.  Hence, it would make sense to use a radix trie.  http://en.wikipedia.org/wiki/Radix_trie
> 
Yes, that might make sense. I did something similar many years ago when I produced correlation matrices of letter sequences in words. It wouldn't work as a linked list, but a multi-dimensional associative array might be an option - thank you, I'll look into that.

One pleasant side-effect of this method would be that it'd be much easier to search for the same entry in different URIs.

I'll have to experiment a bit on workable representations of this.

The funny thing about this is that it might be very easy to implement in awk. Not what I was looking for or thinking of. I'm not sure how awk behaves with very large arrays - I'll find out quite soon.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Memory mapping and network file systems (Was: B-tree performance)
  2013-09-01  8:38       ` Peter Brooks
  2013-09-02  4:28         ` Niklas Holsti
@ 2013-09-05  8:25         ` Jacob Sparre Andersen
  2013-09-05 10:12           ` Peter Brooks
  1 sibling, 1 reply; 13+ messages in thread
From: Jacob Sparre Andersen @ 2013-09-05  8:25 UTC (permalink / raw)


Peter Brooks wrote:

> If the persistent objects are shadowed in RAM disc and that's shared
> over a network, it can make the solution nicely scalable across
> multiple networked machines.

That sounds rather dangerous to me.  I would expect either very low
performance or lots of data corruption.

Greetings,

Jacob
-- 
»In Ada you model the problem space, not the solution space.«
                                                     -- Robert I. Eachus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Memory mapping and network file systems (Was: B-tree performance)
  2013-09-05  8:25         ` Memory mapping and network file systems (Was: B-tree performance) Jacob Sparre Andersen
@ 2013-09-05 10:12           ` Peter Brooks
  0 siblings, 0 replies; 13+ messages in thread
From: Peter Brooks @ 2013-09-05 10:12 UTC (permalink / raw)


On Thursday, 5 September 2013 10:25:08 UTC+2, Jacob Sparre Andersen  wrote:
>> > If the persistent objects are shadowed in RAM disc and that's shared 
> > over a network, it can make the solution nicely scalable across 
> > multiple networked machines.
>  
> That sounds rather dangerous to me.  I would expect either very low
> performance or lots of data corruption.
>                                                   -- Robert I. Eachus
>
So would I. Unless, as is the case, triplestores are more read than written, so only one process needs to write, removing the main source of corruption and removing the need to do lots of locking.



^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2013-09-05 10:12 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-31  7:14 B-tree performance Peter Brooks
2013-08-31 15:26 ` Shark8
2013-08-31 15:47   ` Peter Brooks
2013-09-01  4:57 ` Niklas Holsti
2013-09-01  6:05   ` Peter Brooks
2013-09-01  8:05     ` Niklas Holsti
2013-09-01  8:38       ` Peter Brooks
2013-09-02  4:28         ` Niklas Holsti
2013-09-03  2:15           ` Peter Brooks
2013-09-05  8:25         ` Memory mapping and network file systems (Was: B-tree performance) Jacob Sparre Andersen
2013-09-05 10:12           ` Peter Brooks
2013-09-03 16:14 ` B-tree performance Dan'l Miller
2013-09-03 17:43   ` Peter Brooks

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox