comp.lang.ada
 help / color / mirror / Atom feed
From: Niklas Holsti <niklas.holsti@tidorum.invalid>
Subject: Re: B-tree performance
Date: Sun, 01 Sep 2013 11:05:27 +0300
Date: 2013-09-01T11:05:27+03:00	[thread overview]
Message-ID: <b8gaq8FfqgkU1@mid.individual.net> (raw)
In-Reply-To: <2383e534-646c-4500-ad05-40caf9460c35@googlegroups.com>

On 13-09-01 09:05 , Peter Brooks wrote:
> On Sunday, 1 September 2013 06:57:00 UTC+2, Niklas Holsti  wrote:
>> 
>> 
>> I find your question hard to understand. Are you really talking
>> about B-trees (http://en.wikipedia.org/wiki/B-tree), with fan-out
>> typically much larger than 2? Or are you talking about binary trees
>> with 2-way fan-out? Your phrase "left and right" suggests the
>> latter.
>> 
> Yes, I think you're right. Though, because of clashes, leaf nodes can
> have many members, so it isn't quite a binary tree.

Ah... you are first computing a hash-value from the actual key, and then
looking up the hash-value in a tree? That is an unusual way to use hashing.

Usually, the hash-value is arithmetically converted into a direct "hash
index" of the hash-table (for example, using "hash_value mod
hash_table_size"), and the hash-table entry at that index contains all
objects that have this hash-index (often as a linked list, but other
ways exist). It is this direct computation of an array index that makes
hashing fast, compared with containers that need some kind of search.
See http://en.wikipedia.org/wiki/Hash_table.

(It is also the reason why there is little need to go beyond 32 bits for
the hash-value size, since the hash-value will be folded into a much
smaller array index anyway.)

If the average number of collisions (= objects with the same hash-index)
is small, a linear traversal of the collision set is fast enough to
locate the object with the desired actual key.

If the number of collisions is large, it means that the hash-table is
too small (or the hash function is bad). Hashed containers typically
increase the hash-table size automatically when the frequency of
collisions exceeds some threshold.

If for some reason you want to the keep the hash-table small, although
this leads to many collisions, I suppose you could organize the
colliding object sets into some kind of keyed containers, for example
B-trees, but then you should use the actual key to index these
containers, not the hash-value.

I assume your data structures will be in main memory. If some parts are
in secondary memory (disk files) there are more trade-offs to consider.

> My question is when you get diminishing returns from using a sorted
> array. Or, equivalently, what would be the optimal size of the array
> of key values?

Either make up an analytical model (which will of course depend on lots
of parameters of the problem and processor) or experiment and measure. I
would do the latter. But it will probably also depend on the pattern of
accesses. For example, if the accesses have temporal locality, using
some kind of "move to front" heuristic on the lists of colliding objects
can help.

-- 
Niklas Holsti
Tidorum Ltd
niklas holsti tidorum fi
      .      @       .

  reply	other threads:[~2013-09-01  8:05 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-08-31  7:14 B-tree performance Peter Brooks
2013-08-31 15:26 ` Shark8
2013-08-31 15:47   ` Peter Brooks
2013-09-01  4:57 ` Niklas Holsti
2013-09-01  6:05   ` Peter Brooks
2013-09-01  8:05     ` Niklas Holsti [this message]
2013-09-01  8:38       ` Peter Brooks
2013-09-02  4:28         ` Niklas Holsti
2013-09-03  2:15           ` Peter Brooks
2013-09-05  8:25         ` Memory mapping and network file systems (Was: B-tree performance) Jacob Sparre Andersen
2013-09-05 10:12           ` Peter Brooks
2013-09-03 16:14 ` B-tree performance Dan'l Miller
2013-09-03 17:43   ` Peter Brooks
replies disabled

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox