Hash Type Size

comp.lang.ada
 help / color / mirror / Atom feed

* Hash Type Size
@ 2013-08-18 21:05 sbelmont700
  2013-08-19  1:03 ` AdaMagica
  2013-08-19 22:12 ` Randy Brukardt
  0 siblings, 2 replies; 15+ messages in thread
From: sbelmont700 @ 2013-08-18 21:05 UTC (permalink / raw)


Hi,

What is the size of Ada.Containers.Hash_Type intended to be?  The LRM advises that "Hash_Type'Modulus should be at least 2**32", but this makes hashing a value on a 64-bit machine particularly unpleasant, since my implementation (MinGW-64) still defines it as 32-bit.  Is the intent that Hash_Type should be the native size of the machine (and the implementation is wrong/advice is vague), or that it actually should be 32-bit if possible?

-sb

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Hash Type Size
  2013-08-18 21:05 Hash Type Size sbelmont700
@ 2013-08-19  1:03 ` AdaMagica
  2013-08-19 22:21   ` Randy Brukardt
  2013-08-19 22:12 ` Randy Brukardt
  1 sibling, 1 reply; 15+ messages in thread
From: AdaMagica @ 2013-08-19  1:03 UTC (permalink / raw)


On Sunday, August 18, 2013 11:05:47 PM UTC+2, sbelm...@gmail.com wrote:
> What is the size of Ada.Containers.Hash_Type intended to be?  ...  Is the intent that Hash_Type should be the native size of the machine (and the implementation is wrong/advice is vague), or that it actually should be 32-bit if possible?

See AARM A.18.1(8.b/2). Looks like ist should be the native size. But it's only advice.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Hash Type Size
  2013-08-19  1:03 ` AdaMagica
@ 2013-08-19 22:21   ` Randy Brukardt
  2013-08-19 22:29     ` Randy Brukardt
  0 siblings, 1 reply; 15+ messages in thread
From: Randy Brukardt @ 2013-08-19 22:21 UTC (permalink / raw)

"AdaMagica" <christ-usch.grein@t-online.de> wrote in message 
news:8e573dcd-e84d-423b-b319-d2224a1ce9ae@googlegroups.com...
> On Sunday, August 18, 2013 11:05:47 PM UTC+2, sbelm...@gmail.com wrote:
>> What is the size of Ada.Containers.Hash_Type intended to be?  ...  Is the 
>> intent that Hash_Type should be the native size of the machine (and the 
>> implementation is wrong/advice is vague), or that it actually should be 
>> 32-bit if possible?
>
> See AARM A.18.1(8.b/2). Looks like ist should be the native size. But it's 
> only advice.

My recollection is that we said that with the intent that we didn't want it 
to be smaller than 32-bits unless there was a good reason (like no 32-bit 
math on the target). I don't think there was any intent to suggest that it 
ought to be larger (it's not practical to store more than 2**32 items in an 
Ada container).

There's not much point anyway; you can't directly hash 32-bit signed 
integers in a 32-bit hash value without resorting to Unchecked_Conversion, 
and then you've already completely left the realm of safe and portable 
(neither Integer nor Hash_Type has a defined size by the RM).

                                                          Randy.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Hash Type Size
  2013-08-19 22:21   ` Randy Brukardt
@ 2013-08-19 22:29     ` Randy Brukardt
  0 siblings, 0 replies; 15+ messages in thread
From: Randy Brukardt @ 2013-08-19 22:29 UTC (permalink / raw)


I said:
...
> There's not much point anyway; you can't directly hash 32-bit signed 
> integers in a 32-bit hash value without resorting to Unchecked_Conversion, 
> and then you've already completely left the realm of safe and portable 
> (neither Integer nor Hash_Type has a defined size by the RM).

I just remembered that you could use the 'Mod attribute function to do this 
conversion. But of course that doesn't depend on the size of the type of 
it's argument, so it will work on Integer (for instance) no matter what size 
Integer has.

Doing so will eliminate the uppermost bits from the hash in some cases, but 
that's usually not a problem as values rarely get anywhere near as large as 
the maximum possible. (Hopefully, your types have ranges so you know for 
sure.)

If you're actually using all 64-bits of an integer (and hopefully you are 
declaring that explicitly, so compilers that don't support it can reject 
it), then you'll have to use Unchecked_Conversion or streaming to convert it 
into something hashable. Which I still think is the best idea in 99% of the 
cases.

                                         Randy.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Hash Type Size
  2013-08-18 21:05 Hash Type Size sbelmont700
  2013-08-19  1:03 ` AdaMagica
@ 2013-08-19 22:12 ` Randy Brukardt
  2013-08-31  6:22   ` Peter Brooks
  1 sibling, 1 reply; 15+ messages in thread
From: Randy Brukardt @ 2013-08-19 22:12 UTC (permalink / raw)


<sbelmont700@gmail.com> wrote in message 
news:1679ec49-424b-43bd-8f35-a5f69e658112@googlegroups.com...
>What is the size of Ada.Containers.Hash_Type intended to be?

It's implementation-defined, of course. The formal language makes no 
requirement on the size.

> The LRM advises that "Hash_Type'Modulus should be at least 2**32", but 
> this
> makes hashing a value on a 64-bit machine particularly unpleasant, since 
> my
> implementation (MinGW-64) still defines it as 32-bit.  Is the intent that 
> Hash_Type
> should be the native size of the machine (and the implementation is 
> wrong/advice is
> vague), or that it actually should be 32-bit if possible?

Like Integer, the advice is that it should be at least a particular size, 
and nothing more is said. I'm pretty sure there is no intent at all (so far 
as I recall, we never discussed anything about that). The reason for the 
32-bit advice is to ensure that there are plenty of values available (16-bit 
wouldn't be enough).

If you want truly portable code, you have to hash a stream element 
representation of your data anyway, so there doesn't seem to be much 
advantage to requiring more than that.

So many Ada programmers make unwarranted assumptions about the sizes of 
things. Ada has no notion of "native size" and writing code that depends 
upon that is just not going to be portable.

I've see many pieces of code that assume that Integer has 32-bits (not true 
in Janus/Ada), that Long_Integer has 64-bits (not true in Janus/Ada), that 
the size of System.Address is the same as an access type (not true in 16-bit 
versions of Janus/Ada, and probably some others as well). Almost everyone 
assumes that integers have a two's complement representation (that caused 
all kinds of problems for our U2200 compiler, which of course is a one's 
complement, 36-bit machine).

Moral: unless the RM *requires* something, don't write code that depends 
upon  it.

                                       Randy.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Hash Type Size
  2013-08-19 22:12 ` Randy Brukardt
@ 2013-08-31  6:22   ` Peter Brooks
  2013-08-31 15:57     ` sbelmont700
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Brooks @ 2013-08-31  6:22 UTC (permalink / raw)

On Tuesday, 20 August 2013 00:12:13 UTC+2, Randy Brukardt  wrote:
> 
> Like Integer, the advice is that it should be at least a particular size, 
> and nothing more is said. I'm pretty sure there is no intent at all (so far 
> as I recall, we never discussed anything about that). The reason for the 
> 32-bit advice is to ensure that there are plenty of values available (16-bit 
> wouldn't be enough).
> 
I'd be quite happy to make it explicit. It'd be good to have a Ada.Containers.Hash_Type64 to make it clear that that is what is being used.

Even though there is no way that you'd ever have that number of items in a container, that isn't the point. The larger the size of the hash, the smaller the chance of collisions and, if the native machine size is 64 bits (or, in future, maybe 128 bits), then machine operations will be no slower in 64 than in 32 or 16 bits, so the reduction in collisions will make the application run much faster for large amounts of data.

You might say that this affects portability, which is, on the face of it, true. However, if you're using a 64 bit hash, then you're working with big data, so you'll not be using machines with a size less than 64 bits, making portability to smaller word sizes irrelevant.

I agree completely that it should be explicit, though! It'd be good if the Ada manual said recommended that the bit size was 'at least' 32 bits and that it matched the natural word size 'where possible'. So, on a 16 bit machine hashing would have a performance penalty - so, for that case, it'd be useful to have an Ada.Containers.Hash_Type16 because that would probably be good enough for a real time application that used hashing, and would be twice as fast.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Hash Type Size
  2013-08-31  6:22   ` Peter Brooks
@ 2013-08-31 15:57     ` sbelmont700
  2013-09-03  1:47       ` Randy Brukardt
  0 siblings, 1 reply; 15+ messages in thread
From: sbelmont700 @ 2013-08-31 15:57 UTC (permalink / raw)



To be honest, this was mostly just laziness on my part; there is no good (read: easy) way to hash an opaque data type, and so a quick and dirty (and admittedly non-ideal) way to simply use the pointer as the key itself.  But this is a lot tougher to rationalize now that they are different sizes, so I guess the real question is "what's a good way to hash a private type/access value/system.address?"

It would be nice if there was a interface exclusively for doing this; something like an 'Hash attribute that could be applied to any type to return either a implementation-provided hash (like for strings) or that could be overridden by a programmer in fancier cases.  You can already sort of hack it together now using Ada.Strings.Hash(T'Image(o)).

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Hash Type Size
  2013-08-31 15:57     ` sbelmont700
@ 2013-09-03  1:47       ` Randy Brukardt
  2013-09-03  2:31         ` Peter Brooks
  0 siblings, 1 reply; 15+ messages in thread
From: Randy Brukardt @ 2013-09-03  1:47 UTC (permalink / raw)


<sbelmont700@gmail.com> wrote in message 
news:782ef090-7299-4164-b4e5-14a06d1c1a44@googlegroups.com...

>To be honest, this was mostly just laziness on my part; there is no good 
>(read: easy)
>way to hash an opaque data type, and so a quick and dirty (and admittedly
>non-ideal) way to simply use the pointer as the key itself.  But this is a 
>lot tougher
>to rationalize now that they are different sizes, so I guess the real 
>question is
>"what's a good way to hash a private type/access value/system.address?"
>
>It would be nice if there was a interface exclusively for doing this; 
>something like
>an 'Hash attribute that could be applied to any type to return either a
>implementation-provided hash (like for strings) or that could be overridden 
>by
>a programmer in fancier cases.  You can already sort of hack it together 
>now using
>Ada.Strings.Hash(T'Image(o)).

We talked a bit about having some such thing, but the truth is that there 
isn't any obvious algorithm that would work well for all kinds of types. It 
would be very hard to write some sort of hash function that would work well 
on completely unknown data. We didn't think it was a good idea to make it 
easy to use a function that might not work on your actual data, as that 
would lead to piles of bug reports for implementers (and reports that they 
could not possibility do anything useful with).

So my answer to "what's a good way to hash a private type/access 
value/system.address?" is that no such way exists. The best you can do is 
convert the data to a stream-element array and hash that somehow, but that 
most likely would be a dubious hash (there is a lot of unused bits in a 
typical record, and it would be really easy for those bits to end up being 
most of the result). The whole point is that a hash function needs to be 
tailored to the data that it is hashing.

                                           Randy.





^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Hash Type Size
  2013-09-03  1:47       ` Randy Brukardt
@ 2013-09-03  2:31         ` Peter Brooks
  2013-09-03 10:50           ` John B. Matthews
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Brooks @ 2013-09-03  2:31 UTC (permalink / raw)

On Tuesday, 3 September 2013 03:47:52 UTC+2, Randy Brukardt  wrote:
> 
> So my answer to "what's a good way to hash a private type/access 
> value/system.address?" is that no such way exists. The best you can do is 
> convert the data to a stream-element array and hash that somehow, but that 
> most likely would be a dubious hash (there is a lot of unused bits in a 
> typical record, and it would be really easy for those bits to end up being 
> most of the result). The whole point is that a hash function needs to be 
> tailored to the data that it is hashing.
> 
This looks like a good topic for somebody's Ph.D!

I'd imagine it possible to have some broad applicability for certain algorithms. In the case of triplestores, the ones used by Wikipaedia contain, mainly, English words and phrases, with an increasing number of entries in other languages. It's be useful to know if a hash function that works for Wikipaedia data in English is more or less effective when used on German or Korean.

The large triplestores of genetic data are clearly a different matter and I can see tailoring the algorithm there would be important - what, though, do you actually need to know about the data to design a hash algorithm?

With RDF, which all triplestores are made of, the question that's occupying me is how much benefit there is to encoding the different parts of the URI.

The triples are of the form:

<http://dbpedia.org/resource/Alabama> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/AdministrativeArea>

subject - predicate - object

Clearly the '<http://' and '>' don't need to be part of the hash, I'll need a mapping both ways, from subject -> predicate -> object and from object -> predicate -> subject, that's what'll be in the binary trees.

The predicates will be a much smaller set, so an associative array will probably be best - probably a three-way associative array, one with the source of the predicate 'www.w3.org/1999/02/22-rdf-syntax-ns' another with the particular predicate 'type' and the third with the whole URI - but the same hash-key for each, as they're the same relation.

This is a bit of a simplification as the triples are actually quads (subject,predicate,object, graph), but that doesn't affect the hashing question.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Hash Type Size
  2013-09-03  2:31         ` Peter Brooks
@ 2013-09-03 10:50           ` John B. Matthews
  2013-09-03 17:18             ` Peter Brooks
  0 siblings, 1 reply; 15+ messages in thread
From: John B. Matthews @ 2013-09-03 10:50 UTC (permalink / raw)


In article <8268e85c-e372-4883-8449-ef5253e2c77e@googlegroups.com>,
 Peter Brooks <peter.h.m.brooks@gmail.com> wrote:

> The triples are of the form:
> 
> <http://dbpedia.org/resource/Alabama> 
> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> 
> <http://schema.org/AdministrativeArea>
> 
> subject - predicate - object
> 
> Clearly the '<http://' and '>' don't need to be part of the hash, 
> I'll need a mapping both ways, from subject -> predicate -> 
> object and from object -> predicate -> subject, that's what'll be 
> in the binary trees.

As a concrete example, this program [1] tallies collisions using an 
instance of Ada.Strings.Bounded.Hash on words found in a dictionary. 
While it's specific to GNAT, it may suggest how to obtain a baseline 
against which to compare your implementation(s).

[1] <http://home.roadrunner.com/~jbmatthews/jumble.html#sec2>

-- 
John B. Matthews
trashgod at gmail dot com
<http://sites.google.com/site/drjohnbmatthews>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Hash Type Size
  2013-09-03 10:50           ` John B. Matthews
@ 2013-09-03 17:18             ` Peter Brooks
  2013-09-03 21:21               ` John B. Matthews
  2013-09-04  4:50               ` Paul Rubin
  0 siblings, 2 replies; 15+ messages in thread
From: Peter Brooks @ 2013-09-03 17:18 UTC (permalink / raw)


On Tuesday, 3 September 2013 12:50:04 UTC+2, John B. Matthews  wrote:
> 
> As a concrete example, this program [1] tallies collisions using an  
> instance of Ada.Strings.Bounded.Hash on words found in a dictionary.  
> While it's specific to GNAT, it may suggest how to obtain a baseline 
> against which to compare your implementation(s).
> 
> [1] <http://home.roadrunner.com/~jbmatthews/jumble.html#sec2>
> 
Thank you for that -it's extremely useful. 32% collisions is far too high for me, so I certainly need a better hash function.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Hash Type Size
  2013-09-03 17:18             ` Peter Brooks
@ 2013-09-03 21:21               ` John B. Matthews
  2013-09-04  4:50               ` Paul Rubin
  1 sibling, 0 replies; 15+ messages in thread
From: John B. Matthews @ 2013-09-03 21:21 UTC (permalink / raw)


In article <c6c24f9a-c5ba-4068-be58-c7d1cff4889b@googlegroups.com>,
 Peter Brooks <peter.h.m.brooks@gmail.com> wrote:

> On Tuesday, 3 September 2013 12:50:04 UTC+2, John B. Matthews  wrote:
> > 
> > As a concrete example, this program [1] tallies collisions 
> > using an  instance of Ada.Strings.Bounded.Hash on words found 
> > in a dictionary.  While it's specific to GNAT, it may suggest 
> > how to obtain a baseline against which to compare your 
> > implementation(s).
> > 
> > [1] <http://home.roadrunner.com/~jbmatthews/jumble.html#sec2>
> > 
> Thank you for that -it's extremely useful. 32% collisions is far 
> too high for me, so I certainly need a better hash function.

You are welcome. That 2009 data was obtained using GNAT 4.3.4; 
newer versions are typically better. For example, the GNAT 4.6 
implementation yields the following result:

./collisions
 0: 305643 (77.72%)
 1: 76947 (19.57%)
 2: 9790 (2.49%)
 3: 805 (0.20%)
 4: 56 (0.01%)
 5: 1 (0.00%)

-- 
John B. Matthews
trashgod at gmail dot com
<http://sites.google.com/site/drjohnbmatthews>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Hash Type Size
  2013-09-03 17:18             ` Peter Brooks
  2013-09-03 21:21               ` John B. Matthews
@ 2013-09-04  4:50               ` Paul Rubin
  2013-09-04  4:54                 ` Paul Rubin
  1 sibling, 1 reply; 15+ messages in thread
From: Paul Rubin @ 2013-09-04  4:50 UTC (permalink / raw)

Peter Brooks <peter.h.m.brooks@gmail.com> writes:
>> [1] <http://home.roadrunner.com/~jbmatthews/jumble.html#sec2>
> Thank you for that -it's extremely useful. 32% collisions is far too
> high for me, so I certainly need a better hash function.

It doesn't say how many words are in the dictionary, so it doesn't tell
you anything.  A hash function should approximate a random function,
which means you get collisions starting around sqrt(number of slots in
the table).  Re "tailoring the hash function to the data", this is
called perfect hashing and there's lots of stuff online about how to do
it (see google).  It's useful in some situations, but not that many.
What is your actual application?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Hash Type Size
  2013-09-04  4:50               ` Paul Rubin
@ 2013-09-04  4:54                 ` Paul Rubin
  2013-09-05 19:30                   ` John B. Matthews
  0 siblings, 1 reply; 15+ messages in thread
From: Paul Rubin @ 2013-09-04  4:54 UTC (permalink / raw)


Paul Rubin <no.email@nospam.invalid> writes:
> It doesn't say how many words are in the dictionary,
Wait, it gives the numbers

 0: 305643 (77.72%)
 1: 76947 (19.57%)
 2: 9790 (2.49%)
 3: 805 (0.20%)
 4: 56 (0.01%)
 5: 1 (0.00%)

i.e. there are around 492K words in a 250K slot hash table.  Of course
there will be a lot of collisions.  If you want fewer collisions, you
need more slots than words.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Hash Type Size
  2013-09-04  4:54                 ` Paul Rubin
@ 2013-09-05 19:30                   ` John B. Matthews
  0 siblings, 0 replies; 15+ messages in thread
From: John B. Matthews @ 2013-09-05 19:30 UTC (permalink / raw)

In article <7x7gexqj9d.fsf@ruckus.brouhaha.com>,
 Paul Rubin <no.email@nospam.invalid> wrote:

> Paul Rubin <no.email@nospam.invalid> writes:
> > It doesn't say how many words are in the dictionary,
> Wait, it gives the numbers
> 
>  0: 305643 (77.72%)
>  1: 76947 (19.57%)
>  2: 9790 (2.49%)
>  3: 805 (0.20%)
>  4: 56 (0.01%)
>  5: 1 (0.00%)
> 
> i.e. there are around 492K words in a 250K slot hash table.  Of 
> course there will be a lot of collisions.  If you want fewer 
> collisions, you need more slots than words.

Thank you for commenting on this. I see now that the raw percentages are 
misleading. The number of slots reserved in the Table array is equal to 
the sum of the elements in the Counts Map. The number of dictionary 
words hashed is the sum of the product of each key/element pair in the 
Counts Map.

In the example above, the 393241 slots reserved is much larger than 
required for the 99171 words in the Ubuntu dictionary; the load factor 
is just 25%. Focusing on the Counts Map, node 0 is the remaining empty 
slots in the Table array. Node 1 implies that 1 * 76947 / 99171 = 78% of 
the words can be retrieved directly, while node 5 implies that just 5 * 
1 / 99171 << 1% of the words may require up to four additional 
operations to retrieve.

I've updated the program to adapt the Table size and show a few details.

<http://home.roadrunner.com/~jbmatthews/jumble.html#sec2>

Next to each count is the percentage of words in that occupancy class. 
Note that node zero is unused table entries; node one is table entries 
occupied by a single word; node two is table entries occupied by two 
words; etc. Hashing the same dictionary as above shows the following 
results. 

./collisions
/usr/share/dict/words
Word count: 99171
Table size: 196613
Load factor: 50.44%
 0: 118762 (0.00%)
 1: 59770 (60.27%)
 2: 15218 (30.69%)
 3: 2534 (7.67%)
 4: 288 (1.16%)
 5: 36 (0.18%)
 6: 4 (0.02%)
 7: 1 (0.01%)

These results were obtained using the larger dictionary on Mac OS X:

./collisions
/usr/share/dict/words
Word count: 234936
Table size: 393241
Load factor: 59.74%
 0: 218461 (0.00%)
 1: 126393 (53.80%)
 2: 38524 (32.80%)
 3: 8228 (10.51%)
 4: 1400 (2.38%)
 5: 203 (0.43%)
 6: 29 (0.07%)
 7: 2 (0.01%)
 8: 1 (0.00%)

-- 
John B. Matthews
trashgod at gmail dot com
<http://sites.google.com/site/drjohnbmatthews>

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2013-09-05 19:30 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-18 21:05 Hash Type Size sbelmont700
2013-08-19  1:03 ` AdaMagica
2013-08-19 22:21   ` Randy Brukardt
2013-08-19 22:29     ` Randy Brukardt
2013-08-19 22:12 ` Randy Brukardt
2013-08-31  6:22   ` Peter Brooks
2013-08-31 15:57     ` sbelmont700
2013-09-03  1:47       ` Randy Brukardt
2013-09-03  2:31         ` Peter Brooks
2013-09-03 10:50           ` John B. Matthews
2013-09-03 17:18             ` Peter Brooks
2013-09-03 21:21               ` John B. Matthews
2013-09-04  4:50               ` Paul Rubin
2013-09-04  4:54                 ` Paul Rubin
2013-09-05 19:30                   ` John B. Matthews

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox