* Hash Type Size
@ 2013-08-18 21:05 sbelmont700
2013-08-19 1:03 ` AdaMagica
2013-08-19 22:12 ` Randy Brukardt
0 siblings, 2 replies; 15+ messages in thread
From: sbelmont700 @ 2013-08-18 21:05 UTC (permalink / raw)
Hi,
What is the size of Ada.Containers.Hash_Type intended to be? The LRM advises that "Hash_Type'Modulus should be at least 2**32", but this makes hashing a value on a 64-bit machine particularly unpleasant, since my implementation (MinGW-64) still defines it as 32-bit. Is the intent that Hash_Type should be the native size of the machine (and the implementation is wrong/advice is vague), or that it actually should be 32-bit if possible?
-sb
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Hash Type Size
2013-08-18 21:05 Hash Type Size sbelmont700
@ 2013-08-19 1:03 ` AdaMagica
2013-08-19 22:21 ` Randy Brukardt
2013-08-19 22:12 ` Randy Brukardt
1 sibling, 1 reply; 15+ messages in thread
From: AdaMagica @ 2013-08-19 1:03 UTC (permalink / raw)
On Sunday, August 18, 2013 11:05:47 PM UTC+2, sbelm...@gmail.com wrote:
> What is the size of Ada.Containers.Hash_Type intended to be? ... Is the intent that Hash_Type should be the native size of the machine (and the implementation is wrong/advice is vague), or that it actually should be 32-bit if possible?
See AARM A.18.1(8.b/2). Looks like ist should be the native size. But it's only advice.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Hash Type Size
2013-08-18 21:05 Hash Type Size sbelmont700
2013-08-19 1:03 ` AdaMagica
@ 2013-08-19 22:12 ` Randy Brukardt
2013-08-31 6:22 ` Peter Brooks
1 sibling, 1 reply; 15+ messages in thread
From: Randy Brukardt @ 2013-08-19 22:12 UTC (permalink / raw)
<sbelmont700@gmail.com> wrote in message
news:1679ec49-424b-43bd-8f35-a5f69e658112@googlegroups.com...
>What is the size of Ada.Containers.Hash_Type intended to be?
It's implementation-defined, of course. The formal language makes no
requirement on the size.
> The LRM advises that "Hash_Type'Modulus should be at least 2**32", but
> this
> makes hashing a value on a 64-bit machine particularly unpleasant, since
> my
> implementation (MinGW-64) still defines it as 32-bit. Is the intent that
> Hash_Type
> should be the native size of the machine (and the implementation is
> wrong/advice is
> vague), or that it actually should be 32-bit if possible?
Like Integer, the advice is that it should be at least a particular size,
and nothing more is said. I'm pretty sure there is no intent at all (so far
as I recall, we never discussed anything about that). The reason for the
32-bit advice is to ensure that there are plenty of values available (16-bit
wouldn't be enough).
If you want truly portable code, you have to hash a stream element
representation of your data anyway, so there doesn't seem to be much
advantage to requiring more than that.
So many Ada programmers make unwarranted assumptions about the sizes of
things. Ada has no notion of "native size" and writing code that depends
upon that is just not going to be portable.
I've see many pieces of code that assume that Integer has 32-bits (not true
in Janus/Ada), that Long_Integer has 64-bits (not true in Janus/Ada), that
the size of System.Address is the same as an access type (not true in 16-bit
versions of Janus/Ada, and probably some others as well). Almost everyone
assumes that integers have a two's complement representation (that caused
all kinds of problems for our U2200 compiler, which of course is a one's
complement, 36-bit machine).
Moral: unless the RM *requires* something, don't write code that depends
upon it.
Randy.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Hash Type Size
2013-08-19 1:03 ` AdaMagica
@ 2013-08-19 22:21 ` Randy Brukardt
2013-08-19 22:29 ` Randy Brukardt
0 siblings, 1 reply; 15+ messages in thread
From: Randy Brukardt @ 2013-08-19 22:21 UTC (permalink / raw)
"AdaMagica" <christ-usch.grein@t-online.de> wrote in message
news:8e573dcd-e84d-423b-b319-d2224a1ce9ae@googlegroups.com...
> On Sunday, August 18, 2013 11:05:47 PM UTC+2, sbelm...@gmail.com wrote:
>> What is the size of Ada.Containers.Hash_Type intended to be? ... Is the
>> intent that Hash_Type should be the native size of the machine (and the
>> implementation is wrong/advice is vague), or that it actually should be
>> 32-bit if possible?
>
> See AARM A.18.1(8.b/2). Looks like ist should be the native size. But it's
> only advice.
My recollection is that we said that with the intent that we didn't want it
to be smaller than 32-bits unless there was a good reason (like no 32-bit
math on the target). I don't think there was any intent to suggest that it
ought to be larger (it's not practical to store more than 2**32 items in an
Ada container).
There's not much point anyway; you can't directly hash 32-bit signed
integers in a 32-bit hash value without resorting to Unchecked_Conversion,
and then you've already completely left the realm of safe and portable
(neither Integer nor Hash_Type has a defined size by the RM).
Randy.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Hash Type Size
2013-08-19 22:21 ` Randy Brukardt
@ 2013-08-19 22:29 ` Randy Brukardt
0 siblings, 0 replies; 15+ messages in thread
From: Randy Brukardt @ 2013-08-19 22:29 UTC (permalink / raw)
I said:
...
> There's not much point anyway; you can't directly hash 32-bit signed
> integers in a 32-bit hash value without resorting to Unchecked_Conversion,
> and then you've already completely left the realm of safe and portable
> (neither Integer nor Hash_Type has a defined size by the RM).
I just remembered that you could use the 'Mod attribute function to do this
conversion. But of course that doesn't depend on the size of the type of
it's argument, so it will work on Integer (for instance) no matter what size
Integer has.
Doing so will eliminate the uppermost bits from the hash in some cases, but
that's usually not a problem as values rarely get anywhere near as large as
the maximum possible. (Hopefully, your types have ranges so you know for
sure.)
If you're actually using all 64-bits of an integer (and hopefully you are
declaring that explicitly, so compilers that don't support it can reject
it), then you'll have to use Unchecked_Conversion or streaming to convert it
into something hashable. Which I still think is the best idea in 99% of the
cases.
Randy.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Hash Type Size
2013-08-19 22:12 ` Randy Brukardt
@ 2013-08-31 6:22 ` Peter Brooks
2013-08-31 15:57 ` sbelmont700
0 siblings, 1 reply; 15+ messages in thread
From: Peter Brooks @ 2013-08-31 6:22 UTC (permalink / raw)
On Tuesday, 20 August 2013 00:12:13 UTC+2, Randy Brukardt wrote:
>
> Like Integer, the advice is that it should be at least a particular size,
> and nothing more is said. I'm pretty sure there is no intent at all (so far
> as I recall, we never discussed anything about that). The reason for the
> 32-bit advice is to ensure that there are plenty of values available (16-bit
> wouldn't be enough).
>
I'd be quite happy to make it explicit. It'd be good to have a Ada.Containers.Hash_Type64 to make it clear that that is what is being used.
Even though there is no way that you'd ever have that number of items in a container, that isn't the point. The larger the size of the hash, the smaller the chance of collisions and, if the native machine size is 64 bits (or, in future, maybe 128 bits), then machine operations will be no slower in 64 than in 32 or 16 bits, so the reduction in collisions will make the application run much faster for large amounts of data.
You might say that this affects portability, which is, on the face of it, true. However, if you're using a 64 bit hash, then you're working with big data, so you'll not be using machines with a size less than 64 bits, making portability to smaller word sizes irrelevant.
I agree completely that it should be explicit, though! It'd be good if the Ada manual said recommended that the bit size was 'at least' 32 bits and that it matched the natural word size 'where possible'. So, on a 16 bit machine hashing would have a performance penalty - so, for that case, it'd be useful to have an Ada.Containers.Hash_Type16 because that would probably be good enough for a real time application that used hashing, and would be twice as fast.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Hash Type Size
2013-08-31 6:22 ` Peter Brooks
@ 2013-08-31 15:57 ` sbelmont700
2013-09-03 1:47 ` Randy Brukardt
0 siblings, 1 reply; 15+ messages in thread
From: sbelmont700 @ 2013-08-31 15:57 UTC (permalink / raw)
To be honest, this was mostly just laziness on my part; there is no good (read: easy) way to hash an opaque data type, and so a quick and dirty (and admittedly non-ideal) way to simply use the pointer as the key itself. But this is a lot tougher to rationalize now that they are different sizes, so I guess the real question is "what's a good way to hash a private type/access value/system.address?"
It would be nice if there was a interface exclusively for doing this; something like an 'Hash attribute that could be applied to any type to return either a implementation-provided hash (like for strings) or that could be overridden by a programmer in fancier cases. You can already sort of hack it together now using Ada.Strings.Hash(T'Image(o)).
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Hash Type Size
2013-08-31 15:57 ` sbelmont700
@ 2013-09-03 1:47 ` Randy Brukardt
2013-09-03 2:31 ` Peter Brooks
0 siblings, 1 reply; 15+ messages in thread
From: Randy Brukardt @ 2013-09-03 1:47 UTC (permalink / raw)
<sbelmont700@gmail.com> wrote in message
news:782ef090-7299-4164-b4e5-14a06d1c1a44@googlegroups.com...
>To be honest, this was mostly just laziness on my part; there is no good
>(read: easy)
>way to hash an opaque data type, and so a quick and dirty (and admittedly
>non-ideal) way to simply use the pointer as the key itself. But this is a
>lot tougher
>to rationalize now that they are different sizes, so I guess the real
>question is
>"what's a good way to hash a private type/access value/system.address?"
>
>It would be nice if there was a interface exclusively for doing this;
>something like
>an 'Hash attribute that could be applied to any type to return either a
>implementation-provided hash (like for strings) or that could be overridden
>by
>a programmer in fancier cases. You can already sort of hack it together
>now using
>Ada.Strings.Hash(T'Image(o)).
We talked a bit about having some such thing, but the truth is that there
isn't any obvious algorithm that would work well for all kinds of types. It
would be very hard to write some sort of hash function that would work well
on completely unknown data. We didn't think it was a good idea to make it
easy to use a function that might not work on your actual data, as that
would lead to piles of bug reports for implementers (and reports that they
could not possibility do anything useful with).
So my answer to "what's a good way to hash a private type/access
value/system.address?" is that no such way exists. The best you can do is
convert the data to a stream-element array and hash that somehow, but that
most likely would be a dubious hash (there is a lot of unused bits in a
typical record, and it would be really easy for those bits to end up being
most of the result). The whole point is that a hash function needs to be
tailored to the data that it is hashing.
Randy.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Hash Type Size
2013-09-03 1:47 ` Randy Brukardt
@ 2013-09-03 2:31 ` Peter Brooks
2013-09-03 10:50 ` John B. Matthews
0 siblings, 1 reply; 15+ messages in thread
From: Peter Brooks @ 2013-09-03 2:31 UTC (permalink / raw)
On Tuesday, 3 September 2013 03:47:52 UTC+2, Randy Brukardt wrote:
>
> So my answer to "what's a good way to hash a private type/access
> value/system.address?" is that no such way exists. The best you can do is
> convert the data to a stream-element array and hash that somehow, but that
> most likely would be a dubious hash (there is a lot of unused bits in a
> typical record, and it would be really easy for those bits to end up being
> most of the result). The whole point is that a hash function needs to be
> tailored to the data that it is hashing.
>
This looks like a good topic for somebody's Ph.D!
I'd imagine it possible to have some broad applicability for certain algorithms. In the case of triplestores, the ones used by Wikipaedia contain, mainly, English words and phrases, with an increasing number of entries in other languages. It's be useful to know if a hash function that works for Wikipaedia data in English is more or less effective when used on German or Korean.
The large triplestores of genetic data are clearly a different matter and I can see tailoring the algorithm there would be important - what, though, do you actually need to know about the data to design a hash algorithm?
With RDF, which all triplestores are made of, the question that's occupying me is how much benefit there is to encoding the different parts of the URI.
The triples are of the form:
<http://dbpedia.org/resource/Alabama> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/AdministrativeArea>
subject - predicate - object
Clearly the '<http://' and '>' don't need to be part of the hash, I'll need a mapping both ways, from subject -> predicate -> object and from object -> predicate -> subject, that's what'll be in the binary trees.
The predicates will be a much smaller set, so an associative array will probably be best - probably a three-way associative array, one with the source of the predicate 'www.w3.org/1999/02/22-rdf-syntax-ns' another with the particular predicate 'type' and the third with the whole URI - but the same hash-key for each, as they're the same relation.
This is a bit of a simplification as the triples are actually quads (subject,predicate,object, graph), but that doesn't affect the hashing question.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Hash Type Size
2013-09-03 2:31 ` Peter Brooks
@ 2013-09-03 10:50 ` John B. Matthews
2013-09-03 17:18 ` Peter Brooks
0 siblings, 1 reply; 15+ messages in thread
From: John B. Matthews @ 2013-09-03 10:50 UTC (permalink / raw)
In article <8268e85c-e372-4883-8449-ef5253e2c77e@googlegroups.com>,
Peter Brooks <peter.h.m.brooks@gmail.com> wrote:
> The triples are of the form:
>
> <http://dbpedia.org/resource/Alabama>
> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
> <http://schema.org/AdministrativeArea>
>
> subject - predicate - object
>
> Clearly the '<http://' and '>' don't need to be part of the hash,
> I'll need a mapping both ways, from subject -> predicate ->
> object and from object -> predicate -> subject, that's what'll be
> in the binary trees.
As a concrete example, this program [1] tallies collisions using an
instance of Ada.Strings.Bounded.Hash on words found in a dictionary.
While it's specific to GNAT, it may suggest how to obtain a baseline
against which to compare your implementation(s).
[1] <http://home.roadrunner.com/~jbmatthews/jumble.html#sec2>
--
John B. Matthews
trashgod at gmail dot com
<http://sites.google.com/site/drjohnbmatthews>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Hash Type Size
2013-09-03 10:50 ` John B. Matthews
@ 2013-09-03 17:18 ` Peter Brooks
2013-09-03 21:21 ` John B. Matthews
2013-09-04 4:50 ` Paul Rubin
0 siblings, 2 replies; 15+ messages in thread
From: Peter Brooks @ 2013-09-03 17:18 UTC (permalink / raw)
On Tuesday, 3 September 2013 12:50:04 UTC+2, John B. Matthews wrote:
>
> As a concrete example, this program [1] tallies collisions using an
> instance of Ada.Strings.Bounded.Hash on words found in a dictionary.
> While it's specific to GNAT, it may suggest how to obtain a baseline
> against which to compare your implementation(s).
>
> [1] <http://home.roadrunner.com/~jbmatthews/jumble.html#sec2>
>
Thank you for that -it's extremely useful. 32% collisions is far too high for me, so I certainly need a better hash function.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Hash Type Size
2013-09-03 17:18 ` Peter Brooks
@ 2013-09-03 21:21 ` John B. Matthews
2013-09-04 4:50 ` Paul Rubin
1 sibling, 0 replies; 15+ messages in thread
From: John B. Matthews @ 2013-09-03 21:21 UTC (permalink / raw)
In article <c6c24f9a-c5ba-4068-be58-c7d1cff4889b@googlegroups.com>,
Peter Brooks <peter.h.m.brooks@gmail.com> wrote:
> On Tuesday, 3 September 2013 12:50:04 UTC+2, John B. Matthews wrote:
> >
> > As a concrete example, this program [1] tallies collisions
> > using an instance of Ada.Strings.Bounded.Hash on words found
> > in a dictionary. While it's specific to GNAT, it may suggest
> > how to obtain a baseline against which to compare your
> > implementation(s).
> >
> > [1] <http://home.roadrunner.com/~jbmatthews/jumble.html#sec2>
> >
> Thank you for that -it's extremely useful. 32% collisions is far
> too high for me, so I certainly need a better hash function.
You are welcome. That 2009 data was obtained using GNAT 4.3.4;
newer versions are typically better. For example, the GNAT 4.6
implementation yields the following result:
./collisions
0: 305643 (77.72%)
1: 76947 (19.57%)
2: 9790 (2.49%)
3: 805 (0.20%)
4: 56 (0.01%)
5: 1 (0.00%)
--
John B. Matthews
trashgod at gmail dot com
<http://sites.google.com/site/drjohnbmatthews>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Hash Type Size
2013-09-03 17:18 ` Peter Brooks
2013-09-03 21:21 ` John B. Matthews
@ 2013-09-04 4:50 ` Paul Rubin
2013-09-04 4:54 ` Paul Rubin
1 sibling, 1 reply; 15+ messages in thread
From: Paul Rubin @ 2013-09-04 4:50 UTC (permalink / raw)
Peter Brooks <peter.h.m.brooks@gmail.com> writes:
>> [1] <http://home.roadrunner.com/~jbmatthews/jumble.html#sec2>
> Thank you for that -it's extremely useful. 32% collisions is far too
> high for me, so I certainly need a better hash function.
It doesn't say how many words are in the dictionary, so it doesn't tell
you anything. A hash function should approximate a random function,
which means you get collisions starting around sqrt(number of slots in
the table). Re "tailoring the hash function to the data", this is
called perfect hashing and there's lots of stuff online about how to do
it (see google). It's useful in some situations, but not that many.
What is your actual application?
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Hash Type Size
2013-09-04 4:50 ` Paul Rubin
@ 2013-09-04 4:54 ` Paul Rubin
2013-09-05 19:30 ` John B. Matthews
0 siblings, 1 reply; 15+ messages in thread
From: Paul Rubin @ 2013-09-04 4:54 UTC (permalink / raw)
Paul Rubin <no.email@nospam.invalid> writes:
> It doesn't say how many words are in the dictionary,
Wait, it gives the numbers
0: 305643 (77.72%)
1: 76947 (19.57%)
2: 9790 (2.49%)
3: 805 (0.20%)
4: 56 (0.01%)
5: 1 (0.00%)
i.e. there are around 492K words in a 250K slot hash table. Of course
there will be a lot of collisions. If you want fewer collisions, you
need more slots than words.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Hash Type Size
2013-09-04 4:54 ` Paul Rubin
@ 2013-09-05 19:30 ` John B. Matthews
0 siblings, 0 replies; 15+ messages in thread
From: John B. Matthews @ 2013-09-05 19:30 UTC (permalink / raw)
In article <7x7gexqj9d.fsf@ruckus.brouhaha.com>,
Paul Rubin <no.email@nospam.invalid> wrote:
> Paul Rubin <no.email@nospam.invalid> writes:
> > It doesn't say how many words are in the dictionary,
> Wait, it gives the numbers
>
> 0: 305643 (77.72%)
> 1: 76947 (19.57%)
> 2: 9790 (2.49%)
> 3: 805 (0.20%)
> 4: 56 (0.01%)
> 5: 1 (0.00%)
>
> i.e. there are around 492K words in a 250K slot hash table. Of
> course there will be a lot of collisions. If you want fewer
> collisions, you need more slots than words.
Thank you for commenting on this. I see now that the raw percentages are
misleading. The number of slots reserved in the Table array is equal to
the sum of the elements in the Counts Map. The number of dictionary
words hashed is the sum of the product of each key/element pair in the
Counts Map.
In the example above, the 393241 slots reserved is much larger than
required for the 99171 words in the Ubuntu dictionary; the load factor
is just 25%. Focusing on the Counts Map, node 0 is the remaining empty
slots in the Table array. Node 1 implies that 1 * 76947 / 99171 = 78% of
the words can be retrieved directly, while node 5 implies that just 5 *
1 / 99171 << 1% of the words may require up to four additional
operations to retrieve.
I've updated the program to adapt the Table size and show a few details.
<http://home.roadrunner.com/~jbmatthews/jumble.html#sec2>
Next to each count is the percentage of words in that occupancy class.
Note that node zero is unused table entries; node one is table entries
occupied by a single word; node two is table entries occupied by two
words; etc. Hashing the same dictionary as above shows the following
results.
./collisions
/usr/share/dict/words
Word count: 99171
Table size: 196613
Load factor: 50.44%
0: 118762 (0.00%)
1: 59770 (60.27%)
2: 15218 (30.69%)
3: 2534 (7.67%)
4: 288 (1.16%)
5: 36 (0.18%)
6: 4 (0.02%)
7: 1 (0.01%)
These results were obtained using the larger dictionary on Mac OS X:
./collisions
/usr/share/dict/words
Word count: 234936
Table size: 393241
Load factor: 59.74%
0: 218461 (0.00%)
1: 126393 (53.80%)
2: 38524 (32.80%)
3: 8228 (10.51%)
4: 1400 (2.38%)
5: 203 (0.43%)
6: 29 (0.07%)
7: 2 (0.01%)
8: 1 (0.00%)
--
John B. Matthews
trashgod at gmail dot com
<http://sites.google.com/site/drjohnbmatthews>
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2013-09-05 19:30 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-18 21:05 Hash Type Size sbelmont700
2013-08-19 1:03 ` AdaMagica
2013-08-19 22:21 ` Randy Brukardt
2013-08-19 22:29 ` Randy Brukardt
2013-08-19 22:12 ` Randy Brukardt
2013-08-31 6:22 ` Peter Brooks
2013-08-31 15:57 ` sbelmont700
2013-09-03 1:47 ` Randy Brukardt
2013-09-03 2:31 ` Peter Brooks
2013-09-03 10:50 ` John B. Matthews
2013-09-03 17:18 ` Peter Brooks
2013-09-03 21:21 ` John B. Matthews
2013-09-04 4:50 ` Paul Rubin
2013-09-04 4:54 ` Paul Rubin
2013-09-05 19:30 ` John B. Matthews
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox