* Hash Type Size @ 2013-08-18 21:05 sbelmont700 2013-08-19 1:03 ` AdaMagica 2013-08-19 22:12 ` Randy Brukardt 0 siblings, 2 replies; 15+ messages in thread From: sbelmont700 @ 2013-08-18 21:05 UTC (permalink / raw) Hi, What is the size of Ada.Containers.Hash_Type intended to be? The LRM advises that "Hash_Type'Modulus should be at least 2**32", but this makes hashing a value on a 64-bit machine particularly unpleasant, since my implementation (MinGW-64) still defines it as 32-bit. Is the intent that Hash_Type should be the native size of the machine (and the implementation is wrong/advice is vague), or that it actually should be 32-bit if possible? -sb ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Hash Type Size 2013-08-18 21:05 Hash Type Size sbelmont700 @ 2013-08-19 1:03 ` AdaMagica 2013-08-19 22:21 ` Randy Brukardt 2013-08-19 22:12 ` Randy Brukardt 1 sibling, 1 reply; 15+ messages in thread From: AdaMagica @ 2013-08-19 1:03 UTC (permalink / raw) On Sunday, August 18, 2013 11:05:47 PM UTC+2, sbelm...@gmail.com wrote: > What is the size of Ada.Containers.Hash_Type intended to be? ... Is the intent that Hash_Type should be the native size of the machine (and the implementation is wrong/advice is vague), or that it actually should be 32-bit if possible? See AARM A.18.1(8.b/2). Looks like ist should be the native size. But it's only advice. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Hash Type Size 2013-08-19 1:03 ` AdaMagica @ 2013-08-19 22:21 ` Randy Brukardt 2013-08-19 22:29 ` Randy Brukardt 0 siblings, 1 reply; 15+ messages in thread From: Randy Brukardt @ 2013-08-19 22:21 UTC (permalink / raw) "AdaMagica" <christ-usch.grein@t-online.de> wrote in message news:8e573dcd-e84d-423b-b319-d2224a1ce9ae@googlegroups.com... > On Sunday, August 18, 2013 11:05:47 PM UTC+2, sbelm...@gmail.com wrote: >> What is the size of Ada.Containers.Hash_Type intended to be? ... Is the >> intent that Hash_Type should be the native size of the machine (and the >> implementation is wrong/advice is vague), or that it actually should be >> 32-bit if possible? > > See AARM A.18.1(8.b/2). Looks like ist should be the native size. But it's > only advice. My recollection is that we said that with the intent that we didn't want it to be smaller than 32-bits unless there was a good reason (like no 32-bit math on the target). I don't think there was any intent to suggest that it ought to be larger (it's not practical to store more than 2**32 items in an Ada container). There's not much point anyway; you can't directly hash 32-bit signed integers in a 32-bit hash value without resorting to Unchecked_Conversion, and then you've already completely left the realm of safe and portable (neither Integer nor Hash_Type has a defined size by the RM). Randy. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Hash Type Size 2013-08-19 22:21 ` Randy Brukardt @ 2013-08-19 22:29 ` Randy Brukardt 0 siblings, 0 replies; 15+ messages in thread From: Randy Brukardt @ 2013-08-19 22:29 UTC (permalink / raw) I said: ... > There's not much point anyway; you can't directly hash 32-bit signed > integers in a 32-bit hash value without resorting to Unchecked_Conversion, > and then you've already completely left the realm of safe and portable > (neither Integer nor Hash_Type has a defined size by the RM). I just remembered that you could use the 'Mod attribute function to do this conversion. But of course that doesn't depend on the size of the type of it's argument, so it will work on Integer (for instance) no matter what size Integer has. Doing so will eliminate the uppermost bits from the hash in some cases, but that's usually not a problem as values rarely get anywhere near as large as the maximum possible. (Hopefully, your types have ranges so you know for sure.) If you're actually using all 64-bits of an integer (and hopefully you are declaring that explicitly, so compilers that don't support it can reject it), then you'll have to use Unchecked_Conversion or streaming to convert it into something hashable. Which I still think is the best idea in 99% of the cases. Randy. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Hash Type Size 2013-08-18 21:05 Hash Type Size sbelmont700 2013-08-19 1:03 ` AdaMagica @ 2013-08-19 22:12 ` Randy Brukardt 2013-08-31 6:22 ` Peter Brooks 1 sibling, 1 reply; 15+ messages in thread From: Randy Brukardt @ 2013-08-19 22:12 UTC (permalink / raw) <sbelmont700@gmail.com> wrote in message news:1679ec49-424b-43bd-8f35-a5f69e658112@googlegroups.com... >What is the size of Ada.Containers.Hash_Type intended to be? It's implementation-defined, of course. The formal language makes no requirement on the size. > The LRM advises that "Hash_Type'Modulus should be at least 2**32", but > this > makes hashing a value on a 64-bit machine particularly unpleasant, since > my > implementation (MinGW-64) still defines it as 32-bit. Is the intent that > Hash_Type > should be the native size of the machine (and the implementation is > wrong/advice is > vague), or that it actually should be 32-bit if possible? Like Integer, the advice is that it should be at least a particular size, and nothing more is said. I'm pretty sure there is no intent at all (so far as I recall, we never discussed anything about that). The reason for the 32-bit advice is to ensure that there are plenty of values available (16-bit wouldn't be enough). If you want truly portable code, you have to hash a stream element representation of your data anyway, so there doesn't seem to be much advantage to requiring more than that. So many Ada programmers make unwarranted assumptions about the sizes of things. Ada has no notion of "native size" and writing code that depends upon that is just not going to be portable. I've see many pieces of code that assume that Integer has 32-bits (not true in Janus/Ada), that Long_Integer has 64-bits (not true in Janus/Ada), that the size of System.Address is the same as an access type (not true in 16-bit versions of Janus/Ada, and probably some others as well). Almost everyone assumes that integers have a two's complement representation (that caused all kinds of problems for our U2200 compiler, which of course is a one's complement, 36-bit machine). Moral: unless the RM *requires* something, don't write code that depends upon it. Randy. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Hash Type Size 2013-08-19 22:12 ` Randy Brukardt @ 2013-08-31 6:22 ` Peter Brooks 2013-08-31 15:57 ` sbelmont700 0 siblings, 1 reply; 15+ messages in thread From: Peter Brooks @ 2013-08-31 6:22 UTC (permalink / raw) On Tuesday, 20 August 2013 00:12:13 UTC+2, Randy Brukardt wrote: > > Like Integer, the advice is that it should be at least a particular size, > and nothing more is said. I'm pretty sure there is no intent at all (so far > as I recall, we never discussed anything about that). The reason for the > 32-bit advice is to ensure that there are plenty of values available (16-bit > wouldn't be enough). > I'd be quite happy to make it explicit. It'd be good to have a Ada.Containers.Hash_Type64 to make it clear that that is what is being used. Even though there is no way that you'd ever have that number of items in a container, that isn't the point. The larger the size of the hash, the smaller the chance of collisions and, if the native machine size is 64 bits (or, in future, maybe 128 bits), then machine operations will be no slower in 64 than in 32 or 16 bits, so the reduction in collisions will make the application run much faster for large amounts of data. You might say that this affects portability, which is, on the face of it, true. However, if you're using a 64 bit hash, then you're working with big data, so you'll not be using machines with a size less than 64 bits, making portability to smaller word sizes irrelevant. I agree completely that it should be explicit, though! It'd be good if the Ada manual said recommended that the bit size was 'at least' 32 bits and that it matched the natural word size 'where possible'. So, on a 16 bit machine hashing would have a performance penalty - so, for that case, it'd be useful to have an Ada.Containers.Hash_Type16 because that would probably be good enough for a real time application that used hashing, and would be twice as fast. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Hash Type Size 2013-08-31 6:22 ` Peter Brooks @ 2013-08-31 15:57 ` sbelmont700 2013-09-03 1:47 ` Randy Brukardt 0 siblings, 1 reply; 15+ messages in thread From: sbelmont700 @ 2013-08-31 15:57 UTC (permalink / raw) To be honest, this was mostly just laziness on my part; there is no good (read: easy) way to hash an opaque data type, and so a quick and dirty (and admittedly non-ideal) way to simply use the pointer as the key itself. But this is a lot tougher to rationalize now that they are different sizes, so I guess the real question is "what's a good way to hash a private type/access value/system.address?" It would be nice if there was a interface exclusively for doing this; something like an 'Hash attribute that could be applied to any type to return either a implementation-provided hash (like for strings) or that could be overridden by a programmer in fancier cases. You can already sort of hack it together now using Ada.Strings.Hash(T'Image(o)). ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Hash Type Size 2013-08-31 15:57 ` sbelmont700 @ 2013-09-03 1:47 ` Randy Brukardt 2013-09-03 2:31 ` Peter Brooks 0 siblings, 1 reply; 15+ messages in thread From: Randy Brukardt @ 2013-09-03 1:47 UTC (permalink / raw) <sbelmont700@gmail.com> wrote in message news:782ef090-7299-4164-b4e5-14a06d1c1a44@googlegroups.com... >To be honest, this was mostly just laziness on my part; there is no good >(read: easy) >way to hash an opaque data type, and so a quick and dirty (and admittedly >non-ideal) way to simply use the pointer as the key itself. But this is a >lot tougher >to rationalize now that they are different sizes, so I guess the real >question is >"what's a good way to hash a private type/access value/system.address?" > >It would be nice if there was a interface exclusively for doing this; >something like >an 'Hash attribute that could be applied to any type to return either a >implementation-provided hash (like for strings) or that could be overridden >by >a programmer in fancier cases. You can already sort of hack it together >now using >Ada.Strings.Hash(T'Image(o)). We talked a bit about having some such thing, but the truth is that there isn't any obvious algorithm that would work well for all kinds of types. It would be very hard to write some sort of hash function that would work well on completely unknown data. We didn't think it was a good idea to make it easy to use a function that might not work on your actual data, as that would lead to piles of bug reports for implementers (and reports that they could not possibility do anything useful with). So my answer to "what's a good way to hash a private type/access value/system.address?" is that no such way exists. The best you can do is convert the data to a stream-element array and hash that somehow, but that most likely would be a dubious hash (there is a lot of unused bits in a typical record, and it would be really easy for those bits to end up being most of the result). The whole point is that a hash function needs to be tailored to the data that it is hashing. Randy. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Hash Type Size 2013-09-03 1:47 ` Randy Brukardt @ 2013-09-03 2:31 ` Peter Brooks 2013-09-03 10:50 ` John B. Matthews 0 siblings, 1 reply; 15+ messages in thread From: Peter Brooks @ 2013-09-03 2:31 UTC (permalink / raw) On Tuesday, 3 September 2013 03:47:52 UTC+2, Randy Brukardt wrote: > > So my answer to "what's a good way to hash a private type/access > value/system.address?" is that no such way exists. The best you can do is > convert the data to a stream-element array and hash that somehow, but that > most likely would be a dubious hash (there is a lot of unused bits in a > typical record, and it would be really easy for those bits to end up being > most of the result). The whole point is that a hash function needs to be > tailored to the data that it is hashing. > This looks like a good topic for somebody's Ph.D! I'd imagine it possible to have some broad applicability for certain algorithms. In the case of triplestores, the ones used by Wikipaedia contain, mainly, English words and phrases, with an increasing number of entries in other languages. It's be useful to know if a hash function that works for Wikipaedia data in English is more or less effective when used on German or Korean. The large triplestores of genetic data are clearly a different matter and I can see tailoring the algorithm there would be important - what, though, do you actually need to know about the data to design a hash algorithm? With RDF, which all triplestores are made of, the question that's occupying me is how much benefit there is to encoding the different parts of the URI. The triples are of the form: <http://dbpedia.org/resource/Alabama> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/AdministrativeArea> subject - predicate - object Clearly the '<http://' and '>' don't need to be part of the hash, I'll need a mapping both ways, from subject -> predicate -> object and from object -> predicate -> subject, that's what'll be in the binary trees. The predicates will be a much smaller set, so an associative array will probably be best - probably a three-way associative array, one with the source of the predicate 'www.w3.org/1999/02/22-rdf-syntax-ns' another with the particular predicate 'type' and the third with the whole URI - but the same hash-key for each, as they're the same relation. This is a bit of a simplification as the triples are actually quads (subject,predicate,object, graph), but that doesn't affect the hashing question. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Hash Type Size 2013-09-03 2:31 ` Peter Brooks @ 2013-09-03 10:50 ` John B. Matthews 2013-09-03 17:18 ` Peter Brooks 0 siblings, 1 reply; 15+ messages in thread From: John B. Matthews @ 2013-09-03 10:50 UTC (permalink / raw) In article <8268e85c-e372-4883-8449-ef5253e2c77e@googlegroups.com>, Peter Brooks <peter.h.m.brooks@gmail.com> wrote: > The triples are of the form: > > <http://dbpedia.org/resource/Alabama> > <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> > <http://schema.org/AdministrativeArea> > > subject - predicate - object > > Clearly the '<http://' and '>' don't need to be part of the hash, > I'll need a mapping both ways, from subject -> predicate -> > object and from object -> predicate -> subject, that's what'll be > in the binary trees. As a concrete example, this program [1] tallies collisions using an instance of Ada.Strings.Bounded.Hash on words found in a dictionary. While it's specific to GNAT, it may suggest how to obtain a baseline against which to compare your implementation(s). [1] <http://home.roadrunner.com/~jbmatthews/jumble.html#sec2> -- John B. Matthews trashgod at gmail dot com <http://sites.google.com/site/drjohnbmatthews> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Hash Type Size 2013-09-03 10:50 ` John B. Matthews @ 2013-09-03 17:18 ` Peter Brooks 2013-09-03 21:21 ` John B. Matthews 2013-09-04 4:50 ` Paul Rubin 0 siblings, 2 replies; 15+ messages in thread From: Peter Brooks @ 2013-09-03 17:18 UTC (permalink / raw) On Tuesday, 3 September 2013 12:50:04 UTC+2, John B. Matthews wrote: > > As a concrete example, this program [1] tallies collisions using an > instance of Ada.Strings.Bounded.Hash on words found in a dictionary. > While it's specific to GNAT, it may suggest how to obtain a baseline > against which to compare your implementation(s). > > [1] <http://home.roadrunner.com/~jbmatthews/jumble.html#sec2> > Thank you for that -it's extremely useful. 32% collisions is far too high for me, so I certainly need a better hash function. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Hash Type Size 2013-09-03 17:18 ` Peter Brooks @ 2013-09-03 21:21 ` John B. Matthews 2013-09-04 4:50 ` Paul Rubin 1 sibling, 0 replies; 15+ messages in thread From: John B. Matthews @ 2013-09-03 21:21 UTC (permalink / raw) In article <c6c24f9a-c5ba-4068-be58-c7d1cff4889b@googlegroups.com>, Peter Brooks <peter.h.m.brooks@gmail.com> wrote: > On Tuesday, 3 September 2013 12:50:04 UTC+2, John B. Matthews wrote: > > > > As a concrete example, this program [1] tallies collisions > > using an instance of Ada.Strings.Bounded.Hash on words found > > in a dictionary. While it's specific to GNAT, it may suggest > > how to obtain a baseline against which to compare your > > implementation(s). > > > > [1] <http://home.roadrunner.com/~jbmatthews/jumble.html#sec2> > > > Thank you for that -it's extremely useful. 32% collisions is far > too high for me, so I certainly need a better hash function. You are welcome. That 2009 data was obtained using GNAT 4.3.4; newer versions are typically better. For example, the GNAT 4.6 implementation yields the following result: ./collisions 0: 305643 (77.72%) 1: 76947 (19.57%) 2: 9790 (2.49%) 3: 805 (0.20%) 4: 56 (0.01%) 5: 1 (0.00%) -- John B. Matthews trashgod at gmail dot com <http://sites.google.com/site/drjohnbmatthews> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Hash Type Size 2013-09-03 17:18 ` Peter Brooks 2013-09-03 21:21 ` John B. Matthews @ 2013-09-04 4:50 ` Paul Rubin 2013-09-04 4:54 ` Paul Rubin 1 sibling, 1 reply; 15+ messages in thread From: Paul Rubin @ 2013-09-04 4:50 UTC (permalink / raw) Peter Brooks <peter.h.m.brooks@gmail.com> writes: >> [1] <http://home.roadrunner.com/~jbmatthews/jumble.html#sec2> > Thank you for that -it's extremely useful. 32% collisions is far too > high for me, so I certainly need a better hash function. It doesn't say how many words are in the dictionary, so it doesn't tell you anything. A hash function should approximate a random function, which means you get collisions starting around sqrt(number of slots in the table). Re "tailoring the hash function to the data", this is called perfect hashing and there's lots of stuff online about how to do it (see google). It's useful in some situations, but not that many. What is your actual application? ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Hash Type Size 2013-09-04 4:50 ` Paul Rubin @ 2013-09-04 4:54 ` Paul Rubin 2013-09-05 19:30 ` John B. Matthews 0 siblings, 1 reply; 15+ messages in thread From: Paul Rubin @ 2013-09-04 4:54 UTC (permalink / raw) Paul Rubin <no.email@nospam.invalid> writes: > It doesn't say how many words are in the dictionary, Wait, it gives the numbers 0: 305643 (77.72%) 1: 76947 (19.57%) 2: 9790 (2.49%) 3: 805 (0.20%) 4: 56 (0.01%) 5: 1 (0.00%) i.e. there are around 492K words in a 250K slot hash table. Of course there will be a lot of collisions. If you want fewer collisions, you need more slots than words. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Hash Type Size 2013-09-04 4:54 ` Paul Rubin @ 2013-09-05 19:30 ` John B. Matthews 0 siblings, 0 replies; 15+ messages in thread From: John B. Matthews @ 2013-09-05 19:30 UTC (permalink / raw) In article <7x7gexqj9d.fsf@ruckus.brouhaha.com>, Paul Rubin <no.email@nospam.invalid> wrote: > Paul Rubin <no.email@nospam.invalid> writes: > > It doesn't say how many words are in the dictionary, > Wait, it gives the numbers > > 0: 305643 (77.72%) > 1: 76947 (19.57%) > 2: 9790 (2.49%) > 3: 805 (0.20%) > 4: 56 (0.01%) > 5: 1 (0.00%) > > i.e. there are around 492K words in a 250K slot hash table. Of > course there will be a lot of collisions. If you want fewer > collisions, you need more slots than words. Thank you for commenting on this. I see now that the raw percentages are misleading. The number of slots reserved in the Table array is equal to the sum of the elements in the Counts Map. The number of dictionary words hashed is the sum of the product of each key/element pair in the Counts Map. In the example above, the 393241 slots reserved is much larger than required for the 99171 words in the Ubuntu dictionary; the load factor is just 25%. Focusing on the Counts Map, node 0 is the remaining empty slots in the Table array. Node 1 implies that 1 * 76947 / 99171 = 78% of the words can be retrieved directly, while node 5 implies that just 5 * 1 / 99171 << 1% of the words may require up to four additional operations to retrieve. I've updated the program to adapt the Table size and show a few details. <http://home.roadrunner.com/~jbmatthews/jumble.html#sec2> Next to each count is the percentage of words in that occupancy class. Note that node zero is unused table entries; node one is table entries occupied by a single word; node two is table entries occupied by two words; etc. Hashing the same dictionary as above shows the following results. ./collisions /usr/share/dict/words Word count: 99171 Table size: 196613 Load factor: 50.44% 0: 118762 (0.00%) 1: 59770 (60.27%) 2: 15218 (30.69%) 3: 2534 (7.67%) 4: 288 (1.16%) 5: 36 (0.18%) 6: 4 (0.02%) 7: 1 (0.01%) These results were obtained using the larger dictionary on Mac OS X: ./collisions /usr/share/dict/words Word count: 234936 Table size: 393241 Load factor: 59.74% 0: 218461 (0.00%) 1: 126393 (53.80%) 2: 38524 (32.80%) 3: 8228 (10.51%) 4: 1400 (2.38%) 5: 203 (0.43%) 6: 29 (0.07%) 7: 2 (0.01%) 8: 1 (0.00%) -- John B. Matthews trashgod at gmail dot com <http://sites.google.com/site/drjohnbmatthews> ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2013-09-05 19:30 UTC | newest] Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2013-08-18 21:05 Hash Type Size sbelmont700 2013-08-19 1:03 ` AdaMagica 2013-08-19 22:21 ` Randy Brukardt 2013-08-19 22:29 ` Randy Brukardt 2013-08-19 22:12 ` Randy Brukardt 2013-08-31 6:22 ` Peter Brooks 2013-08-31 15:57 ` sbelmont700 2013-09-03 1:47 ` Randy Brukardt 2013-09-03 2:31 ` Peter Brooks 2013-09-03 10:50 ` John B. Matthews 2013-09-03 17:18 ` Peter Brooks 2013-09-03 21:21 ` John B. Matthews 2013-09-04 4:50 ` Paul Rubin 2013-09-04 4:54 ` Paul Rubin 2013-09-05 19:30 ` John B. Matthews
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox