comp.lang.ada
 help / color / mirror / Atom feed
From: "John B. Matthews" <nospam@nospam.invalid>
Subject: Re: Hash Type Size
Date: Thu, 05 Sep 2013 15:30:14 -0400
Date: 2013-09-05T15:30:14-04:00	[thread overview]
Message-ID: <nospam-0D8F9C.15301405092013@news.aioe.org> (raw)
In-Reply-To: 7x7gexqj9d.fsf@ruckus.brouhaha.com

In article <7x7gexqj9d.fsf@ruckus.brouhaha.com>,
 Paul Rubin <no.email@nospam.invalid> wrote:

> Paul Rubin <no.email@nospam.invalid> writes:
> > It doesn't say how many words are in the dictionary,
> Wait, it gives the numbers
> 
>  0: 305643 (77.72%)
>  1: 76947 (19.57%)
>  2: 9790 (2.49%)
>  3: 805 (0.20%)
>  4: 56 (0.01%)
>  5: 1 (0.00%)
> 
> i.e. there are around 492K words in a 250K slot hash table.  Of 
> course there will be a lot of collisions.  If you want fewer 
> collisions, you need more slots than words.

Thank you for commenting on this. I see now that the raw percentages are 
misleading. The number of slots reserved in the Table array is equal to 
the sum of the elements in the Counts Map. The number of dictionary 
words hashed is the sum of the product of each key/element pair in the 
Counts Map.

In the example above, the 393241 slots reserved is much larger than 
required for the 99171 words in the Ubuntu dictionary; the load factor 
is just 25%. Focusing on the Counts Map, node 0 is the remaining empty 
slots in the Table array. Node 1 implies that 1 * 76947 / 99171 = 78% of 
the words can be retrieved directly, while node 5 implies that just 5 * 
1 / 99171 << 1% of the words may require up to four additional 
operations to retrieve.

I've updated the program to adapt the Table size and show a few details.

<http://home.roadrunner.com/~jbmatthews/jumble.html#sec2>

Next to each count is the percentage of words in that occupancy class. 
Note that node zero is unused table entries; node one is table entries 
occupied by a single word; node two is table entries occupied by two 
words; etc. Hashing the same dictionary as above shows the following 
results. 

./collisions
/usr/share/dict/words
Word count: 99171
Table size: 196613
Load factor: 50.44%
 0: 118762 (0.00%)
 1: 59770 (60.27%)
 2: 15218 (30.69%)
 3: 2534 (7.67%)
 4: 288 (1.16%)
 5: 36 (0.18%)
 6: 4 (0.02%)
 7: 1 (0.01%)

These results were obtained using the larger dictionary on Mac OS X:

./collisions
/usr/share/dict/words
Word count: 234936
Table size: 393241
Load factor: 59.74%
 0: 218461 (0.00%)
 1: 126393 (53.80%)
 2: 38524 (32.80%)
 3: 8228 (10.51%)
 4: 1400 (2.38%)
 5: 203 (0.43%)
 6: 29 (0.07%)
 7: 2 (0.01%)
 8: 1 (0.00%)

-- 
John B. Matthews
trashgod at gmail dot com
<http://sites.google.com/site/drjohnbmatthews>


      reply	other threads:[~2013-09-05 19:30 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-08-18 21:05 Hash Type Size sbelmont700
2013-08-19  1:03 ` AdaMagica
2013-08-19 22:21   ` Randy Brukardt
2013-08-19 22:29     ` Randy Brukardt
2013-08-19 22:12 ` Randy Brukardt
2013-08-31  6:22   ` Peter Brooks
2013-08-31 15:57     ` sbelmont700
2013-09-03  1:47       ` Randy Brukardt
2013-09-03  2:31         ` Peter Brooks
2013-09-03 10:50           ` John B. Matthews
2013-09-03 17:18             ` Peter Brooks
2013-09-03 21:21               ` John B. Matthews
2013-09-04  4:50               ` Paul Rubin
2013-09-04  4:54                 ` Paul Rubin
2013-09-05 19:30                   ` John B. Matthews [this message]
replies disabled

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox