From: "John B. Matthews" <nospam@nospam.invalid>
Subject: Re: Hash Type Size
Date: Thu, 05 Sep 2013 15:30:14 -0400
Date: 2013-09-05T15:30:14-04:00 [thread overview]
Message-ID: <nospam-0D8F9C.15301405092013@news.aioe.org> (raw)
In-Reply-To: 7x7gexqj9d.fsf@ruckus.brouhaha.com
In article <7x7gexqj9d.fsf@ruckus.brouhaha.com>,
Paul Rubin <no.email@nospam.invalid> wrote:
> Paul Rubin <no.email@nospam.invalid> writes:
> > It doesn't say how many words are in the dictionary,
> Wait, it gives the numbers
>
> 0: 305643 (77.72%)
> 1: 76947 (19.57%)
> 2: 9790 (2.49%)
> 3: 805 (0.20%)
> 4: 56 (0.01%)
> 5: 1 (0.00%)
>
> i.e. there are around 492K words in a 250K slot hash table. Of
> course there will be a lot of collisions. If you want fewer
> collisions, you need more slots than words.
Thank you for commenting on this. I see now that the raw percentages are
misleading. The number of slots reserved in the Table array is equal to
the sum of the elements in the Counts Map. The number of dictionary
words hashed is the sum of the product of each key/element pair in the
Counts Map.
In the example above, the 393241 slots reserved is much larger than
required for the 99171 words in the Ubuntu dictionary; the load factor
is just 25%. Focusing on the Counts Map, node 0 is the remaining empty
slots in the Table array. Node 1 implies that 1 * 76947 / 99171 = 78% of
the words can be retrieved directly, while node 5 implies that just 5 *
1 / 99171 << 1% of the words may require up to four additional
operations to retrieve.
I've updated the program to adapt the Table size and show a few details.
<http://home.roadrunner.com/~jbmatthews/jumble.html#sec2>
Next to each count is the percentage of words in that occupancy class.
Note that node zero is unused table entries; node one is table entries
occupied by a single word; node two is table entries occupied by two
words; etc. Hashing the same dictionary as above shows the following
results.
./collisions
/usr/share/dict/words
Word count: 99171
Table size: 196613
Load factor: 50.44%
0: 118762 (0.00%)
1: 59770 (60.27%)
2: 15218 (30.69%)
3: 2534 (7.67%)
4: 288 (1.16%)
5: 36 (0.18%)
6: 4 (0.02%)
7: 1 (0.01%)
These results were obtained using the larger dictionary on Mac OS X:
./collisions
/usr/share/dict/words
Word count: 234936
Table size: 393241
Load factor: 59.74%
0: 218461 (0.00%)
1: 126393 (53.80%)
2: 38524 (32.80%)
3: 8228 (10.51%)
4: 1400 (2.38%)
5: 203 (0.43%)
6: 29 (0.07%)
7: 2 (0.01%)
8: 1 (0.00%)
--
John B. Matthews
trashgod at gmail dot com
<http://sites.google.com/site/drjohnbmatthews>
prev parent reply other threads:[~2013-09-05 19:30 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-08-18 21:05 Hash Type Size sbelmont700
2013-08-19 1:03 ` AdaMagica
2013-08-19 22:21 ` Randy Brukardt
2013-08-19 22:29 ` Randy Brukardt
2013-08-19 22:12 ` Randy Brukardt
2013-08-31 6:22 ` Peter Brooks
2013-08-31 15:57 ` sbelmont700
2013-09-03 1:47 ` Randy Brukardt
2013-09-03 2:31 ` Peter Brooks
2013-09-03 10:50 ` John B. Matthews
2013-09-03 17:18 ` Peter Brooks
2013-09-03 21:21 ` John B. Matthews
2013-09-04 4:50 ` Paul Rubin
2013-09-04 4:54 ` Paul Rubin
2013-09-05 19:30 ` John B. Matthews [this message]
replies disabled
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox