Re: Hash Type Size - Peter Brooks

comp.lang.ada
 help / color / mirror / Atom feed

From: Peter Brooks <peter.h.m.brooks@gmail.com>
Subject: Re: Hash Type Size
Date: Mon, 2 Sep 2013 19:31:56 -0700 (PDT)
Date: 2013-09-02T19:31:56-07:00	[thread overview]
Message-ID: <8268e85c-e372-4883-8449-ef5253e2c77e@googlegroups.com> (raw)
In-Reply-To: <l03f49$8db$1@loke.gir.dk>

On Tuesday, 3 September 2013 03:47:52 UTC+2, Randy Brukardt  wrote:
> 
> So my answer to "what's a good way to hash a private type/access 
> value/system.address?" is that no such way exists. The best you can do is 
> convert the data to a stream-element array and hash that somehow, but that 
> most likely would be a dubious hash (there is a lot of unused bits in a 
> typical record, and it would be really easy for those bits to end up being 
> most of the result). The whole point is that a hash function needs to be 
> tailored to the data that it is hashing.
> 
This looks like a good topic for somebody's Ph.D!

I'd imagine it possible to have some broad applicability for certain algorithms. In the case of triplestores, the ones used by Wikipaedia contain, mainly, English words and phrases, with an increasing number of entries in other languages. It's be useful to know if a hash function that works for Wikipaedia data in English is more or less effective when used on German or Korean.

The large triplestores of genetic data are clearly a different matter and I can see tailoring the algorithm there would be important - what, though, do you actually need to know about the data to design a hash algorithm?

With RDF, which all triplestores are made of, the question that's occupying me is how much benefit there is to encoding the different parts of the URI.

The triples are of the form:

<http://dbpedia.org/resource/Alabama> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/AdministrativeArea>

subject - predicate - object

Clearly the '<http://' and '>' don't need to be part of the hash, I'll need a mapping both ways, from subject -> predicate -> object and from object -> predicate -> subject, that's what'll be in the binary trees.

The predicates will be a much smaller set, so an associative array will probably be best - probably a three-way associative array, one with the source of the predicate 'www.w3.org/1999/02/22-rdf-syntax-ns' another with the particular predicate 'type' and the third with the whole URI - but the same hash-key for each, as they're the same relation.

This is a bit of a simplification as the triples are actually quads (subject,predicate,object, graph), but that doesn't affect the hashing question.

next prev parent reply	other threads:[~2013-09-03  2:31 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-08-18 21:05 Hash Type Size sbelmont700
2013-08-19  1:03 ` AdaMagica
2013-08-19 22:21   ` Randy Brukardt
2013-08-19 22:29     ` Randy Brukardt
2013-08-19 22:12 ` Randy Brukardt
2013-08-31  6:22   ` Peter Brooks
2013-08-31 15:57     ` sbelmont700
2013-09-03  1:47       ` Randy Brukardt
2013-09-03  2:31         ` Peter Brooks [this message]
2013-09-03 10:50           ` John B. Matthews
2013-09-03 17:18             ` Peter Brooks
2013-09-03 21:21               ` John B. Matthews
2013-09-04  4:50               ` Paul Rubin
2013-09-04  4:54                 ` Paul Rubin
2013-09-05 19:30                   ` John B. Matthews

replies disabled

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox