Whatâ€™ s the best general implementation of custom #hash methods?

Shot_SPiotr_S.S · September 10, 2009, 5:56pm

Hello, ruby-talk.

Whatâ€™s the best (fast, yet with relatively few cases when equal hashes
are generated for non-eql objects) way to implement a hash method?

Is XOR-ing the hashes of an objectâ€™s instance variables (assuming that
two objects are eql if their ivars are eql) a good general approach?

Iâ€™m asking because I use this pattern in my code, and it seems quite
fast; I needed to work around a bug in MRIâ€™s Set#hash and this solutionÂ¹
seems about as fast as the original MRIâ€™s Set#hash. Iâ€™m just not sure
whether itâ€™s hashy enough (so to speak) in that two un-eql objects
hashed in this way produce different hashes often enough.

Â¹

â€” Shot

Shot_SPiotr_S.S · September 10, 2009, 6:53pm

On Thu, Sep 10, 2009 at 11:56 AM, Shot (Piotr S.)[email protected]
wrote:

seems about as fast as the original MRI’s Set#hash. I’m just not sure
whether it’s hashy enough (so to speak) in that two un-eql objects
hashed in this way produce different hashes often enough.

Well, let’s look at how Array#hash works here’s the 1.8 C code:

static VALUE
rb_ary_hash(ary)
VALUE ary;
{
long i, h;
VALUE n;

h = RARRAY(ary)->len;
for (i=0; i<RARRAY(ary)->len; i++) {

h = (h << 1) | (h<0 ? 1 : 0);
n = rb_hash(RARRAY(ary)->ptr[i]);
h ^= NUM2LONG(n);
}
return LONG2FIX(h);
}

This looks like a typical cryptographic hash algorithm pattern
combining bit rotation and xor. It seeds the value with the length of
the array, then for each element, rotates the value 1 bit to the left,
then xors in the elements hash.

This scrambles the bits more than just a straight xor

–
Rick DeNatale

Blog: http://talklikeaduck.denhaven2.com/
Twitter: http://twitter.com/RickDeNatale
WWR: http://www.workingwithrails.com/person/9021-rick-denatale
LinkedIn: http://www.linkedin.com/in/rickdenatale

Shot_SPiotr_S.S · September 10, 2009, 8:04pm

Shot (Piotr S.) wrote:

Hello, ruby-talk.

What’s the best (fast, yet with relatively few cases when equal hashes
are generated for non-eql objects) way to implement a hash method?

One way is to just put everything you care about into an array and hash
that, as in:

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/230421

But it might not qualify as “fast”, I suppose.

Shot_SPiotr_S.S · September 11, 2009, 2:10am

You could check out some of these (in several languages including Ruby):

Shot_SPiotr_S.S · November 5, 2009, 9:50pm

On Sep 10, 10:49 am, Rick DeNatale [email protected] wrote:

fast; I needed to work around a bug in MRI’s Set#hash and this solution¹
long i, h;
}

This looks like a typical cryptographic hash algorithm pattern
combining bit rotation and xor. It seeds the value with the length of
the array, then for each element, rotates the value 1 bit to the left,
then xors in the elements hash.

This scrambles the bits more than just a straight xor

–
Rick DeNatale

I must take STRONG exception to your characterization of this
algorithm. This method performs parallel xors of the data. Change one
bit of one of the values, and you change one bit of the hash. A
fundamental property of cryptographic hashes is that changing one bit
of the input changes many bits of output. This is even true of crcs.
Cryptographic hashes do rotate & xor (sometimes), but only as part of
a larger operation.

Shot_SPiotr_S.S · September 11, 2009, 2:16am

On 9/10/09, Shot (Piotr S.) [email protected] wrote:

Hello, ruby-talk.

What’s the best (fast, yet with relatively few cases when equal hashes
are generated for non-eql objects) way to implement a hash method?

Is XOR-ing the hashes of an object’s instance variables (assuming that
two objects are eql if their ivars are eql) a good general approach?

The drawback to this algorithm would be that if you have the same
values but in different variables, they’ll end up hashing the same.
For example:

class Pair
def initialize(a,b)
@a,@b=a,b
end
def hash
@a.hash^@b.hash
end
end

Pair.new(“foo”,“bar”).hash==Pair.new(“bar”,“foo”).hash #=>true

As a practical matter, this often doesn’t matter. It depends on the
application and if you think this kind of situation is likely. For
general purpose use, it’s better to have some way of encoding the
‘location’ of each sub-value in the hash as well.