What’s the standard way of implementing #hash for value objects in Ruby?

Nikolai_W · December 31, 2011, 3:16pm

On Sat, Dec 31, 2011 at 13:58, Robert K.
[email protected] wrote:

On Fri, Dec 30, 2011 at 3:43 PM, Nikolai W. [email protected] wrote:

But still, I don’t see the need. Note also that a proper Hash key
usually should be immutable because changing them causes all sorts of
trouble if not done carefully.

Hence the use of “value object” in my question.

I can’t see that “value object” implies “immutable”.

http://c2.com/cgi/wiki?ValueObject
Value object - Wikipedia

Continue reading:

http://c2.com/cgi/wiki?ValueObjectsShouldBeImmutable

Second, let me rephrase my
question and add some additional context and examples:

What algorithm should one employ in the calculation of the hash value
of an arbitrary value object?

There is no single standard (or best) way. The fact that different
languages (Java, Ruby…) have different means to calculate combined
hash values which all seem to work pretty well indicates this IMHO.

Really? Everyone seems to use the XOR method (with good cause).

As I’ve already pointed out, internally, Ruby does something
completely different.

I would claim that the algorithm should take the class of the object
into account as well, both for consistency with #== (which should
check equality of the classes of the objects being compared) and for
added entropy.

You pay a price for additional calculation though.

Since the fields are immutable, the result of the calculation can be
cached, so that’s not a valid reason to exclude it.

end
end

Might it be useful to have Ruby expose a way to perform this
calculation from the Ruby realm so that other classes may employ this
algorithm?

Not sure whether we would really gain that much. Those calls are
efficient in C but if you provide that mechanism in Ruby land you will
have multiple calls, e.g.

def hash
h = Fixnum::HASH_START
h = h.combine_hash(@a)
h = h.combine_hash(@b)
h = h.combine_hash(@c)
end

I don’t understand what you’re getting at with this example. It
doesn’t seem to add anything to the discussion. My example code,
which shows how Ruby does it internally for Struct, makes multiple
calls. Since these methods would be simple wrappers of the C
functions, the hash calculation would (almost) be as fast as it would
be for Struct.

Nikolai_W · December 31, 2011, 1:58pm

On Fri, Dec 30, 2011 at 3:43 PM, Nikolai W. [email protected] wrote:

identical influence on the hash code. Consequently likelihood of
collisions increases and thus performance of Hash storage and
retrieval may decrease.

This is exactly what you suggest in your article, with the addition
(or rather XOR) of the hash value of the objects class.

Right. I should probably have given more detail in this thread with
regard to what goals this is “worse” than the simple XOR of member
hash values.

But still, I don’t see the need. Note also that a proper Hash key
usually should be immutable because changing them causes all sorts of
trouble if not done carefully.

Hence the use of value object in my question.

I can’t see that “value object” implies “immutable”.

http://c2.com/cgi/wiki?ValueObject

And most of the time Hash keys are
String, Symbol, Fixnum and the like - which all do have their #hash
implementation already.

But how do you combine them? Thats what this whole thread has been
about. I realize now that I made quite a few assumptions about what I
thought readers would understand from my original question that may
not have been as obvious to them as they were to me.

There you have learned something about communication.

First, let me apologize for this lack of clarity.

No sweat.

Second, let me rephrase my
question and add some additional context and examples:

What algorithm should one employ in the calculation of the hash value
of an arbitrary value object?

There is no single standard (or best) way. The fact that different
languages (Java, Ruby…) have different means to calculate combined
hash values which all seem to work pretty well indicates this IMHO.

As an example, what algorithm should one employ in the calculation of
the hash value of an immutable class A containing three immutable
instance variables @a, @b, @c that contain, respectively, a String, a
Fixnum, and a Symbol and that are all used in the calculation of #==?

In this case I would choose the most straightforward solution: XOR of
all member hash codes. This is sufficient since you indicated that no
value can show up in two different fields.

class A
def hash
[@a, @b, @c].hash
end
end

These are the two main implementations in the Standard Library,
anyway. The first is also the solution proposed by Robert K. in

http://blog.rubybestpractices.com/posts/rklemme/018-Complete_Class.html

And it does work indeed. But keep in mind that the article is not a
specific article about calculating hash codes but rather about what
has to be done to make a class “complete”. Hash codes are just one
aspect.

I would claim that the algorithm should take the class of the object
into account as well, both for consistency with #== (which should
check equality of the classes of the objects being compared) and for
added entropy.

Considering the most common case that all keys in a Hash are of the
same class it is questionable whether this will really add entropy
since then for all instances the class’s hash will be the same. You
pay a price for additional calculation though.

end
end

Might it be useful to have Ruby expose a way to perform this
calculation from the Ruby realm so that other classes may employ this
algorithm?

Not sure whether we would really gain that much. Those calls are
efficient in C but if you provide that mechanism in Ruby land you will
have multiple calls, e.g.

def hash
h = Fixnum::HASH_START
h = h.combine_hash(@a)
h = h.combine_hash(@b)
h = h.combine_hash(@c)
end

It may turn out to be more efficient to just do

def hash
[@a, @b, @c].hash
end

or simply use instances of Struct generated classes and reuse their
implementation.

Requirements for a hash function are pretty clear (and pretty much
standard):

It should make it as unlikely as possible that two objects which
are not equivalent have the same hash value.
Calculation should be as cheap (in terms of CPU cycles) as possible.

Hash function - Wikipedia has more detail

As you can see these are no hard numeric or formal requirements and
they both have potential to contradict each other. So it depends on
the situation what algorithm one chooses. It is always a trade off
between these two goals and there is no single solution which is best
under all circumstances.

I have attached a stats generator for existing Hashes. So anybody
interested can play with it and find out the distribution of hash
values for a given Hash.

Kind regards

robert

Nikolai_W · January 2, 2012, 6:16pm

On Sat, Dec 31, 2011 at 3:16 PM, Nikolai W. [email protected] wrote:

http://c2.com/cgi/wiki?ValueObject
Value object - Wikipedia

Continue reading:

http://c2.com/cgi/wiki?ValueObjectsShouldBeImmutable

Ah, OK. Still it’s not a “must”.

Really? Everyone seems to use the XOR method (with good cause).
No, not the simple “XOR all hash codes” method. Consider
java.util.AbstractList

public int hashCode() {

int hashCode = 1;
Iterator i = iterator();
while (i.hasNext()) {
E obj = i.next();
hashCode = 31*hashCode + (obj==null ? 0 : obj.hashCode());
}
return hashCode;
}

As Ive already pointed out, internally, Ruby does something
completely different.

It’s bit manipulations as well as far as I can see - but more complex
than the simple XOR all or the Java version.

I would claim that the algorithm should take the class of the object
into account as well, both for consistency with #== (which should
check equality of the classes of the objects being compared) and for
added entropy.

You pay a price for additional calculation though.

Since the fields are immutable, the result of the calculation can be
cached, so thats not a valid reason to exclude it.

You can only cache it if the object is frozen. And the question still
remains to be answered whether there is a significant gain by having
the class’s hash included.

end
def hash
h = Fixnum::HASH_START
h = h.combine_hash(@a)
h = h.combine_hash(@b)
h = h.combine_hash(@c)
end

I dont understand what youre getting at with this example. It
doesnt seem to add anything to the discussion.

It demonstrates that whatever would be exposed to Ruby land would need
multiple method calls to avoid object creation overhead. If you are
willing to pay that price you can just use [@a,@b,@c].hash.

My example code,
which shows how Ruby does it internally for Struct, makes multiple
calls. Since these methods would be simple wrappers of the C
functions, the hash calculation would (almost) be as fast as it would
be for Struct.

You still have some overhead. And then it might be more efficient to
use Array’s implementation. It’s certainly simpler.

I still cannot see why there should be a “standard way of implementing
#hash for value objects”. We have Struct’s way and Array’s way and we
can use other approaches like XOR all. Why would a standard make
things better?

Cheers

robert