On Wed, Jul 26, 2006 at 05:03:50AM +0900, Charles O Nutter wrote:
And even in the TreeMap case, you sacrifice performance for features.
Very true. While a peek at the source code will reveal that the
Red-Black tree it uses is well-implemented, nothing comes for free.
An unordered hash will be the fastest for lookup and insertion.
That depends, though it is generally true. However, your
datastructures should suit the problem domain and the constraints you’re
working under.
Adding any
additional features will slow it down out of the necessity of imposing an
order on either keys or values. With TreeMap, I think it’s something like
O(log n) lookup times rather than the theoretical O(1) with a straight hash,
and that’s not counting any tree rebalancing needed after future insertions.
Theoretically. However, while looking at speed through the lens of time
complexity can be a good rule of thumb, it’s sometimes somewhat
misleading.
For instance, say you had a dataset that you had to do lookups upon, and
you’d two choices as to the method to use for lookups. The first is a
linear scan of the unordered data, and the other is some keyed lookup.
The former would give you O(n) performance, while the latter would give
O(1) performance.
If you didn’t have any more information, you’d be forgiven for choosing
the latter method, but say scanning each element takes only .01ms,
whereas the calculation required in the latter method takes 2s per
element, it’s a no-brainer to pick the the O(n) method over the O(1)
method if your dataset is of the right size.
In the case of hashes, you only get O(1) if (a) the hash can be
calculated in constant time and (b) you’re using a perfect hash
function. Strictly speaking, just as Quicksort’s performance is O(n^2),
insertion and lookup for hashtables is O(n+m) where the hash degrades to
something resembling a bunch of keyed linked lists and n is the mean
time to convert the hash key to an index, and m is the mean time
required to do a linear scan of the appropriate linked list. But such
cases are rare, but inherent in the way hashes are implemented.
Big-O notation is only meant as a guide to algorithm choice and is no
substitute for and understanding of the tradeoffs involved with each
potential algorithm and how they actually perform when used with your
datasets.
Trees are good when you (a) want to guarantee how long it will take to
add,
lookup, or remove an element or (b) need the keys to be ordered by some
property. Most of the time, we don’t need these properties.
At minimum, you’ll double up memory storing an array and a hash together, or
you’ll sacrifice performance maintaining and traversing a balanced tree.
…
Leave hash alone; custom data structures are fine when the peculiar beast
known as an “ordered hash” is actually useful.
Agreed. There’s no sense in adding additional complexity to the language
to cope with something like that which most of us only use rarely, if
ever. That is, after all, what libraries are for.
K.