A few months ago I talked about hash tables, known in TR1 as unordered associative containers, and hinted that’s what really interesting about them is how they work under the covers. Of course, neither TR1 or the C++ Standard describe how to implement containers. These documents only describe how the containers must behave.
When I analyzed unordered associative containers (UACs), I looked at one specific implementation, the Dinkumware version. The requirements on UACs are simple. Insertion must be average time complexity O(1) and worst-case complexity O(n). Search performance must be the same: constant time on average, and linear in the worst case. This is the true distinction of hash tables. Unlike associative containers, which have O(log n) complexity for insertion and find, UACs can do much better: constant time insertion/find. The drawback of UACs is that performance can degrade to worse than log n: linear.
Hash tables work by hashing elements into buckets. The perfect hash table has one element per bucket. A good hash table has a handful of elements in each and every bucket.
The Dinkumware implementation of UACs is simple but clever. Elements T are stored in a single std::list<T>. Buckets each contain a single iterator that points somewhere into the list. Hence, buckets is defined as vector of iterators, a std::vector< std::list<T>::iterator >. When a new element is added to the UAC, it is hashed to determine its bucket. If the bucket is empty, the element itself is added to the list and the bucket iterator is changed to "point" to the element in the list. If the bucket already points into the list, the element is inserted into the list. The clever part of the implementation is the use of a single std::list data structure to represent the "mini-lists" of each bucket. There’s a picture on slide 20 here that gives you a visualization. Unfortunately the PDF doesn’t animate, but you’ll get the general idea. Notice that int elements 2 and 7 both hashed to the same bucket. The bucket points to the 2, which is followed by the 7. The next element in the list is 4, but because the iterator to element 4 is in the next non-empty bucket, we know that 7 is the last element in the bucket. The Dinkumware implementation uses this little trick to avoid having a unique list per bucket. The single list represents all the bucket lists.
I haven’t had the chance to look at other implementations, but I’m sure it would prove a very interesting exercise.