Section notes for Week 13: Hashing

Section Notes for Week 12: Hashing

Story thus far

  1. Conversion of Java into C++
  2. C++ I/O
  3. C++ Preprocessor
  4. Pointers, local variables, and the stack
  5. new and delete
  6. Code Review
  7. More Example Code
  8. Reference Variables and Method invocation
  9. Const keyword
  10. Constructors, copy constructors, and destructors
  11. Operator Overloading and the string class
  12. Big 3
  13. Templates and the array class
  14. Discussion of the List ADT
  15. A look at an array list implementation
  16. A look at the Linked List implementation
  17. Brief Discussion of Doubly Linked Lists
  18. Iterators
  19. Function Objects
  20. Generic Programming
  21. Trees
  22. Binary Search Trees
  23. Tree Traversal
  24. Trees with Back Pointers
  25. Exam Review
  26. Motivation
  27. Cases
  28. Rotations
  29. The Big Picture
  30. Iterators in AVL FIXME: Broken

Corrections:

  1. Patricia tree need not start at level 0

Today:

  1. Hashing
  2. Open Hashing
  3. de la Briandais Tries
  4. Trie Sourcecode

Hashing

Ok. So we've taught you all sorts of data structures.. Wouldn't it be nice if we could just do everything in constant time? Insert, remove, find, etc? This is surely a pipedream, right? Or is it...

A hash table is a data structure that works just like an array, except instead of forcing you to use ints as your index, you can use any arbitrary data type you choose!

How the hell is this done? Well, we basically need a function that maps a key K to an integer index i of the hash table (which has indices 0..m) with each index having probability 1/m. Such a function is called a hashing function.

The only one you need to worry about works on string input and creates an integer key: i = (K[0] + K[1]*256 + K[2]*256^2 + K[3]*256^3 + ...) % m.

There are a couple of subtleties to the above approach. First, since 256^n is going to exceed 2^(sizeof(int)*8) (4 billion) when n = 4, we really should factor then distribute the % m at each step, giving us something like: K[0] + 256 (K[1] + 256 (K[2] + 256 (K[3] + 256 (K[4] ...) % m ) % m) % m) %m. This prevents the key from ever overflowing integer space in a single operation.

Furthermore, your tablesize m should be prime. There are some justifications for this, but none of them are very satisfying. The main one to think about is if m is a power of 2.

Open Hashing

Ok, so we can now index into a hash table. How do we build one then? Well, the most straight-forward way is to make an array of linked-lists, one for each bucket. Insert and remove simple do normal linked list operations. You may sort the linked lists, but you don't have to.

This method is also known as Separate Chaining, because you have a bunch of separate chains in your table.

  • Quick sketch of this on the board.

    So what are the running times for this? Well, if the hashing function is good, each bucket should have n/m entries in it. Doesn't look so good, huh? Sure looks like if m is a constant, we're still at O(n).

    Well m doesn't have to be a constant. You can expand your hash table size (m) every time you insert into a chain that has more than k entries. If k is a constant, your access time is bounded at O(k) = O(1).

    Question: What happens to your running times if you expand whenever n/m exceeds say 0.5?

    Closed Hashing

    The above method is my personal favorite. I think it's nice and clean and simple. However, as it turns out, you can actually make things run a little faster as far as constant factors are concerned. How? Well instead of making a separate chain, which requires many different memory allocations, we store all our data in a single ("closed") table.

    This is also known, confusingly enough, as Open Addressing.

    SO how do we resolve conflicts in this approach? Well, they all revolve around finding a suitable open index so that your hash function becomes h(k) = (h1(k) + f(i)) % m, where h1(k) is the original function mentioned above, i is the number of attempts you have made so far (and failed), and f is some function.

    There are 3 main methods of doing this probing (and thus providing the function f). Each one a is little bit better than the one before it. Why teach you all 3 then? I don't know... Maybe it's so you know what NOT to do.

    Linear probing

    Linear probing is the most naive of all the approaches. If a cell is occupied, it just puts the entry in the next open cell. Formally, f(i) = i.

  • Example with some integer keys

    A find operation will enter at an index, and continue searching until it hits an open index. (Why open? Why not just one that contains a key that doesn't match the index value?)

    How do you delete? Well, rather than shift every element in your chain up, you can simply mark a cell as deleted, so that find knows to continue its search, and so that insert knows it can throw a value there.

  • Example after a delete.

    Running times? Well, you can try to do the same trick as before to keep your chains bounded to some length, but you have even bigger problems.

    The Primary Clustering Problem: when the table starts to get full, clusters will develop, and the larger a cluster gets, the higher the probability that an element will need to be inserted in some cell of the cluster, thus expanding it. This is not so nice, and leads to O(n) worst case find and insert running times.

    Quadradic Probing

    Ok, so having f(i) be linear was clearly no good. What if we tried to avoid creating those clusters by making f(i) "space out" a bit. Say f(i) = i^2?

  • Example with quadratic probing

    As it turns out, this WILL eliminate primary clustering, but it introduces another problem. Secondary clustering!

    Secondary clustering is when elements with the same hash value form chains. As it turns out, secondary clustering prevents us from guaranteing an insert if the table is >= half full! (In fact, no chains can exceed m/2).

  • Example with failure case

    Double hashing

    So this is it: if we use TWO hashing functions, so that f(i) = h2(k)*i, then we have a unique probe sequence for every key. This makes our overall hash function h(k) = (h1(k) + h2(k)*i) % m.

    h2(k) should be relatively prime to m for all k, and should not share very many hash values with h1(k), and shouldn't allow hashing to 0.

  • Example with h1(k) = x mod 9 and h2(k) = (7 - x) mod 7 + 1

    Rehashing

    There's not much to rehashing. Basically you just create a new table, and re-insert every value into the new table. This takes O(n), but you won't do it until at something like O(n/2) insertions, so if you average it out, your individual ops still take O(1). Usually you pick the next largest prime table size that's roughly twice your current table size.

    Source Code

    So my primary job with the sourcecode today is to show you guys pure virtual functions. These are virtual functions that are not implemented in the base class, and provide behavior similar to abstract classes in java. Much like Java, you can't instantiate a class that has unimplemented pure virtual functions, either directly or through inheritance. You can, however, have pointers of the abstract class's type, and have them point to subclasses that actually implement these virtual functions.

    Ok, so that's that. Lets have a look at how the hashtable implementation makes use of these guys.

    There are TWO .h files in the directory. Which one should we look at first?

    Some things to notice:

    1. Find() is pure virtual, yet Insert and Remove are not.
    2. We have a TableNode member class, and an array of pointers to this class, much like in the Trie.
    3. We have a deletion status flag, but it's magic! Bad!
    4. Notice our pure virtual hash function.
    5. What type of hashing technique is employed? Can we pin it down in specific?
    6. So what types of subclasses should we expect?

    So now lets have a quick look at insert and delete. Much like in every other data structure that supports these ops, Insert and Delete actually Find the element first. So we use the pure virtual Find, which will be implemented by the subclasses and then do a standard insert. How does tis work? Well the key is that Find is well specified in what it should do. It needs to return a true/false value if the key was found, an update the current index to where the key should be. This allows insert to work pretty much right off the bat.

    So lets have a look at the subclasses now. Notice we have the three of them all in the .h file. Other than that, things aren't very interesting, so lets look real quick at a couple of finds (they all actually work the same anyways), and call it a day.