Section notes for Week 13: Hashing

## Section Notes for Week 12: Hashing

### Corrections:

1. Patricia tree need not start at level 0

### Hashing

Ok. So we've taught you all sorts of data structures.. Wouldn't it be nice if we could just do everything in constant time? Insert, remove, find, etc? This is surely a pipedream, right? Or is it...

A hash table is a data structure that works just like an array, except instead of forcing you to use ints as your index, you can use any arbitrary data type you choose!

How the hell is this done? Well, we basically need a function that maps a key K to an integer index i of the hash table (which has indices 0..m) with each index having probability 1/m. Such a function is called a hashing function.

The only one you need to worry about works on string input and creates an integer key: i = (K + K*256 + K*256^2 + K*256^3 + ...) % m.

There are a couple of subtleties to the above approach. First, since 256^n is going to exceed 2^(sizeof(int)*8) (4 billion) when n = 4, we really should factor then distribute the % m at each step, giving us something like: K + 256 (K + 256 (K + 256 (K + 256 (K ...) % m ) % m) % m) %m. This prevents the key from ever overflowing integer space in a single operation.

### Open Hashing

Ok, so we can now index into a hash table. How do we build one then? Well, the most straight-forward way is to make an array of linked-lists, one for each bucket. Insert and remove simple do normal linked list operations. You may sort the linked lists, but you don't have to.

This method is also known as Separate Chaining, because you have a bunch of separate chains in your table.

• Quick sketch of this on the board.

So what are the running times for this? Well, if the hashing function is good, each bucket should have n/m entries in it. Doesn't look so good, huh? Sure looks like if m is a constant, we're still at O(n).

Well m doesn't have to be a constant. You can expand your hash table size (m) every time you insert into a chain that has more than k entries. If k is a constant, your access time is bounded at O(k) = O(1).

### Closed Hashing

The above method is my personal favorite. I think it's nice and clean and simple. However, as it turns out, you can actually make things run a little faster as far as constant factors are concerned. How? Well instead of making a separate chain, which requires many different memory allocations, we store all our data in a single ("closed") table.

This is also known, confusingly enough, as Open Addressing.

SO how do we resolve conflicts in this approach? Well, they all revolve around finding a suitable open index so that your hash function becomes h(k) = (h1(k) + f(i)) % m, where h1(k) is the original function mentioned above, i is the number of attempts you have made so far (and failed), and f is some function.

There are 3 main methods of doing this probing (and thus providing the function f). Each one a is little bit better than the one before it. Why teach you all 3 then? I don't know... Maybe it's so you know what NOT to do.

#### Linear probing

Linear probing is the most naive of all the approaches. If a cell is occupied, it just puts the entry in the next open cell. Formally, f(i) = i.

• Example with some integer keys

A find operation will enter at an index, and continue searching until it hits an open index. (Why open? Why not just one that contains a key that doesn't match the index value?)

How do you delete? Well, rather than shift every element in your chain up, you can simply mark a cell as deleted, so that find knows to continue its search, and so that insert knows it can throw a value there.

• Example after a delete.

Running times? Well, you can try to do the same trick as before to keep your chains bounded to some length, but you have even bigger problems.

The Primary Clustering Problem: when the table starts to get full, clusters will develop, and the larger a cluster gets, the higher the probability that an element will need to be inserted in some cell of the cluster, thus expanding it. This is not so nice, and leads to O(n) worst case find and insert running times.

#### Quadradic Probing

Ok, so having f(i) be linear was clearly no good. What if we tried to avoid creating those clusters by making f(i) "space out" a bit. Say f(i) = i^2?

• Example with quadratic probing

As it turns out, this WILL eliminate primary clustering, but it introduces another problem. Secondary clustering!

Secondary clustering is when elements with the same hash value form chains. As it turns out, secondary clustering prevents us from guaranteing an insert if the table is >= half full! (In fact, no chains can exceed m/2).

• Example with failure case

#### Double hashing

So this is it: if we use TWO hashing functions, so that f(i) = h2(k)*i, then we have a unique probe sequence for every key. This makes our overall hash function h(k) = (h1(k) + h2(k)*i) % m.

h2(k) should be relatively prime to m for all k, and should not share very many hash values with h1(k), and shouldn't allow hashing to 0.

• Example with h1(k) = x mod 9 and h2(k) = (7 - x) mod 7 + 1

### Source Code

So my primary job with the sourcecode today is to show you guys pure virtual functions. These are virtual functions that are not implemented in the base class, and provide behavior similar to abstract classes in java. Much like Java, you can't instantiate a class that has unimplemented pure virtual functions, either directly or through inheritance. You can, however, have pointers of the abstract class's type, and have them point to subclasses that actually implement these virtual functions.

Ok, so that's that. Lets have a look at how the hashtable implementation makes use of these guys.

Some things to notice:

1. Find() is pure virtual, yet Insert and Remove are not.
2. We have a TableNode member class, and an array of pointers to this class, much like in the Trie.
3. We have a deletion status flag, but it's magic! Bad!
4. Notice our pure virtual hash function.
5. What type of hashing technique is employed? Can we pin it down in specific?
6. So what types of subclasses should we expect?

So now lets have a quick look at insert and delete. Much like in every other data structure that supports these ops, Insert and Delete actually Find the element first. So we use the pure virtual Find, which will be implemented by the subclasses and then do a standard insert. How does tis work? Well the key is that Find is well specified in what it should do. It needs to return a true/false value if the key was found, an update the current index to where the key should be. This allows insert to work pretty much right off the bat.

So lets have a look at the subclasses now. Notice we have the three of them all in the .h file. Other than that, things aren't very interesting, so lets look real quick at a couple of finds (they all actually work the same anyways), and call it a day.