Section notes for Week 7 (More Trees!)

Section Notes for Week 7 (More Trees!)

Story thus far

  1. Conversion of Java into C++
  2. C++ I/O
  3. C++ Preprocessor
  4. Pointers, local variables, and the stack
  5. new and delete
  6. Code Review
  7. More Example Code
  8. Reference Variables and Method invocation
  9. Const keyword
  10. Constructors, copy constructors, and destructors
  11. Operator Overloading and the string class
  12. Big 3
  13. Templates and the array class
  14. Discussion of the List ADT
  15. A look at an array list implementation
  16. A look at the Linked List implementation
  17. Brief Discussion of Doubly Linked Lists
  18. Iterators
  19. Function Objects
  20. Generic Programming
  21. Trees
  22. Binary Search Trees
  23. Tree Traversal
  24. Trees with Back Pointers
  25. Exam Review
  26. Motivation
  27. Cases
  28. Rotations
  29. The Big Picture
  30. Iterators in AVL

Today:

  1. B-Trees
  2. B-Tree Operations
  3. RedBlack-Trees
  4. RedBlack Operations

B-Trees

Ok. So if binary search trees can give us O(lg(n)) performance, can we make a k-ary tree (where each node has k children) to give us O(logk(n)) performance?

The answer is yes! Such a tree is called a B-Tree. In a B-Tree, each node may have as many as B children, and all the data is stored only in the leaves.

B-tree's are actually a specific kind of a more general tree structure known as (a, b)-trees, where the number of children of a node is between a and b, inclusive. B-Trees are (a, b)-trees with b = 2a - 1. So the minimum number of children in a B-Tree node is (b+1)/2, and B is always odd. (The exception being the root, which may have between 2 and b children, inclusive).

So lets have a look at the B-Tree on page 238 of the Green Book. Notice that the internal nodes have values in them. These values are used for searching in the normal binary tree comaparison sense, except that the pointer to the left is for nodes <= to the key. This is because the actual data is stored at the leaf nodes, so the left pointer is the side of the tree that holds the actual data.

From this, you should start to get an idea of how the Find operation on the tree is going to run, and from some handwaving, we can see that the height of the B-tree is going to be roughly O(log(b+1)/2(n)).


Find(X)
{
   if at leaf
   {
      if value == X
         return 1
      else 
         return 0
   }
   else  // at internal node, which might also be the root node
   {
      search node for appropriate pointer, based on ranges
      Repeat Find(X) on that subtree, return result
   }
}

So why don't we want to use B-Tree's all the time, with like B=199 to make the height real small and searching real fast?

Well, when we traverse each node, we still have to find out which pointer to take. So if we have B=199, we have to search through 198 indices in each node. Even with a binary search, this still takes O(lg(B)) time, putting the actual running time for a B-Tree find at O(lg(B)*log(b+1)/2(n)). As Jason mentioned in lecture, B-Trees aren't very good as main-memory data structures if B exceeds 5.

So why do they exist? Well, as it turns out, Modern Journaling Filesystems such as ReiserFS make use of a certain type of B-Trees (B*Trees or B+Trees), because it takes much longer to do the I/O to fetch a node of any size into memory than it does to scan a node of any size. (Disk access times are real slow compared to main memory operations, and even compared to disk read times). This way, they effectively make use of fast memory operations to minimize slow disk operations.

B-Tree Operations

Insert

Insert, just like in the BST and AVL trees, is preceeded by a Find() to get us to the point in the tree where the node should be, and just throw it in there. A new index is placed in the parent node according to the following rules, which ensure the left child <= key < right child property.

First, find the sibbling of the tree where the node could go. From here, we have two cases:

  1. The new node is less than the node in this position
    Then its key goes up, and a pointer to the new node is placed to the left of this key.
  2. The new node is greater than the node in this position
    The node in the found position's key goes up, and a pointer to the new node is placed to the right of this key.

  • Example of this on a B-Tree node with only 2 children

    But this isn't all there is to it. We also have to preserve the B-Tree property mentioned above ((B+1)/2 <= children <= B). So the cases we need to think about all depend on the number of children in the parent node.

    1. The parent node has B children or less.
      B-Tree property holds. STOP.

    2. The parent node now has more than B children.
      This requires us to "split" the parent node. The split of the children will be even, because B+1 is even, but the index nodes will be odd. So parent node becomes two separate children of it's parent, with the now unused center key being moved up and placed in the position to the right of the pointer to the the formerly unsplit node. A new pointer to the right half of the split is then created to the right of this key. You then must CONTINUE up the tree to where the key from the split was inserted, to make sure the B-tree property holds there as well. If it does not, repeat this case.

      There is one subcase to this. If the root node has more than B children, you must split it, and use the middle key to create a new root. Same idea, but just be sure to update the root pointer.

  • Example of Cases 1 and 2

    Remove

    Remove is going to do a find on the element, and then remove the leaf with the corresponding key.

    Lets consider what happens when a leaf suddenly disappears.

    1. If the removed node was the least child of its parent:
      Remove the leftmost index of the parent.
    2. If the removed node was a middle child of its parent:
      Remove the index in the parent whose left child was this node. This is technically the same case as #1.
    3. If the removed node was the greatest child if its parent:
      Remove the index whose right child was this node.

    So after this is done, we need to check to make sure the B-Tree property is preserved after the index removal. Again, consider the two cases:

    1. If the number of indices in the parent node does not drop below (B+1)/2:
      The B-Tree property has not been violated. STOP.

    2. If the number of indices in the parent node DOES drop below (B+1)/2:
      Now we need to consider how to get the number of indices back into this range. So we look to the siblings of the parent (NOT the COUSINS of the parent. Why?) This leads us to two subcases, depending on the B-Tree properties of the siblings.

      1. If the sibling has more than (B+1)/2 children:
        We can then safely "steal" a node from the sibling. Jason defines stealing as swaping the lost index from the steal with the parent index, and then dropping the parent index down to the left or right of the stolen node (depending on if this is the right or left sibling). The B-Tree property is then preserved. STOP, we are done.

      2. If all adjacent siblings have exactly (B+1)/2 children:
        We cannot steal. So we must perform a merge. The merge will cause a parent to lose a child, but it also makes the index that separated the two merge nodes unneccessary. So this index drops down to fill in the merge. But the parent may have underflowed, so CONTINUE up and check the cases on it.

      3. If the underflowed node was the root, it has one child. Eliminate the root, and the new child is the root.

    Red-Black Trees

    Alright, so last week we covered AVL trees as a solution to the balancing problem. As it turns out there are alternate solutions as well. CLR (the CS273 book, aka "The Tome") describes a balancing system called the Red-Black Tree. A Red-Black tree maintains balance through the following properties:

    1. Every node is either red or black.
    2. NULL's are considered black nodes
    3. If a node is red, both the children must be black.
    4. Every path from a node to NULL contains the same number of black nodes.
    5. The root node will always be black.

    Properties 3 and 4 ensure tree balance by keeping the length of the longest path to be no more than 2 times the length of the shortest path in the tree. (Since the paths from root to leaf have the same number of black nodes, and no two red nodes may be in a row).

    Much like AVL trees, these properties are maintained through rotations. Red-black trees actually only make use of single-left and single-right rotations. So you don't have to learn anything new in that respect. However, you will need to know the rules for redblack trees, and the cases involved in maintaining them. So lets ramble about that.

    Red-Black Tree Operations

    Find

    Duh.

    Insert

    Insert is an interesting beast. Once again, it starts off just like the BST insert, with an initial node coloring of Red. (Why not initial coloring of black?)

    From this point, we must consider the red-black properties and how to preserve them. Since Jason does an excellent job of developing the rationale behind the cases in his notes, I'm just going to dump them here. If you haven't been to lecture, do read over those notes in detail. (That is if you concern yourself with trivial details such as passing this class).

    1. If the parent of the new node is black, tree properties are maintained. Stop.
    2. If the color of the parent is red, we break rule 3. Consider some subcases:
      1. If the color of the parent's sibling is red (ie the uncle is red):
        We can safely change the color of that level of the tree to black, and change the color of the parent of the parent of the inserted node to red. This preserves rule 4, so it is OK, but we must then CONTINUE up the tree, to see if THAT node's parent is red.

      2. If the color of the uncle is black:
        Now we can't simply change our parent's node to black, because that would break rule 4.. What we can do is the following:
        1. If we are closer to the uncle than our sibling:
          Rotate on our parent away from uncle, and set our pointer to our old parent (so we are now far from our uncle), and drop down to the following case.
        2. If we are father from the uncle than our sibling:
          Now we can rotate towards the uncle at the uncle's parent (our grandparent). We can then safely change the old grandparent's color to red and our parent's color to black to preserve rule 4. STOP.
      Lastly, ALWAYS recolor root black

  • Go through the example in CLR with these rules.

    Remove

    Red-Black remove is going to work just like BST (and AVL) removal to get rid of the node we specified. After that, we have to check to make sure the Red-Black properties hold, and if not, do corrections.

    So what are those cases? Well first off, if the deleted node was red, this means that the black height hasn't changed, and the removal of a red node won't do anything to put two reds in a row. STOP.
    If the deleted node was black, we got problems. So we need to consider how the tree has changed. Some node (either NULL or a child or an IOS) is in the place of the deleted node.

    From here, we have 4 cases:

    Case 1: The deleted node's sibling is red (fallthrough)
    Case 2: The deleted node's sibling is black and has at least 1 red child
     Case 2a: The farthest child is black (make it red, fallthrough)
     Case 2b: The farthest child is red (stop case)
    Case 3: X's black sibling has no red children
    
    
    RBRemove(value)
    {
      // Pretend BST returns the place in the tree where a node was actually
      // deleted.
      // x may be NULL (black), or it may be the IOS of the deleted value
      // This is an important detail glossed over by Jason!
      x = BST_remove(value);
      
      if(deleted node was red)
        return; // STOP
      else
      { 
        // go from here up the tree until we see a red node 
        while(x != root and x is black)
        {
          // Since the black height of this side has decreased, we must 
          // increase the black height of the current side.
          
          // case 1 is that X's sibling is red. 
          if(x's sibling is red)
          {
            color x's sibling black
            color x's parent red
            rotate toward x
            
            // This operation has NOT changed the black heights at all.
            // It simply makes the next operation (which does) possible.
            // x remains the same
          }
    
          // now the sibling is black. From here, we can actually reduce the
          // black height on the opposite side by a recoloring
    
          if(at least one of x's sibling's children is red) // case 2
          {
            if(x's sibling's child furthest from x is black) // case 2a
            {
              color x's sibling red
              color child of sibling closest to x black
              single rotation on x's sibling away from x.
              // case 2a is always followed by 2b, because
              // 2a only gets us into the position where 2b can do its
              // work (See pg 36 of jason's notes).
              // If it helps, you can consider 2a + 2b a "double
              // rotation".
            }
            // case 2b
            // now x's sibling's child furthest from x is red.
            // This is the only operation that actually fixes 
            // the black height
            color child of x's sibling furthest from x black
            color x's sibling to whatever color x's parent is.
            color x's parent black
            single rotation on x's parent towards x
    
            // the black height is now repaired. BREAK from the loop
          }
          else // case 3: x's black sibling has no red children
          {
            // we must move up the tree, as the balancing problem
            // is up there
            color x's sibling red
            x = x->parent
          }
    
        }
        color x black  
      }
    }
    
    

    More Info

    As mentioned above, an EXCELLENT description of Red-Black trees can be found in the CS273 textbook. Lots of diagrams for every case, etc. Red-Black trees are described in our text, but the diagrams aren't as nice and the text is not as thorough. B-Trees are also described in CLR, but I didn't read that in detail. It looks good though. CLR are pretty smart guys. That book got me through 273 and 373 without ever needing to set foot in a classroom.

    In his notes, Jason describes a type of B-Tree that is related to a Red-Black tree. That material is optional, but may help you understand things. If you're feeling adventurous, check it out. I find it confusing myself, but to each their own. :)

    FYI: Red-black trees are the data structure used in the STL <map> class.