Section notes for Week 12: Tries

Section Notes for Week 12: Tries

Story thus far

  1. Conversion of Java into C++
  2. C++ I/O
  3. C++ Preprocessor
  4. Pointers, local variables, and the stack
  5. new and delete
  6. Code Review
  7. More Example Code
  8. Reference Variables and Method invocation
  9. Const keyword
  10. Constructors, copy constructors, and destructors
  11. Operator Overloading and the string class
  12. Big 3
  13. Templates and the array class
  14. Discussion of the List ADT
  15. A look at an array list implementation
  16. A look at the Linked List implementation
  17. Brief Discussion of Doubly Linked Lists
  18. Iterators
  19. Function Objects
  20. Generic Programming
  21. Trees
  22. Binary Search Trees
  23. Tree Traversal
  24. Trees with Back Pointers
  25. Exam Review
  26. Motivation
  27. Cases
  28. Rotations
  29. The Big Picture
  30. Iterators in AVL FIXME: Broken

Today:

  1. Tries
  2. Patricia Tries
  3. de la Briandais Tries
  4. Trie Sourcecode

Tries

So today is going to be a short discussion section (hopefully). All we have to do is talk about these wondrous data structures known as tries (pronounced like try), and tries are not all that complicated.

Sure sure, but what the hell is a trie? Well, it's like a tree, except more. Recall how our comparison-based tree structures are pretty much limited to O(lg n) (sure, we have B-tree's, but these aren't so hot for a fanout of more than 3 or 5 in main memory). Tries are the solution to this. Instead of using comparisons to move left and right, we can use a character of a string at each level to choose the appropriate child.

This means that our tree effectively has a fanout of 27 and at each level (A-Z and the null char '\0'), we can do a constant time array access to get a pointer to the next level.

  • Example with Euler, Einstein, Euclid, Euclidean, Eulers

    So what's the running time on one of these guys? Is it dependent on the number of strings in the Trie?

    So can anyone point out a problem with this approach?

    Patricia Tries

    Patricia (Practical Algorithm To Retrieve Information Coded in Alphanumeric) tries attempt to solve the problem of wasted levels in the trie. They are basically compressed tries, where a child is only created when and if a string string differs from its neighboring strings. If it has no neighbors, a pointer directly to the leaf is created.

    The when and if is italicized for a reason. When I took this class, I missed the when part. This means that shared characters between two (or more) strings are simply not represented in the trie. It just skips over those characters and places an entry in the child array at the first place they differ from one another. Note that this CAN BE THE NULL CHARACTER.

    In the book's implementation, the array is also marked with the position of the character used to differentiate the strings. This is important for dynamic updates. You should also consider including a pointer to the missing characters to make your life easier as well.

    Example with Euler, Einstein, Euclid, Euclidean, Eulers

  • Example of find.

    de la Briandais Tries

    de la Braindais tries try to solve an orthogonal problem: wasted indices on each level of the trie. Instead of having a whole array at each level, we can have a linked list!

  • Example with Euler, Einstein, Euclid, Euclidean, Eulers in a plain Braindais trie.

    If you were paying attention (and you know what orthogonal means), you may be wondering if we can combine de la Briandais and Patricia trees together. Well it turns out we can!

  • Example with Euler, Einstein, Euclid, Euclidean, Eulers in a hybrid trie.

    Trie Sourcecode

    Aight, lets look at an actual implementation of this thing. As usual, a gander at the .h file will give us a good overview of how this thing is working.

    Things to note:

    1. Only two ADT ops: Insert and Find
    2. Notice the layout of the TreeNode. Can we say what type of trie this is yet?
    3. How do we tell when we've reached data? Is is a good idea? What can we do instead?

    Ok, since I'm not allowed to show you anything other than insert, that's what we'll look at. As we're going through the code, see if you can pinpoint what type of trie this is.

  • Insert.

    So how would remove work on a normal trie? Well basically we need to check at every level to see if we have more children than just this string. Unfortunately, Jason's implementation doesn't support such a check in constant time. But luckily adding such a field to the TrieNode class is a simple task. (A good, quick excersize is to think about the running time without such a field). Anyways, here's some pseudocode.

    PseudoCode sketch of remove for vanilla Trie:

    
    bool Remove(String string, TrieNode *& node = root, int prevLevel = -1)
    {
      if(node == NULL)
        return false;
    
      if(node is leaf)
        if(prevLevel == string.length())
        {
          delete node
          node = NULL;
          return true;
        }
        else
          return false; // string not in trie
    
      if(Remove(string, node->children[string[prevLevel+1]], prevLevel+1))
        decrement number of children of node by 1; // not supported by our class
    
      if(node has no children)
      {
        delete node;
        node = NULL;
        return true;
      }
    
      return false;
    }
    
    

    Some things to consider (that I'm not allowed to talk about because they will be showing up on MP7):

    1. What do traversals on a trie mean?
    2. How would insert/remove work on a Patricia tree?

      Less important:

    3. Can this data structure be used for fast sorting?
    4. How can you avoid all the memory overhead if you just want to sort?

    More information on tries can be found on pages 260-263 in the puke green book.