Search Structures

CS321 Lecture: Advanced Binary Search Trees             Last revised 10/19/99

Materials: Transparencies from Pascal Edition of Horowitz pp. 428, 431, 433

I. Introduction
-  ------------

   A. In CS122, we spent some time on the subject of search structures: Data 
      structures that can be used to store and retrieve information associated 
      with a certain key.  The operations on such a structure can be pictured 
      as follows:

        Insertion:
                     __________________
        Key, value  | Search           |
        ----------> | structure        |
                    | (key,value pairs)|
                    |__________________|

        Lookup:
                     ___________
        Key         | Search    | Value
        ----------> | structure | -------> 
                    |___________|


        Deletion     ____________________________
                    | Search                     |
        ----------> | structure                  |
                    | (key and its value removed)|
                    |____________________________|

   B. Let's briefly review the structures we considered in CS122:

Structure       Insert          Lookup          Delete

"Pile"          O(1)            O(n)            O(1) if we know where victim is

Ordered array   O(n)            O(logn)         O(n)

Linked list     O(1) if we know O(n)            O(1) if we know where victim is
                     where it goes
Binary search
tree            O(logn) - O(n)  O(logn) - O(n)  O(logn) - O(n)

Hash table      O(1) - O(n)     O(1) - O(n)     O(1) - O(n)

   C. Clearly, for tables of any significant size the best candidates are either
      a binary search tree or a hash table.

      1. Unfortunately, while both of these offer the potential for excellent
         performance, both have the potential to degenerate to very poor
         performance, depending on the actual keys that occur.  In principle,
         we could apply probabilistic techniques to assess the probability of 
         good performance, but we cannot know for certain that performance will 
         not turn out to be bad.

      2. In this course, we want to look at some more advanced versions of the
         binary search tree and the hash table.  We will see that

         a. In the case of the binary search tree, there are methods for 
            guaranteeing O(log n) performance

         b. For hash tables where the set of keys is static and known in
            advance (e.g. the reserved words of a programming language), we can
            guarantee O(1) performance.

         c. For dynamic hash tables we can increase the likelihood  of O(1) 
            performance, but cannot guarantee it.

      3. Thus, for dynamic search tables of large size, we will have a choice of
         an O(log n) guaranteed performance, or a likely O(1) performance with
         some small risk of degeneration to O(n).  Which to use for a given
         problem depends on the consequences of such degeneration.

         a. If a system has tight performance requirements with severe
            consequences for not meeting them (e.g. life-critical systems),
            then the guaranteed O(log n) performance is clearly to be
            preferred.

         b. In many other cases, the possibility of O(1) performance may turn
            out to be worth the risk.

II. Introduction to Balanced Binary Search Trees
--  ------------ -- -------- ------ ------ -----

   A. We have seen that all operations on a binary search tree (locate,
      insert, and delete) can be done in O(h) time, where h is the height of
      the tree.  However, depending on the order in which insertions are done,
      h can vary from a minimum of log n to a maximum of n, where n is the
      number of nodes.  Optimum performance requires that we somehow ensure
      that the tree is balanced, or nearly so.

   B. Actually, there are two different ways of approaching balancing a
      tree:

      1. Height balancing attempts to make both subtrees of each node have
         equal - or nearly equal - height.  It results, then, in a tree whose
         height differs only slightly from the optimal value of log n.

      2. Weight balancing takes into account the PROBABILITIES of accessing
         the different nodes - assuming such information is known to us.

         a. For example, suppose we had to build a binary search tree consisting
            of the following Pascal reserved words.  Suppose further that we had
            data available to us as to the relative frequency of usage of each
            (expressed as a percentage of all uses of words in the group), as 
            shown:

         begin          55%             Note: No claim is made that these
         case           25%             represent actual frequencies for typical
         for            11%             Pascal code.  In fact, the numbers are
         forward         5%             contrived to illustrate a point.
         otherwise       2%
         packed          1%
         varying         1%

         b. Suppose we constructed a height-balanced tree, as shown:

                        forward
                       /       \
                     case     packed
                    /   \    /       \
                 begin  for otherwise varying

            - 5% of the lookups would access just 1 node (forward)
            - 25% + 1% = 26% would access 2 nodes (case, packed)
            - 55% + 11% + 2% + 1% = 69% would access 3 nodes (the rest)

            Therefore, the average number of access would be

            (.05 * 1) + (.26 * 2) + (.69 * 3) = 2.64 nodes accessed per lookup

         c. Now suppose, instead, we constructed the following search tree

                   begin
                        \
                        case
                            \
                            for
                              \
                             forward
                                  \
                                 otherwise
                                     \
                                   packed
                                      \
                                     varying

            The average number of nodes visited by lookup is now

            - 55% access 1 node (begin)
            - 25% access 2 nodes (case)
            - 11% access 3 nodes (for)
            - 5%  access 4 nodes (forward)
            - 2%  access 5 nodes (otherwise)
            - 1%  access 6 nodes (packed)
            - 1%  access 7 nodes (varying)
            - 15% + 2% access 3 nodes (for, packed)
            - 4% + 1% access 4 nodes (otherwise, varying)

            (.55 * 1) + (.25 * 2) + (.11 * 3) + (.05 * 4) + (.02 * 5) +
            (.01 * 6) + (.01 * 7) = average 1.81 nodes accessed

            This represents over a 30% savings in average lookup time

         d. Interestingly, for the particular distribution of probability
            values we have used, this tree is actually optimal.  To see
            that, consider what would happen if we rotated the tree about
            one of the nodes - e.g. around the root:

                    case
                   /    \
                begin     for
                              \
                             forward
                                  \
                                 otherwise
                                     \
                                   packed
                                      \
                                     varying


            We have now reduced the number of nodes accessed for lookups in
            every case, save 1.  But since begin is accessed 55% of the
            the time, the net change in average number of accesses is
            (.55 * +1) + ((1 - .55) * - 1) = .55 - .45 = +.10.  Thus, this
            change makes the performance worse.  The same phenomenon would
            arise with other potential improvements.

      3. In general, weight balancing is an appropriate optimization only
         for static trees - i.e. trees in which the only operations performed
         after initial construction are lookups (no inserts, deletes.)  Such
         search trees are common, though, since programming languages, command
         interpreters and the like have lists of reserved or predefined words
         that need to be searched regularly.  Of course, weight balancing
         also requires advance knowledge of probability distributions for
         the accesses.

   C. We will first consider various structures and algorithms for
      height-balanced trees.  Then we will return to look at weight-balanced
      trees.

III. Height-Balanced Binary Search Trees: AVL Trees
---  --------------- ------ ------ -----  --- -----

   A. We have seen that operations on a binary search tree take time
      proportional to the height of the tree, which is, in the best case,
      O(log n), but in the worst case O(n).  We now consider a method for
      maintaining a binary search tree with height O(log n), regardless of
      the order in which the keys are inserted.  The type of tree we will
      consider is called an AVL tree, after its two inventors:  Adelson-Velski
      and Landis.

   B. First, we begin with a preliminary definition.  We say that a binary
      tree is HB(k) for some integer k >= 1 iff it is empty, or

      - its two subtrees differ in height by no more than k and
      - its two subtrees are HB(k)

      Note: every HB(k) tree is also HB(k+1), HB(k+2) ... - but we to describe a
            given tree we choose the smallest k for which the definition is
            satisfied.

      Example: an HB(1) tree:

                        ()
                       /  \
                      ()  ()
                          /
                         ()

                an HB(2) tree:

                         ()
                        /  \
                      ()    ()
                     /  \     \
                    ()  ()    ()
                   /          /
                  ()         ()

                (though subtree heights are equal, right subtree is HB(2).)

   C. We now define an AVL tree: an AVL tree is an HB(1) binary search tree.

   D. Before we talk about how to maintain AVL trees, we need to consider
      why they are of interest.  We will therefore prove the following
      theorem: The maximum height of an HB(1) tree containing n nodes is less
      than 1.44 log n.  (Thus, an AVL tree has height that is O(log n), though
      up to 44% higher than the best-case complete tree.)

      1. We approach our proof by asking a related question - what is the
         MINIMUM number of nodes in an HB(1) tree of height h.  (Clearly,
         for a given number of nodes, such a tree will have maximal height).

      2. We define the function minnodes(h) = minimum number of nodes in an
         HB(1) tree of height h.

         a. There are two trivial cases

                 minnodes(0) = 0
                 minnodes(1) = 1

         b. For h > 1, we have a tree with a root and two subtrees.  

            i. One must have height h-1 for the overall tree to have height h.  

           ii. The other may have height either h-1 or h-2 (since the tree
               is HB(1)).  However, since minnodes(h) increases with h we will 
               get a smaller value if it has height h-2.  

         iii. Further, to get the minimum number of nodes overall, we want the 
              subtrees to be minimal.

          iv. Thus, we get the recurrence

                for h > 1, minnodes(h) = 1 + minnodes(h-1) + minnodes(h-2)

        c. The solution to this recurrence is

                minnodes(h) = Fib(h+2) - 1

           To see this, observe that

                minnodes(0) = Fib(2) - 1 = 1 - 1 = 0 (as required)
                minnodes(1) = Fib(3) - 1 = 2 - 1 = 1 (as required)

           For h > 1

                minnodes(h) = Fib(h+2) - 1 = Fib(h+1) + Fib(h) - 1
                                           = 1 + (Fib(h+1) - 1) + (Fib(h) - 1)
                                           = 1 + minnodes(h-1) + minnodes(h-2)
                                             (as required)

                                        __    x                  __    x  
        d. But Fib(x) =    1    ( 1 + \| 5  )       1    ( 1 - \| 5  )   
                          ----  ( --------- )  -   ----  ( --------- )  - 
                            __  (    2      )        __  (    2      )   
                          \| 5                     \| 5                  

           Using sqrt(5) = 2.24, we approximate

           Fib(x) = .45*(1.62)**x - .45*(-.62)**x

           But since the second term approaches zero rapidly for large x, we
           can use

           Fib(x) ~= .45*(1.62)**x      for large x

        e. This gives

           minnodes(h) ~= .45*(1.62)**(h+2) - 1 for large h

           For large h, the -1 becomes insignificant, we se can use

                          .45*(1.62)**(h+2)

        f. Thus, for an HB(1) tree with n nodes, we have

                n >= .45*(1.62)**(h+2)

            log      n >= (h+2) + log     .45
                1.62                 1.62

            log      n >= (h+2) - .8
                1.62

            h <= log    n - 1.2
                    1.62

            But since log     n = 1.44 log  n, we have
                          1.62            2

            h <= 1.44 log n - 1.2

            i.e. h = O(log n)

   E. Maintaining AVL trees when inserting new nodes

      1. We have seen that an AVL tree has height at worst 44% greater than
         a complete binary tree.  Moreover, while maintaining a binary search
         tree as a complete tree under insertions would be costly, it
         turns out that maintaining an AVL tree under insertions and deletions
         is not.  In fact, all operations on an AVL tree can be done in O(h)
         time = O(log n).

      2. The key idea is this: we associate with each node in the tree a field
         recording its BALANCE.

         a. The balance of a node is defined as follows:

                 height(left subtree) - height(right subtree).

         b. Example: in the following tree the balances are recorded on each
            node.

                          (0)
                         /    \
                      (1)      (-1)
                     /   \    /   \
                    (1)  (0) (0)   (-1)
                   /                 \
                  (0)                (0)


         c. Clearly in an HB(1) tree the only possible balance values are -1,
            0, and +1.  Thus, if space is a premium, we may be able to avoid
            allocating a separate field for the balance by using one bit in each
            child pointer to tag the node as "heavy on the left" (left child
            tag bit set), "heavy on the right" (right child tag bit set) or
            "balanced" (no tag bits set.)  However, in our examples we will work
            with a separate tag field.

       3. In the insertion algorithm, we add new nodes at the bottom of the
          tree as always.  But after we do so, we work our way back up to the
          root of the tree, updating balances and possibly performing a
          correction should a balance become < -1 or > +1.  

          a. This can be done easily with a recursive insert; we take care of 
             updating balances etc. as we unwind the insertion - e.g.

             to insert newinfo into tree rooted at r:

             if r = nil then
                insert new node with balance 0 here
             else if newinfo < r^.info then
                insert into left child
                update balance of r
             else
                insert into right child
                update balance of r

          b. To perform the update, we observe that one of two things can
             have happened as a result of the insertion:

             i. The height of the child was not changed.  In this case, the
                balance of the parent (r) does not change

            ii. The height of the child increased by one.  In this case, the
                balance of the parent changes by +/- 1 (depending on whether
                the insertion was on its left or right side.)

           iii. We can decide between these two cases by looking at the
                child before and after the insert.

                Before insert   After insert

                nil             ---             height increased
                bal = 0         bal = 0         height unchanged
                bal = 0         bal = +/- 1     height increased
                bal = +/- 1     bal = 0         height unchanged
                bal = +/- 1     bal = +/- 1     height unchanged

          c. Now if the height of the child increased, then we have the
             following cases with regard to the parent's balance

                Side where      Parent's balance        New balance when
                insert done     before insert           child height increased

                L               -1                      0
                L               0                       +1
                L               +1                      TROUBLE
                R               -1                      TROUBLE
                R               0                       -1
                R               +1                      0

        4. In the two cases labeled "TROUBLE" we have to perform some
           corrective action at the parent node r.  WE CONSIDER ONLY THE
           CASE OF INSERTION ON THE LEFT HERE - insertion on the right is
           a mirror image and is left as an exercise to the student.

        5. The corrective action to be taken depends on the balance of the
           child.  Recall that there are two ways a child's height can
           increase - it can go from nil to non-nil, or its balance can
           go from 0 to +/- 1.  Only the latter, however, can cause an
           imbalance at the parent.  (If its child was nil before the
           insert, it could not now be unbalanced on that side.)  Thus, we
           have one of the following two cases

           a. Insertion into left subtree of child:

           BEFORE:                      AFTER:

                 (r)                     (r)            
               / +1  \                 / +1  \          
             (c)     T3              (c)     T3         
            / 0 \                   /+1 \               
           T1   T2                 T1   T2              

           If height of T1 = h     Height of T1 is now h+1
           then height of T2 = h   (Other heights unchanged)
           and height of T3 = h

           Overall height is h+2   Overall height is h+3

           b. Insertion into right subtree of child:

           BEFORE:                      AFTER:

                 (r)                     (r)            
               / +1  \                 / +1  \          
             (c)     T3              (c)     T3         
            / 0 \                   /-1 \               
           T1   T2                 T1   T2              

           If height of T1 = h     Height of T2 is now h+1
           then height of T2 = h   (Other heights unchanged)
           and height of T3 = h

           Overall height is h+2   Overall height is h+3

      6. The first sort of problem is handled by a RIGHT ROTATION AROUND R:

           BEFORE:                      AFTER INSERT:           AFTER ROTATION:

                 (r)                     (r)                    (c)
               / +1  \                 / +1  \                /  0  \
             (c)     T3              (c)     T3              T1     (r)
            / 0 \                   /+1 \                          / 0 \
           T1   T2                 T1   T2                        T2    T3

           If height of T1 = h     Height of T1 is now h+1
           then height of T2 = h   (Other heights unchanged)
           and height of T3 = h

           Overall height is h+2   Overall height is h+3   Overall height is h+2

         a. Observe that the rotation preserves the inorder traversal order,
            and thus the binary search tree property.

            Before: inorder = T1 c T2 r T3
            After:  inorder = T1 c T2 r T3

         b. Observe, too, that since the overall height of the subtree formerly
            rooted at r (but now at c) is again h+2, no further corrections 
            will be needed higher in the tree.

         c. Thus, after insertion and correcting rotation, the tree is once
            again an AVL tree!

      7. The second sort of problem is more complex, and requires a DOUBLE
         ROTATION involving c and r.  To analyze this, we need to get the
         right child of c into the act.  (We'll call it g, since it's a
         grandchild of r.)

BEFORE:                 AFTER:             LEFT ROTATE          RIGHT ROTATE
                                           AROUND C:            AROUND T:

         (r)                     (r)               (r)              (g)
       / +1  \                 / +1  \           / +1  \          /  0  \
     (c)     T3              (c)     T3        (g)     T3       (c)     (r)
    / 0 \                   /-1 \             /   \            / * \   / * \
   T1   (g)                T1   (g)         (c)   T2b         T1  T2a T2b  T3
       / 0 \                   /+/-1\      /   \
     T2a  T2b                T2a   T2b    T1   T2a

If height of T1 = h               Height of either T2a or    * = see below
then height of T2a=h-1            T2b (but not both) is
and height of T2b=h-1             now h.
and height of T3 = h              (Other heights unchanged)

Overall height is h+2   Overall height is h+3              Overall height is h+2

         a. The setting of the balances is now a bit more complex.  While g's
            balance is always 0 (since each subtree is of height h+1), we have 
            two cases for c and r, based on the balance of g after the 
            insertion but before the rotation.

                 Balance of g   Height of subtrees      Balance of c and r after
                 after insert   after insert            rotate
                 but before
                 rotate         T2a     T2b             c       r

                 -1             h-1     h               +1      0
                 +1             h       h-1             0       -1

         b. Observe that the rotations preserves the inorder traversal order,
            and thus the binary search tree property.

            Before: inorder = T1 c T2a g T2b r T3
            After:  inorder = T1 c T2a g T2b r T3

         c. Observe, too, that since the overall height of the subtree formerly
            rooted at r (but now at g) is again h+2, no further corrections 
            will be needed higher in the tree.

         d. Thus, after insertion and correcting rotation, the tree is once
            again an AVL tree!

      8. Note that these AVL insertion operations - even when rotations are
         needed - still take time proportional to the height of the tree.
         Thus, insertion into an AVL tree is O(log n).

   F. Deletion from AVL trees

      1. As always, deletion is a harder operation than insertion; but it is
         always possible to delete a node from an AVL tree in O(log n) time.

      2. To keep our life simple, we consider only the case of deleting a
         node with at most one child.  (Recall that the problem of deleting a
         node with two children can be converted into this case by "promoting"
         data from its inpred or insucc and then deleting the duplicated data
         lower in the tree.

      3. If a node has one child, then its balance before deletion is +/- 1,
         and its child's balance is 0.  We simply move the child up to take
         the place in the structure of the deleted node.

      4. As in insertion, we have to make a pass through the tree from the
         parent of the deleted node back to the root, resetting balances and
         possibly performing corrective rotations.

         a. This time, we are looking for situations where the height of
            a child DECREASED:

                Before delete   After delete

                ---             nil             height decreased
                bal = 0         bal = 0         height unchanged
                bal = 0         bal = +/- 1     height unchanged
                bal = +/- 1     bal = 0         height decreased
                bal = +/- 1     bal = +/- 1     height unchanged

          b. Now if the height of the child decreased, then we have the
             following cases with regard to the parent's balance

                Side where      Parent's balance        New balance when
                delete done     before delete           child height decreased

                L               -1                      TROUBLE
                L               0                       -1
                L               +1                      0
                R               -1                      0
                R               0                       +1
                R               +1                      TROUBLE

        5. In the two cases labeled "TROUBLE" we have to perform some
           corrective action at the parent node r.  WE CONSIDER ONLY THE
           CASE OF DELETION ON THE LEFT HERE - deletion on the right is
           a mirror image and is left as an exercise to the student.

        6. The corrective action to be taken depends on the balance of the
           OTHER child - the one on the side where the delete did NOT
           occur.  (This child must be non-nil, else an unbalance condition
           could not have arisen.)  Now, however we have three major cases:

           a. Left child deleted; right child heavy on right.  Fix by
              LEFT ROTATE around r.

           BEFORE:              AFTER DELETE:           AFTER ROTATE:

                (r)                     (r)                       (c)
              / -1  \                 / -2  \                    / 0  \
             T1     (c)              T1     (c)                (r)    T3
                  / -1  \                 / -1  \             / 0 \
                 T2     T3               T2     T3           T1   T2


           If height of T1 = h     Height of T1 is now h-1
           then height of T2 = h-1 (Other heights unchanged)
           and height of T3 = h

           Overall height is h+2   Overall height is h+2   Overall height is h+1

           b. Left child deleted; right child balanced.  Fix by LEFT ROTATE
              around r:

           BEFORE:              AFTER DELETE:           AFTER ROTATE:

                (r)                     (r)                       (c)
              / -1  \                 / -2  \                    /+1  \
             T1     (c)              T1     (c)                (r)    T3
                  /  0  \                 /  0  \             /-1 \
                 T2     T3               T2     T3           T1   T2

           If height of T1 = h     Height of T1 is now h-1
           then height of T2 = h   (Other heights unchanged)
           and height of T3 = h

           Overall height is h+2   Overall height is h+2   Overall height is h+2

           c. Left child deleted; right child heavy on left.  Fix by RIGHT
              ROTATE around c, then LEFT ROTATE around r.

           BEFORE:              AFTER DELETE:           AFTER BOTH ROTATES:

                (r)                     (r)                        (g)
              / -1  \                 / -2  \                    /  0  \
             T1     (c)              T1     (c)                (r)     (c)
                  / +1 \                  / +1  \             / * \   / * \
                (g)     T3              (g)     T3           T1  T2a T2b  T3
               /   \                   /   \
              T2a  T2b                T2a  T2b

           If height of T1 = h     Height of T1 is now h-1
           then at least one of
           T2a, T2b is h-1; other
           may be h-2.  The height of T3 = h-1.

           Overall height is h+2   Overall height is h+2   Overall height is h+1

           * Balances of r and c are functions of original balance of g

         d. Observe that, as with insert, the rotations preserves the inorder 
            traversal order in every case.

         e. However, in some cases the height of the overall subtree formerly
            rooted at r (and now at c or g) decreases by 1.  Thus, further
            corrections may still be needed higher up in the tree - in
            contrast to the case with insert.

IV. Splay Trees
--  ----- -----

   A. The AVL tree we just considered gives guaranteed O(log N) cost for each
      operation on the tree, at the cost of storing some additional information
      in each node (its balance) and doing significant extra work on insertions
      and deletions.

   B. We book discusses an alternative strategy that avoids the need for storing
      additional information and does additional work on lookups and deletions
      but not on insertions.  Further, the additional work is less complex
      that the additional work required to maintain an AVL tree.

      1. This strategy, known as SPLAYING, does not guarantee that EACH 
         individual operation will have cost O(log N) - it could in fact have 
         cost O(N).  However, it guarantees that the AVERAGE COST of any series
         of operations is O(log N).  (If one step in the series takes O(N)
         time, the time taken by the remaining steps is small enough that the
         average is O(log N).

      2. We say, therefore, that a splay tree has AMORTIZED COST for each
         operation O(log N).

      3. The basic idea is that, after looking up a node in the tree (as part
         of a lookup or delete operation), we perform a series of rotations
         working back up the tree which results in the node we just found
         being made the root of the tree.  If the node we found was down a
         long, narrow branch of the tree, a side effect is to make the whole
         tree much more balanced.

   C. The whole process of doing this is discussed at length in chapter 4 of
      the book, and a different appraoch is discussed in chapter 12.  (The
      latter approach does the splaying on the way down the tree when finding
      a node, rather than on the way back up after finding it.)  Chapter 12
      also discusses the analysis that leads to the O(log N) amortized cost -
      a rather messy process!

   D. Time will not permit further discussion here.

V. 2-3-4 and Red-Black Trees
-  ----- --- --------- -----

   A. We have considered one type of search tree that has guaranteed O(log n)
      performance for all operations: the AVL tree.  We now consider another
      type of search tree having similar a similar guarantee: the 2-3-4 tree,
      and then an implementation of this basic idea known as a red-black tree.
      The latter structure is particularly important because it is the one
      used by the STL set, map, multiset, and multimap containers.

   B. A 2-3-4 tree is a search tree in which each node has 2, 3, or 4 sons -
      and contains 1, 2, or 3 keys - e.g.

      A 2 node:
                                 ( k )
                               /       \
                 All keys are < k       All keys are > k

      A 3 node:
                                ( k1 k2 )
                             /      |     \
                All keys are < k1   |   All keys are > k2 
                                    |
                        All keys lie between k1 and k2

      A 4 node:
                                ( k1 k2 k3 )
                             /      |  |    \
                All keys are < k1   |  |   All keys are > k3
                                    |  |
       All keys lie between k1 and k2  All keys lie between k2 and k3

   C. Such a tree can be searched by an extension of our binary search tree
      algorithms.  We won't consider this, though, because we will eventually
      use a different representation for the tree.

   D. Of more interest is the problem of maintaining the tree.

      1. As always, we will insert new keys at the bottom of the tree.  But
         now we have several possibilities:

         a. If we encounter a 2 node or a 3 node whose children are all nil,
            instead of creating a new node we will add the key to the
            existing node.  Example - Insert Raccoon into:

                                Fox   Hippo
                             /      |        \
            Aardvark Cow Dog      Goose       Zebra

            The rightmost node becomes:         Raccoon Zebra

         b. If we encounter a 4 node whose children are all nil, we will SPLIT
            the 4 node into two 2 nodes, and pass the middle key up to its 
            parent, and then insert the new key - e.g.  Insert Elephant 
            the above tree:

            Split the four node:

                             Cow      Fox      Hippo
                           /     |           |       \
                   Aardvark   Dog            Goose     Raccoon Zebra

            Insert the new key:

                             Cow      Fox      Hippo
                           /     |           |       \
                   Aardvark   Dog Elephant   Goose    Raccoon Zebra

            (Note that the split is done BEFORE the insertion.)

      2. However, there is one place where this strategy can get us into
         trouble.  Suppose, when we split a four node, its parent is also a
         four node?   Then, we have no room to insert the key in the parent.

         a. This can be handled by simply splitting the parent and promoting
            one of its keys - which in turn could cause another split etc.
            This can get messy.

         b. A cleaner alternative is to adopt a policy of ALWAYS SPLITTING
            any four node we encounter going down the tree on a search.

            i. Even if we don't have to do so immediately, we will probably
               have to do so eventually.

           ii. This ensures whenever we split a node that its parent will be
               either a 2 node or a 3 node, with room for the promoted key.

      3. One special case remains - the root.  If the root of the tree is
         a four node, what do we do?  (Where do we put the promoted key?)

         a. The answer is that we create a new two node to become the root
            of the tree, which adopts the two halves of the original root - e.g.

            Before:                     After:

                |                         |
           Cow  Fox  Gopher              Fox
                                        /   \
                                      Cow   Gopher

         b. When this occurs, the overall tree increases its height by one.
            However, this will be a rather rare occurrence, since two promotions
            from below (each the result of splitting a four node) will have to
            occur before the root again needs to split.

      4. One interesting property of a 2-3-4 tree is that it is ALWAYS
         height-balanced.  In particular:

         a. Either all the children of a node are nil or none of them are.

         b. All the nodes with nil children are at the same level.

         This follows from the fact that we never insert new nodes into the
         tree; rather, we insert into or split existing nodes.  The only
         way the height of the tree ever increases is when the root splits,
         and this affects all leaves equally.

      5. Example: insert T H E  Q U I C K  B R O W N  F O X
                  into an initially empty 2-3-4 tree

        T                       [ T ]

        H                       [ H T ]

        E                       [ E H T ]

        Q                           [ H ]
                                   /     \
                                [ E ]  [ Q T ]

        U                           [ H ]
                                   /     \
                                [ E ]  [ Q T U ]

        I                             [ H T ]
                                     /   |   \
                                [ E ] [ I Q ] [ U ]

        C                               [ H T ]
                                       /   |   \
                                [ C E ] [ I Q ] [ U ]

        K                                [ H T ]
                                       /    |    \
                                [ C E ] [ I K Q ] [ U ]

        B                                  [ H T ]
                                         /    |    \
                                [ B C E ] [ I K Q ] [ U ]

        R                                 [  H  K  T  ]
                                        /     |   |    \   
                                [ B C E ] [ I ] [ Q R ] [ U ]

        O                                     [ K ]
                                            /      \
                                       [ H ]        [ T ]
                                       /   \       /     \
                                [ B C E ] [ I ] [ O Q R ] [ U ]

        W                                     [ K ]
                                            /      \
                                       [ H ]        [ T ]
                                       /   \       /     \
                                [ B C E ] [ I ] [ O Q R ] [ U W ]

        N                                     [ K ]
                                            /      \
                                       [ H ]        [  Q  T  ]
                                       /   \       /    |     \
                                [ B C E ] [ I ] [ N O ] [ R ] [ U W ]

        F                                     [ K ]
                                            /      \
                                   [  C  H  ]        [  Q  T  ]
                                 /   |     \       /    |     \
                            [ B ] [ E F ] [ I ] [ N O ] [ R ] [ U W ]

        O already present

        X                                     [ K ]
                                            /      \
                                   [  C  H  ]        [  Q  T  ]
                                 /   |     \       /    |     \
                            [ B ] [ E F ] [ I ] [ N O ] [ R ] [ U W X ]

        Question: what sequence of keys would most quickly lead to another
        insertion into the root?

        Answer: Y M P

   E. Let's analyze the efficieny of operations on a 2-3-4 tree.  Clearly
      insert and locate are O(h).  (We don't consider delete here, because
      it is considerably more complex to implement; however, it too is O(h)).
      What is the relationship between the number of KEYS in a 2-3-4 tree (n)
      and its height (h)?  (Note: we worry about keys, not number of nodes.)

      1. As we did with the AVL tree, we consider the related problem of
         determining the MINIMUM number of keys in a 2-3-4 tree of height h.
         Clearly, such a tree would have every node a two node.  Thus, we
         would have:

         Level          Number of nodes                 Number of keys

         1              1                               1
         2              2                               2
         3              4                               4

      2. Therefore, we can see that n >= 2^h - so

         h <= log  n
                 2                      

      3. We can also show by similar reasoning on a MAXIMAL tree that

         h >= log  n
                 4 

         But since log  n = 0.5 * log  n, we have h = theta(log n)
                      4              2

   F. We have seen that 2-3-4 trees represent a nice way of maintaining a
      balanced search tree.  Unfortunately, the code to maintain and use them
      is made complex by the fact that we have to deal with three different
      types of node.  We now consider a type of tree called a red-black tree
      that implements the principle of the 2-3-4 tree more easily.

      1. A red-black tree is a binary tree in which we associate a COLOR
         (red or black) with each link.  (This can be implemented by a single
         tag bit on each pointer.)

      2. We can represent a 2-3-4 tree by a red-black tree as follows:

         two node                       ( )                     / = black ptr
                                       /   \                    //= red ptr

         three node                     ( )      or     ( )
                                      //   \           /   \\
                                     ( )                   ( )
                                    /   \                 /   \

         four node                      ( )
                                       // \\
                                     ( )   ( )
                                     / \   / \

       3. This kind of tree has several interesting properties:

          a. Our lookup algorithm is the same as for an ordinary binary search
             tree.

          b. The number of black links on any path from root to leaf is the
             same for all paths in the tree.

          c. On any path from root to leaf, we never encounter two successive
             red links.  Thus, the total number of links on any path from
             root to leaf is at most twice the number of links in the
             corresponding 2-3-4 tree.  (This relates to our demonstration that
             the height of a 2-3-4 tree lies between log  n and log  n.
                                                        4          2

       4. On insert, when going down the tree, if we encounter a node with
          two red children, it represents a four node and should be split.
          There are four cases, depending on the parent of the four node.
          (Note: most cases below are actually two separate cases, depending on
          whether the node being split is the left child or the right child of
          its parent.  We consider only the left child case; the right child
          one is a mirror image.)

          a. There is no parent - the node being split is the root of the whole
             tree.

             2-3-4 TREE BEFORE                  RED-BLACK REPRESENTATION BEFORE

                ------------
                | C1 C2 C3 |                              (C2)
                ------------                             //  \\
                /   |  |   \                          (C1)    (C3) 
               T1  T2  T3  T4                         /  \    /  \
                                                     T1  T2  T3  T4

             2-3-4 TREE AFTER                   RED-BLACK REPRESENTATION AFTER

                   ------
                   | C2 |                                (C2)
                   ------                               /    \
                   /     \                            (C1)   (C3)
               ------   ------                        /  \   /  \
               | C1 |   | C3 |                       T1  T2 T3  T4
               ------   ------
               /    \   /    \
              T1   T2  T3    T4

             Observe: The only change is to convert the two red child pointers
             to black!

          b. The parent is a two node.

             2-3-4 TREE BEFORE                  RED-BLACK REPRESENTATION BEFORE

              -----
              | P |
              -----             
             /     \                                    ( P )
      ------------   T5                                /     \
      | C1 C2 C3 |                                  (C2)     T5
      ------------                                 //  \\
      /   |  |   \                               (C1)  (C3)
     T1   T2 T3  T4                              /  \  /  \
                                                T1 T2 T3  T4

             2-3-4 TREE AFTER                   RED-BLACK REPRESENTATION AFTER

            --------
            | C2 P |
            --------                    
           /    |    \                                  ( P )
      ------   -----   T5                             //     \
      | C1 |   |C3 |                                (C2)     T5
      ------   -----                               /    \
      /   |    |   \                             (C1)  (C3)
     T1   T2   T3  T4                            /  \  /  \
                                                T1 T2 T3  T4

            Observe: the only change needed to implement the split is to
                     change the pointer from the parent to the node being
                     split to red, and to change the two child pointers of
                     the node being split to black!

          c. The parent is a three node with node being split on its "black"
             side.

             2-3-4 TREE BEFORE                  RED-BLACK REPRESENTATION BEFORE

              ---------
              | P1 P2 |
              ---------         
             /     |   \                                 (P1)
      ------------ T5   T6                             /     \\
      | C1 C2 C3 |                                  (C2)      (P2)
      ------------                                 //  \\     /  \
      /   |  |   \                               (C1)  (C3)  T5   T6
     T1   T2 T3  T4                              /  \  /  \
                                                T1 T2 T3  T4

             2-3-4 TREE AFTER                   RED-BLACK REPRESENTATION AFTER

            ------------
            | C2 P1 P2 |
            ------------                        
           /    |    |  \                                (P1)
      ------   ----- T5  T6                           //     \\
      | C1 |   |C3 |                                (C2)     (P2)    
      ------   -----                               /    \    /  \
      /   |    |   \                             (C1)  (C3) T5  T6
     T1   T2   T3  T4                            /  \  /  \
                                                T1 T2 T3  T4

            Observe: This case is basically the same as the previous one.

         d. The parent is a three node with node being split on its "red"
            side.  We have two subcases:

            i. Node being split is leftmost child of parent.

             2-3-4 TREE BEFORE                  RED-BLACK REPRESENTATION BEFORE

              ---------
              | P1 P2 |                                    (P2)
              ---------                                   //  \
             /     |   \                                (P1)   T6
      ------------ T5   T6                             /    \
      | C1 C2 C3 |                                  (C2)    T5
      ------------                                 //  \\ 
      /   |  |   \                               (C1)  (C3)
     T1   T2 T3  T4                              /  \  /  \
                                                T1 T2 T3  T4

             2-3-4 TREE AFTER                   RED-BLACK REPRESENTATION AFTER

              ------------
              | C2 P1 P2 |                                   
              ------------                              
             /    |   |  \                               (P1) 
       ------  ------ T5  T6                          //     \\
       | C1 |  | C3 |                               (C2)     (P2)
       ------  ------                              /    \    /  \ 
       /   |   |   \                             (C1)  (C3) T5  T6
      T1   T2  T3  T4                            /  \  /  \
                                                T1 T2 T3  T4

            Observe: This has required a right rotation around the root of
            the parent.  The root of the subtree is now P1, not P2.

            ii. Node being split is second child from left of parent.

             2-3-4 TREE BEFORE                  RED-BLACK REPRESENTATION BEFORE

              ---------
              | P1 P2 |                                    (P2)
              ---------                                   //  \
           /      |       \                             (P1)   T6
         T1  ------------  T6                          /    \
             | C1 C2 C3 |                             T1    (C2)
             ------------                                  //  \\ 
             /   |  |   \                                (C1)  (C3)
            T2   T3 T4  T5                               /  \  /  \
                                                        T2 T3  T4 T5

             2-3-4 TREE AFTER                   RED-BLACK REPRESENTATION AFTER

               ------------
               | P1 C2 P2 |                                (C2)  
               ------------                               //  \\
             /    |   |      \                         (P1)    (P2)
           T1 ------  ------ T6                       /   \    /   \
              | C1 |  | C3 |                        T1  (C1)  (C3)   T6
              ------  ------                            /  \  /  \ 
               /   |   |   \                           T2 T3  T4 T5
              T2  T3  T4   T5

            Observe: This has required a double rotation - left around the left
            child of the parent, then right around the root of the parent.  
            The root of the subtree is now C2, not P2.

         d. The parent can never be a four node; if it were, it would have 
            already been split when passing through it to the child we are now
            splitting.  Thus, we have covered all possible cases (except for
            mirror images.)

      5. Finally, we must consider what happens on insert when we reach the
         bottom of the tree.

         a. In the 2-3-4 tree, we never add nodes at the bottom of a tree -
            we simply insert a new key into a leaf node.  (Recall that our
            splitting strategy guarantees that each node on the path from the
            root down will be at biggest a three, so there will always be
            room.)

         b. However, in the red-black implementation we actually DO add a node.

         c. Again there are several cases dependent on the leaf into which the
            key is to go.  (Again, we consider insertion on the left.  Insertion
            on the right is the mirror image.)

            i. The leaf is a two node:

             2-3-4 TREE BEFORE                  RED-BLACK REPRESENTATION BEFORE

                -----
                | L |                                   ( L )
                -----                                   /   \
                /    \                                nil   nil
              nil    nil

             2-3-4 TREE AFTER                   RED-BLACK REPRESENTATION AFTER

                -------
                | N L |                                 ( L )
                -------                                //   \
                /  |  \                               (N)   nil
              nil nil  nil                            / \
                                                    nil  nil

           ii. The leaf is a three node, with new key going on the "black"
               side:

             2-3-4 TREE BEFORE                  RED-BLACK REPRESENTATION BEFORE

                ---------
                | L1 L2 |                                       (L1)
                ---------                                       /   \\
                /   |   \                                     nil   (L2)
              nil  nil  nil                                         /  \
                                                                  nil  nil

             2-3-4 TREE AFTER                   RED-BLACK REPRESENTATION AFTER

                -----------
                | N L1 L2 |                                     (L1)
                -----------                                    //   \\
                /   |  |  \                                   (N)   (L2)
              nil nil nil nil                                /  \   /  \
                                                           nil nil nil  nil

          iii. The leaf is a three node, with new key going on the outside of 
               "red" side:

             2-3-4 TREE BEFORE                  RED-BLACK REPRESENTATION BEFORE

                ---------
                | L1 L2 |                                       (L2)
                ---------                                      //   \
                /   |   \                                     (L1)  nil
              nil  nil  nil                                   /  \
                                                            nil  nil

             2-3-4 TREE AFTER                   RED-BLACK REPRESENTATION AFTER

                -----------
                | N L1 L2 |                                     (L1)
                -----------                                    //   \\
                /   |  |  \                                   (N)   (L2)
              nil nil nil nil                                /  \   /  \
                                                           nil nil nil  nil

            Note rotation around root of subtree.

           iv. The leaf is a three node, with new key going on the inside of 
               "red" side:

             2-3-4 TREE BEFORE                  RED-BLACK REPRESENTATION BEFORE

                ---------
                | L1 L2 |                                       (L2)
                ---------                                      //   \
                /   |   \                                     (L1)  nil
              nil  nil  nil                                   /  \
                                                            nil  nil

             2-3-4 TREE AFTER                   RED-BLACK REPRESENTATION AFTER

                -----------
                | L1 N L2 |                                      (N)
                -----------                                    //   \\
                /   |  |  \                                   (L1)   (L2)
              nil nil nil nil                                /  \   /  \
                                                           nil nil nil  nil

            Note double rotation (new node is originally added as a child of
            L1, then L1 subtree is rotated left, then main subtree is rotated
            right.

VI. Weight-Balanced Binary Search Trees
--  --------------- ------ ------ -----

   A. As we indicated earlier, in cases where the set of key values to be
      searched is known in advance, if we know the relative frequency of
      searches for the different keys, then we can construct a search tree
      that performs even better than a height-balanced one.

   B. We now consider a method for finding the optimal binary search tree
      for a given static set of keys, given an advance knowledge of the
      probabilities of various values being sought.

      1. First, though, we need to note that our discussion of probabilities
         for various keys thus far has not been complete.  In particular, we
         have assumed that only keys actually in the tree will be searched
         for.

         a. If - as is often the case - some searches will involve keys not 
            in the tree, then we must also consider the cost of these failures.

         b. For example, consider our tree of Pascal reserved words.  When
            the lexical scanner of a Pascal compiler encounters a word, it
            does not know whether it is a reserved word or a user-defined
            identifier until it checks the reserved word table.  If the word
            is not in the table, then it must be an identifier.

         c. To handle this, we convert our search tree into an EXTENDED TREE
            by adding FAILURE NODES (by convention, drawn as square boxes.)

            Example: our balanced tree for the seven Pascal identifiers:

                        forward
                       /       \
                     case     packed
                    /   \    /       \
                 begin  for otherwise varying
                 /  \   /  \   /  \      /  \
                []  [] []  [] []  []    []  []

            Each failure node represents a group of keys for which the
            search would fail - e.g. the leftmost one represents all keys
            less than begin [a, apple, and]; the second all keys between
            begin and case [boolean, c] etc.

         d. In calculating the cost of a tree, we need to consider both
            the probabilities of failures and of successes.  

            i. Let p  be the probability of searching for key  (1 <= i <= n)
                    i                                        i

           ii. Let q  be the probability of searching for a non-existent key
                    i
               lying between key  and key   .  (Of course q  represents
                                i        i+1               0
               all values less than key , and q  all values greater than key .)
                                       1       n                            n

          iii. Clearly, since we are working with probabilities, the sum of
               all the p's and q's must be 1.

           iv. For the above balanced tree, the total cost would be

                p  + 2(p  + p ) + 3(p + p + p + p ) + 3(q + q + ... + q )
                 4      2    6       1   3   5   7       0   1         7

               - The cost of an internal node is its weight times its level -
                 i.e. the probability of its being searched for times the
                 # of key comparisons needed to find it.

               - The cost of an external node is its weight times its 
                 level minus 1 = the level of its parent - i.e. the probability
                 of its being searched for times the # key of comparisons 
                 needed to get to the nil pointer that indicates that the
                 item is not in the tree.

      2. To find an optimal tree, we need to define some terms and measures:

         a. T    is a binary search tree containing key    through key .
             ij                                        i+1            j
                                 
         b. T   , then, is an empty tree, consisting only of the failure node
             ii 
            lying between key  and key   .
                             i        i+1

         c. The weight of T   is p    + p   + ... + p  + q  + q   + ... q
                           ij     i+1    i+2         j    i    i+1       j

            which is the probability that a search will end up in T   .  The
                                                                   ij
            weight of the empty tree T  , then, is q  - the probability of
                                      ii            i
            the failure node lying between key  and key   .  Note that, for a
                                              i        i+1
            non-empty tree, the weight is simply the probability of the root
            plus the sum of the weights of the subtrees.

         d. The cost of T   is calculated as follows:
                         ij

            - If T   is empty (consists only of a failure node), then its
                  ij
              cost is zero.
        
            - Otherwise, its cost is the weight of its root, plus the sum
              of the weights of its subtrees, plus the sum of the costs of
              its subtrees.

              - The first term represents the fact that search for the key
                at the root costs one comparison.

              - The rationale for including the costs of the subtrees in the
                overall cost should be clear.  To this, we add the WEIGHTS
                of the subtrees to reflect the fact that we must do one
                comparison at the root BEFORE deciding to go into the
                subtree, and the probability that that comparison will lead
                into the subtree is equal to the weight of the subtree.

         e. Clearly, an optimal binary search tree is one whose cost is minimal.

         f. Example - Horowitz (Pascal version) page 428 - TRANSPARENCY.
            The cost of tree (b), with equal probabilities for all keys and
            failures, is determined as follows:

            i. Cost of external nodes = 0 in each case, and weights of
               external nodes = 1/7 in each case.

           ii. Cost of tree rooted at "do" = weight of root (1/7) +
               sum of costs of subtrees (0) + sum of weights of subtrees (2/7) =
               3/7.  The weight of this subtree is also 3/7.

          iii. Cost of tree rooted at "read" is 3/7 by similar reasoning, and
               its weight is also 3/7.

           iv. Cost of overall tree =

                Weight of root = 1/7 +

                Weight of left subtree = 3/7 +
        
                Weight of right subtree = 3/7 +

                Cost of left subtree = 3/7 +

                Cost of right subtree = 3/7 =           13/7

      3. Horowitz gives an algorithm for finding an optimal tree, given
         a set of values for the p's and q's.  The algorithm uses the
         following terms:

         a. T   is the OPTIMAL tree including keys i+1 .. j.
             ij
            Therefore, T  is the optimal tree for the whole set of keys,
                        0n
            and is what we want to find.

         b. r   is the ROOT of T   .
             ij                 ij

            - Obviously, r   is undefined if i = j.
                          ij
              (We will record the value as 0 in this case.)

            - If i < j, then the subtrees of T   are T         and T
                                              ij      i r - 1       r   j
                                                         ij          ij
              (Clearly, if T   is optimal then its subtrees must be also.)
                            ij

         c. w   is the WEIGHT of T   .  
             ij                   ij

            - For i = j, w   = q .
                          ij    i

            - For i < j, w   = p    + w         +  w
                          ij    r      i r - 1      r  j
                                 ij       ij         ij

         d. c   is the COST of T  .
             ij                 ij

            - For i = j, c   = 0.
                          ij
            - For i < j, c   = w    + w          + w       + c        + c
                          ij    r      i r  - 1     r   j     i r - 1    r   j
                                 ij       ij         ij          ij       ij

                             = w    + c        + c
                                ij     i r - 1    r   j
                                          ij       ij

      4. The operation of the algorithm for four keys is traced on page 431
         (TRANSPARENCY), using the probabilities [multiplied by 16 for
         convenience]: p = (3,3,1,1) and q = (2,3,1,1,1)

         a. The first row represents empty trees, whose weights are simply
            the appropriate "q" value, whose costs are 0, and whose roots
            are undefined.

         b. The second row represents trees containing just one key.
            In each case, the weight is the sum of the weight of the one key
            plus the weights of the two adjacent failure nodes, and the
            cost is the weight of the one key (since the costs of failure
            nodes are zero.)  The root, of course, is the one key.

         c. The third row represents the optimal choice for constructing
            trees of two nodes.  

            i. For example, the first entry represents a tree including keys 1
               and 2 - ie. T  .  The two options would have been to let key 1
                            02
               be the root or key 2 be the root.  Calculating the costs:

               - if key 1 is the root, then the cost is 

                 p  + w   + w   + c   + c  = 3 + 2 + 7 + 0 + 7 = 19
                  1    00    12    00    12

               - if key 2 is the root, then the cost is

                 p  + w   + w   + c   + c  = 3 + 8 + 1 + 8 + 0 = 20
                  2    01    22    01    22 

               Thus, 1 is chosen as r   and the cost of 19 is recorded.
                                     02
           ii. The remaining entries in the row are calculated in the
               same way.  Note that the weights and costs needed to compare
               root choices are always available from previous rows.

          d. Subsequent rows represent optimal trees with 3 and then 4
             keys.  The latter is, of course, the final answer.

      5. This algorithm is implemented by the following program:

         TRANSPARENCY p. 433

      6. Time complexity?  (ASK CLASS) - O(n^2)

VII. Advanced Hashing
---  -------- -------

   A. Review of basic concepts:

      1. Though the actual key may be of any data type (often a string), we
         treat the key as an integer by using some conversion technique - e.g.
         by treating the characters in the key as digits radix 27 (if all       
         alphabetic) or 128 (if any ASCII character is allowed.)

      2. We apply some key-to-address transformation technique (hash function)
         to the key to reduce the range of possible values: 1 .. b or 
         0 .. b - 1 for some b.

      3. We construct a table which is conceptually an array of b BUCKETS, each 
         of which consists of one or more SLOTS.  The buckets are numbered
         1 .. b or 0 .. b - 1.  A key-value pair is stored, when possible, in 
         the bucket corresponding to the hash function of its key.

         Example: A common hash function is the division remainder method.
                  If we have a hash table with 11 buckets numbered 0 .. 10, and 
                  wish to store the key 42, we try to store it in bucket
                  42 mod 11 = 9.

      4. Two keys that hash to the same bucket are called SYNONYMS, and if
         both occur in the same table they are said to COLLIDE.  If the
         number of colliding keys hashing to the same bucket exceeds the number 
         of slots available in the bucket, then we have an OVERFLOW and must
         deal with it somehow.  Possible solutions we have looked at 
         previously include:

         a. LINEAR PROBING - if a key belongs in bucket h but cannot fit,
            then we try h + 1, h + 2 ... (wrapping around to the start as
            needed.)

         b. CHAINING - the hash table consists not of an array of buckets,
            but of an array of LISTS OF BUCKETS.  If there is not room for
            a given key in existing buckets, we add a new bucket to the
            appropriate list.

         Both of these methods can lead to O(n) performance over time -
         the first because we may (in principle) have to probe a large fraction
         of all the buckets when inserting or looking up a key, and the latter
         because the chains of buckets can grow long.  

   B. Performance improvement techniques for hashing include the following:

      1. Improving the hash function.  A hash function that disperses the
         keys evenly over all the buckets gives much better performance than
         one that tends to cluster many of the keys into one or a few
         buckets.

         Example: A very poor hash function for Gordon College student
                  ids would be to use just the first two digits.  Most ID's are 
                  of the form 93xxxxx, and this function would hash them all 
                  to the same value!

      2. Improving the overflow handling strategy.  In particular, we want
         to reduce the probability of repeated collissions for a given key.

      3. Allowing the table to grow dynamically as keys are added, thus
         reducing the probability of collision by increasing the range of
         values of the hash function.

   C. Possibilities for Hash Functions

      1. Any hashing function should meet two basic criteria:

         a. For all possible logical keys, it must produce a value in the
            range 0..b-1 or 1..b.

         b. It must disperse the logical keys uniformly - i.e. the probability
            that a randomly chosen key hashes to any particular bucket should be
            1/b - or very close to this.

            This second criterion becomes more complicated if the keys to be 
            used exhibit some pattern or bias.  (As was the case with our
            student id example)

         c. A further consideration is that the hashing function should not
            be computationally-expensive, since we are trying to compete with
            an O(log n) search strategy and can lose our advantage if too much
            computation is required.

      2. The division-remainder method

         a. home-bucket := key mod b (to produce a result in the range 0..b-1)

            or key mod b + 1 (to produce a result in the range 1..b)

         b. Advantages:

           i. Computationally simple if the the key is an integer to begin with,
              or if converting it to an integer is not too expensive.

          ii. Provides good dispersion if b is a prime or at least has no
              prime factors <= 20.

         iii. Flexible choice of b values - many sizes to choose from.

      3. The mid-square method

         a. home-bucket := middle m bits of sqr(key).

         b. This requires that b be a power of 2.

         c. Example: Let b be 64, and let keys be integers ranging from 1 to 
            1000. Then the square of the key is 20 bits long, and we choose bits
            7..12.  Suppose our key is 50:

                 50 would hash as follows:       bits taken
                                         ________
                        sqr(50) = 2500 = 0000 0000 1001 1100 0100
                                                                 2

                home-bucket = 010011 = 19
                                     2

         d. Advantages:

            i. Computationally simple if the the key is an integer to begin
               with, or if converting it to an integer is not too expensive.

           ii. Provides good dispersion, since the hash function depends on all
               bits of the original key.

          iii. Tables whose size is a power of 2 are often natural anyway.

      4. Folding

         a. Folding is one way of avoiding the need to convert a non-integer key
            to an integer - which often poses a problem if the key is long 
            enough that the resultant value would not fit in the word length of 
            the underlying machine.  (E.g. even a 7-letter alphabetic key, 
            treated as a number radix 27, could have a value bigger than the 
            largest 32 bit integer.)

         b. The key is divided into a number of pieces, each of which is treated
            as an integer.  All of the pieces are added together, either 
            straight or with alternate pieces reversed.

         c. Example: 123456789012 might be treated as four pieces

                          123 456 789 012 

            which could be added together one of two ways:

            i. Shift folding:           123
                                        456
                                        789
                                        012
                                        ---
                                       1380

           ii. Boundary folding:        123
                                        654
                                        789
                                        210
                                        ---
                                       1776

      5. Digit analysis

         a. The previous methods required no advance knowledge of the actual set
            of keys to be used.  If the keys (or truly characteristic subset) 
            are available in advance, though, a hashing function might be 
            developed based on an analysis of them.

         b. One approach is to calculate the frequency distribution for 
            different values of each digit (letter) of the keys.

            Example: frequency analysis on last names of students in class

     6. Perfect hash functions

        a. If the full set of keys is known in advance, it actually becomes
           possible to generate a PERFECT HASH FUNCTION - one which hashes
           each key differently.

        b. Such functions typically have a range that exceeds the number of
           keys, which means there are some values in the range to which no 
           key hashes.  (These then become perenially vacant slots in the
           hash table.)  A perfect hash function that uses exactly the right
           number of slots is called a MINIMAL PERFECT HASH FUNCTION.

        c. The GNU software distribution includes a program - gperf - that
           generates near-minimal perfect hash functions for fixed sets of
           keys.  It has been used to build reserved word tables for a number of
           compilers, command interpreters etc.

      7. One last topic we mention briefly is the notion of ORDER-PRESERVING
         hash functions.

         a. In general, it is not the case that if key1 < key2 then
            hash(key1) < hash(key2).  However, there are some hash functions 
            that have this property.  They are known as order-preserving hash 
            functions.

         b. An order-preserving hash function would be used in a case where one
            wishes to have the ability to process table entries in ascending 
            order of key value, starting at some given point.  Such processing 
            is needed when looking for a RANGE of key values - e.g.

                 JOHNS <= last_name < JOHNT

         c. Of course, insisting that the hash function be order-preserving
            may mean sacrificing some performance in terms of quality of
            dispersion of the keys!

   D. Improving Collision handling

      1. Though a good hashing scheme that disperses keys uniformly can reduce
         the number of synonyms and hence the probability of a collision, we
         cannot aovid having to deal with such collisions as do occur somehow.

      2. Comments on efficiency of linear open addressing

         a. At first glance, it appears that hashing with linear open addressing
            could be terribly inefficient: it could degenerate to searching the
            entire table.

         b. On the other hand, if the record we want is, in fact, in its home
            bucket or very near to it, then this method works quite well.

         c. The success of this method depends on two things:

            i. Allocating enough space in the table so that there are sufficient
               vacant slots to break up long searches.  (A good rule of thumb is
               to never allow more than 80% of the slots to actually be used- eg
               if we wish to store records on 1200 students, then use a table 
               with at least 1500 slots, plus an appropriate hash function.)

           ii. Choose a hash function that disperses the keys uniformly over the
               slots.

         d. One remaining problem that is hard to avoid, however, is the
            problem of CLUSTERING.

            i. Consider the following portion of a hashtable:

                        |_____________|
                        | Bucket x    |
                        |_____________|
                        | Bucket x+1  |
                        |_____________|
                        | Bucket x+2  |
                        |_____________|
                        | Bucket x+3  |
                        |_____________|
                        ....

               Suppose bucket x overflows.  Then a key belonging to bucket x
               is inserted into bucket x+1.

           ii. Of course, the effect of this overflow is to increase the
               probability that bucket x+1 will also overflow, since it is now
               receiving keys that map to two different buckets.

          iii. When bucket x+1 overflows, it begins adding keys to bucket x+2.
               This also becomes the place where further overflows from bucket
               x must go, of course.  So now bucket x+2 becomes the target for
               keys hashing to three different buckets.  This further enhances
               the chances of bucket x+2 overflowing, which would make bucket
               x+3 the target for keys hashing to four different buckets ...

           iv. As you can see, linear probing suffers from the problem that
               clusters of overflowing buckets can develop such that several
               buckets "compete" for the same overflow space.  (In the above
               case, buckets x, x+1, x+2, x+3 and x+4 would all compete for
               space in bucket x+4.)  Further, once this clustering starts to
               occur, it feeds on and compounds itself.

            v. There are several alternatives available to reduce this
               clustering problem.

      3. Quadratic probing

         a. Quadratic probing addresses the clustering problem of linear open
            addressing by using a quadratic function to choose overflow buckets.

            i. If a key belongs in bucket x, the following series of buckets is
               examined until one is found with room to hold it:

                x
                (x+1) mod b
                (x-1) mod b
                (x+4) mod b
                (x-4) mod b
                (x+9) mod b
                (x-9) mod b
                ..

                                                                  2
               i.e. the buckets probed are of the form (home +/- i ) mod b

               (And the same strategy is followed when looking up a key)

           ii. Notice how this breaks up clusters.  In the above example,
               bucket x would first overflow to bucket x+1, increasing the
               probability of overflow there.  But now once bucket x+1
               overflows, further overflows from bucket x would go into
               bucket x-1, while overflows from bucket x+1 would go into
               bucket x+2.  Thus, buckets x and x+1 would not compete with
               each other for overflow space, and the reinforcing effect
               of local overflows would not occur.

         b. Quadratic probing does, however, impose a restriction on our
            choice of table sizes.  In general, we want to be sure that
            an insertion into a nearly full table will succeed if at all
            possible, which means that - if necessary - we will eventually
            probe each bucket in the table exactly once.  It can be shown
            that quadratic probing does this if the table size is a prime of the
            form 4j + 3 - i.e. is a prime one less than a multiple of four.

      4. Rehashing

         a. Another approach to solving the clustering problem is REHASHING.
            Instead of using a single hash function, we use a series of hash
            functions f1, f2, f3 ...

         b. When an attempt is made to insert a key in the table, it is first
            hashed using f1.  If the resultant bucket is full, then the key is
            hashed again using f2. If that bucket is full, then f3 is used, etc,
            until some hash function hashes the key to a bucket that is empty.
            The same strategy is followed when looking up a key.

         c. One obvious challenge is developing a suitable series of hash 
            functions.  Ideally, the hash functions should have the property
            that if f1 hashes two keys to the same bucket, then f2 hashes them
            to different buckets, etc.  In contrast to other overflow handling
            methods, this has the effect of causing two keys that collide 
            initially to not also collide on overflow.   On the other hand, it 
            is hard to find functions having this property that also guarantee 
            that every bucket will eventually be tried in the case of insertion
            into an almost full table.

   E. Growing the table dynamically.

      1. One issue with any of the schemes we have considered thus far is
         correctly sizing the table.

         a. A table where the ratio of available slots to actual keys stored is 
            large will likely have relatively few overflows, but will waste a 
            lot of space.

            (There are many fewer car accidents in Vermont than in downtown NY!)

         b. What choice does one make if it is hard to tell ahead of time how
            big the table is going to get?  (E.g. a given general purpose
            program might need 100 slots on one run and 10,000 on the next.)

      2. The use of chaining is one way to address this problem, course.

         a. Using chaining, the table can grow as needed, subject to total
            available memory.
        
         b. However, the bigger the table gets, the more the performance
            degenerates.

            Example: Suppose we use a hash function with range 1 .. 1000.
                     If there are fewer than 1000 keys in the table, and
                     the hash function has good dispersion, performance
                     is likely to be O(1).  However, if we grow to 100,000
                     keys, then the average chain length becomes 100 and
                     we must do an average of 100 comparisons per probe.

         c. An alternative to this is to periodically restructure the table -
            e.g. doubling the size of the table and the range of the hash
            function.  This, of course, also requires that all existing
            entries be re-inserted into the restructured table - a high
            one-time cost for each restructuring.

   F. Extendible Hashing.

      1. Several schemes have been proposed to allow the size of a hash table
         to grow dynamically in a smooth, efficient way.  We consider only
         one here.  For others, see Smith and Barnes: Files and Databases
         pp 124-135.

      2. All such schemes use a hash function that generates a large range of 
         values.  For example, on a 32-bit computer, a typical hash function 
         used with such a scheme would produce a full 32-bit value.

      3. Initially, only a limited number of bits from the hash function are
         actually used; the rest are ignored.  When a bucket overflows,
         however, it is split in two and an additional bit of the hash function
         is then used to redistribute the keys between the halves.

      4. The scheme we consider here makes use of a table called the directory,
         whose size is a power of two.  Each entry in the table points to a
         bucket of keys, but not necessarily a unique bucket.  (That is,
         several table entries may point to the same bucket.)  When we do
         lookups or insertions, we use as many bits from the hash function
         as are needed to compute an index for this table, and then follow
         the pointer to the correct bucket.

         Example: the following is a hashtable with bucket size 2, with
                  keys and hash values as shown.  (Hash values are sums of 
                  ASCII values of characters of keys with bits in reverse
                  order - not great, but OK).  At present, three bits of
                  the hash function are used to distribute the keys.

                ------
                 000 ------------------------>  HIPPO           0000000110
                ------                          CAT             0001101100
                 001 ---------------\
                ------               \------->  AARDVARK        0011001001
                 010 -------------\
                ------             ---------->  DOG             0101101100
                 011 -------------/             JACKAL          0110010110
                ------
                 100 ------------------------>  ELEPHANT        1000101001
                ------
                 101 ------------------------>  GOPHER          1010001110
                ------                          FOX             1011011100
                 110 --------------\
                ------              --------->  BUFFALO         1111111110
                 111 --------------/

          We now consider the following insertions:

          MONKEY        1100101110 - would go in bucket with BUFFALO

          OSPREY        0100011110 - would cause bucket containing DOG, JACKAL
                                     to split.  As a result, 010 entry in the
                                     table would point to a bucket containing
                                     DOG and OSPREY, while 011 entry would now
                                     point to a bucket with JACKAL

          IGUANA        1010110110 - would force us to go to using 4 bits to
                                     differentiate keys, since there are already
                                     two entries with 101 as their first 3 bits.
                                     The new table (including MONKEY and OSPREY
                                     from before) would look like this:

                                        
                 0000------------\      
                ------            ----------->  HIPPO           0000000110
                 0001------------/              CAT             0001101100      
                ------
                 0010------------\
                ------            ------------> AARDVARK        0011001001
                 0011------------/
                ------
                 0100------------\
                ------            ------------> DOG             0101101100
                 0101------------/              OSPREY          0100011110
                ------
                 0110------------\      
                ------            ------------> JACKAL          0110010110
                 0111------------/
                ------
                 1000------------\
                ------            ------------> ELEPHANT        1000101001
                 1001------------/
                ------                          GOPHER          1010001110
                 1010-------------------------> IGUANA          1010110110
                ------
                 1011-------------------------> FOX             1011011100
                ------
                 1100----------\
                ------          \
                 1001------------\
                ------            ------------> MONKEY          1100101110
                 1110------------/              BUFFALO         1111111110
                ------          /
                 1111----------/
Copyright ©1999 - Russell C. Bjork