CS321 Lecture: B-Trees                          Last revised 11/23/99

Introduction
------------

   A. We have been devoting considerable attention recently to the subject
      of search structures: Data structures that can be used to store and
      retrieve information associated with a certain key.  The operations
      on such a structure can be pictured as follows:

        Insertion:
                     __________________
        Key, value  | Search           |
        ----------> | structure        |
                    | (key,value pairs)|
                    |__________________|

        Lookup:
                     ___________
        Key         | Search    | Value
        ----------> | structure | -------> 
                    |___________|

        Deletion:    ____________________________
                    | Search                     |
        ----------> | structure                  |
                    | (key and its value removed)|
                    |____________________________|

   B. Let's briefly review the structures we have considered thus far:

Structure       Insert          Lookup          Delete

"Pile"          O(1)            O(n)            O(1) if we know where victim is
Ordered array   O(n)            O(logn)         O(n)
Linked list     O(1) if we know O(n)            O(1) if we know where victim is
                     where it goes
Binary search
tree            O(logn)-O(n)    O(logn)-O(n)    O(logn)-O(n)
AVL, 234 trees  O(logn)         O(logn)         O(logn)
Hash table      O(1)-O(n)       O(1)-O(n)       O(1) if we know where victim is

   C. All of these structures have one thing in common: they were designed for
      use with search tables stored in main memory, where access to any item
      is equally fast (true random access).  However, it is often the case
      that we must build search structures on disk rather than in primary
      memory, for two reasons:

      1. Size.  Large structures cannot be kept in their entirety in Mp.

      2. Permanency.  Structures in Mp are volatile and need to be created
         (or read in from disk whenever a program using them is run)

      In particular, when we considered indexed files in CS122, we saw that
      the indexed file organization requires an index structure on the primary 
      key and each of the secondary keys.  These are, in fact, search structures
      in which the values in the key value pairs are records in the file (or 
      pointers to them.)

   D. When we build structures on disk, we must deal with certain realities of
      access and transfer time:

      1. Random access to disk typically requires on the order of 10-20 ms
         access time to position the head and wait for data to come up under
         it.

      2. Once the head position is right, data can be transferred at rates in
         excess of 1 million bytes/sec.

      3. Observe, then, how total transfer times behave for different size
         blocks (assuming a fairly fast 10 ms access time, and 1 megabyte/sec
         transfer rate)

         Size of block  Access time     Transfer time   Total time

         1 byte         10 ms           1 us            10.001 ms
        10 bytes        10 ms           10 us           10.01 ms
       100 bytes        10 ms           100 us          10.1 ms
      1000 bytes        10 ms           1 ms            11 ms
     10000 bytes        10 ms           10 ms           20 ms

         Clearly, then, small transfers have very high overheads for access
         time, and one would prefer to organize a search structure in such a
         way as to allow fairly large transfers to/from disk (1000 bytes +)

      4. Most of the structures we have considered do not lend themselves well
         to on-disk implementation because they require accesses to too many
         different parts of the structure.

         a. E.g. binary search tree of 1000 nodes has minimum height 10, so
            would require 10 disk accesses / item.

         b. The exception is hashing, where we can use a hash function to
            select a block potentially containing many items, then search that 
            using some other means to find the item wanted.

   E. Today, we consider a search structure specifically designed for use with
      disk files: the B-Tree.

      1. A B-Tree is a form of search tree in which breadth is traded for depth.
         Each node contains multiple keys (instead of 1 as in a binary search
         tree), and so has multi-way branching rather than 2-way branching.

      2. The result is a very "bushy" sort of tree, with typical heights in the
         range 2-4, thus requiring only 2-4 disk accesses per operation.

I. Definitions
-  -----------

   A. Preliminary: An m-way search tree.

      1. An m-way search tree is a tree of degree m in which:

         a. Each node having s children (2 <= s <= m) contains s-1 keys.
            Let the children be called t  .. t     and the keys k  .. k
                                        1     s                  1     s-1

         b. The keys and children obey the following properties:

            i.  k < k < k  ... < k
                 1   2   3        s-1

           ii.  All the keys in child t are < k
                                       1       1

           iii. All the keys in child t  (1<i<s) lie between k    and k
                                       i                      i-1      i 

            iv. All the keys in child t are > k
                                       s       s-1 

      2. Observe that a binary search tree is simply an m-way search tree
         with m = 2.

      3. Examples of 4-way search tress:

                        C     F    K                    A
                      /    /    \    \                 / \
                    A B   D E    HIJ  L                   B
                   / | \ / | \  /||\  /\                 / \
                                                            C
         (where the empty children are failure nodes)      / \
                                                              D
                                                             / \
                                                                E
                                                               / \
                                                                  F
                                                                 / \
                                                                    etc..

         a. Clearly, the first is much more desirable than the second!  

         b. Note that if the first example were implemented as a binary search
            tree instead, it would have height 4 instead of 2, a sizable cost
            increase at 10ms per disk access.  (The savings become even larger
            as m increases.  For example, the first example could be implemented
            as a 12-way search tree with only 1 level.)

      4. Observe, then, that a 2-3-4 tree is simply a variant of a 4-way search
         tree.  In fact, B-Trees generalize some of the ideas we used with
         2-3-4 trees, though the standard algorithms for maintaining B-Trees
         are slightly different from those used with 2-3-4 trees.

   B. Definition of a B-Tree

      1. As was true with binary search trees, we recognize that m-way
         search trees can be very efficient if well-balanced, but have
         undesirable degenerate cases.  With binary search trees, we defined a
         variants such as AVL trees and Red-Black trees that avoid the 
         degenerate behavior.  We do the same here.

      2. A B-Tree of order m is an m-way search tree in which:

         a. All the failure nodes are on the same level.

         b. Each internal (non-failure) node, except the root, has at least
             __ __
            |  m  |
              ___    children
               2

         c. The root, if it is not a failure node (meaning the tree is
            totally empty), has at least 2 children.

         d. Of our two examples, only the first is a B-Tree

      3. Examples: which of the following is/are B-Tree(s) of order 5:
        
         E J O              E J O             E  J  O            E  J  O
       /  / \  \          / /  \  \         /  / \   \          /  / \   \
     ABC  F KLM PQRS   ABF  GH  KL PQRS    AB  FI  KL  PQRS   AB  HI  KL  PQRS
    //\\ /\ //\\//|\\ //\\ /|\ /|\ //|\\  /|\ /|\ /|\ //|\\   /|\ /|\ /|\ //|\\
                                                                 FG
                                                                 /|\
     no: Node "F"    no: the tree is    yes             no: all the failure
     has only 2      not a search tree,                 nodes are not on the
     children        since F > E                        same level


      4. Note: because all the failure nodes of a B-Tree are on the same
         (bottom) level, we normally do not bother to draw them.  Thus,
         we will draw the one good tree in the above example as follows from
         now on:

                                E  J  O
                               /  / \   \
                             AB  FI  KL  PQ

   C. Some properties of a B-Tree

      1. What is the MAXIMUM number of KEYS in a B-Tree of order m of
         height h?
                
         a. In such a tree, each non-failure node would have the maximal
            number of children (m), and thus the maximal number of keys (m-1).
            Thus, we would have:

            1 node              m-1 keys                 at level 1
            m nodes             m * (m-1) keys           at level 2
            m**2 nodes          m * m * (m-1) keys       at level 3
            ...
            m**(h-1) nodes      m**(h-1) * (m-1) keys    at level h
                                --------------------
                                m**h - 1 keys total

         b. Compare our result for complete binary trees of height h -
            2**h - 1 nodes.

      2. What is the MINIMUM number of KEYS in a B-Tree of order m of
         height h?  

         a. In such a tree, the root would have only 2 children (1 key), since
            this is the minimum allowed for a root.  All other nodes would
            have ceil(m/2) children, and ceil(m/2) - 1 keys.

         b. For convenience, let c = ceil(m/2).

         1 node                 1 key                   at level 1
         2 nodes                2 * (c-1) keys          at level 2
         2*c nodes              2 * c * (c-1) keys      at level 3
         2*c**2 nodes           2 * c**2 * (c-1) keys   at level 4
         ...
         2*c**(h-2) nodes       2 * c**(h-2)*(c-1) keys at level h
                                -----------
                                2 * [c**(h-1)-1] + 1 =

                                2 * c**(h-1) - 1 keys total

      3. To determine the height of a B-Tree of order m containing n keys, we 
         solve each of the above for h, as follows:

         a. From the equation for the maximum number of keys, we know:

                n <= m**h - 1

            or, solving for h:

                n+1 <= m**h

                log (n+1) <= h
                   m

            Now, since h must be an integer, we can take the ceiling of the
            log to obtain:
                  ___________
                 | log (n+1) | <= h
                      m

         b. From the equation for the minimum number of keys, we know:

                n >= 2 * c**(h-1) - 1

            or, solving for h

                (n + 1)
                ------- >= c**(h-1)
                   2

                log  ((n+1)/2) >= h-1
                   c

                h <= 1 + log       ((n+1)/2)
                            - -
                           | m |
                            ___
                             2

            Now, since h must be an integer, we can use the floor of the log
            to obtain:

                h <= 1 + | log     ((n+1)/2) |   (floor of the log to the base
                         |     _ _           |   ceiling m/2 of ((n+1)/2))
                         |    | m |          |
                         |     ___           |
                         |      2            |
                         |___________________|

         c. Combining the above results from minimal and maximal trees, we 
            obtain the following bounds for h:

                ceil(log (n+1)) <= h <= 1 + floor(log          (n+1)/2)
                        m                            ceil(m/2)

      4. Some examples:

         a. 1 million keys - B-Tree of order 200: height is 3 

            - Lower bound is ceil(log    1,000,001) = 3
                                      200

            (Note that a maximal tree of height 2, order 200, contains
             39,999 keys - so tree must have height at least 3)

            - Upper bound is 1+ floor(log    500,001) = 3
                                         100

            (Note that a minimal tree of height 4, order 200, contains 
             1,999,999 keys, so tree must have height no greater than 3)

         a. 2 million keys - B-Tree of order 200: 

            - Lower bound is still 3

            - Upper bound is now 4

            so the height could be 3 or 4.

   D. An important note: In our discussion, we have talked only about nodes
      containing KEYS.  In practice, we build search structures to allow us
      to associate keys with VALUES (e.g. name = key; phone number and address
      = value).  In the form of B-Tree we are discussing, then, a node actually
      contains s pointers to children, s-1 keys, and s-1 values.  These can be
      stored in one of two ways:

      1. The actual value can be stored in the node.  This, however, can reduce
         m, and thus the branching factor of the tree, if the size of the value
         is large compared to that of the key (as it often is).

         Example: node size 8000, key length 12, pointer size 4 bytes, allows
                  m = 500 if we don't store any value with the key

                  if we also have to store a value of size 36, however, we
                  would reduce m to 125

      2. The node can contain a pointer to another disk block that stores the
         actual value.  (This additional pointer adds minimally to the size of
         the node.)  However, this means that successful searches require an
         additional access to get the data.

      3. Indexed files often use B-Trees, or one of the variants we will discuss
         later.  The index may be a B-Tree containing pointers to another area
         where the records proper are found; or the whole file may be a B-Tree.

II. Operations on B-Trees:
--  ---------- -- -------

    A. For our examples, we will use the following B-Tree of order 3 (sometimes
       called a 2-3 tree, since each node has 2 or 3 children and 1 or 2 keys). 
       We use order 3 to keep the size of the examples down:


                 J   T

         C F      M P         Y

      AB DE GH  KL NO R    UW   Z


    B. Locate a given key k in a tree rooted at t:

        Locate(k: key; t: BTree);

           if t = nil then 
                search fails
           else
             begin
               i := 1;
               while (i < Number of children of t) and (key[i] < k) do
                   i := i + 1;
               if (i < Number of children of t) and (key[i] = k) then
                   return associated information
               else
                   locate(k, child[i])
              end

        Example: Locate J succeeds immediately at the root

                 Locate L: at the root, we end the while loop with i = 2,
                           since key[2] = 'T', so we go to the second child of 
                           the root

                           in that child, the while loop exits with i = 1,
                           (since key[1] = 'M' >'L'), so we go to the first 
                           child.

                           In that node, we find what we are looking for.

                 Locate Z: at the root, we end the while loop with i = 3,
                           since i = Number of children, so we go to the third
                           child of the root

                           in that child, the while loop exits with i = 2, for
                           the same reason, so we go to the second child.

                           In that node, we find Z

                 Locate X: at the root, we end the while loop with i = 3

                           In the third child, the while loop exits with i = 1,
                           since key[1] = 'Y' > 'X', so we go to the first 
                           child.

                           In that node, we exit the while loop with i = 3.
                           Since the third child (and all children, in fact) of
                           this node is a failure node, the search fails.

   C. Inserting a new node.  (Assume that we disallow duplicate keys).

      1. We first proceed as in locate, until we get to a leaf - that is,
         a node whose children are empty.  (Of course, if we find the key we
         are looking for along the way, we declare an error and quit.)

      2. If the leaf node we have arrived at contains less than the maximum
         number of keys, then we simply insert the new key at the appropriate
         point, and add an extra "nil" child pointer.

        Example:        Insert S: We work our way down to the node containing
                                  R.  Since it contains only one key, and can
                                  hold two, we add S, and our tree becomes:

                 J   T

         C F      M P         Y

      AB DE GH  KL NO RS    UW   Z


        Note that inserting a new key in a leaf may require moving other
        keys over.

        Example:        Insert Q in the original tree.  Result:

                 J   T

         C F      M P         Y

      AB DE GH  KL NO QR    UW   Z
                       ^
                       |__ R has been moved over one place


      3. Life becomes more interesting if the leaf we reach is already
         full.  (E.g. consider trying to insert "X" in the above.)  In this
         case, we cannot add a new node on a lower level, since this would
         violate one of the B-Tree constraints.  Instead, we proceed as
         follows:

         a. Allocate a new node from a free list, or extend the file by one
            node.

         b. Redistribute the keys in the original node, plus the new key,
            so that:

            - The first half remain in the original node
            - The middle key in the order of keys is held out for a use to
              be explained shortly.
            - The second half go into the new node.

          Note:

            The key we were inserting can go into either of the nodes, or
            it might be the middle key.  (e.g. if we are inserting X, it
            will go into the new node; but if we are inserting V into
            the same node, it would be the middle key.)

         c. Insert the middle key we saved out, plus a pointer to the newly
            created node, into the parent at an appropriate point, just after
            the pointer that we followed to go down to the node we split.  Of
            course, this means we move keys and pointers into the parent to
            make room.
 
        Example: insert X into the original tree

                               __________ this key was promoted into parent
                 J   T        |
                              v
         C F      M P         WY

      AB DE GH  KL NO R    U  X   Z
                           ^  ^
                           |  |___ new node now contains X
                           |______ original node now contains only U

      4. Observe that this strategy guarantees that the resulting tree will
         still meet all the tests for a B-Tree.

         a. Clearly, all the leaves are still on the same level.

         b. What about the number of children of the new nodes?

          - If we are forced to split a node, it is because it contained
            the maximum number of keys before insertion - m-1.  With the
            new key, this gives m keys, to be divided two ways plus the
            promoted key.  This leaves m-1 keys to be divided.

            - If m is odd, then each node gets (m-1)/2 keys, and has
              (m+1)/2 children, which is exactly ceil(m/2), as required.

            - If m is even, then one node gets (m-2)/2 keys, and the other
              gets m/2 keys.  The smaller node then has (m-2)/2 + 1 children =
              m/2 children, which is exactly ceil(m/2), as required.  (The 
              larger node has more than the minimum, which is fine.)

      5. Now what if there is no room for the promoted key in the parent?
         (Example: insert I into the original tree.  Node GH splits, with
         H to be promoted to node CF.  But this node has no room for another
         key and child.)

         a. Solution: split the parent as before, creating a new parent to
            hold half of the keys and pointers to half of the children.
            Again, promote one key and a pointer to the new node up one
            level.

         b. Note that, if carried to its ultimate, this can result in splitting
            the root, which is how the tree gains height.  At this point, the
            single middle key resulting from the splitting of the root becomes
            the one key in the new root.  (This is why we allow the root of a
            B-Tree to have as few as 2 children.)

        Example: insert I:

                    J
                /         \
              F             T
           /   \          /    \
         C     H        M P      Y
        / \   / \     / | \     /  \
      AB DE   G I   KL NO  R    UW   Z

      6. You will note that the approach we have taken to splitting nodes in
         a B-Tree is somewhat different from the one we used with 2-3-4 trees.
         Here, we have postponed splitting a node until absolutely necessary.
         The price is more complex code; but given the time required for disk
         accesses this policy makes some sense, since it postpones height
         increases until the absolute last moment possible.  (One could use the
         2-3-4 tree approach of anticipating the need for splits on the way down
         the tree if one wanted to, however.)

   D. Deletion from a B-Tree

      1. As we have seen in other kinds of trees, deleting a key from a
         leaf will be much simpler than deleting a key from an interior node.
         As before, then, we use the trick of converting a deletion from an
         interior node into a deletion from a leaf by promoting a key from
         a leaf - typically the first key in the leftmost subtree of the
         child just after the key.

         Example: to delete J from the root of our original tree, we would
                  promote K to take its place, and delete K from the leaf.

      2. Deleting a key from a leaf is basically trivial - we simply slide
         other keys over as necessary to fill in the gap.  

        Example: Delete N from our original tree:

                 J   T

         C F      M P         Y

      AB DE GH  KL  O R    UW   Z


      3. However, we can run into a problem if the leaf we are deleting from
         already contains the minimal number of keys.

        Example: Delete R from our original tree.

      4. In this case, we essentially reverse the process we used to deal with
         an over-full node on insert.

         a. We find one of the brothers of the node from which we are deleting
            (we can use either side.)

         b. We rearrange keys between the node we are working on and the brother
            so as to give each the minimal number, if possible.  This will mean
            changing the divider key between them in the parent.

        Example: When deleting R from the original tree, we can combine R's
                 node with NO and rearrange as follows:

                 J   T

         C F      M O         Y

      AB DE GH  KL N  P    UW   Z


         c. If, as a result, we do not have enough keys to make two legal
            nodes (i.e. if the brother we are using also contains the minimal
            number of keys), then we combine the two nodes into one, also
            removing a key and child pointer from the parent.

        Example: working with the above (not the original tree), we can now
                 try to delete P.  Since the only node we can combine with
                 is N, and it has the minimal number of keys already, we must
                 pull O down from the parent and combine everything into one
                 node, recycling the other:

                 J   T

         C F       M          Y

      AB DE GH   KL NO     UW   Z

         d. Of course, removing a key from the parent may get us into trouble
            as well.  (e.g. suppose that, in succession, we removed L, N, and
            then O from the above).  In this case, the parent may have to
            "borrow" keys and children from a brother.  In an extreme case,
            we may even have to merge the parent with its brother, and could
            ultimately even reduce the height of the tree if we had to merge
            two children of the root.

        Example: Recall the tree we got by splitting the root:

                    J
                /         \
              F             T
           /   \          /    \
         C     H        M P      Y
        / \   / \     / | \     /  \
      AB DE   G I   KL NO  R    UW   Z

        Suppose we now try to delete Z:

        - We have to merge UW with the now vacated node.  

                    J
                /         \
              F             T
           /   \          /    \
         C     H        M P      W
        / \   / \     / | \     /  \
      AB DE   G I   KL NO  R    U   Y

        Suppose we now delete Y: We must merge with U, but this pulls W
        out of the parent, leaving it with too few keys.  (Zero, in this
        case; but in a higher degree tree we're in trouble when the
        number of keys drops below ceil(m/2) - 1.

        - We therefore rearrange keys and children with MP:


                    J
                /         \
              F             T
           /   \         /    \
         C     H       M       P
        / \   / \     / \     /  \
      AB DE   G I   KL  NO   R    UW
       
III. Variants of the B-Tree
---  -------- -- --- ------

   A. We have seen that a B-Tree of order m containing n keys has height

      ceil(log  (n+1)) <= h <= 1 + floor(log         ((n+1)/2) )
              m                             ceil(m/2)

      1. We have used this formula to calculate h given n and m.  However, it
         can also be used during design of the tree to calculate the m value
         needed to achieve a given target value of h for some n.  Let m    be
         the smallest m that does the job.  Then we have               min

         h          <= 1 + floor(log              ((n+1)/2))
          desired                   ceil(m  / 2)
                                          min

         Since the log shrinks as m grows, at m    the log will be just barely
                                               min
         less than the next largest integer, so we can approximate this by:

         h         = log             ((n+1)/2)
          desired       (ceil(m  / 2)
                               min

         ceil(m   / 2) ~ ((n+1)/2)**(1/h        )
               min                      desired

         m    = 2 * ((n+1)/2) ** (1/h) - 1
          min

      2. In practice, one can rarely achieve a height of one (since that would
         mean the entire "tree" is a single node), and even a height of 2 is
         seldom obtainable for applications of any size.  However, keeping the
         height of the tree to 3 is a reasonable goal for most applications.

      3. Example - suppose we desire to limit h to 3.  We require

         m    = ceil (2 * cuberoot((n+1)/2)) - 1
          min

         Then we obtain the following lower bounds for m for different values 
         of n:

         n              Minimum m

        1000             15
        10,000           34
        100,000          73
        1,000,000       158
        10,000,000      341
        100,000,000     736

         Clearly, m grows rather slowly with n!

   B. Nonetheless, considerations of node size may make it difficult to achieve
      the necessary branching factor.  Recall that, in a B-Tree, each node
      contains up to m pointers, m-1 keys, AND M-1 VALUES ASSOCIATED WITH THE
      M-1 KEYS.

      1. Suppose we need an m-value of 20 for some application in which the
         keys are 10 bytes long and the associated values are 100 bytes long.
         If a pointer is 4 bytes, then the minimum node size is

         4*20 + 19*(10+100) = 2170 bytes (over 4 disk blocks if blocks are 512 
                                          bytes each)

      2. For efficient performance, we must guarantee that each node can be
         read or written with a single disk access - which requires that the
         entire node reside in contiguous blocks on disk.  Further, we must
         have an internal buffer big enough to store the node when it is
         read or written.  Even with a node size of 4+ blocks, the former
         condition may be difficult to achieve on a disk with a smaller
         cluster size.

   C. Two techniques can be used to hold the node size down while still
      achieving a desired branching factor.  These techniques lead to two
      variants of the B-Tree.

   D. A B* Tree of order m is an m-way search tree in which each node (save
      the root) has a minimum of (ceil (2m-1)/3) children and a maximum of m
      children.

      1. Nodes in a B* Tree of order m have the same size as those in a
         B Tree of order m, but their minimum branching factor is greater.

      2. This is achieved by using the following strategy on insertion of
         a new key:

         a. If the leaf in which the key belongs has room for the new key,
            then it is put there (as with the basic B-Tree.)

         b. However, if the leaf is full, then instead of splitting the
            leaf we choose one of its brothers and attempt to redistribute
            keys between the two leaves.  (This is sort of the reverse of what
            we did when deleting a key from a B-Tree.)

            Example: A B* Tree of order 5 has 4-5 children (3-4 keys) for each
                     node.  Consider the following example of such a tree:

                        E   K   P   V
                      /   |   |   |   \
                   ABCD FGHI LMN  RSTU XYZ

            If we go to insert Q, we find that leaf RSTU is full.  Instead
            of splitting it (which would force a split of the root and a new
            level in the tree), we combine RSTU with one of its brothers - say
            LMN - and rearrange keys between them and the divider in the
            parent to get

                        E   K   Q   V
                      /   |   |   |   \
                   ABCD FGHI LMNP RSTU XYZ

         c. If the chosen brother is also full, then we combine the keys from
            the two nodes and split the result to give three nodes.  This
            preserves the ceil((2m-1)/3) branching factor, as follows:

            - The two full nodes each had m-1 keys.  Together with the
              divider key between them and the new key, we have 2m keys
              to work with.

            - We need two keys to separate 3 nodes, so we are left with
              2m-2 keys to divide between the 3 new nodes.  The following
              shows how these might be distributed, depending of m mod 3:

                m mod 3         Distribution of keys

                0               (2m-3)/3, (2m-3)/3, 2m/3
                1               (2m-2)/3, (2m-2)/3, (2m-2)/3
                2               (2m-4)/3, (2m-1)/3, (2m-1)/3

              Since each node has one more child than keys, this leads to
              the following distribution of children

                m mod 3         Distribution of children

                0               (2m)/3, (2m)/3, (2m+3)/3
                1               (2m+1)/3, (2m+1)/3, (2m+1)/3
                2               (2m-1)/3, (2m+2)/3, (2m+2)/3

              Since all these values are integers, it can be seen that, in
              each case, the node with the smallest number of children gets at
              least ceil((2m-1)/3) children, as required.

      3. To see the advantage of the B* Tree, consider the following table
         showing the minimum number of keys in a B Tree and a B* Tree of
         height 3 for different values of m

         m      minimal B Tree height 3         mimimal B* Tree height 3

         5      17                              17
         10     49                              97
         20     199                             337
         50     1249                            2177
         100    4999                            8977
         200    19,999                          35,377

         (The advantage is even greater for higher trees)

E. Another way to get a high branching factor while keeping node size down
   is the use of B+ Trees.

   1. We saw that the total size of a node is generally the limiting factor
      in terms of the value of "m" that can be used for a B-Tree.  Here, of
      course, the main villain is usually the value that is stored in the
      node along with the key.  If the value is - say - 10 times bigger than
      the key, then its presence in the node reduces the potential branching
      factor by a ratio of almost 10:1!

   2. One way to address this would be to not store the values in the tree
      at all.

      a. Rather, each node would contain up to m child pointers, m-1 keys
         and m-1 POINTERS TO VALUES STORED ELSEWHERE.  

      b. The difficulty with this scheme, though, is that once the tree has
         been searched to find the desired key, an additional disk access is
         needed to find the data.  The effect on performance is the same as if
         the height of the tree were increased by one, so this may undo a the 
         gain obtained by using the higher branching factor.

   3. A B+ tree addresses this problem as follows:

      a. Values are only stored in the lowest level of the tree.  Nodes at
         higher levels contain keys, but not values.

      b. This means that the branching factor in the upper levels is much
         greater than the branching factor at the lowest level (where the
         children are failure nodes.)

         Example: assume nodes are 512 bytes, keys are 10 bytes, values are 90
                  bytes, and pointers are 4 bytes.

                  Each node in the lowest level of a B+ tree could store up to 5
                  key-value pairs, with 12 bytes to spare.  (No pointers need
                  be stored, because the 6 children are all failure nodes.)

                  Each node at upper levels would have branching factor 37.
                  It would store up to 36 key-pointer pairs, plus one extra
                  pointer, with 4 bytes to spare.

                  We assume that we can distinguish between a leaf and a
                  non-leaf node in some way during our processing - perhaps by
                  keeping track of the height of the tree or by tagging the
                  node itself or the pointer to it in some special way.

      c. Of course, this means that all keys must occur at the lowest level
         of the tree, so that a value can be stored with them.  The keys in
         the upper levels, then, are copies of keys stored lower down; some
         keys are stored twice.  In particular, each upper level key is a copy
         of the greatest key in the subtree to its left.

         Example: given the above scenario, assume that we have a B-Tree
                  that holds the 26 letters of the alphabet as keys.

                  Since the maximum branching factor of a leaf is 6, the
                  minimum branching factor would be 3, and each leaf would
                  hold 2-5 keys.  Thus, we would have 6-13 leaves, which could
                  easily be accomodated as children of a single root node.
                  Thus, our tree might look like this:

                  ------------------------------------
                   B   E     I    K     N   Q   S    V
                  -------------------------------------
                 /   |     |    |    |    |   |    |    \
                AB  CDE  FGHI  JK   LMN  OPQ  RS  TUV  WXYZ

                  Note that the separator keys in the root are copies of the
                  last key in each leaf to the left of the separator.  We
                  could also have chosen to store the first key in the leaf
                  to the right.

      d. For this particular set of characteristics, we might contrast:

         i. This B+ tree of height 2, with plenty of room to grow without
            gaining height.

        ii. An ordinary B-Tree of order 6 (where all levels hold values) -
            which would be of height 3 for this configuration.

       iii. An ordinary B-Tree where all levels hold keys and POINTERS to
            values.  Here, since a key plus two pointers requires 18 bytes,
            we would have a maximum degree of 29 (28 key-value-pointer + 1
            extra pointer).  This could actually be done with height 1 for
            this data, but would grow to height 2 after just a few more
            insertions.  Note, though, that since getting to the data requires
            one more disk access it would behave like a tree of height 2 or
            3 as the case may be.

      e. Just to illustrate the concept of a B+ tree further, we consider the
         what might happen with different assumptions about how many keys would
         fit in a non-leaf node, so that we end up with a three-level tree, 
         like the following:
                                -----
                                | K |
                                -----
                                /   \
                  ---------------    ------------------
                  | B   E     I |    | N   Q   S    V |
                  ---------------    ------------------
                 /   |     |    |    |    |   |    |    \
                AB  CDE  FGHI  JK   LMN  OPQ  RS  TUV  WXYZ

IV. Implementing Various File Organizations with B-Trees
--  ------------ ------- ---- ------------- ---- -------

   A. Variable-length records pose a special problem when storing files that
      must support direct access to individual records. However B-Trees - 
      especially the B+ variant - can easily accomodate variable length records.
      Instead of containing a fixed number of key-value pairs, each leaf would 
      hold a number of such pairs dependent on the size of each.  

   B. B-Trees are the most widely used structure for implementing indexed
      files.  Most commonly, the file is structured as a B+ tree on the basis
      of the primary key.

   C. Secondary indexes can be implemented by using additional B-Trees - one
      per additional index.  

      1. These can share the same file as the primary tree; the file header 
         block simply will contain a pointer to the root of the primary tree 
         and each secondary tree.

      2. One small question is how the secondary trees refer to the actual
         data records in the primary tree, since they should contain only keys
         and pointers to values to avoid duplicating whole records.

         a. They might contain pointers to the block and offset in block of
            the data.  However, this would mean that splittings and
            recombinations in the primary tree during insert/delete would
            have to trigger updates in the secondary tree - not only for the
            entries pertaining to the item inserted/deleted, but also for
            others moved by a block split or recombination.

         b. Or, the "pointer" might be a copy of the primary key of the
            record - so a secondary key lookup would be followed by a lookup
            of the primary key in the primary tree.  This, however, slows
            lookups.

         c. Or, it might be expedient to use an ordinary B-Tree for the primary
            tree, with pointers to values kept in an altogether separate area
            NOT affected by splits and recombinations.  (Note, then, that the
            presence of secondary indexes might alter the decision as to the
            optimal structure for the primary index.) 

      3. Another issue is the handling of duplicate keys.  Recall that, in
         most cases, we require the primary key for an indexed file to be
         unique; but the secondary keys may allow several records to have the
         same value.  (Example: student file - primary key = ID, secondary
         key = last name.)  How shall we handle the duplicate keys in the
         secondary index?

         a. Multiple entries in the index for the same key represents one
            possibility, but uses extra space to store the key more than
            once.

         b. We can let the index point to the FIRST record in the file with
            the specified key value, and let each record in the file include
            a link pointing to the next record with the same secondary key 
            value.  This provides fast retrieval of subsequent records with a
            given key after finding the first, but means that each record must
            contain a link field for each secondary key.  (This link is stored
            with the record, but not returned as part of it when the record is
            accessed.)

         c. We can make use of variable-length records in the index - each
            secondary key index entry consists of a key value and 1 or more
            pointers to records having that key value in them.

   D. Finally, B-Trees can be used to implement relative files!  Here, the
      relative record number serves as the key on which the tree is
      organized.  This has two advantages.

      1. No space need actually be allocated for nonexistent records.  If
         the search ends in a leaf that does not contain an entry for a
         desired record number, then we conclude the record does not exist.

      2. We can accomodate variable length records.

      However, finding a record by record number now requires multiple
      disk accesses, whereas before one access would do, so this may not
      be a good idea unless we need the space flexibility.

Copyright ©1999 - Russell C. Bjork