CS321 Lecture: Advanced Binary Search Trees Last revised 10/19/99
Materials: Transparencies from Pascal Edition of Horowitz pp. 428, 431, 433
I. Introduction
- ------------
A. In CS122, we spent some time on the subject of search structures: Data
structures that can be used to store and retrieve information associated
with a certain key. The operations on such a structure can be pictured
as follows:
Insertion:
__________________
Key, value | Search |
----------> | structure |
| (key,value pairs)|
|__________________|
Lookup:
___________
Key | Search | Value
----------> | structure | ------->
|___________|
Deletion ____________________________
| Search |
----------> | structure |
| (key and its value removed)|
|____________________________|
B. Let's briefly review the structures we considered in CS122:
Structure Insert Lookup Delete
"Pile" O(1) O(n) O(1) if we know where victim is
Ordered array O(n) O(logn) O(n)
Linked list O(1) if we know O(n) O(1) if we know where victim is
where it goes
Binary search
tree O(logn) - O(n) O(logn) - O(n) O(logn) - O(n)
Hash table O(1) - O(n) O(1) - O(n) O(1) - O(n)
C. Clearly, for tables of any significant size the best candidates are either
a binary search tree or a hash table.
1. Unfortunately, while both of these offer the potential for excellent
performance, both have the potential to degenerate to very poor
performance, depending on the actual keys that occur. In principle,
we could apply probabilistic techniques to assess the probability of
good performance, but we cannot know for certain that performance will
not turn out to be bad.
2. In this course, we want to look at some more advanced versions of the
binary search tree and the hash table. We will see that
a. In the case of the binary search tree, there are methods for
guaranteeing O(log n) performance
b. For hash tables where the set of keys is static and known in
advance (e.g. the reserved words of a programming language), we can
guarantee O(1) performance.
c. For dynamic hash tables we can increase the likelihood of O(1)
performance, but cannot guarantee it.
3. Thus, for dynamic search tables of large size, we will have a choice of
an O(log n) guaranteed performance, or a likely O(1) performance with
some small risk of degeneration to O(n). Which to use for a given
problem depends on the consequences of such degeneration.
a. If a system has tight performance requirements with severe
consequences for not meeting them (e.g. life-critical systems),
then the guaranteed O(log n) performance is clearly to be
preferred.
b. In many other cases, the possibility of O(1) performance may turn
out to be worth the risk.
II. Introduction to Balanced Binary Search Trees
-- ------------ -- -------- ------ ------ -----
A. We have seen that all operations on a binary search tree (locate,
insert, and delete) can be done in O(h) time, where h is the height of
the tree. However, depending on the order in which insertions are done,
h can vary from a minimum of log n to a maximum of n, where n is the
number of nodes. Optimum performance requires that we somehow ensure
that the tree is balanced, or nearly so.
B. Actually, there are two different ways of approaching balancing a
tree:
1. Height balancing attempts to make both subtrees of each node have
equal - or nearly equal - height. It results, then, in a tree whose
height differs only slightly from the optimal value of log n.
2. Weight balancing takes into account the PROBABILITIES of accessing
the different nodes - assuming such information is known to us.
a. For example, suppose we had to build a binary search tree consisting
of the following Pascal reserved words. Suppose further that we had
data available to us as to the relative frequency of usage of each
(expressed as a percentage of all uses of words in the group), as
shown:
begin 55% Note: No claim is made that these
case 25% represent actual frequencies for typical
for 11% Pascal code. In fact, the numbers are
forward 5% contrived to illustrate a point.
otherwise 2%
packed 1%
varying 1%
b. Suppose we constructed a height-balanced tree, as shown:
forward
/ \
case packed
/ \ / \
begin for otherwise varying
- 5% of the lookups would access just 1 node (forward)
- 25% + 1% = 26% would access 2 nodes (case, packed)
- 55% + 11% + 2% + 1% = 69% would access 3 nodes (the rest)
Therefore, the average number of access would be
(.05 * 1) + (.26 * 2) + (.69 * 3) = 2.64 nodes accessed per lookup
c. Now suppose, instead, we constructed the following search tree
begin
\
case
\
for
\
forward
\
otherwise
\
packed
\
varying
The average number of nodes visited by lookup is now
- 55% access 1 node (begin)
- 25% access 2 nodes (case)
- 11% access 3 nodes (for)
- 5% access 4 nodes (forward)
- 2% access 5 nodes (otherwise)
- 1% access 6 nodes (packed)
- 1% access 7 nodes (varying)
- 15% + 2% access 3 nodes (for, packed)
- 4% + 1% access 4 nodes (otherwise, varying)
(.55 * 1) + (.25 * 2) + (.11 * 3) + (.05 * 4) + (.02 * 5) +
(.01 * 6) + (.01 * 7) = average 1.81 nodes accessed
This represents over a 30% savings in average lookup time
d. Interestingly, for the particular distribution of probability
values we have used, this tree is actually optimal. To see
that, consider what would happen if we rotated the tree about
one of the nodes - e.g. around the root:
case
/ \
begin for
\
forward
\
otherwise
\
packed
\
varying
We have now reduced the number of nodes accessed for lookups in
every case, save 1. But since begin is accessed 55% of the
the time, the net change in average number of accesses is
(.55 * +1) + ((1 - .55) * - 1) = .55 - .45 = +.10. Thus, this
change makes the performance worse. The same phenomenon would
arise with other potential improvements.
3. In general, weight balancing is an appropriate optimization only
for static trees - i.e. trees in which the only operations performed
after initial construction are lookups (no inserts, deletes.) Such
search trees are common, though, since programming languages, command
interpreters and the like have lists of reserved or predefined words
that need to be searched regularly. Of course, weight balancing
also requires advance knowledge of probability distributions for
the accesses.
C. We will first consider various structures and algorithms for
height-balanced trees. Then we will return to look at weight-balanced
trees.
III. Height-Balanced Binary Search Trees: AVL Trees
--- --------------- ------ ------ ----- --- -----
A. We have seen that operations on a binary search tree take time
proportional to the height of the tree, which is, in the best case,
O(log n), but in the worst case O(n). We now consider a method for
maintaining a binary search tree with height O(log n), regardless of
the order in which the keys are inserted. The type of tree we will
consider is called an AVL tree, after its two inventors: Adelson-Velski
and Landis.
B. First, we begin with a preliminary definition. We say that a binary
tree is HB(k) for some integer k >= 1 iff it is empty, or
- its two subtrees differ in height by no more than k and
- its two subtrees are HB(k)
Note: every HB(k) tree is also HB(k+1), HB(k+2) ... - but we to describe a
given tree we choose the smallest k for which the definition is
satisfied.
Example: an HB(1) tree:
()
/ \
() ()
/
()
an HB(2) tree:
()
/ \
() ()
/ \ \
() () ()
/ /
() ()
(though subtree heights are equal, right subtree is HB(2).)
C. We now define an AVL tree: an AVL tree is an HB(1) binary search tree.
D. Before we talk about how to maintain AVL trees, we need to consider
why they are of interest. We will therefore prove the following
theorem: The maximum height of an HB(1) tree containing n nodes is less
than 1.44 log n. (Thus, an AVL tree has height that is O(log n), though
up to 44% higher than the best-case complete tree.)
1. We approach our proof by asking a related question - what is the
MINIMUM number of nodes in an HB(1) tree of height h. (Clearly,
for a given number of nodes, such a tree will have maximal height).
2. We define the function minnodes(h) = minimum number of nodes in an
HB(1) tree of height h.
a. There are two trivial cases
minnodes(0) = 0
minnodes(1) = 1
b. For h > 1, we have a tree with a root and two subtrees.
i. One must have height h-1 for the overall tree to have height h.
ii. The other may have height either h-1 or h-2 (since the tree
is HB(1)). However, since minnodes(h) increases with h we will
get a smaller value if it has height h-2.
iii. Further, to get the minimum number of nodes overall, we want the
subtrees to be minimal.
iv. Thus, we get the recurrence
for h > 1, minnodes(h) = 1 + minnodes(h-1) + minnodes(h-2)
c. The solution to this recurrence is
minnodes(h) = Fib(h+2) - 1
To see this, observe that
minnodes(0) = Fib(2) - 1 = 1 - 1 = 0 (as required)
minnodes(1) = Fib(3) - 1 = 2 - 1 = 1 (as required)
For h > 1
minnodes(h) = Fib(h+2) - 1 = Fib(h+1) + Fib(h) - 1
= 1 + (Fib(h+1) - 1) + (Fib(h) - 1)
= 1 + minnodes(h-1) + minnodes(h-2)
(as required)
__ x __ x
d. But Fib(x) = 1 ( 1 + \| 5 ) 1 ( 1 - \| 5 )
---- ( --------- ) - ---- ( --------- ) -
__ ( 2 ) __ ( 2 )
\| 5 \| 5
Using sqrt(5) = 2.24, we approximate
Fib(x) = .45*(1.62)**x - .45*(-.62)**x
But since the second term approaches zero rapidly for large x, we
can use
Fib(x) ~= .45*(1.62)**x for large x
e. This gives
minnodes(h) ~= .45*(1.62)**(h+2) - 1 for large h
For large h, the -1 becomes insignificant, we se can use
.45*(1.62)**(h+2)
f. Thus, for an HB(1) tree with n nodes, we have
n >= .45*(1.62)**(h+2)
log n >= (h+2) + log .45
1.62 1.62
log n >= (h+2) - .8
1.62
h <= log n - 1.2
1.62
But since log n = 1.44 log n, we have
1.62 2
h <= 1.44 log n - 1.2
i.e. h = O(log n)
E. Maintaining AVL trees when inserting new nodes
1. We have seen that an AVL tree has height at worst 44% greater than
a complete binary tree. Moreover, while maintaining a binary search
tree as a complete tree under insertions would be costly, it
turns out that maintaining an AVL tree under insertions and deletions
is not. In fact, all operations on an AVL tree can be done in O(h)
time = O(log n).
2. The key idea is this: we associate with each node in the tree a field
recording its BALANCE.
a. The balance of a node is defined as follows:
height(left subtree) - height(right subtree).
b. Example: in the following tree the balances are recorded on each
node.
(0)
/ \
(1) (-1)
/ \ / \
(1) (0) (0) (-1)
/ \
(0) (0)
c. Clearly in an HB(1) tree the only possible balance values are -1,
0, and +1. Thus, if space is a premium, we may be able to avoid
allocating a separate field for the balance by using one bit in each
child pointer to tag the node as "heavy on the left" (left child
tag bit set), "heavy on the right" (right child tag bit set) or
"balanced" (no tag bits set.) However, in our examples we will work
with a separate tag field.
3. In the insertion algorithm, we add new nodes at the bottom of the
tree as always. But after we do so, we work our way back up to the
root of the tree, updating balances and possibly performing a
correction should a balance become < -1 or > +1.
a. This can be done easily with a recursive insert; we take care of
updating balances etc. as we unwind the insertion - e.g.
to insert newinfo into tree rooted at r:
if r = nil then
insert new node with balance 0 here
else if newinfo < r^.info then
insert into left child
update balance of r
else
insert into right child
update balance of r
b. To perform the update, we observe that one of two things can
have happened as a result of the insertion:
i. The height of the child was not changed. In this case, the
balance of the parent (r) does not change
ii. The height of the child increased by one. In this case, the
balance of the parent changes by +/- 1 (depending on whether
the insertion was on its left or right side.)
iii. We can decide between these two cases by looking at the
child before and after the insert.
Before insert After insert
nil --- height increased
bal = 0 bal = 0 height unchanged
bal = 0 bal = +/- 1 height increased
bal = +/- 1 bal = 0 height unchanged
bal = +/- 1 bal = +/- 1 height unchanged
c. Now if the height of the child increased, then we have the
following cases with regard to the parent's balance
Side where Parent's balance New balance when
insert done before insert child height increased
L -1 0
L 0 +1
L +1 TROUBLE
R -1 TROUBLE
R 0 -1
R +1 0
4. In the two cases labeled "TROUBLE" we have to perform some
corrective action at the parent node r. WE CONSIDER ONLY THE
CASE OF INSERTION ON THE LEFT HERE - insertion on the right is
a mirror image and is left as an exercise to the student.
5. The corrective action to be taken depends on the balance of the
child. Recall that there are two ways a child's height can
increase - it can go from nil to non-nil, or its balance can
go from 0 to +/- 1. Only the latter, however, can cause an
imbalance at the parent. (If its child was nil before the
insert, it could not now be unbalanced on that side.) Thus, we
have one of the following two cases
a. Insertion into left subtree of child:
BEFORE: AFTER:
(r) (r)
/ +1 \ / +1 \
(c) T3 (c) T3
/ 0 \ /+1 \
T1 T2 T1 T2
If height of T1 = h Height of T1 is now h+1
then height of T2 = h (Other heights unchanged)
and height of T3 = h
Overall height is h+2 Overall height is h+3
b. Insertion into right subtree of child:
BEFORE: AFTER:
(r) (r)
/ +1 \ / +1 \
(c) T3 (c) T3
/ 0 \ /-1 \
T1 T2 T1 T2
If height of T1 = h Height of T2 is now h+1
then height of T2 = h (Other heights unchanged)
and height of T3 = h
Overall height is h+2 Overall height is h+3
6. The first sort of problem is handled by a RIGHT ROTATION AROUND R:
BEFORE: AFTER INSERT: AFTER ROTATION:
(r) (r) (c)
/ +1 \ / +1 \ / 0 \
(c) T3 (c) T3 T1 (r)
/ 0 \ /+1 \ / 0 \
T1 T2 T1 T2 T2 T3
If height of T1 = h Height of T1 is now h+1
then height of T2 = h (Other heights unchanged)
and height of T3 = h
Overall height is h+2 Overall height is h+3 Overall height is h+2
a. Observe that the rotation preserves the inorder traversal order,
and thus the binary search tree property.
Before: inorder = T1 c T2 r T3
After: inorder = T1 c T2 r T3
b. Observe, too, that since the overall height of the subtree formerly
rooted at r (but now at c) is again h+2, no further corrections
will be needed higher in the tree.
c. Thus, after insertion and correcting rotation, the tree is once
again an AVL tree!
7. The second sort of problem is more complex, and requires a DOUBLE
ROTATION involving c and r. To analyze this, we need to get the
right child of c into the act. (We'll call it g, since it's a
grandchild of r.)
BEFORE: AFTER: LEFT ROTATE RIGHT ROTATE
AROUND C: AROUND T:
(r) (r) (r) (g)
/ +1 \ / +1 \ / +1 \ / 0 \
(c) T3 (c) T3 (g) T3 (c) (r)
/ 0 \ /-1 \ / \ / * \ / * \
T1 (g) T1 (g) (c) T2b T1 T2a T2b T3
/ 0 \ /+/-1\ / \
T2a T2b T2a T2b T1 T2a
If height of T1 = h Height of either T2a or * = see below
then height of T2a=h-1 T2b (but not both) is
and height of T2b=h-1 now h.
and height of T3 = h (Other heights unchanged)
Overall height is h+2 Overall height is h+3 Overall height is h+2
a. The setting of the balances is now a bit more complex. While g's
balance is always 0 (since each subtree is of height h+1), we have
two cases for c and r, based on the balance of g after the
insertion but before the rotation.
Balance of g Height of subtrees Balance of c and r after
after insert after insert rotate
but before
rotate T2a T2b c r
-1 h-1 h +1 0
+1 h h-1 0 -1
b. Observe that the rotations preserves the inorder traversal order,
and thus the binary search tree property.
Before: inorder = T1 c T2a g T2b r T3
After: inorder = T1 c T2a g T2b r T3
c. Observe, too, that since the overall height of the subtree formerly
rooted at r (but now at g) is again h+2, no further corrections
will be needed higher in the tree.
d. Thus, after insertion and correcting rotation, the tree is once
again an AVL tree!
8. Note that these AVL insertion operations - even when rotations are
needed - still take time proportional to the height of the tree.
Thus, insertion into an AVL tree is O(log n).
F. Deletion from AVL trees
1. As always, deletion is a harder operation than insertion; but it is
always possible to delete a node from an AVL tree in O(log n) time.
2. To keep our life simple, we consider only the case of deleting a
node with at most one child. (Recall that the problem of deleting a
node with two children can be converted into this case by "promoting"
data from its inpred or insucc and then deleting the duplicated data
lower in the tree.
3. If a node has one child, then its balance before deletion is +/- 1,
and its child's balance is 0. We simply move the child up to take
the place in the structure of the deleted node.
4. As in insertion, we have to make a pass through the tree from the
parent of the deleted node back to the root, resetting balances and
possibly performing corrective rotations.
a. This time, we are looking for situations where the height of
a child DECREASED:
Before delete After delete
--- nil height decreased
bal = 0 bal = 0 height unchanged
bal = 0 bal = +/- 1 height unchanged
bal = +/- 1 bal = 0 height decreased
bal = +/- 1 bal = +/- 1 height unchanged
b. Now if the height of the child decreased, then we have the
following cases with regard to the parent's balance
Side where Parent's balance New balance when
delete done before delete child height decreased
L -1 TROUBLE
L 0 -1
L +1 0
R -1 0
R 0 +1
R +1 TROUBLE
5. In the two cases labeled "TROUBLE" we have to perform some
corrective action at the parent node r. WE CONSIDER ONLY THE
CASE OF DELETION ON THE LEFT HERE - deletion on the right is
a mirror image and is left as an exercise to the student.
6. The corrective action to be taken depends on the balance of the
OTHER child - the one on the side where the delete did NOT
occur. (This child must be non-nil, else an unbalance condition
could not have arisen.) Now, however we have three major cases:
a. Left child deleted; right child heavy on right. Fix by
LEFT ROTATE around r.
BEFORE: AFTER DELETE: AFTER ROTATE:
(r) (r) (c)
/ -1 \ / -2 \ / 0 \
T1 (c) T1 (c) (r) T3
/ -1 \ / -1 \ / 0 \
T2 T3 T2 T3 T1 T2
If height of T1 = h Height of T1 is now h-1
then height of T2 = h-1 (Other heights unchanged)
and height of T3 = h
Overall height is h+2 Overall height is h+2 Overall height is h+1
b. Left child deleted; right child balanced. Fix by LEFT ROTATE
around r:
BEFORE: AFTER DELETE: AFTER ROTATE:
(r) (r) (c)
/ -1 \ / -2 \ /+1 \
T1 (c) T1 (c) (r) T3
/ 0 \ / 0 \ /-1 \
T2 T3 T2 T3 T1 T2
If height of T1 = h Height of T1 is now h-1
then height of T2 = h (Other heights unchanged)
and height of T3 = h
Overall height is h+2 Overall height is h+2 Overall height is h+2
c. Left child deleted; right child heavy on left. Fix by RIGHT
ROTATE around c, then LEFT ROTATE around r.
BEFORE: AFTER DELETE: AFTER BOTH ROTATES:
(r) (r) (g)
/ -1 \ / -2 \ / 0 \
T1 (c) T1 (c) (r) (c)
/ +1 \ / +1 \ / * \ / * \
(g) T3 (g) T3 T1 T2a T2b T3
/ \ / \
T2a T2b T2a T2b
If height of T1 = h Height of T1 is now h-1
then at least one of
T2a, T2b is h-1; other
may be h-2. The height of T3 = h-1.
Overall height is h+2 Overall height is h+2 Overall height is h+1
* Balances of r and c are functions of original balance of g
d. Observe that, as with insert, the rotations preserves the inorder
traversal order in every case.
e. However, in some cases the height of the overall subtree formerly
rooted at r (and now at c or g) decreases by 1. Thus, further
corrections may still be needed higher up in the tree - in
contrast to the case with insert.
IV. Splay Trees
-- ----- -----
A. The AVL tree we just considered gives guaranteed O(log N) cost for each
operation on the tree, at the cost of storing some additional information
in each node (its balance) and doing significant extra work on insertions
and deletions.
B. We book discusses an alternative strategy that avoids the need for storing
additional information and does additional work on lookups and deletions
but not on insertions. Further, the additional work is less complex
that the additional work required to maintain an AVL tree.
1. This strategy, known as SPLAYING, does not guarantee that EACH
individual operation will have cost O(log N) - it could in fact have
cost O(N). However, it guarantees that the AVERAGE COST of any series
of operations is O(log N). (If one step in the series takes O(N)
time, the time taken by the remaining steps is small enough that the
average is O(log N).
2. We say, therefore, that a splay tree has AMORTIZED COST for each
operation O(log N).
3. The basic idea is that, after looking up a node in the tree (as part
of a lookup or delete operation), we perform a series of rotations
working back up the tree which results in the node we just found
being made the root of the tree. If the node we found was down a
long, narrow branch of the tree, a side effect is to make the whole
tree much more balanced.
C. The whole process of doing this is discussed at length in chapter 4 of
the book, and a different appraoch is discussed in chapter 12. (The
latter approach does the splaying on the way down the tree when finding
a node, rather than on the way back up after finding it.) Chapter 12
also discusses the analysis that leads to the O(log N) amortized cost -
a rather messy process!
D. Time will not permit further discussion here.
V. 2-3-4 and Red-Black Trees
- ----- --- --------- -----
A. We have considered one type of search tree that has guaranteed O(log n)
performance for all operations: the AVL tree. We now consider another
type of search tree having similar a similar guarantee: the 2-3-4 tree,
and then an implementation of this basic idea known as a red-black tree.
The latter structure is particularly important because it is the one
used by the STL set, map, multiset, and multimap containers.
B. A 2-3-4 tree is a search tree in which each node has 2, 3, or 4 sons -
and contains 1, 2, or 3 keys - e.g.
A 2 node:
( k )
/ \
All keys are < k All keys are > k
A 3 node:
( k1 k2 )
/ | \
All keys are < k1 | All keys are > k2
|
All keys lie between k1 and k2
A 4 node:
( k1 k2 k3 )
/ | | \
All keys are < k1 | | All keys are > k3
| |
All keys lie between k1 and k2 All keys lie between k2 and k3
C. Such a tree can be searched by an extension of our binary search tree
algorithms. We won't consider this, though, because we will eventually
use a different representation for the tree.
D. Of more interest is the problem of maintaining the tree.
1. As always, we will insert new keys at the bottom of the tree. But
now we have several possibilities:
a. If we encounter a 2 node or a 3 node whose children are all nil,
instead of creating a new node we will add the key to the
existing node. Example - Insert Raccoon into:
Fox Hippo
/ | \
Aardvark Cow Dog Goose Zebra
The rightmost node becomes: Raccoon Zebra
b. If we encounter a 4 node whose children are all nil, we will SPLIT
the 4 node into two 2 nodes, and pass the middle key up to its
parent, and then insert the new key - e.g. Insert Elephant
the above tree:
Split the four node:
Cow Fox Hippo
/ | | \
Aardvark Dog Goose Raccoon Zebra
Insert the new key:
Cow Fox Hippo
/ | | \
Aardvark Dog Elephant Goose Raccoon Zebra
(Note that the split is done BEFORE the insertion.)
2. However, there is one place where this strategy can get us into
trouble. Suppose, when we split a four node, its parent is also a
four node? Then, we have no room to insert the key in the parent.
a. This can be handled by simply splitting the parent and promoting
one of its keys - which in turn could cause another split etc.
This can get messy.
b. A cleaner alternative is to adopt a policy of ALWAYS SPLITTING
any four node we encounter going down the tree on a search.
i. Even if we don't have to do so immediately, we will probably
have to do so eventually.
ii. This ensures whenever we split a node that its parent will be
either a 2 node or a 3 node, with room for the promoted key.
3. One special case remains - the root. If the root of the tree is
a four node, what do we do? (Where do we put the promoted key?)
a. The answer is that we create a new two node to become the root
of the tree, which adopts the two halves of the original root - e.g.
Before: After:
| |
Cow Fox Gopher Fox
/ \
Cow Gopher
b. When this occurs, the overall tree increases its height by one.
However, this will be a rather rare occurrence, since two promotions
from below (each the result of splitting a four node) will have to
occur before the root again needs to split.
4. One interesting property of a 2-3-4 tree is that it is ALWAYS
height-balanced. In particular:
a. Either all the children of a node are nil or none of them are.
b. All the nodes with nil children are at the same level.
This follows from the fact that we never insert new nodes into the
tree; rather, we insert into or split existing nodes. The only
way the height of the tree ever increases is when the root splits,
and this affects all leaves equally.
5. Example: insert T H E Q U I C K B R O W N F O X
into an initially empty 2-3-4 tree
T [ T ]
H [ H T ]
E [ E H T ]
Q [ H ]
/ \
[ E ] [ Q T ]
U [ H ]
/ \
[ E ] [ Q T U ]
I [ H T ]
/ | \
[ E ] [ I Q ] [ U ]
C [ H T ]
/ | \
[ C E ] [ I Q ] [ U ]
K [ H T ]
/ | \
[ C E ] [ I K Q ] [ U ]
B [ H T ]
/ | \
[ B C E ] [ I K Q ] [ U ]
R [ H K T ]
/ | | \
[ B C E ] [ I ] [ Q R ] [ U ]
O [ K ]
/ \
[ H ] [ T ]
/ \ / \
[ B C E ] [ I ] [ O Q R ] [ U ]
W [ K ]
/ \
[ H ] [ T ]
/ \ / \
[ B C E ] [ I ] [ O Q R ] [ U W ]
N [ K ]
/ \
[ H ] [ Q T ]
/ \ / | \
[ B C E ] [ I ] [ N O ] [ R ] [ U W ]
F [ K ]
/ \
[ C H ] [ Q T ]
/ | \ / | \
[ B ] [ E F ] [ I ] [ N O ] [ R ] [ U W ]
O already present
X [ K ]
/ \
[ C H ] [ Q T ]
/ | \ / | \
[ B ] [ E F ] [ I ] [ N O ] [ R ] [ U W X ]
Question: what sequence of keys would most quickly lead to another
insertion into the root?
Answer: Y M P
E. Let's analyze the efficieny of operations on a 2-3-4 tree. Clearly
insert and locate are O(h). (We don't consider delete here, because
it is considerably more complex to implement; however, it too is O(h)).
What is the relationship between the number of KEYS in a 2-3-4 tree (n)
and its height (h)? (Note: we worry about keys, not number of nodes.)
1. As we did with the AVL tree, we consider the related problem of
determining the MINIMUM number of keys in a 2-3-4 tree of height h.
Clearly, such a tree would have every node a two node. Thus, we
would have:
Level Number of nodes Number of keys
1 1 1
2 2 2
3 4 4
2. Therefore, we can see that n >= 2^h - so
h <= log n
2
3. We can also show by similar reasoning on a MAXIMAL tree that
h >= log n
4
But since log n = 0.5 * log n, we have h = theta(log n)
4 2
F. We have seen that 2-3-4 trees represent a nice way of maintaining a
balanced search tree. Unfortunately, the code to maintain and use them
is made complex by the fact that we have to deal with three different
types of node. We now consider a type of tree called a red-black tree
that implements the principle of the 2-3-4 tree more easily.
1. A red-black tree is a binary tree in which we associate a COLOR
(red or black) with each link. (This can be implemented by a single
tag bit on each pointer.)
2. We can represent a 2-3-4 tree by a red-black tree as follows:
two node ( ) / = black ptr
/ \ //= red ptr
three node ( ) or ( )
// \ / \\
( ) ( )
/ \ / \
four node ( )
// \\
( ) ( )
/ \ / \
3. This kind of tree has several interesting properties:
a. Our lookup algorithm is the same as for an ordinary binary search
tree.
b. The number of black links on any path from root to leaf is the
same for all paths in the tree.
c. On any path from root to leaf, we never encounter two successive
red links. Thus, the total number of links on any path from
root to leaf is at most twice the number of links in the
corresponding 2-3-4 tree. (This relates to our demonstration that
the height of a 2-3-4 tree lies between log n and log n.
4 2
4. On insert, when going down the tree, if we encounter a node with
two red children, it represents a four node and should be split.
There are four cases, depending on the parent of the four node.
(Note: most cases below are actually two separate cases, depending on
whether the node being split is the left child or the right child of
its parent. We consider only the left child case; the right child
one is a mirror image.)
a. There is no parent - the node being split is the root of the whole
tree.
2-3-4 TREE BEFORE RED-BLACK REPRESENTATION BEFORE
------------
| C1 C2 C3 | (C2)
------------ // \\
/ | | \ (C1) (C3)
T1 T2 T3 T4 / \ / \
T1 T2 T3 T4
2-3-4 TREE AFTER RED-BLACK REPRESENTATION AFTER
------
| C2 | (C2)
------ / \
/ \ (C1) (C3)
------ ------ / \ / \
| C1 | | C3 | T1 T2 T3 T4
------ ------
/ \ / \
T1 T2 T3 T4
Observe: The only change is to convert the two red child pointers
to black!
b. The parent is a two node.
2-3-4 TREE BEFORE RED-BLACK REPRESENTATION BEFORE
-----
| P |
-----
/ \ ( P )
------------ T5 / \
| C1 C2 C3 | (C2) T5
------------ // \\
/ | | \ (C1) (C3)
T1 T2 T3 T4 / \ / \
T1 T2 T3 T4
2-3-4 TREE AFTER RED-BLACK REPRESENTATION AFTER
--------
| C2 P |
--------
/ | \ ( P )
------ ----- T5 // \
| C1 | |C3 | (C2) T5
------ ----- / \
/ | | \ (C1) (C3)
T1 T2 T3 T4 / \ / \
T1 T2 T3 T4
Observe: the only change needed to implement the split is to
change the pointer from the parent to the node being
split to red, and to change the two child pointers of
the node being split to black!
c. The parent is a three node with node being split on its "black"
side.
2-3-4 TREE BEFORE RED-BLACK REPRESENTATION BEFORE
---------
| P1 P2 |
---------
/ | \ (P1)
------------ T5 T6 / \\
| C1 C2 C3 | (C2) (P2)
------------ // \\ / \
/ | | \ (C1) (C3) T5 T6
T1 T2 T3 T4 / \ / \
T1 T2 T3 T4
2-3-4 TREE AFTER RED-BLACK REPRESENTATION AFTER
------------
| C2 P1 P2 |
------------
/ | | \ (P1)
------ ----- T5 T6 // \\
| C1 | |C3 | (C2) (P2)
------ ----- / \ / \
/ | | \ (C1) (C3) T5 T6
T1 T2 T3 T4 / \ / \
T1 T2 T3 T4
Observe: This case is basically the same as the previous one.
d. The parent is a three node with node being split on its "red"
side. We have two subcases:
i. Node being split is leftmost child of parent.
2-3-4 TREE BEFORE RED-BLACK REPRESENTATION BEFORE
---------
| P1 P2 | (P2)
--------- // \
/ | \ (P1) T6
------------ T5 T6 / \
| C1 C2 C3 | (C2) T5
------------ // \\
/ | | \ (C1) (C3)
T1 T2 T3 T4 / \ / \
T1 T2 T3 T4
2-3-4 TREE AFTER RED-BLACK REPRESENTATION AFTER
------------
| C2 P1 P2 |
------------
/ | | \ (P1)
------ ------ T5 T6 // \\
| C1 | | C3 | (C2) (P2)
------ ------ / \ / \
/ | | \ (C1) (C3) T5 T6
T1 T2 T3 T4 / \ / \
T1 T2 T3 T4
Observe: This has required a right rotation around the root of
the parent. The root of the subtree is now P1, not P2.
ii. Node being split is second child from left of parent.
2-3-4 TREE BEFORE RED-BLACK REPRESENTATION BEFORE
---------
| P1 P2 | (P2)
--------- // \
/ | \ (P1) T6
T1 ------------ T6 / \
| C1 C2 C3 | T1 (C2)
------------ // \\
/ | | \ (C1) (C3)
T2 T3 T4 T5 / \ / \
T2 T3 T4 T5
2-3-4 TREE AFTER RED-BLACK REPRESENTATION AFTER
------------
| P1 C2 P2 | (C2)
------------ // \\
/ | | \ (P1) (P2)
T1 ------ ------ T6 / \ / \
| C1 | | C3 | T1 (C1) (C3) T6
------ ------ / \ / \
/ | | \ T2 T3 T4 T5
T2 T3 T4 T5
Observe: This has required a double rotation - left around the left
child of the parent, then right around the root of the parent.
The root of the subtree is now C2, not P2.
d. The parent can never be a four node; if it were, it would have
already been split when passing through it to the child we are now
splitting. Thus, we have covered all possible cases (except for
mirror images.)
5. Finally, we must consider what happens on insert when we reach the
bottom of the tree.
a. In the 2-3-4 tree, we never add nodes at the bottom of a tree -
we simply insert a new key into a leaf node. (Recall that our
splitting strategy guarantees that each node on the path from the
root down will be at biggest a three, so there will always be
room.)
b. However, in the red-black implementation we actually DO add a node.
c. Again there are several cases dependent on the leaf into which the
key is to go. (Again, we consider insertion on the left. Insertion
on the right is the mirror image.)
i. The leaf is a two node:
2-3-4 TREE BEFORE RED-BLACK REPRESENTATION BEFORE
-----
| L | ( L )
----- / \
/ \ nil nil
nil nil
2-3-4 TREE AFTER RED-BLACK REPRESENTATION AFTER
-------
| N L | ( L )
------- // \
/ | \ (N) nil
nil nil nil / \
nil nil
ii. The leaf is a three node, with new key going on the "black"
side:
2-3-4 TREE BEFORE RED-BLACK REPRESENTATION BEFORE
---------
| L1 L2 | (L1)
--------- / \\
/ | \ nil (L2)
nil nil nil / \
nil nil
2-3-4 TREE AFTER RED-BLACK REPRESENTATION AFTER
-----------
| N L1 L2 | (L1)
----------- // \\
/ | | \ (N) (L2)
nil nil nil nil / \ / \
nil nil nil nil
iii. The leaf is a three node, with new key going on the outside of
"red" side:
2-3-4 TREE BEFORE RED-BLACK REPRESENTATION BEFORE
---------
| L1 L2 | (L2)
--------- // \
/ | \ (L1) nil
nil nil nil / \
nil nil
2-3-4 TREE AFTER RED-BLACK REPRESENTATION AFTER
-----------
| N L1 L2 | (L1)
----------- // \\
/ | | \ (N) (L2)
nil nil nil nil / \ / \
nil nil nil nil
Note rotation around root of subtree.
iv. The leaf is a three node, with new key going on the inside of
"red" side:
2-3-4 TREE BEFORE RED-BLACK REPRESENTATION BEFORE
---------
| L1 L2 | (L2)
--------- // \
/ | \ (L1) nil
nil nil nil / \
nil nil
2-3-4 TREE AFTER RED-BLACK REPRESENTATION AFTER
-----------
| L1 N L2 | (N)
----------- // \\
/ | | \ (L1) (L2)
nil nil nil nil / \ / \
nil nil nil nil
Note double rotation (new node is originally added as a child of
L1, then L1 subtree is rotated left, then main subtree is rotated
right.
VI. Weight-Balanced Binary Search Trees
-- --------------- ------ ------ -----
A. As we indicated earlier, in cases where the set of key values to be
searched is known in advance, if we know the relative frequency of
searches for the different keys, then we can construct a search tree
that performs even better than a height-balanced one.
B. We now consider a method for finding the optimal binary search tree
for a given static set of keys, given an advance knowledge of the
probabilities of various values being sought.
1. First, though, we need to note that our discussion of probabilities
for various keys thus far has not been complete. In particular, we
have assumed that only keys actually in the tree will be searched
for.
a. If - as is often the case - some searches will involve keys not
in the tree, then we must also consider the cost of these failures.
b. For example, consider our tree of Pascal reserved words. When
the lexical scanner of a Pascal compiler encounters a word, it
does not know whether it is a reserved word or a user-defined
identifier until it checks the reserved word table. If the word
is not in the table, then it must be an identifier.
c. To handle this, we convert our search tree into an EXTENDED TREE
by adding FAILURE NODES (by convention, drawn as square boxes.)
Example: our balanced tree for the seven Pascal identifiers:
forward
/ \
case packed
/ \ / \
begin for otherwise varying
/ \ / \ / \ / \
[] [] [] [] [] [] [] []
Each failure node represents a group of keys for which the
search would fail - e.g. the leftmost one represents all keys
less than begin [a, apple, and]; the second all keys between
begin and case [boolean, c] etc.
d. In calculating the cost of a tree, we need to consider both
the probabilities of failures and of successes.
i. Let p be the probability of searching for key (1 <= i <= n)
i i
ii. Let q be the probability of searching for a non-existent key
i
lying between key and key . (Of course q represents
i i+1 0
all values less than key , and q all values greater than key .)
1 n n
iii. Clearly, since we are working with probabilities, the sum of
all the p's and q's must be 1.
iv. For the above balanced tree, the total cost would be
p + 2(p + p ) + 3(p + p + p + p ) + 3(q + q + ... + q )
4 2 6 1 3 5 7 0 1 7
- The cost of an internal node is its weight times its level -
i.e. the probability of its being searched for times the
# of key comparisons needed to find it.
- The cost of an external node is its weight times its
level minus 1 = the level of its parent - i.e. the probability
of its being searched for times the # key of comparisons
needed to get to the nil pointer that indicates that the
item is not in the tree.
2. To find an optimal tree, we need to define some terms and measures:
a. T is a binary search tree containing key through key .
ij i+1 j
b. T , then, is an empty tree, consisting only of the failure node
ii
lying between key and key .
i i+1
c. The weight of T is p + p + ... + p + q + q + ... q
ij i+1 i+2 j i i+1 j
which is the probability that a search will end up in T . The
ij
weight of the empty tree T , then, is q - the probability of
ii i
the failure node lying between key and key . Note that, for a
i i+1
non-empty tree, the weight is simply the probability of the root
plus the sum of the weights of the subtrees.
d. The cost of T is calculated as follows:
ij
- If T is empty (consists only of a failure node), then its
ij
cost is zero.
- Otherwise, its cost is the weight of its root, plus the sum
of the weights of its subtrees, plus the sum of the costs of
its subtrees.
- The first term represents the fact that search for the key
at the root costs one comparison.
- The rationale for including the costs of the subtrees in the
overall cost should be clear. To this, we add the WEIGHTS
of the subtrees to reflect the fact that we must do one
comparison at the root BEFORE deciding to go into the
subtree, and the probability that that comparison will lead
into the subtree is equal to the weight of the subtree.
e. Clearly, an optimal binary search tree is one whose cost is minimal.
f. Example - Horowitz (Pascal version) page 428 - TRANSPARENCY.
The cost of tree (b), with equal probabilities for all keys and
failures, is determined as follows:
i. Cost of external nodes = 0 in each case, and weights of
external nodes = 1/7 in each case.
ii. Cost of tree rooted at "do" = weight of root (1/7) +
sum of costs of subtrees (0) + sum of weights of subtrees (2/7) =
3/7. The weight of this subtree is also 3/7.
iii. Cost of tree rooted at "read" is 3/7 by similar reasoning, and
its weight is also 3/7.
iv. Cost of overall tree =
Weight of root = 1/7 +
Weight of left subtree = 3/7 +
Weight of right subtree = 3/7 +
Cost of left subtree = 3/7 +
Cost of right subtree = 3/7 = 13/7
3. Horowitz gives an algorithm for finding an optimal tree, given
a set of values for the p's and q's. The algorithm uses the
following terms:
a. T is the OPTIMAL tree including keys i+1 .. j.
ij
Therefore, T is the optimal tree for the whole set of keys,
0n
and is what we want to find.
b. r is the ROOT of T .
ij ij
- Obviously, r is undefined if i = j.
ij
(We will record the value as 0 in this case.)
- If i < j, then the subtrees of T are T and T
ij i r - 1 r j
ij ij
(Clearly, if T is optimal then its subtrees must be also.)
ij
c. w is the WEIGHT of T .
ij ij
- For i = j, w = q .
ij i
- For i < j, w = p + w + w
ij r i r - 1 r j
ij ij ij
d. c is the COST of T .
ij ij
- For i = j, c = 0.
ij
- For i < j, c = w + w + w + c + c
ij r i r - 1 r j i r - 1 r j
ij ij ij ij ij
= w + c + c
ij i r - 1 r j
ij ij
4. The operation of the algorithm for four keys is traced on page 431
(TRANSPARENCY), using the probabilities [multiplied by 16 for
convenience]: p = (3,3,1,1) and q = (2,3,1,1,1)
a. The first row represents empty trees, whose weights are simply
the appropriate "q" value, whose costs are 0, and whose roots
are undefined.
b. The second row represents trees containing just one key.
In each case, the weight is the sum of the weight of the one key
plus the weights of the two adjacent failure nodes, and the
cost is the weight of the one key (since the costs of failure
nodes are zero.) The root, of course, is the one key.
c. The third row represents the optimal choice for constructing
trees of two nodes.
i. For example, the first entry represents a tree including keys 1
and 2 - ie. T . The two options would have been to let key 1
02
be the root or key 2 be the root. Calculating the costs:
- if key 1 is the root, then the cost is
p + w + w + c + c = 3 + 2 + 7 + 0 + 7 = 19
1 00 12 00 12
- if key 2 is the root, then the cost is
p + w + w + c + c = 3 + 8 + 1 + 8 + 0 = 20
2 01 22 01 22
Thus, 1 is chosen as r and the cost of 19 is recorded.
02
ii. The remaining entries in the row are calculated in the
same way. Note that the weights and costs needed to compare
root choices are always available from previous rows.
d. Subsequent rows represent optimal trees with 3 and then 4
keys. The latter is, of course, the final answer.
5. This algorithm is implemented by the following program:
TRANSPARENCY p. 433
6. Time complexity? (ASK CLASS) - O(n^2)
VII. Advanced Hashing
--- -------- -------
A. Review of basic concepts:
1. Though the actual key may be of any data type (often a string), we
treat the key as an integer by using some conversion technique - e.g.
by treating the characters in the key as digits radix 27 (if all
alphabetic) or 128 (if any ASCII character is allowed.)
2. We apply some key-to-address transformation technique (hash function)
to the key to reduce the range of possible values: 1 .. b or
0 .. b - 1 for some b.
3. We construct a table which is conceptually an array of b BUCKETS, each
of which consists of one or more SLOTS. The buckets are numbered
1 .. b or 0 .. b - 1. A key-value pair is stored, when possible, in
the bucket corresponding to the hash function of its key.
Example: A common hash function is the division remainder method.
If we have a hash table with 11 buckets numbered 0 .. 10, and
wish to store the key 42, we try to store it in bucket
42 mod 11 = 9.
4. Two keys that hash to the same bucket are called SYNONYMS, and if
both occur in the same table they are said to COLLIDE. If the
number of colliding keys hashing to the same bucket exceeds the number
of slots available in the bucket, then we have an OVERFLOW and must
deal with it somehow. Possible solutions we have looked at
previously include:
a. LINEAR PROBING - if a key belongs in bucket h but cannot fit,
then we try h + 1, h + 2 ... (wrapping around to the start as
needed.)
b. CHAINING - the hash table consists not of an array of buckets,
but of an array of LISTS OF BUCKETS. If there is not room for
a given key in existing buckets, we add a new bucket to the
appropriate list.
Both of these methods can lead to O(n) performance over time -
the first because we may (in principle) have to probe a large fraction
of all the buckets when inserting or looking up a key, and the latter
because the chains of buckets can grow long.
B. Performance improvement techniques for hashing include the following:
1. Improving the hash function. A hash function that disperses the
keys evenly over all the buckets gives much better performance than
one that tends to cluster many of the keys into one or a few
buckets.
Example: A very poor hash function for Gordon College student
ids would be to use just the first two digits. Most ID's are
of the form 93xxxxx, and this function would hash them all
to the same value!
2. Improving the overflow handling strategy. In particular, we want
to reduce the probability of repeated collissions for a given key.
3. Allowing the table to grow dynamically as keys are added, thus
reducing the probability of collision by increasing the range of
values of the hash function.
C. Possibilities for Hash Functions
1. Any hashing function should meet two basic criteria:
a. For all possible logical keys, it must produce a value in the
range 0..b-1 or 1..b.
b. It must disperse the logical keys uniformly - i.e. the probability
that a randomly chosen key hashes to any particular bucket should be
1/b - or very close to this.
This second criterion becomes more complicated if the keys to be
used exhibit some pattern or bias. (As was the case with our
student id example)
c. A further consideration is that the hashing function should not
be computationally-expensive, since we are trying to compete with
an O(log n) search strategy and can lose our advantage if too much
computation is required.
2. The division-remainder method
a. home-bucket := key mod b (to produce a result in the range 0..b-1)
or key mod b + 1 (to produce a result in the range 1..b)
b. Advantages:
i. Computationally simple if the the key is an integer to begin with,
or if converting it to an integer is not too expensive.
ii. Provides good dispersion if b is a prime or at least has no
prime factors <= 20.
iii. Flexible choice of b values - many sizes to choose from.
3. The mid-square method
a. home-bucket := middle m bits of sqr(key).
b. This requires that b be a power of 2.
c. Example: Let b be 64, and let keys be integers ranging from 1 to
1000. Then the square of the key is 20 bits long, and we choose bits
7..12. Suppose our key is 50:
50 would hash as follows: bits taken
________
sqr(50) = 2500 = 0000 0000 1001 1100 0100
2
home-bucket = 010011 = 19
2
d. Advantages:
i. Computationally simple if the the key is an integer to begin
with, or if converting it to an integer is not too expensive.
ii. Provides good dispersion, since the hash function depends on all
bits of the original key.
iii. Tables whose size is a power of 2 are often natural anyway.
4. Folding
a. Folding is one way of avoiding the need to convert a non-integer key
to an integer - which often poses a problem if the key is long
enough that the resultant value would not fit in the word length of
the underlying machine. (E.g. even a 7-letter alphabetic key,
treated as a number radix 27, could have a value bigger than the
largest 32 bit integer.)
b. The key is divided into a number of pieces, each of which is treated
as an integer. All of the pieces are added together, either
straight or with alternate pieces reversed.
c. Example: 123456789012 might be treated as four pieces
123 456 789 012
which could be added together one of two ways:
i. Shift folding: 123
456
789
012
---
1380
ii. Boundary folding: 123
654
789
210
---
1776
5. Digit analysis
a. The previous methods required no advance knowledge of the actual set
of keys to be used. If the keys (or truly characteristic subset)
are available in advance, though, a hashing function might be
developed based on an analysis of them.
b. One approach is to calculate the frequency distribution for
different values of each digit (letter) of the keys.
Example: frequency analysis on last names of students in class
6. Perfect hash functions
a. If the full set of keys is known in advance, it actually becomes
possible to generate a PERFECT HASH FUNCTION - one which hashes
each key differently.
b. Such functions typically have a range that exceeds the number of
keys, which means there are some values in the range to which no
key hashes. (These then become perenially vacant slots in the
hash table.) A perfect hash function that uses exactly the right
number of slots is called a MINIMAL PERFECT HASH FUNCTION.
c. The GNU software distribution includes a program - gperf - that
generates near-minimal perfect hash functions for fixed sets of
keys. It has been used to build reserved word tables for a number of
compilers, command interpreters etc.
7. One last topic we mention briefly is the notion of ORDER-PRESERVING
hash functions.
a. In general, it is not the case that if key1 < key2 then
hash(key1) < hash(key2). However, there are some hash functions
that have this property. They are known as order-preserving hash
functions.
b. An order-preserving hash function would be used in a case where one
wishes to have the ability to process table entries in ascending
order of key value, starting at some given point. Such processing
is needed when looking for a RANGE of key values - e.g.
JOHNS <= last_name < JOHNT
c. Of course, insisting that the hash function be order-preserving
may mean sacrificing some performance in terms of quality of
dispersion of the keys!
D. Improving Collision handling
1. Though a good hashing scheme that disperses keys uniformly can reduce
the number of synonyms and hence the probability of a collision, we
cannot aovid having to deal with such collisions as do occur somehow.
2. Comments on efficiency of linear open addressing
a. At first glance, it appears that hashing with linear open addressing
could be terribly inefficient: it could degenerate to searching the
entire table.
b. On the other hand, if the record we want is, in fact, in its home
bucket or very near to it, then this method works quite well.
c. The success of this method depends on two things:
i. Allocating enough space in the table so that there are sufficient
vacant slots to break up long searches. (A good rule of thumb is
to never allow more than 80% of the slots to actually be used- eg
if we wish to store records on 1200 students, then use a table
with at least 1500 slots, plus an appropriate hash function.)
ii. Choose a hash function that disperses the keys uniformly over the
slots.
d. One remaining problem that is hard to avoid, however, is the
problem of CLUSTERING.
i. Consider the following portion of a hashtable:
|_____________|
| Bucket x |
|_____________|
| Bucket x+1 |
|_____________|
| Bucket x+2 |
|_____________|
| Bucket x+3 |
|_____________|
....
Suppose bucket x overflows. Then a key belonging to bucket x
is inserted into bucket x+1.
ii. Of course, the effect of this overflow is to increase the
probability that bucket x+1 will also overflow, since it is now
receiving keys that map to two different buckets.
iii. When bucket x+1 overflows, it begins adding keys to bucket x+2.
This also becomes the place where further overflows from bucket
x must go, of course. So now bucket x+2 becomes the target for
keys hashing to three different buckets. This further enhances
the chances of bucket x+2 overflowing, which would make bucket
x+3 the target for keys hashing to four different buckets ...
iv. As you can see, linear probing suffers from the problem that
clusters of overflowing buckets can develop such that several
buckets "compete" for the same overflow space. (In the above
case, buckets x, x+1, x+2, x+3 and x+4 would all compete for
space in bucket x+4.) Further, once this clustering starts to
occur, it feeds on and compounds itself.
v. There are several alternatives available to reduce this
clustering problem.
3. Quadratic probing
a. Quadratic probing addresses the clustering problem of linear open
addressing by using a quadratic function to choose overflow buckets.
i. If a key belongs in bucket x, the following series of buckets is
examined until one is found with room to hold it:
x
(x+1) mod b
(x-1) mod b
(x+4) mod b
(x-4) mod b
(x+9) mod b
(x-9) mod b
..
2
i.e. the buckets probed are of the form (home +/- i ) mod b
(And the same strategy is followed when looking up a key)
ii. Notice how this breaks up clusters. In the above example,
bucket x would first overflow to bucket x+1, increasing the
probability of overflow there. But now once bucket x+1
overflows, further overflows from bucket x would go into
bucket x-1, while overflows from bucket x+1 would go into
bucket x+2. Thus, buckets x and x+1 would not compete with
each other for overflow space, and the reinforcing effect
of local overflows would not occur.
b. Quadratic probing does, however, impose a restriction on our
choice of table sizes. In general, we want to be sure that
an insertion into a nearly full table will succeed if at all
possible, which means that - if necessary - we will eventually
probe each bucket in the table exactly once. It can be shown
that quadratic probing does this if the table size is a prime of the
form 4j + 3 - i.e. is a prime one less than a multiple of four.
4. Rehashing
a. Another approach to solving the clustering problem is REHASHING.
Instead of using a single hash function, we use a series of hash
functions f1, f2, f3 ...
b. When an attempt is made to insert a key in the table, it is first
hashed using f1. If the resultant bucket is full, then the key is
hashed again using f2. If that bucket is full, then f3 is used, etc,
until some hash function hashes the key to a bucket that is empty.
The same strategy is followed when looking up a key.
c. One obvious challenge is developing a suitable series of hash
functions. Ideally, the hash functions should have the property
that if f1 hashes two keys to the same bucket, then f2 hashes them
to different buckets, etc. In contrast to other overflow handling
methods, this has the effect of causing two keys that collide
initially to not also collide on overflow. On the other hand, it
is hard to find functions having this property that also guarantee
that every bucket will eventually be tried in the case of insertion
into an almost full table.
E. Growing the table dynamically.
1. One issue with any of the schemes we have considered thus far is
correctly sizing the table.
a. A table where the ratio of available slots to actual keys stored is
large will likely have relatively few overflows, but will waste a
lot of space.
(There are many fewer car accidents in Vermont than in downtown NY!)
b. What choice does one make if it is hard to tell ahead of time how
big the table is going to get? (E.g. a given general purpose
program might need 100 slots on one run and 10,000 on the next.)
2. The use of chaining is one way to address this problem, course.
a. Using chaining, the table can grow as needed, subject to total
available memory.
b. However, the bigger the table gets, the more the performance
degenerates.
Example: Suppose we use a hash function with range 1 .. 1000.
If there are fewer than 1000 keys in the table, and
the hash function has good dispersion, performance
is likely to be O(1). However, if we grow to 100,000
keys, then the average chain length becomes 100 and
we must do an average of 100 comparisons per probe.
c. An alternative to this is to periodically restructure the table -
e.g. doubling the size of the table and the range of the hash
function. This, of course, also requires that all existing
entries be re-inserted into the restructured table - a high
one-time cost for each restructuring.
F. Extendible Hashing.
1. Several schemes have been proposed to allow the size of a hash table
to grow dynamically in a smooth, efficient way. We consider only
one here. For others, see Smith and Barnes: Files and Databases
pp 124-135.
2. All such schemes use a hash function that generates a large range of
values. For example, on a 32-bit computer, a typical hash function
used with such a scheme would produce a full 32-bit value.
3. Initially, only a limited number of bits from the hash function are
actually used; the rest are ignored. When a bucket overflows,
however, it is split in two and an additional bit of the hash function
is then used to redistribute the keys between the halves.
4. The scheme we consider here makes use of a table called the directory,
whose size is a power of two. Each entry in the table points to a
bucket of keys, but not necessarily a unique bucket. (That is,
several table entries may point to the same bucket.) When we do
lookups or insertions, we use as many bits from the hash function
as are needed to compute an index for this table, and then follow
the pointer to the correct bucket.
Example: the following is a hashtable with bucket size 2, with
keys and hash values as shown. (Hash values are sums of
ASCII values of characters of keys with bits in reverse
order - not great, but OK). At present, three bits of
the hash function are used to distribute the keys.
------
000 ------------------------> HIPPO 0000000110
------ CAT 0001101100
001 ---------------\
------ \-------> AARDVARK 0011001001
010 -------------\
------ ----------> DOG 0101101100
011 -------------/ JACKAL 0110010110
------
100 ------------------------> ELEPHANT 1000101001
------
101 ------------------------> GOPHER 1010001110
------ FOX 1011011100
110 --------------\
------ ---------> BUFFALO 1111111110
111 --------------/
We now consider the following insertions:
MONKEY 1100101110 - would go in bucket with BUFFALO
OSPREY 0100011110 - would cause bucket containing DOG, JACKAL
to split. As a result, 010 entry in the
table would point to a bucket containing
DOG and OSPREY, while 011 entry would now
point to a bucket with JACKAL
IGUANA 1010110110 - would force us to go to using 4 bits to
differentiate keys, since there are already
two entries with 101 as their first 3 bits.
The new table (including MONKEY and OSPREY
from before) would look like this:
0000------------\
------ -----------> HIPPO 0000000110
0001------------/ CAT 0001101100
------
0010------------\
------ ------------> AARDVARK 0011001001
0011------------/
------
0100------------\
------ ------------> DOG 0101101100
0101------------/ OSPREY 0100011110
------
0110------------\
------ ------------> JACKAL 0110010110
0111------------/
------
1000------------\
------ ------------> ELEPHANT 1000101001
1001------------/
------ GOPHER 1010001110
1010-------------------------> IGUANA 1010110110
------
1011-------------------------> FOX 1011011100
------
1100----------\
------ \
1001------------\
------ ------------> MONKEY 1100101110
1110------------/ BUFFALO 1111111110
------ /
1111----------/
Copyright ©1999 - Russell C. Bjork