CS321 Lecture: Huffman codes revised 12/5/97
Materials: Transparencies of Huffman algorithm.
I. Introduction
- ------------
A. We introduced the concept of weight-balanced binary trees in conjunction
with binary search trees. Actually, weight-balanced trees have other
applications as well.
B. One area of considerable interest in many applications is DATA
COMPRESSION - reducing the number of bits required to store a given
body of data. We consider one approach here, based on weight-balanced
binary trees.
1. Suppose you were given the task of storing messages comprised of the
7 letters A-G plus space (just to keep things simple.) In the absence
of any information about their relative frequency of use, the best you
could do would be to use a three bit code - e.g.
000 = space
001 .. 111 = A .. Z
2. However, suppose you were given the following frequency of usage data.
Out of every 100 characters, it is expected that:
10 are A's Note: these data are contrived!
10 are B's
5 are C's
5 are D's
30 are E's
5 are F's
5 are G's
30 are spaces
a. Using the three bit code we just considered, a typical message of
length 100 would use 300 bits.
b. Suppose, however, we used the following variable-length code
instead:
A = 000 NOTE: No shorter code can be a prefix of
B = 001 any longer code. Thus, we cannot
C = 0100 use codes like 00 or 01 - if we saw
D = 0101 these bits, we wouldn't know if they
E = 10 were a character in their own right or
F = 0110 part of the code for A/B or C/D.
G = 0111
space = 11
A message of length 100 with typical distribution would now need:
(10 * 3) + (10 * 3) + (5 * 4) + (5 * 4) + (2 * 30) + (5 * 4) +
(5 * 4) + (30 * 2) = 260 bits = a savings of about 13%
3. A variable length code can be represented by a decode tree, with
external nodes representing characters and internal nodes representing
a decision point at a single bit of the message - e.g.
( first bit)
/ 0 \ 1
(2nd bit) (2nd bit)
/ 0 \ 1 / 0 \ 1
(3rd bit) (3rd bit) [E] [space]
/ 0 \ 1 / 0 \ 1
[A] [B] (4th bit) (4th bit)
/ 0 \ 1 / 0 \ 1
[C] [D] [F] [G]
The optimum such tree is the one having the smallest weighted
external path length - e.g. sum of the levels of the leaves times
their weights.
II. The Huffman Algorithm
-- --- ------- ---------
A. An algorithm for computing such a weight-balanced code tree is the
Huffman algorithm, given in the book.
1. TRANSPARENCY
2. Basic method: we work with a linked list of partial trees.
a. Initially, the list contains one entry for each character.
b. At each iteration, we choose the two partial trees of least
weight and construct a new tree consisting of an internal
node plus these two as its children. We put this new tree
back on the list, with weight equal to the sum of its children's
weights.
c. Since each step reduces the length of the list by 1 (two
partial trees removed and one put back on), after n-1
iterations we have a list consisting of a single node, which
is our decode tree.
d. Example: For the above data. (Note: this isn't quite the way
the code actually works. For convenience, it builds the list
backwards. However, we ignore this.)
Initial list: A B C D E F G space
.10 .10 .05 .05 .30 .05 .05 .30
/ \ / \ / \ / \ / \ / \ / \ / \
Step 1 - remove C, D - and add new node:
() A B E F G space
.10 .10 .10 .30 .05 .05 .30
/ \ / \ / \ / \ / \ / \ / \
C D
Step 2 - remove F, G - and add new node:
() () A B E space
.10 .10 .10 .10 .30 .30
/ \ / \ / \ / \ / \ / \
F G C D
Step 3 - remove A, B - and add new node:
() () () E space
.20 .10 .10 .30 .30
/ \ / \ / \ / \ / \
A B F G C D
Step 4 - remove two partial trees - and add new node:
() () E space
.20 .20 .30 .30
/ \ / \ / \ / \
() () A B
/ \ / \
C D F G
Step 5 - remove two partial trees - and add new node:
() E space
.40 .30 .30
/ \ / \ / \
() ()
/ \ / \
A B () ()
/ \ / \
C D F G
Step 6 - remove E, space - and add new node:
() ()
.60 .40
/ \ / \
E space () ()
/ \ / \
A B () ()
/ \ / \
C D F G
Step 7 - construct final tree:
()
1.00
/ \
() ()
/ \ / \
() () E space
/ \ / \
A B () ()
/ \ / \
C D F G
B. Analysis:
1. Constructing the initial list is O(n).
2. Transforming to a tree involves n-1 (= O(n)) iterations. On
each iteration, we scan the entire list to find the two partial
trees of least weight = O(n) - so this process, using the
simplest mechanism for storing the list of partial trees is O(n^2).
(It can be made O(n log n) by storing the partial trees in a heap.
3. Printing the tree is O(n).
4. Overall is therefore O(n^2). However, we could reduce time to
O(n log n) by using a more sophisticated data structure for
the "list" of partial trees - e.g. a heap based on weight.
C. We have applied this technique to individual characters in an alphabet.
It could also be profitably applied to larger units - e.g. we might
choose to have a single code for frequently occurring words (such as
"the") or sequences of letters within words (such as "th" or "ing").
Copyright ©1998 - Russell C. Bjork