The Huffman Algorithm

CS321 Lecture: Huffman codes                            revised 12/5/97

Materials: Transparencies of Huffman algorithm.

I. Introduction
-  ------------

   A. We introduced the concept of weight-balanced binary trees in conjunction
      with binary search trees.  Actually, weight-balanced trees have other
      applications as well.

   B. One area of considerable interest in many applications is DATA 
      COMPRESSION - reducing the number of bits required to store a given
      body of data.  We consider one approach here, based on weight-balanced
      binary trees.
   
      1. Suppose you were given the task of storing messages comprised of the
         7 letters A-G plus space (just to keep things simple.)  In the absence 
         of any information about their relative frequency of use, the best you 
         could do would be to use a three bit code - e.g.
   
         000        = space
         001 .. 111 = A .. Z
   
      2. However, suppose you were given the following frequency of usage data.
         Out of every 100 characters, it is expected that:
   
           10 are A's           Note: these data are contrived!
           10 are B's
            5 are C's
            5 are D's
           30 are E's
            5 are F's
            5 are G's
           30 are spaces
   
         a. Using the three bit code we just considered, a typical message of
            length 100 would use 300 bits.
    
         b. Suppose, however, we used the following variable-length code
            instead:
  
            A     = 000         NOTE: No shorter code can be a prefix of
            B     = 001               any longer code.  Thus, we cannot
            C     = 0100              use codes like 00 or 01 - if we saw
            D     = 0101              these bits, we wouldn't know if they
            E     = 10                were a character in their own right or
            F     = 0110              part of the code for A/B or C/D.
            G     = 0111
            space = 11
   
            A message of length 100 with typical distribution would now need:
   
            (10 * 3) + (10 * 3) + (5 * 4) + (5 * 4) + (2 * 30) + (5 * 4) + 
             (5 * 4) + (30 * 2) = 260 bits = a savings of about 13%

      3. A variable length code can be represented by a decode tree, with
         external nodes representing characters and internal nodes representing
         a decision point at a single bit of the message - e.g.

                                    ( first bit)
                                / 0               \ 1
                         (2nd bit)              (2nd bit)
                        / 0     \ 1             / 0     \ 1
                  (3rd bit)   (3rd bit)       [E]     [space]
                  / 0  \ 1   / 0       \ 1
                [A]   [B]   (4th bit)  (4th bit)
                           / 0    \ 1   / 0   \ 1
                         [C]     [D]   [F]    [G]

         The optimum such tree is the one having the smallest weighted
         external path length - e.g. sum of the levels of the leaves times
         their weights.
   
II. The Huffman Algorithm
--  --- ------- ---------

   A. An algorithm for computing such a weight-balanced code tree is the 
      Huffman algorithm, given in the book.
   
      1. TRANSPARENCY

      2. Basic method: we work with a linked list of partial trees.

         a. Initially, the list contains one entry for each character.

         b. At each iteration, we choose the two partial trees of least
            weight and construct a new tree consisting of an internal
            node plus these two as its children.  We put this new tree
            back on the list, with weight equal to the sum of its children's
            weights.

         c. Since each step reduces the length of the list by 1 (two
            partial trees removed and one put back on), after n-1
            iterations we have a list consisting of a single node, which
            is our decode tree.

         d. Example: For the above data.  (Note: this isn't quite the way
            the code actually works.  For convenience, it builds the list
            backwards.  However, we ignore this.)

            Initial list:        A    B    C    D    E    F    G   space
                                .10  .10  .05  .05  .30  .05  .05  .30
                                / \  / \  / \  / \  / \  / \  / \  / \

            Step 1 - remove C, D - and add new node:

                                 ()   A    B    E    F    G   space
                                .10  .10  .10  .30  .05  .05  .30
                                / \  / \  / \  / \  / \  / \  / \
                                C D

            Step 2 - remove F, G - and add new node:

                                 ()    ()   A    B    E   space
                                .10   .10  .10  .10  .30  .30
                                / \   / \  / \  / \  / \  / \
                                F G   C D

            Step 3 - remove A, B - and add new node:

                                 ()    ()   ()   E   space
                                .20   .10  .10  .30  .30
                                / \   / \  / \  / \  / \
                                A B   F G  C D

            Step 4 - remove two partial trees - and add new node:

                                 ()        ()    E   space
                                .20        .20   .30  .30
                                / \        / \   / \  / \
                              ()   ()      A B  
                             / \   / \
                             C D   F G

            Step 5 - remove two partial trees - and add new node:

                                 ()       E   space
                                .40      .30   .30
                                / \      / \   / \
                              ()   ()
                             / \   / \
                             A B ()   ()
                                 / \  / \
                                 C D  F G

            Step 6 - remove E, space - and add new node:

                                ()               ()
                                .60             .40
                                / \             / \ 
                                E  space      ()   ()
                                             / \   / \
                                             A B ()   ()
                                                 / \  / \
                                                 C D  F G

            Step 7 - construct final tree:

                                    ()
                                   1.00
                                  /    \
                                 ()     ()
                                / \     / \
                              ()   ()   E space
                             / \   / \
                             A B ()   ()
                                 / \  / \
                                 C D  F G

   B. Analysis:

      1. Constructing the initial list is O(n).

      2. Transforming to a tree involves n-1 (= O(n)) iterations.  On
         each iteration, we scan the entire list to find the two partial
         trees of least weight = O(n) - so this process, using the
         simplest mechanism for storing the list of partial trees is O(n^2).
         (It can be made O(n log n) by storing the partial trees in a heap.

      3. Printing the tree is O(n).

      4. Overall is therefore O(n^2).  However, we could reduce time to
         O(n log n) by using a more sophisticated data structure for
         the "list" of partial trees - e.g. a heap based on weight.

   C. We have applied this technique to individual characters in an alphabet.
      It could also be profitably applied to larger units - e.g. we might
      choose to have a single code for frequently occurring words (such as
      "the") or sequences of letters within words (such as "th" or "ing").
Copyright ©1998 - Russell C. Bjork