Dyalog APL - Huffman packing.

cmp ← {cmp←1} ##.packH exp                  ⍝ Huffman packing.

Huffman's  algorithm  builds  a binary tree of unique items in the input string,
using  the  items'  frequency  of occurrence as a weighting. Frequently occuring
items  are  placed near the root of the tree and those less frequent towards the
leaves. For example, using the string 'Mississippi':

    Original string:    M i s s i s s i p p i
    unique items:       M i s p
    item frequency:     1 4 4 2

    Huffman tree:       ┌──┴──┐
                        s   ┌─┴──┐
                        4 ┌─┴─┐  i
                          M   p  4
                          1   2

An item may then be picked from the tree using binary code: 0:left, 1:right. For
example, in the above tree 'M' is at (1 0 0), 'i' is at (1 1) and 's' is at (0).
The full coding for 'Mississippi' is the concatenation of codes:

     M       i     s   s   i     s   s   i     p       p       i
    (1 0 0) (1 1) (0) (0) (1 1) (0) (0) (1 1) (1 0 1) (1 0 1) (1 1)

=>  1 0 0 1 1 0 0 1 1 0 0 1 1 1 0 1 1 0 1 1 1

Choosing  the shortest codes for the most frequently occurring items, guarantees
that the total code length is optimally short.

Notice  that no separation marker is needed between successive code items in the
output "bit string" as the length of each is determined by the path it describes
from  the  root  of the tree to its leaf. On expansion, bits are taken one-at-a-
time  from  the input bit stream, to navigate through the tree. Each time a leaf
is  encountered, it is emitted to the output stream and the process resumed with
the following bit, starting from the tree's root.

Output  from  the encoding is the bit string, together with its Huffman tree. In
addition,  in  order  to allow simple arrays of _any_ rank, the subject array is
ravelled, with its original shape prefixed to the output structure.

Technical notes:

The nested array form of the Huffman tree could be included directly in the out-
put  structure,  but  there are more space-efficient representations. An obvious
one  is  to emit the vector of unique items and its corresponding frequency vec-
tor.  This  is  appealing in that the sub-function that builds the tree from the
frequency  vector  may be used for both compression and expansion, and so may be
included in the packH capsule as a common local function.

Using this method, the output would look like this:

        packH'Mississippi'
    ┌──┬────┬───────┬─────────────────────────────────────────┐
    │11│Misp│1 4 4 2│1 0 0 1 1 0 0 1 1 0 0 1 1 1 0 1 1 0 1 1 1│
    └──┴────┴───────┴─────────────────────────────────────────┘

However, for an array where any item occurs more than 127 times, elements of the
frequency vector will occupy 16 bits apiece (and 32 bits if any item occurs more
than  32,767  times).  A more frugal method is to output the tree's [leaves] and
[depths]  vectors.  For  all  pratical purposes, items of the depths vector will
consume only 8 bits each.

        packH'Mississippi'
    ┌──┬────┬───────┬─────────────────────────────────────────┐
    │11│sMpi│1 3 3 2│1 0 0 1 1 0 0 1 1 0 0 1 1 1 0 1 1 0 1 1 1│
    └──┴────┴───────┴─────────────────────────────────────────┘

A  Huffman  tree has the property that each node has either 0 or 2 sub-trees. It
is  this  property that allows us to reconstruct the tree unambiguously from the
depths  of  its  leaves.  For  example, there are only four such distinct 4-leaf
trees, and they have distinct leaf-depth vectors.

    4-leaf trees:       ┌─∘┐         ┌─∘──┐       ┌─∘─┐        ┌─∘─┐
                      ┌─∘┐ D        ┌∘─┐  D       A  ┌∘─┐      A ┌─∘┐
                     ┌∘┐ C          A ┌∘┐           ┌∘┐ D        B ┌∘┐
                     A B              B C           B C            C D
    leaf depths:     3 3 2 1        2 3 3 1       1 3 3 2      1 2 3 3

In contrast, leaf-depth can not distinguish more general binary trees:

    2-leaf trees:     ┌∘┐          ┌∘─┐
                     ┌∘ B          ∘┐ B
                     A              A
    leaf depths:     2  1           2 1

As  a  larger  example, look at the following (approximate (*)) Huffman tree for
these  notes, with character '·' substituted for blanks and newlines. Notice how
the  most  frequently  occurring  character  '·'  is allocated code (0) and that
groups  of related characters such as '()', '{}', '[]', '├ ┼ ┤', '┌ ┐' and '└ ┘'
tend to appear at the same depth in the tree. Notice also, how the very presence
of  this  tree  diagram in the notes has forced the '─' character up to position
(1 0 0) in itself.

┌┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
·  ┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────┴─────────────────────────────────────────────────────────┐
  ┌┴───────────────────────────────────────────────────────────────────────────────────────┐                                                                     ┌─────────┴─────────────────────────┐
  ─                      ┌─────────────────────────────────────────────────────────────────┴───────────────────┐                                   ┌─────────────┴───┐                             ┌─┴───────────────────────────────┐
     ┌───────────────────┴─┐                                                                     ┌─────────────┴┐                          ┌───────┴─┐             ┌─┴───┐   ┌─────────────────────┴┐    ┌───────────────────────────┴─┐
    ┌┴───────────┐        ┌┴───────────────────┐                                             ┌───┴─┐            t    ┌─────────────────────┴─┐      ┌┴─┐          ┌┴┐  ┌─┴┐ ┌┴───────────────────┐  e  ┌─┴─┐                          ┌┴┐
    c          ┌─┴─┐      h  ┌─────────────────┴───────────────────────┐                    ┌┴─┐  ┌┴─────────┐     ┌─┴───────┐              ┌┴─┐    o ┌┴─┐        │ n ┌┴┐ r a        ┌───────────┴┐   ┌┴┐ ┌┴───────────────────┐      s i
         ┌─────┴┐ ┌┴───┐    ┌┴─────┐                                 ┌─┴─────────────────┐  1 ┌┴┐ m    ┌─────┴┐   ┌┴┐      ┌─┴───────────┐  f ┌┴─┐    u ┌┴─────┐      l ┴          ┌─┴─┐          ┐   ┌ p d  ┌─────────────────┴─┐
       ┌─┴───┐  y v  ┌─┴┐   b  ┌───┴───┐               ┌─────────────┴┐                ┌─┴┐   0 →    ┌─┴─┐    .   , -    ┌─┴┐          ┌─┴┐   ' ┌┴┐     g  ┌───┴┐              ┌───┴┐ ┌┴─┐                  ┌┴─────────┐        ┌┴─┐
      ┌┴┐  ┌─┴┐     ┌┴┐ 3     ┌┴─┐   ┌─┴───┐       ┌───┴─────┐        x        ┌───────┴┐ w         ┌┴┐ ┌┴─┐           ┌─┴┐ M      ┌───┴┐ 2     └ ┘       ┌┴─┐  ⍝             ┌┴─┐  ( ) ┌┴───┐              :        ┌─┴─────┐  ┬ ┌┴┐
      } ↑ ┌┴┐ 4     ↓ ¯       A ┌┴┐ ┌┴┐  ┌─┴─┐   ┌─┴─┐   ┌───┴───┐         ┌───┴───┐    j           ⍺ T ⊃ ┌┴┐         ┌┴┐ ∘      ┌─┴─┐  H                 > ┌┴┐               k ┌┴┐     q  ┌─┴─┐                 ┌───┴┐    ┌─┴┐   ⍵ +
          B ≡                   F C D ⌽ ┌┴┐ ┌┴┐ ┌┴┐ ┌┴┐ ┌┴─┐   ┌─┴─┐     ┌─┴─┐   ┌─┴─┐                    " ├         ┼ ┤      ┌─┴┐ ┌┴┐                     = ~                 ⍉ ←       ┌┴┐ ┌┴┐              ┌─┴─┐  <  ┌─┴┐ {
                                        N E * [ ] \ R 9 S ┌┴┐ ┌┴┐ ┌┴┐   ┌┴┐ ┌┴┐ ┌┴┐ ┌┴┐                                       ┌┴┐ O 7 ⍬                                                   ⊂ ¨ / ∇             ┌┴┐ ┌┴┐   ┌┴┐ I
                                                          # ! ⍨ W ⍋ ⊥   ⍳ ; 5 _ ⍪ ⎕ ∨ ⍕                                       × |                                                                             U 6 8 ⍴   ≠ P

(*)  The  tree  is approximate, because inserting it into the notes, changes the
notes'  Huffman  tree slightly (as does adding this comment). Several iterations
of  recalculating  and replacing the tree resulted in a loop of several distinct
trees, of which the above is one. Therefore, the above tree _does not_ represent
these notes exactly unless adding this final remark happens to cause the iterat-
ions to converge! The tree was formatted using:

    hfmt←{                                                ⍝ Format Huffman tree.
        join←{⍵⍪⍉↑↑{⍺,' ',⍵}/↓∘⍉¨⍺}                       ⍝ join sub-trees.
        arch←{⍺⍺ ⍺[⎕IO+{⍵+∨\⍵}⍺⍺'┴'=⊃↓⍵]}                 ⍝ make ┌─┴─┐ archway.
        0=≡⍵:↑,↓⍕,⍵                                       ⍝ leaf: char mat.
        lft rgt←∇¨⍵                                       ⍝ sub-trees and their,
        case←×|≡¨⍵                                        ⍝ relative depths:
        0 0≡case:lft rgt join'┌┴┐'                        ⍝ leaf-leaf,
        0 1≡case:lft rgt join'┌┴',' ─┐'⌽arch rgt          ⍝ leaf-branch,
        1 0≡case:lft rgt join'┴┐',⍨' ─┌'+arch lft         ⍝ branch-leaf,
        lft rgt join(' ─┌'+arch lft),'┴',' ─┐'⌽arch rgt   ⍝ branch-branch.
    }

-----------
compression
-----------
The  tree  is  built  by taking the two most infrequently occurring items in the
unique  list  and  replacing them with a single item: their "pair" together with
the  sum of their frequencies. This process is repeated until only a single item
remains:

        tree←{                          ⍝ Huffman tree.
            1=⍴⍵:⊃⌽⊃⍵                   ⍝ list exhausted: done
            nxt←⍵[2↑⍋⊃¨⍵]               ⍝ next two lowest frequencies,
            freqs items←↓⍉↑nxt          ⍝ and corresponding items.
            ∇(⍵~nxt),⊂(+/freqs),⊂items  ⍝ collect 2 most infrequent items.
        }↓⍉↑freq uniq                   ⍝ tree from frequency-item pairs.

 ┌→──┬───┬───┬───┐    ┌→──┬───┬──────┐    ┌→──┬──────────┐    ┌→──────────────┐
 │1 M│4 i│4 s│2 p│ => │   │   │┌→┬──┐│ => │   │┌→┬──────┐│ => │┌→─┬──────────┐│
 └+─→┴+─→┴+─→┴+─→┘    │4 i│4 s││3│Mp││    │   ││ │┌→─┬─┐││    ││  │┌→┬──────┐││
                      │   │   │└─┴─→┘│    │4 s││7││Mp│i│││    ││  ││ │┌→─┬─┐│││
                      └+─→┴+─→┴─────→┘    │   ││ │└─→┴─┘││    ││11││s││Mp│i││││
                                          │   │└─┴─────→┘│    ││  ││ │└─→┴─┘│││
                 ┌→┬──────┐               └+─→┴─────────→┘    ││  │└─┴─────→┘││
 yielding tree:  │ │┌→─┬─┐│                                   │└~─┴─────────→┘│
                 │s││Mp│i││                                   └──────────────→┘
                 │ │└─→┴─┘│
                 └─┴─────→┘

The  leaves and depths vectors are extracted from the tree by a simple recursive
traversal.

        leaves depths←↓⍉↑tree{      ⍝ tree leaves and depths.
            ⍬≡⍴⍺:⊂⍺ ⍵               ⍝ leaf: leaf and depth.
            ↑,/⍺ ∇¨1+⍵              ⍝ branch: traverse deeper sub-branches.
        }0                          ⍝ starting at depth 0 for the root.

---------
expansion
---------
The  tree  is  restored from its [depths] and [leaves] vectors with a single re-
ductive  pass  across their transpose [depth-leaf] vector. During the reduction,
the  operand function examines adjacent nodes, replacing ones of identical depth
with a new pair of depth less by 1. The reduction is "seeded" with a dummy depth
of ¯1 on the right. Using the 'Mississippi' example:

        depths leaves               ⍝ depths and leaves vectors.
    ┌───────┬────┐
    │1 3 3 2│sMpi│
    └───────┴────┘

        ↓⍉↑depths leaves            ⍝ depth-leaf vector.
    ┌───┬───┬───┬───┐
    │1 s│3 M│3 p│2 i│
    └───┴───┴───┴───┘

    tree←⍬ 0 1⊃{                    ⍝ reconstituted Huffman tree.
        (⊃⍺)≠⊃⊃⍵:(⊂⍺),⍵             ⍝ distinct adjacent depths: continue.
        (⌽{⊃⍵-1}\⌽↓⍉↑⍺(⊃⍵))∇ 1↓⍵    ⍝ identical adjacent depths: coalesce.
    }/(↓⍉↑depths leaves),¯1         ⍝ depths/leaves ++ trailing dummy entry.

The  first  line  skips over (or accumulates into the result,) adjacent items if
their depths differ:

        (⊃⍺)≠⊃⊃⍵:(⊂⍺),⍵             ⍝ distinct adjacent depths: continue.

The  second  line  calls  the operand function recursively for adjacent items of
identical depth. We can break open the line to examine its constituent parts:

        (·················)∇ 1↓⍵    ⍝ repeat with adjacent nodes coalesced.
        ·············⍺(⊃⍵)······    ⍝ pair of adjacent same-depth tree nodes.
        ··········↓⍉↑···········    ⍝ => (depth depth) (node node)
        ·⌽{····}\⌽··············    ⍝ apply to _first_ item of pair.
        ···⊃⍵-1·················    ⍝ reduced depth
        ························    ⍝ => (depth-1)(node node)

The 'Mississippi' example proceeds through the reduction as follows:

                     ┌────────────────  ⍝ adjacent depths:
                  ┌──┴───┐
    ┌───┬───┬───┬─↓─┐  ┌─↓─┐
    │ 1 │ 3 │ 3 │ 2 │  │¯1 │            ⍝ depths 2 ¯1 distinct: continue.
    │ s │ M │ p │ i │  │   │
    └───┴───┴───┴───┘  └───┘

        ┌───┬───┬───┐  ┌───┬───┐
        │ 1 │ 3 │ 3 │  │ 2 │¯1 │        ⍝ depths 3 2  distinct: continue.
        │ s │ M │ p │  │ i │   │
        └───┴───┴───┘  └───┴───┘

            ┌───┬───┐  ┌───┬───┬───┐
            │ 1 │ 3 │  │ 3 │ 2 │¯1 │    ⍝ depths 3 3 same: pair 'M' 'p'
            │ s │ M │  │ p │ i │   │
            └───┴───┘  └───┴───┴───┘

            ┌───┬───┐  ┌───┬───┐
            │ 1 │ 2 │  │ 2 │¯1 │        ⍝ depths 2 2 same: pair 'Mp' 'i'
            │ s │┌┴┐│  │ i │   │
            │   │M p│  └───┴───┘
            └───┴───┘

          ┌───┬─────┐  ┌───┐
          │ 1 │   1 │  │¯1 │            ⍝ depths 1 ¯1 distinct: continue.
          │ s │ ┌─┴┐│  │   │
          │   │┌┴┐ i│  └───┘
          │   │M p  │
          └───┴─────┘

                ┌───┐  ┌─────┬───┐
                │ 1 │  │   1 │¯1 │      ⍝ depths 1 1 same: pair 's'('Mp' 'i')
                │ s │  │ ┌─┴┐│   │
                └───┘  │┌┴┐ i│   │
                       │M p  │   │
                       └─────┴───┘

            ┌───────┐  ┌───┐
            │ 0     │  │¯1 │            ⍝ depths 0 ¯1 distinct: continue,
            │┌┴───┐ │  │   │
            │s  ┌─┴┐│  └───┘
            │  ┌┴┐ i│
            │  M p  │
            └───────┘

                       ┌───────┬───┐
                       │ 0     │¯1 │    ⍝ reduction complete.
                       │┌┴───┐ │   │
                       │s  ┌─┴┐│   │
                       │  ┌┴┐ i│   │
                       │  M p  │   │
                       └───────┴───┘

The  tree  is extracted from the final (scalar) result of the (vector) reduction
using (⍬ 0 1⊃)

            ┌───────────────────┐
            │┌→─────────────┬──┐│
            ││┌→┬──────────┐│  ││
            │││ │┌→┬──────┐││  ││      ┌→┬──────┐
            │││ ││ │┌→─┬─┐│││  ││      │ │┌→─┬─┐│
 {⍬ 0 1⊃⍵}  │││0││s││Mp│i││││¯1││  =>  │s││Mp│i││
            │││ ││ │└─→┴─┘│││  ││      │ │└─→┴─┘│
            │││ │└─┴─────→┘││  ││      └─┴─────→┘
            ││└─┴─────────→┘│  ││
            │└─────────────→┴~─┘│
            └──────────────────→┘

Having  reconstructed  the Huffman tree, the bit-stream is traversed to retrieve
the  original items. A potential performance bottleneck for the unpacking funct-
ion, is that the algorithm accesses one bit at a time, using consecutive bits to
traverse the tree. An alternative approach is to construct a dictionary that can
be  indexed by the (2∘⊥) of a number of consecutive bits at once. If the maximum
depth  of the tree is [maxd], then a dictionary of (2*maxd) entries will have an
entry  for  any  possible  value of [maxd] bits taken from the input string. For
_short_ paths through the tree (for example, to the 's' of 'Mississippi'), dict-
ionary  entries  are duplicated. Each dictionary entry contains the value of its
leaf and the number of bits required to reach it, and hence to skip in the input
stream.  The input stream is padded with [maxd] dummy bits, to prevent the final
indexing operation from failing.

        dict
    ┌───┬───┬───┬───┬───┬───┬───┬───┐
    │1 s│1 s│1 s│1 s│3 M│3 p│2 i│2 i│
    └───┴───┴───┴───┴───┴───┴───┴───┘

Prefixing bit string values for each entry:

        ⍉↑(,⍳2 2 2)dict
    ┌─────┬───┐
    │0 0 0│1 s│
    ├─────┼───┤
    │0 0 1│1 s│
    ├─────┼───┤
    │0 1 0│1 s│
    ├─────┼───┤
    │0 1 1│1 s│
    ├─────┼───┤
    │1 0 0│3 M│
    ├─────┼───┤
    │1 0 1│3 p│
    ├─────┼───┤
    │1 1 0│2 i│
    ├─────┼───┤
    │1 1 1│2 i│
    └─────┴───┘

Ref: David A. Huffmann, "A Method for the Construction of Minimum-Redundancy
Codes," Proc. IRE, vol. 40, no. 9, pp. 1098-1101; September, 1952.

See also: Data_compression morse

Back to: contents

Back to: Workspaces