cmp ← {cmp←1} ##.packH exp ⍝ Huffman packing. Huffman's algorithm builds a binary tree of unique items in the input string, using the items' frequency of occurrence as a weighting. Frequently occuring items are placed near the root of the tree and those less frequent towards the leaves. For example, using the string 'Mississippi': Original string: M i s s i s s i p p i unique items: M i s p item frequency: 1 4 4 2 Huffman tree: ┌──┴──┐ s ┌─┴──┐ 4 ┌─┴─┐ i M p 4 1 2 An item may then be picked from the tree using binary code: 0:left, 1:right. For example, in the above tree 'M' is at (1 0 0), 'i' is at (1 1) and 's' is at (0). The full coding for 'Mississippi' is the concatenation of codes: M i s s i s s i p p i (1 0 0) (1 1) (0) (0) (1 1) (0) (0) (1 1) (1 0 1) (1 0 1) (1 1) => 1 0 0 1 1 0 0 1 1 0 0 1 1 1 0 1 1 0 1 1 1 Choosing the shortest codes for the most frequently occurring items, guarantees that the total code length is optimally short. Notice that no separation marker is needed between successive code items in the output "bit string" as the length of each is determined by the path it describes from the root of the tree to its leaf. On expansion, bits are taken one-at-a- time from the input bit stream, to navigate through the tree. Each time a leaf is encountered, it is emitted to the output stream and the process resumed with the following bit, starting from the tree's root. Output from the encoding is the bit string, together with its Huffman tree. In addition, in order to allow simple arrays of _any_ rank, the subject array is ravelled, with its original shape prefixed to the output structure. Technical notes: The nested array form of the Huffman tree could be included directly in the out- put structure, but there are more space-efficient representations. An obvious one is to emit the vector of unique items and its corresponding frequency vec- tor. This is appealing in that the sub-function that builds the tree from the frequency vector may be used for both compression and expansion, and so may be included in the packH capsule as a common local function. Using this method, the output would look like this: packH'Mississippi' ┌──┬────┬───────┬─────────────────────────────────────────┐ │11│Misp│1 4 4 2│1 0 0 1 1 0 0 1 1 0 0 1 1 1 0 1 1 0 1 1 1│ └──┴────┴───────┴─────────────────────────────────────────┘ However, for an array where any item occurs more than 127 times, elements of the frequency vector will occupy 16 bits apiece (and 32 bits if any item occurs more than 32,767 times). A more frugal method is to output the tree's [leaves] and [depths] vectors. For all pratical purposes, items of the depths vector will consume only 8 bits each. packH'Mississippi' ┌──┬────┬───────┬─────────────────────────────────────────┐ │11│sMpi│1 3 3 2│1 0 0 1 1 0 0 1 1 0 0 1 1 1 0 1 1 0 1 1 1│ └──┴────┴───────┴─────────────────────────────────────────┘ A Huffman tree has the property that each node has either 0 or 2 sub-trees. It is this property that allows us to reconstruct the tree unambiguously from the depths of its leaves. For example, there are only four such distinct 4-leaf trees, and they have distinct leaf-depth vectors. 4-leaf trees: ┌─∘┐ ┌─∘──┐ ┌─∘─┐ ┌─∘─┐ ┌─∘┐ D ┌∘─┐ D A ┌∘─┐ A ┌─∘┐ ┌∘┐ C A ┌∘┐ ┌∘┐ D B ┌∘┐ A B B C B C C D leaf depths: 3 3 2 1 2 3 3 1 1 3 3 2 1 2 3 3 In contrast, leaf-depth can not distinguish more general binary trees: 2-leaf trees: ┌∘┐ ┌∘─┐ ┌∘ B ∘┐ B A A leaf depths: 2 1 2 1 As a larger example, look at the following (approximate (*)) Huffman tree for these notes, with character '·' substituted for blanks and newlines. Notice how the most frequently occurring character '·' is allocated code (0) and that groups of related characters such as '()', '{}', '[]', '├ ┼ ┤', '┌ ┐' and '└ ┘' tend to appear at the same depth in the tree. Notice also, how the very presence of this tree diagram in the notes has forced the '─' character up to position (1 0 0) in itself. ┌┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ · ┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────┴─────────────────────────────────────────────────────────┐ ┌┴───────────────────────────────────────────────────────────────────────────────────────┐ ┌─────────┴─────────────────────────┐ ─ ┌─────────────────────────────────────────────────────────────────┴───────────────────┐ ┌─────────────┴───┐ ┌─┴───────────────────────────────┐ ┌───────────────────┴─┐ ┌─────────────┴┐ ┌───────┴─┐ ┌─┴───┐ ┌─────────────────────┴┐ ┌───────────────────────────┴─┐ ┌┴───────────┐ ┌┴───────────────────┐ ┌───┴─┐ t ┌─────────────────────┴─┐ ┌┴─┐ ┌┴┐ ┌─┴┐ ┌┴───────────────────┐ e ┌─┴─┐ ┌┴┐ c ┌─┴─┐ h ┌─────────────────┴───────────────────────┐ ┌┴─┐ ┌┴─────────┐ ┌─┴───────┐ ┌┴─┐ o ┌┴─┐ │ n ┌┴┐ r a ┌───────────┴┐ ┌┴┐ ┌┴───────────────────┐ s i ┌─────┴┐ ┌┴───┐ ┌┴─────┐ ┌─┴─────────────────┐ 1 ┌┴┐ m ┌─────┴┐ ┌┴┐ ┌─┴───────────┐ f ┌┴─┐ u ┌┴─────┐ l ┴ ┌─┴─┐ ┐ ┌ p d ┌─────────────────┴─┐ ┌─┴───┐ y v ┌─┴┐ b ┌───┴───┐ ┌─────────────┴┐ ┌─┴┐ 0 → ┌─┴─┐ . , - ┌─┴┐ ┌─┴┐ ' ┌┴┐ g ┌───┴┐ ┌───┴┐ ┌┴─┐ ┌┴─────────┐ ┌┴─┐ ┌┴┐ ┌─┴┐ ┌┴┐ 3 ┌┴─┐ ┌─┴───┐ ┌───┴─────┐ x ┌───────┴┐ w ┌┴┐ ┌┴─┐ ┌─┴┐ M ┌───┴┐ 2 └ ┘ ┌┴─┐ ⍝ ┌┴─┐ ( ) ┌┴───┐ : ┌─┴─────┐ ┬ ┌┴┐ } ↑ ┌┴┐ 4 ↓ ¯ A ┌┴┐ ┌┴┐ ┌─┴─┐ ┌─┴─┐ ┌───┴───┐ ┌───┴───┐ j ⍺ T ⊃ ┌┴┐ ┌┴┐ ∘ ┌─┴─┐ H > ┌┴┐ k ┌┴┐ q ┌─┴─┐ ┌───┴┐ ┌─┴┐ ⍵ + B ≡ F C D ⌽ ┌┴┐ ┌┴┐ ┌┴┐ ┌┴┐ ┌┴─┐ ┌─┴─┐ ┌─┴─┐ ┌─┴─┐ " ├ ┼ ┤ ┌─┴┐ ┌┴┐ = ~ ⍉ ← ┌┴┐ ┌┴┐ ┌─┴─┐ < ┌─┴┐ { N E * [ ] \ R 9 S ┌┴┐ ┌┴┐ ┌┴┐ ┌┴┐ ┌┴┐ ┌┴┐ ┌┴┐ ┌┴┐ O 7 ⍬ ⊂ ¨ / ∇ ┌┴┐ ┌┴┐ ┌┴┐ I # ! ⍨ W ⍋ ⊥ ⍳ ; 5 _ ⍪ ⎕ ∨ ⍕ × | U 6 8 ⍴ ≠ P (*) The tree is approximate, because inserting it into the notes, changes the notes' Huffman tree slightly (as does adding this comment). Several iterations of recalculating and replacing the tree resulted in a loop of several distinct trees, of which the above is one. Therefore, the above tree _does not_ represent these notes exactly unless adding this final remark happens to cause the iterat- ions to converge! The tree was formatted using: hfmt←{ ⍝ Format Huffman tree. join←{⍵⍪⍉↑↑{⍺,' ',⍵}/↓∘⍉¨⍺} ⍝ join sub-trees. arch←{⍺⍺ ⍺[⎕IO+{⍵+∨\⍵}⍺⍺'┴'=⊃↓⍵]} ⍝ make ┌─┴─┐ archway. 0=≡⍵:↑,↓⍕,⍵ ⍝ leaf: char mat. lft rgt←∇¨⍵ ⍝ sub-trees and their, case←×|≡¨⍵ ⍝ relative depths: 0 0≡case:lft rgt join'┌┴┐' ⍝ leaf-leaf, 0 1≡case:lft rgt join'┌┴',' ─┐'⌽arch rgt ⍝ leaf-branch, 1 0≡case:lft rgt join'┴┐',⍨' ─┌'+arch lft ⍝ branch-leaf, lft rgt join(' ─┌'+arch lft),'┴',' ─┐'⌽arch rgt ⍝ branch-branch. } ----------- compression ----------- The tree is built by taking the two most infrequently occurring items in the unique list and replacing them with a single item: their "pair" together with the sum of their frequencies. This process is repeated until only a single item remains: tree←{ ⍝ Huffman tree. 1=⍴⍵:⊃⌽⊃⍵ ⍝ list exhausted: done nxt←⍵[2↑⍋⊃¨⍵] ⍝ next two lowest frequencies, freqs items←↓⍉↑nxt ⍝ and corresponding items. ∇(⍵~nxt),⊂(+/freqs),⊂items ⍝ collect 2 most infrequent items. }↓⍉↑freq uniq ⍝ tree from frequency-item pairs. ┌→──┬───┬───┬───┐ ┌→──┬───┬──────┐ ┌→──┬──────────┐ ┌→──────────────┐ │1 M│4 i│4 s│2 p│ => │ │ │┌→┬──┐│ => │ │┌→┬──────┐│ => │┌→─┬──────────┐│ └+─→┴+─→┴+─→┴+─→┘ │4 i│4 s││3│Mp││ │ ││ │┌→─┬─┐││ ││ │┌→┬──────┐││ │ │ │└─┴─→┘│ │4 s││7││Mp│i│││ ││ ││ │┌→─┬─┐│││ └+─→┴+─→┴─────→┘ │ ││ │└─→┴─┘││ ││11││s││Mp│i││││ │ │└─┴─────→┘│ ││ ││ │└─→┴─┘│││ ┌→┬──────┐ └+─→┴─────────→┘ ││ │└─┴─────→┘││ yielding tree: │ │┌→─┬─┐│ │└~─┴─────────→┘│ │s││Mp│i││ └──────────────→┘ │ │└─→┴─┘│ └─┴─────→┘ The leaves and depths vectors are extracted from the tree by a simple recursive traversal. leaves depths←↓⍉↑tree{ ⍝ tree leaves and depths. ⍬≡⍴⍺:⊂⍺ ⍵ ⍝ leaf: leaf and depth. ↑,/⍺ ∇¨1+⍵ ⍝ branch: traverse deeper sub-branches. }0 ⍝ starting at depth 0 for the root. --------- expansion --------- The tree is restored from its [depths] and [leaves] vectors with a single re- ductive pass across their transpose [depth-leaf] vector. During the reduction, the operand function examines adjacent nodes, replacing ones of identical depth with a new pair of depth less by 1. The reduction is "seeded" with a dummy depth of ¯1 on the right. Using the 'Mississippi' example: depths leaves ⍝ depths and leaves vectors. ┌───────┬────┐ │1 3 3 2│sMpi│ └───────┴────┘ ↓⍉↑depths leaves ⍝ depth-leaf vector. ┌───┬───┬───┬───┐ │1 s│3 M│3 p│2 i│ └───┴───┴───┴───┘ tree←⍬ 0 1⊃{ ⍝ reconstituted Huffman tree. (⊃⍺)≠⊃⊃⍵:(⊂⍺),⍵ ⍝ distinct adjacent depths: continue. (⌽{⊃⍵-1}\⌽↓⍉↑⍺(⊃⍵))∇ 1↓⍵ ⍝ identical adjacent depths: coalesce. }/(↓⍉↑depths leaves),¯1 ⍝ depths/leaves ++ trailing dummy entry. The first line skips over (or accumulates into the result,) adjacent items if their depths differ: (⊃⍺)≠⊃⊃⍵:(⊂⍺),⍵ ⍝ distinct adjacent depths: continue. The second line calls the operand function recursively for adjacent items of identical depth. We can break open the line to examine its constituent parts: (·················)∇ 1↓⍵ ⍝ repeat with adjacent nodes coalesced. ·············⍺(⊃⍵)······ ⍝ pair of adjacent same-depth tree nodes. ··········↓⍉↑··········· ⍝ => (depth depth) (node node) ·⌽{····}\⌽·············· ⍝ apply to _first_ item of pair. ···⊃⍵-1················· ⍝ reduced depth ························ ⍝ => (depth-1)(node node) The 'Mississippi' example proceeds through the reduction as follows: ┌──────────────── ⍝ adjacent depths: ┌──┴───┐ ┌───┬───┬───┬─↓─┐ ┌─↓─┐ │ 1 │ 3 │ 3 │ 2 │ │¯1 │ ⍝ depths 2 ¯1 distinct: continue. │ s │ M │ p │ i │ │ │ └───┴───┴───┴───┘ └───┘ ┌───┬───┬───┐ ┌───┬───┐ │ 1 │ 3 │ 3 │ │ 2 │¯1 │ ⍝ depths 3 2 distinct: continue. │ s │ M │ p │ │ i │ │ └───┴───┴───┘ └───┴───┘ ┌───┬───┐ ┌───┬───┬───┐ │ 1 │ 3 │ │ 3 │ 2 │¯1 │ ⍝ depths 3 3 same: pair 'M' 'p' │ s │ M │ │ p │ i │ │ └───┴───┘ └───┴───┴───┘ ┌───┬───┐ ┌───┬───┐ │ 1 │ 2 │ │ 2 │¯1 │ ⍝ depths 2 2 same: pair 'Mp' 'i' │ s │┌┴┐│ │ i │ │ │ │M p│ └───┴───┘ └───┴───┘ ┌───┬─────┐ ┌───┐ │ 1 │ 1 │ │¯1 │ ⍝ depths 1 ¯1 distinct: continue. │ s │ ┌─┴┐│ │ │ │ │┌┴┐ i│ └───┘ │ │M p │ └───┴─────┘ ┌───┐ ┌─────┬───┐ │ 1 │ │ 1 │¯1 │ ⍝ depths 1 1 same: pair 's'('Mp' 'i') │ s │ │ ┌─┴┐│ │ └───┘ │┌┴┐ i│ │ │M p │ │ └─────┴───┘ ┌───────┐ ┌───┐ │ 0 │ │¯1 │ ⍝ depths 0 ¯1 distinct: continue, │┌┴───┐ │ │ │ │s ┌─┴┐│ └───┘ │ ┌┴┐ i│ │ M p │ └───────┘ ┌───────┬───┐ │ 0 │¯1 │ ⍝ reduction complete. │┌┴───┐ │ │ │s ┌─┴┐│ │ │ ┌┴┐ i│ │ │ M p │ │ └───────┴───┘ The tree is extracted from the final (scalar) result of the (vector) reduction using (⍬ 0 1⊃) ┌───────────────────┐ │┌→─────────────┬──┐│ ││┌→┬──────────┐│ ││ │││ │┌→┬──────┐││ ││ ┌→┬──────┐ │││ ││ │┌→─┬─┐│││ ││ │ │┌→─┬─┐│ {⍬ 0 1⊃⍵} │││0││s││Mp│i││││¯1││ => │s││Mp│i││ │││ ││ │└─→┴─┘│││ ││ │ │└─→┴─┘│ │││ │└─┴─────→┘││ ││ └─┴─────→┘ ││└─┴─────────→┘│ ││ │└─────────────→┴~─┘│ └──────────────────→┘ Having reconstructed the Huffman tree, the bit-stream is traversed to retrieve the original items. A potential performance bottleneck for the unpacking funct- ion, is that the algorithm accesses one bit at a time, using consecutive bits to traverse the tree. An alternative approach is to construct a dictionary that can be indexed by the (2∘⊥) of a number of consecutive bits at once. If the maximum depth of the tree is [maxd], then a dictionary of (2*maxd) entries will have an entry for any possible value of [maxd] bits taken from the input string. For _short_ paths through the tree (for example, to the 's' of 'Mississippi'), dict- ionary entries are duplicated. Each dictionary entry contains the value of its leaf and the number of bits required to reach it, and hence to skip in the input stream. The input stream is padded with [maxd] dummy bits, to prevent the final indexing operation from failing. dict ┌───┬───┬───┬───┬───┬───┬───┬───┐ │1 s│1 s│1 s│1 s│3 M│3 p│2 i│2 i│ └───┴───┴───┴───┴───┴───┴───┴───┘ Prefixing bit string values for each entry: ⍉↑(,⍳2 2 2)dict ┌─────┬───┐ │0 0 0│1 s│ ├─────┼───┤ │0 0 1│1 s│ ├─────┼───┤ │0 1 0│1 s│ ├─────┼───┤ │0 1 1│1 s│ ├─────┼───┤ │1 0 0│3 M│ ├─────┼───┤ │1 0 1│3 p│ ├─────┼───┤ │1 1 0│2 i│ ├─────┼───┤ │1 1 1│2 i│ └─────┴───┘ Ref: David A. Huffmann, "A Method for the Construction of Minimum-Redundancy Codes," Proc. IRE, vol. 40, no. 9, pp. 1098-1101; September, 1952. See also: Data_compression morse Back to: contents Back to: Workspaces