cmp ← {cmp←12} ##.packZ exp ⍝ Abraham Lempel, Jacob Ziv, Terry Welch.
The LZW algorithm relies on recurrence of sequences (strings) of items in its
input vector. The items are typically characters, but may be drawn from any
"alphabet" of distinct values, including for example the boolean alphabet: 0 1,
or the unique items of a nested vector (see the example using →tokens←, below).
The algorithm maintains a "dictionary", which maps strings in the input vector
to their associated output codes. The dictionary initially contains mappings
for all possible strings of length one. Input is taken one item at a time to
find the longest initial string already present in the dictionary. The code for
that string is output and the string, extended with the following item (i) from
the input vector, is added to the dictionary with the next unused code (obtained
by incrementing a counter). The process repeats, with the input string beginning
with (i).
The number of bits in an output code, and hence the maximum number of entries in
the dictionary, is usually fixed and once this limit is reached, no more entries
are added. A code width of 12 bits has been recommended. The absolute value of
left argument [cmp] specifies this limit. If [cmp] is negative, the dictionary,
rather than the compression structure is returned. See example below.
Technical notes:
Output from the function is a 3-vector: (shape codes alphabet). [codes] is a
vector of non-negative indices into the dictionary, encoded as a w-row boolean
matrix, where w is the number of bits (default maximum 12) allocated for each
value. The integer vector for codes is easily reconstituted with 2⊥⍵. [alphabet]
is the set of unique items in the input vector.
Each new "string" inserted into the dictionary is identical to an existing one,
but with one extra "character" appended. In light of this, the dictionary may be
represented as a tree. For an alphabet of ⍺ unique items, each node in the tree
has ⍺ sub-nodes: an "⍺-ary tree". For example, a boolean vector (with alphabet:
0, 1) will produce a binary tree and a string that uses all values from ⎕AV will
give rise to a 256-ary tree. Extracting the code associated with a string from
the dictionary amounts to using a character at a time to navigate through each
of the tree's nodes. Encountering a null node signals that the string is not
currently in the dictionary, which can be extended by adding a new node with
a codeword representing the new string. This type of tree is known in the trade
as a "suffix trie" (Aho et al 1983).
Historical note:
The 20-year patent on this beautiful algorithm finally expired world-wide on 7th
July, 2004. Prior to this date, incorporating any coding of LZW in commercial
software incurred a royalty.
(muse:
This raises the question of whether algorithms are invented or discovered.
The following thought-experiment suggests they are _discovered_:
If it turns out that life on Mars has advanced sufficiently to produce math-
ematicians, their set of prime numbers must coincide with ours (even if
their anatomy has lead them to a counting base other than decimal).
Neither civilisation invented the sequence of prime numbers; it was already
"out there", when we became aware of, or discovered it.
Same goes for algorithms. If the Martians are aware of prime numbers, in all
probability, they will also be aware of the sieving algorithm that extracts
them (perhaps they call it "The Sieve of Zork"). Similarly, it is unlikely
that they are unaware of the subtract-&-swap method for finding highest
common factors, which was discovered on our planet by Euclid. And so on;
Thus, algorithms exist independently of the cultures that discover them (an
idea proposed by the Greek philosopher and mathematician Plato ¯427-¯347).
Viewed in this light, useful algorithms could be seen as diamonds in the
otherwise dark coal-face that is the set of all expression transformation
sequences.
(
However, it's hard (for JMS) to see why the same argument doesn't apply
to artefacts generally accepted to be inventions: the baby sling, the
wheel/axle, the stirrup, the beer-can ring-pull, and so forth. Perhaps
it's a question of the degree of complexity of the construction or per-
haps it depends upon whether the components are discrete or continuous.
)
Of course, both inventions and discoveries may be patented.
)
Examples:
packZ'mississippi' ⍝ compress char vector.
11 0 0 0 0 1 1 0 0 0 misp
0 0 1 1 0 1 1 1 0
0 1 0 0 1 1 1 1 1
{⍵≡0 packZ packZ ⍵}'mississippi' ⍝ expand recovers original.
1
2⊥2⊃packZ'mississippi' ⍝ ... bool matrix decoded.
0 1 2 2 5 7 3 3 1
¯8 packZ'mississippi' ⍝ final compression dictionary.
┌─┬─┬─┬─┬──┬──┬──┬──┬───┬───┬──┬──┐
│m│i│s│p│mi│is│ss│si│iss│sip│pp│pi│
└─┴─┴─┴─┴──┴──┴──┴──┴───┴───┴──┴──┘
segs←{(¯12 packZ ⍵)[⎕io+2⊥2⊃packZ ⍵]} ⍝ dictionary entry per code word.
segs'mississippi' ⍝ segments for string.
┌─┬─┬─┬─┬──┬──┬─┬─┬─┐
│m│i│s│s│is│si│p│p│i│
└─┴─┴─┴─┴──┴──┴─┴─┴─┘
{⍵≡↑,/segs ⍵}'mississippi' ⍝ enlist of segs recovers original.
1
segs 40⍴'a' ⍝ segs for highly repetitive string.
┌─┬──┬───┬────┬─────┬──────┬───────┬────────┬────┐
│a│aa│aaa│aaaa│aaaaa│aaaaaa│aaaaaaa│aaaaaaaa│aaaa│
└─┴──┴───┴────┴─────┴──────┴───────┴────────┴────┘
segs 40⍴'abcd' ⍝ segs for less repetitive string.
┌─┬─┬─┬─┬──┬──┬───┬──┬──┬───┬───┬───┬────┬─────┬────┬───┐
│a│b│c│d│ab│cd│abc│da│bc│dab│cda│bcd│abcd│abcda│bcda│bcd│
└─┴─┴─┴─┴──┴──┴───┴──┴──┴───┴───┴───┴────┴─────┴────┴───┘
packZ'hello world' ⍝ compressed char vector.
┌──┬─────────────────────┬────────┐
│11│0 0 0 0 0 1 1 0 1 0 1│helo wrd│
│ │0 0 1 1 1 0 0 1 1 1 1│ │
│ │0 1 0 0 1 0 1 1 0 0 1│ │
└──┴─────────────────────┴────────┘
packZ 11 ⎕dr'hello world' ⍝ compressed bool vector.
┌──┬───────────────────────────────────────────────────────────────┬───┐
│88│0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0│0 1│
│ │0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 1 1 0 1 1 0 0 1 0 0 1 1 1 0│ │
│ │0 0 0 0 0 1 0 1 1 1 0 0 1 1 1 1 1 0 0 0 1 0 0 0 1 0 1 0 0 1 1 0│ │
│ │0 0 0 1 0 1 1 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 0 0 0 1 0 0│ │
│ │0 1 1 0 0 0 0 0 1 0 1 0 1 0 0 0 1 1 1 1 1 0 1 0 1 0 0 1 0 0 0 0│ │
└──┴───────────────────────────────────────────────────────────────┴───┘
packZ 83 ⎕DR'hello world' ⍝ compressed integer vector.
┌──┬─────────────────────┬──────────────────────────────┐
│11│0 0 0 0 0 1 1 0 1 0 1│104 101 108 111 32 119 114 100│
│ │0 0 1 1 1 0 0 1 1 1 1│ │
│ │0 1 0 0 1 0 1 1 0 0 1│ │
└──┴─────────────────────┴──────────────────────────────┘
size←{⍬⍴⎕size'⍵'} ⍝ function for size in bytes.
size notes.packZ ⍝ size of notes.
7200
size packZ notes.packZ ⍝ size of compressed notes.
3948
size 8 packZ notes.packZ ⍝ ... with 8-bit code words.
5388
size 100 100⍴⍳7 ⍝ size of numeric matrix.
10020
size packZ 100 100⍴⍳7 ⍝ compressed size.
512
mat ⍝ nested matrix.
┌────────┬─────┬────────┐
│Scissors│Stone│Paper │
├────────┼─────┼────────┤
│Stone │Paper│Scissors│
└────────┴─────┴────────┘
size mat ⍝ matrix size.
188
size packZ ⍉mat ⍝ compressed size.
172
packZ mat ⍝ compressed matrix.
┌───┬─────────┬──────────────────────┐
│2 3│0 0 0 1 0│┌────────┬─────┬─────┐│
│ │0 0 1 0 0││Scissors│Stone│Paper││
│ │0 1 0 0 0│└────────┴─────┴─────┘│
└───┴─────────┴──────────────────────┘
tokens ⊃⎕nr'packZ' ⍝ tokens in first line of packZ.
┌─────┬─┬─┬───┬─┬───┬─┬─┬─┬─┬──────────────────────────┬───────────────────────────────┐
│packZ│←│{│⎕IO│ │⎕ML│←│0│ │1│ │⍝ Lempel-Ziv-Welch compression.│
└─────┴─┴─┴───┴─┴───┴─┴─┴─┴─┴──────────────────────────┴───────────────────────────────┘
toks←↑,/tokens¨⎕nr'packZ' ⍝ nested vector of packZ's tokens.
size toks ⍝ size of packZ's tokens.
11508
size packZ toks ⍝ size of compressed tokens.
4672
{⍵≡0 packZ packZ ⍵}toks ⍝ nested pack: full circle.
1
{⍵≡0 packZ packZ ⍵}'hello world' ⍝ char pack: full circle.
1
{⍵≡0 packZ packZ ⍵}11 ⎕dr'hello world' ⍝ bool pack: full circle.
1
See also: Data_compression tokens
Back to: contents
Back to: Workspaces