Dyalog APL - Abraham Lempel, Jacob Ziv, Terry Welch.

cmp ← {cmp←12} ##.packZ exp         ⍝ Abraham Lempel, Jacob Ziv, Terry Welch.

The  LZW  algorithm  relies on recurrence of sequences (strings) of items in its
input  vector.  The  items  are  typically characters, but may be drawn from any
"alphabet"  of distinct values, including for example the boolean alphabet: 0 1,
or the unique items of a nested vector (see the example using →tokens←, below).

The  algorithm  maintains a "dictionary", which maps strings in the input vector
to  their  associated  output  codes. The dictionary initially contains mappings
for  all  possible  strings  of length one. Input is taken one item at a time to
find  the longest initial string already present in the dictionary. The code for
that  string is output and the string, extended with the following item (i) from
the input vector, is added to the dictionary with the next unused code (obtained
by incrementing a counter). The process repeats, with the input string beginning
with (i).

The number of bits in an output code, and hence the maximum number of entries in
the dictionary, is usually fixed and once this limit is reached, no more entries
are  added.  A code width of 12 bits has been recommended. The absolute value of
left  argument [cmp] specifies this limit. If [cmp] is negative, the dictionary,
rather than the compression structure is returned. See example below.

Technical notes:

Output  from  the  function  is a 3-vector: (shape codes alphabet). [codes] is a
vector  of  non-negative indices into the dictionary, encoded as a w-row boolean
matrix,  where  w  is the number of bits (default maximum 12) allocated for each
value. The integer vector for codes is easily reconstituted with 2⊥⍵. [alphabet]
is the set of unique items in the input vector.

Each  new "string" inserted into the dictionary is identical to an existing one,
but with one extra "character" appended. In light of this, the dictionary may be
represented  as a tree. For an alphabet of ⍺ unique items, each node in the tree
has  ⍺ sub-nodes: an "⍺-ary tree". For example, a boolean vector (with alphabet:
0, 1) will produce a binary tree and a string that uses all values from ⎕AV will
give  rise  to a 256-ary tree. Extracting the code associated with a string from
the  dictionary  amounts to using a character at a time to navigate through each
of  the  tree's  nodes.  Encountering a null node signals that the string is not
currently  in  the  dictionary,  which can be extended by adding a new node with
a  codeword representing the new string. This type of tree is known in the trade
as a "suffix trie" (Aho et al 1983).

Historical note:

The 20-year patent on this beautiful algorithm finally expired world-wide on 7th
July,  2004.  Prior  to this date, incorporating any coding of LZW in commercial
software incurred a royalty.

(muse:
    This  raises  the question of whether algorithms are invented or discovered.
    The following thought-experiment suggests they are _discovered_:

    If it turns out that life on Mars has advanced sufficiently to produce math-
    ematicians,  their  set  of  prime  numbers must coincide with ours (even if
    their anatomy has lead them to a counting base other than decimal).

    Neither  civilisation invented the sequence of prime numbers; it was already
    "out there", when we became aware of, or discovered it.

    Same goes for algorithms. If the Martians are aware of prime numbers, in all
    probability,  they will also be aware of the sieving algorithm that extracts
    them  (perhaps  they call it "The Sieve of Zork"). Similarly, it is unlikely
    that  they  are  unaware  of  the subtract-&-swap method for finding highest
    common factors, which was discovered on our planet by Euclid. And so on;

    Thus,  algorithms exist independently of the cultures that discover them (an
    idea proposed by the Greek philosopher and mathematician Plato ¯427-¯347).

    Viewed  in  this  light,  useful algorithms could be seen as diamonds in the
    otherwise  dark  coal-face  that is the set of all expression transformation
    sequences.

    (
        However,  it's hard (for JMS) to see why the same argument doesn't apply
        to  artefacts  generally  accepted to be inventions: the baby sling, the
        wheel/axle,  the  stirrup, the beer-can ring-pull, and so forth. Perhaps
        it's  a question of the degree of complexity of the construction or per-
        haps it depends upon whether the components are discrete or continuous.
    )

    Of course, both inventions and discoveries may be patented.
)

Examples:

    packZ'mississippi'                      ⍝ compress char vector.
 11  0 0 0 0 1 1 0 0 0  misp
     0 0 1 1 0 1 1 1 0
     0 1 0 0 1 1 1 1 1

    {⍵≡0 packZ packZ ⍵}'mississippi'        ⍝ expand recovers original.
1

    2⊥2⊃packZ'mississippi'                  ⍝ ... bool matrix decoded.
0 1 2 2 5 7 3 3 1

    ¯8 packZ'mississippi'                   ⍝ final compression dictionary.
┌─┬─┬─┬─┬──┬──┬──┬──┬───┬───┬──┬──┐
│m│i│s│p│mi│is│ss│si│iss│sip│pp│pi│
└─┴─┴─┴─┴──┴──┴──┴──┴───┴───┴──┴──┘

    segs←{(¯12 packZ ⍵)[⎕io+2⊥2⊃packZ ⍵]}   ⍝ dictionary entry per code word.

    segs'mississippi'                       ⍝ segments for string.
┌─┬─┬─┬─┬──┬──┬─┬─┬─┐
│m│i│s│s│is│si│p│p│i│
└─┴─┴─┴─┴──┴──┴─┴─┴─┘

    {⍵≡↑,/segs ⍵}'mississippi'              ⍝ enlist of segs recovers original.
1
    segs 40⍴'a'                             ⍝ segs for highly repetitive string.
┌─┬──┬───┬────┬─────┬──────┬───────┬────────┬────┐
│a│aa│aaa│aaaa│aaaaa│aaaaaa│aaaaaaa│aaaaaaaa│aaaa│
└─┴──┴───┴────┴─────┴──────┴───────┴────────┴────┘

    segs 40⍴'abcd'                          ⍝ segs for less repetitive string.
┌─┬─┬─┬─┬──┬──┬───┬──┬──┬───┬───┬───┬────┬─────┬────┬───┐
│a│b│c│d│ab│cd│abc│da│bc│dab│cda│bcd│abcd│abcda│bcda│bcd│
└─┴─┴─┴─┴──┴──┴───┴──┴──┴───┴───┴───┴────┴─────┴────┴───┘

    packZ'hello world'                      ⍝ compressed char vector.
┌──┬─────────────────────┬────────┐
│11│0 0 0 0 0 1 1 0 1 0 1│helo wrd│
│  │0 0 1 1 1 0 0 1 1 1 1│        │
│  │0 1 0 0 1 0 1 1 0 0 1│        │
└──┴─────────────────────┴────────┘

    packZ 11 ⎕dr'hello world'               ⍝ compressed bool vector.
┌──┬───────────────────────────────────────────────────────────────┬───┐
│88│0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0│0 1│
│  │0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 1 1 0 1 1 0 0 1 0 0 1 1 1 0│   │
│  │0 0 0 0 0 1 0 1 1 1 0 0 1 1 1 1 1 0 0 0 1 0 0 0 1 0 1 0 0 1 1 0│   │
│  │0 0 0 1 0 1 1 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 0 0 0 1 0 0│   │
│  │0 1 1 0 0 0 0 0 1 0 1 0 1 0 0 0 1 1 1 1 1 0 1 0 1 0 0 1 0 0 0 0│   │
└──┴───────────────────────────────────────────────────────────────┴───┘

    packZ 83 ⎕DR'hello world'               ⍝ compressed integer vector.
┌──┬─────────────────────┬──────────────────────────────┐
│11│0 0 0 0 0 1 1 0 1 0 1│104 101 108 111 32 119 114 100│
│  │0 0 1 1 1 0 0 1 1 1 1│                              │
│  │0 1 0 0 1 0 1 1 0 0 1│                              │
└──┴─────────────────────┴──────────────────────────────┘

    size←{⍬⍴⎕size'⍵'}                       ⍝ function for size in bytes.

    size notes.packZ                        ⍝ size of notes.
7200
    size packZ notes.packZ                  ⍝ size of compressed notes.
3948
    size 8 packZ notes.packZ                ⍝ ... with 8-bit code words.
5388

    size 100 100⍴⍳7                         ⍝ size of numeric matrix.
10020
    size packZ 100 100⍴⍳7                   ⍝ compressed size.
512

    mat                                     ⍝ nested matrix.
┌────────┬─────┬────────┐
│Scissors│Stone│Paper   │
├────────┼─────┼────────┤
│Stone   │Paper│Scissors│
└────────┴─────┴────────┘

    size mat                                ⍝ matrix size.
188
    size packZ ⍉mat                         ⍝ compressed size.
172
    packZ mat                               ⍝ compressed matrix.
┌───┬─────────┬──────────────────────┐
│2 3│0 0 0 1 0│┌────────┬─────┬─────┐│
│   │0 0 1 0 0││Scissors│Stone│Paper││
│   │0 1 0 0 0│└────────┴─────┴─────┘│
└───┴─────────┴──────────────────────┘

    tokens ⊃⎕nr'packZ'                      ⍝ tokens in first line of packZ.
┌─────┬─┬─┬───┬─┬───┬─┬─┬─┬─┬──────────────────────────┬───────────────────────────────┐
│packZ│←│{│⎕IO│ │⎕ML│←│0│ │1│                          │⍝ Lempel-Ziv-Welch compression.│
└─────┴─┴─┴───┴─┴───┴─┴─┴─┴─┴──────────────────────────┴───────────────────────────────┘

    toks←↑,/tokens¨⎕nr'packZ'               ⍝ nested vector of packZ's tokens.

    size toks                               ⍝ size of packZ's tokens.
11508
    size packZ toks                         ⍝ size of compressed tokens.
4672
    {⍵≡0 packZ packZ ⍵}toks                 ⍝ nested pack: full circle.
1
    {⍵≡0 packZ packZ ⍵}'hello world'        ⍝ char pack: full circle.
1
    {⍵≡0 packZ packZ ⍵}11 ⎕dr'hello world'  ⍝ bool pack: full circle.
1

See also: Data_compression tokens

Back to: contents

Back to: Workspaces