Conservation of Bits

Copyright ©2003 by Paul Niquette. All rights reserved.

Can you think of a variable-length code that will conserve bits?

ere is a rather lopsided tree for representing a variable length code.

The most frequent symbol is G, and only one bit is assigned to encode it.  The next most frequent symbol is C, which requires two bits.  That leaves (no pun intended) the remaining two symbols, A and T, which require three bits each. 

A little foozling will quickly convince you that these assignments are the best you can do on behalf of four symbols. 

Congratulations, you have re-invented the famous Huffman Coding Procedure, which is applied in nearly every information-intensive application that requires efficient use of binary bits in the representation of data. The previous sentence does not have an exclamation point; however, in recognition of its emphatic assertion, this one does
You will find Huffman Codes in data storage and retrieval systems, modems, fax machines, and high-definition television.

So, then, for the Conservation of Bits, how well did your variable length code do?  The following table provides the answer.
 
Nucleotide
Variable 
Length Code
Earth-Life
Frequencies
Pro-rated
Bits
Guanine
Cytosine
Adonine
Thymine
0
01
110
111
35.7%
28.6%
21.4%
14.3%
0.357
0.572
0.642
0.429

The sum of 0.357 + 0.572 + 0.642 + 0.429 = 2.000.  On the average, apparently, the variable length code requires two bits to represent each nucleotide.  Two bits? -- hey, that's the same as the fixed-length code. 

Hmm, the answer to the question posed in this puzzle, "Can you think of a variable-length code that will conserve bits?" must therefore be...

No

 
on't be discouraged.  The puzzle was cooked up to achieve that outcome.  The integers 2, 3, 4, 5 were chosen for the relative frequencies of T, A, C, G, which produced the indicated proportions in the table above (in nature, by the way, the relative frequencies of nucleotides are closer to 12, 17, 25, 28).

If, say, the relative frequencies had been 1, 2, 4, 8, the average number of Huffman Code bits per symbol would have been 12/3.  The formal name for that number is the Shannon Entropy Limit (SEL), and for the case at hand, it represents a data compression of 0.833, conserving 0.166 bits per bit.  The same SEL would arise with various relative frequencies, including for example 1, 1, 4, 6 and 3, 9, 10, 29.  The most radical distribution of symbols might be something like 1, 1, 1, 100, which produces an SEL of 1.05, thus approaching 50% in the Conservation of Bits.


David A. Huffman, 1925-1999 is probably best known for the development of the Huffman Coding Procedure, the result of a term paper written while he was a graduate student at MIT.  He got a grade of A.

Huffman codes belong in a category called "comma-free codes," since no extra bits are needed for 'punctuation' or delimiting symbols in a data stream -- this, despite the fact that each symbol requires a variable number of bits in its respective code.  To qualify as a comma-free code, it is necessary that no code can comprise the same pattern of bits appearing at the beginning of any longer code.

An Internet search for "Huffman Code" will find more than 10,000 websites, several with applets for creating Huffman trees.


Home Page
Puzzle Page
Logic and Reasonings
The Puzzle as a Literary Genre