No Huff and Puff—Just Compress!

March 16, 2025 - 5 minutes read - 889 words

The Huffman’s Algorithm

The Huffman’s algorithm was born in 1952 as a way to perform a lossless compression on data files. Its efficiency comes from the frequency analysis of characters present in the text to encode: shorter codes are used to encode more frequent characters while longer codes are used to encore less frequent ones.

It constitutes the foundation of modern text compression.

You can read more on variable-length encoding and tree representations under the Design tag on this blog.

Given the sequence 001, how do we know we should decode it as 0 & 01 and not 001?

Huffman’s codes for characters are never the prefix code for another character. Thanks to this, there is no ambiguity when decoding the text. In this example, the algorithm would not output 0 & 01 codes together.

How it works

The algorithm is based on frequency analysis of characters in the text to encode. The least two frequent characters are then put into a binary tree, which root node is the result of their combined frequency. This operation is repeated until there is only one node left.

The code for a given character is then the path followed along the binary tree to reach this character; by convention left nodes carry the 0 direction while right nodes carry 1.

To aid with visual representation, the below is an example of such tree.

Practical Example

To represent the Huffman’s code, we first need to define a Huffman’s node. Since we will be processing pairs of least frequent nodes, a min heap is used. The Huffman’s node requires to implement the less than magic method __lt__.

Note we need to handle the case where two characters have the same frequency by comparing their symbol instead.

class Huffman_Node:
    def __init__(self, freq, symbol, left=None, right=None):
        self.freq = freq
        self.symbol = symbol
        self.right = right
        self.left = left
        self.dir = ''

    def __lt__(self, other):
        if self.freq == other.freq:
            return self.symbol < other.symbol
        return self.freq < other.freq

    def __str__(self):
        return f'Symbol: {self.symbol}, Freq: {self.freq}, Left: {self.left}, Right: {self.right}'

From there, the Huffman’s tree can be generated. Traversing the tree and building paths to the child nodes provides the codes given to each character of the text. Those paths are kept in encoding/decoding tables.

import heapq
from collections import Counter

class Huffman_Tree:
    def __init__(self):
        self.root = None
        self.encoding_table = {}
        self.decoding_table = {}

    def _build_encoding_table(self, text):
        frequencies = Counter(text)
        self._get_huffman_coding(frequencies)

        def traverse(node, path=""):
            if not node:
                return
            if not node.left and not node.right:
                self.encoding_table[node.symbol] = path
                self.decoding_table[path] = node.symbol
                return
            if node.left:
                traverse(node.left, path + str(node.left.dir))
            if node.right:
                traverse(node.right, path + str(node.right.dir))
        return traverse(self.root)

    def _get_huffman_coding(self, frequencies):
        nodes = []

        for char, freq in frequencies.items():
            heapq.heappush(nodes, Huffman_Node(freq, char))

        while len(nodes) > 1:
            left = heapq.heappop(nodes)
            right = heapq.heappop(nodes)
            # assign directional value, convention is 0 left, 1 right
            left.dir = 0
            right.dir = 1

            combined_node = Huffman_Node(left.freq + right.freq, left.symbol + right.symbol, left, right)
            heapq.heappush(nodes, combined_node)
        #print([str(node) for node in nodes])
        self.root = nodes[0]

The encode method builds the encoding table and processes each character one by one, following the encoding table.

The decode method decodes the first sequence of bits found in the decoding table. This is possible thanks to the property of the Huffman’s algorithm not to encode a character with another’s prefix. This provides the guarantee that the first known found sequence is the correct one. This is done simply using a two pointers method.

class Huffman_Tree:
    
    def encode(self, text):
        self._build_encoding_table(text)
        encoded_text = ''
        for c in text:
            encoded_text += self.encoding_table[c]
        return encoded_text

    def decode(self, text):
        if not self.decoding_table:
            print('Decoding table not built! Unable to proceed')
            return ''
        # prefixes of the table are guaranteed to not be shared with other
        decoded_text = ''
        l, r = 0, 0
        while r < len(text):
            curr_bits = text[l:r+1]
            if curr_bits in self.decoding_table:
                decoded_text += self.decoding_table[curr_bits]
                l = r + 1
            r += 1
        return decoded_text

Fancy a run?

The below code encodes and decodes the test string “leane.dev is my bedtime read; counting bits guaranteed”.

if __name__ == '__main__':
    tree = Huffman_Tree()

    ### My Example ###
    example_str = "leane.dev is my bedtime read; counting bits guaranteed"
    encoded_example = tree.encode(example_str)
    print(f'The encoded example is: \n{encoded_example} \n with length {len(encoded_example)} \n while original length'
          f' was {len(example_str)}\n')
    print(f'Encoding table is: \n {tree.encoding_table}')
    print(f'Decoding table is: \n {tree.decoding_table}')

    decoded_str = tree.decode(encoded_example)
    print(f'Decoded string is: {decoded_str}')

The encoded example is: 
10001110101111110101011010100110111110101011010000010110010110001010000101100100111011100110101011111101011110010110110101000101111000001111000111011110110000101000011010010000010110000001011111111011111100011011011001 
 with length 218 
 while original length was 54

Encoding table is: 
 {'s': '0000', 'u': '0001', 't': '001', ' ': '010', 'y': '01100', '.': '011010', ';': '011011', 'a': '0111', 'b': '10000', 'c': '100010', 'l': '100011', 'd': '1001', 'e': '101', 'g': '11000', 'm': '11001', 'i': '1101', 'n': '1110', 'o': '111100', 'v': '111101', 'r': '11111'}
Decoding table is: 
 {'0000': 's', '0001': 'u', '001': 't', '010': ' ', '01100': 'y', '011010': '.', '011011': ';', '0111': 'a', '10000': 'b', '100010': 'c', '100011': 'l', '1001': 'd', '101': 'e', '11000': 'g', '11001': 'm', '1101': 'i', '1110': 'n', '111100': 'o', '111101': 'v', '11111': 'r'}
Decoded string is: leane.dev is my bedtime read; counting bits guaranteed

Encoding this 54 characters example string would have required 432 bits or 54 bytes following the ASCII scheme.

Using the Huffman’s algorithm, we could do it with only 218 bits, representing a 50% compression.

Even assuming a 7 bits ASCII encoding, we reach a 43% compression!

The Huffman’s Algorithm

How it works

Practical Example

Fancy a run?

References