Computer Science‎ > ‎

Huffman Encoding

Huffman codes are very effective and widely used technique for compressing data. Huffman encoding problem is of finding the minimum length bit string which can be used to encode a string of symbols. It uses a table of frequencies of occurrence of each character to represent each character as a binary string, optimally. It uses a simple heap based priority queue. Each leaf is labeled with a character and its frequency of occurrence. Each internal node is labeled with the sum of the weights of the leaves in its sub-tree. The Huffman encoding scheme is an example of a greedy algorithm.

Analysis

The running time of Huffman on a set of n characters is O(nlogn).

Huffman Encoding - C Program Source code for generating Huffman Codes

`#include<string.h>#include<stdio.h>#include<limits.h>#include<stdlib.h>typedef struct node{        char ch;        int freq;        struct node *left;        struct node *right;}node;/*Declaring heap globally so that we do not need to pass it as an argument every time*//* Heap implemented  here is Min Heap */node * heap[1000000];int heapSize;/*Initialize Heap*/void Init(){        heapSize = 0;        heap[0] = (node *)malloc(sizeof(node));        heap[0]->freq = -INT_MAX;}/*Insert an element into the heap */void Insert(node * element){        heapSize++;        heap[heapSize] = element; /*Insert in the last place*/        /*Adjust its position*/        int now = heapSize;        while(heap[now/2] -> freq > element -> freq)         {                heap[now] = heap[now/2];                now /= 2;        }        heap[now] = element;}node * DeleteMin(){        /* heap[1] is the minimum element. So we remove heap[1]. Size of the heap is decreased.            Now heap[1] has to be filled. We put the last element in its place and see if it fits.           If it does not fit, take minimum element among both its children and replaces parent with it.           Again See if the last element fits in that place.*/        node * minElement,*lastElement;        int child,now;        minElement = heap[1];        lastElement = heap[heapSize--];        /* now refers to the index at which we are now */        for(now = 1; now*2 <= heapSize ;now = child)        {                /* child is the index of the element which is minimum among both the children */                 /* Indexes of children are i*2 and i*2 + 1*/                child = now*2;                /*child!=heapSize beacuse heap[heapSize+1] does not exist, which means it has only one                   child */                if(child != heapSize && heap[child+1]->freq < heap[child] -> freq )                 {                        child++;                }                /* To check if the last element fits ot not it suffices to check if the last element                   is less than the minimum element among both the children*/                if(lastElement -> freq > heap[child] -> freq)                {                        heap[now] = heap[child];                }                else /* It fits there */                {                        break;                }        }        heap[now] = lastElement;        return minElement;}void print(node *temp,char *code){        if(temp->left==NULL && temp->right==NULL)        {                printf("char %c code %s\n",temp->ch,code);                return;        }        int length = strlen(code);        char leftcode[512],rightcode[512];        strcpy(leftcode,code);        strcpy(rightcode,code);        leftcode[length] = '0';        leftcode[length+1] = '\0';        rightcode[length] = '1';        rightcode[length+1] = '\0';        print(temp->left,leftcode);        print(temp->right,rightcode);}/* Given the list of characters along with their frequencies, our goal is to predict the encoding of the   characters such that total length of message when encoded becomes minimum */ int main(){        Init();        int distinct_char ;        scanf("%d",&distinct_char);        char ch;        int freq;               int iter;        for(iter=0;iter<distinct_char;iter++)        {                char t[4];                scanf("%s",t); //Scanning the character as string to avoid formatting issues of input.                ch = t[0];                scanf("%d",&freq);                node * temp = (node *) malloc(sizeof(node));                temp -> ch = ch;                temp -> freq = freq;                temp -> left = temp -> right = NULL;                Insert(temp);        }        /* Special Case */        if(distinct_char==1)        {                printf("char %c code 0\n",ch);                return 0;        }        for(iter=0;iter<distinct_char-1 ;iter++)        {                node * left = DeleteMin();                node * right = DeleteMin();                node * temp = (node *) malloc(sizeof(node));                temp -> ch = 0;                temp -> left = left;                temp -> right = right;                temp -> freq = left->freq + right -> freq;                Insert(temp);        }        node *tree = DeleteMin();        char code[512];        code[0] = '\0';        print(tree,code);}`

```Rough notes about the Algorithm and how it is implemented in the code above:

Heap is declared globally so that we do not need to pass it as an argument every time. Heap implemented here is Min Heap. Heap’s node (Node) structure is defined with fields character type ch, integer type freq, *left (pointer to the structure Node basically denotes the left subtree of a node) and *right (pointer to the structure Node basically denotes the right subtree of a node).

Given the list of characters along with their frequencies, our goal is to predict the encoding of the characters such that total length of message when encoded becomes minimum.

Firstly Heap of type Node is initialized with heapSize = 0, heap[0] -> freq = -INT_MAX(maximum possible value of signed int) and heap[0] = (Node *)malloc(sizeof(Node)).```
`Input the character string and the frequency of that character. Store the values in a temp Node and initialize the left and right subtree to NULL.`
```
temp -> ch = ch
temp -> freq = freq
temp -> left = temp -> right = NULL
Insert the temp Node in the heap using Insert function.
For a special case when there is only one character print “Character code of the character
is 0”.```
```
For all the distinct characters (distinct_character), find the left & right leaf nodes by deleting the minimum element them from heap. Add them in left and right subtree
of temporary node. Update the frequency of the temporary node to the sum of the frequencies of he left and right subtree nodes. Insert this temporary node in the heap.```
```
For iter=0 to distinct_char-2
Node * left = DeleteMin()
Node * right = DeleteMin()
Node * temp = (Node *) malloc(sizeof(Node))
temp -> ch = 0
temp -> left = left
temp -> right = right
temp -> freq = left->freq + right -> freq
Insert(temp)
Iter + 1
Initialize a Node tree and store the minimum element of the heap in it.
Node *tree = DeleteMin()
Declare an array of character type, code[512] and initialize code[0] = ‘\0’ (NULL)
Print the final tree using print function.```
```
Insert function – It takes the element to be inserted in the heap as an argument.
• heapSize is increased by 1, and element is inserted at the last place.
heapSize++
heap[heapSize] = element
• Now the position of the element is adjusted such that heap property is maintained.
That is done by comparing it with its parent and swapping them until it is greater
than its parent. Store the heapSize in a temporary variable (now, refers to the
index at which we are now).Until heap[now/2] > element,
o heap[now] = heap[now/2] i.e. replace the value at index now by the

value of its parent(index now/2)

o Divide now by 2 for moving above in the list.

• Now when the right index has been found, store the element there.

heap[now] = element```
```
DeleteMin function - heap[1] is the minimum element. So we remove heap[1]. Size of the heap is decreased. Now heap[1] is filled with the last element in the heap and see if it fits. If it does not fit, take minimum element among both its children and replace the last element with it. Again see if the last element fits in that place. Basically, percolate down and swap with minimum child as necessary. To check if the last element fits or not it suffices to check if the last element is less than the minimum element among both the children, if it is then we are done. This is done by comparing their frequencies.

print function – It takes pointer to the tree Node as temp and pointer to the code array.
• If temp->left and temp->right = NULL, then this is the leaf element of the tree.
Hence print the character and its code, and return from the function.
• Initialize an integer variable length to the length of the array code.
• Declare two arrays leftcode[512] and rightcode[512] to store the code of the left
subtree and right subtree respectively. Initially copy the code to both leftcode and
rightcode.
• Append leftcode with 0 and NULL, and rightcode with 1 and NULL.
• Move to the left subtree and right subtree of the temp Node.```

Related Tutorials ( Common examples of Greedy Algorithms ) :

 Elementary cases : Fractional Knapsack Problem, Task Scheduling Elementary problems in Greedy algorithms - Fractional Knapsack, Task Scheduling. Along with C Program source code. Data Compression using Huffman Trees Compression using Huffman Trees. A greedy technique for encoding information.

Some Important Data Structures and Algorithms, at a glance:

 Arrays : Popular Sorting and Searching Algorithms Bubble Sort Insertion Sort Selection Sort Shell Sort Merge Sort Quick Sort Heap Sort Binary Search Algorithm Basic Data Structures  and Operations on them Stacks Queues Single Linked List Double Linked List Circular Linked List
 Tree Data Structures Binary Search Trees Heaps Height Balanced Trees Graphs and Graph Algorithms Depth First Search Breadth First Search Minimum Spanning Trees: Kruskal Algorithm Minumum Spanning Trees: Prim's Algorithm Dijkstra Algorithm for Shortest Paths Floyd Warshall Algorithm for Shortest Paths Bellman Ford Algorithm Popular Algorithms in Dynamic Programming Dynamic Programming Integer Knapsack problem Matrix Chain Multiplication Longest Common Subsequence Greedy Algorithms Elementary cases : Fractional Knapsack Problem, Task Scheduling Data Compression using Huffman Trees

Basic Data Structures and Algorithms

Sorting- at a glance