Computer Science‎ > ‎

Algorithms: Greedy Algorithms - Data Compression Using Huffman Encoding with C Program source code



To go through the C program / source-code, scroll down to the end of this page


Huffman Encoding


Huffman codes are very effective and widely used technique for compressing data. Huffman encoding problem is of finding the minimum length bit string which can be used to encode a string of symbols. It uses a table of frequencies of occurrence of each character to represent each character as a binary string, optimally. It uses a simple heap based priority queue. Each leaf is labeled with a character and its frequency of occurrence. Each internal node is labeled with the sum of the weights of the leaves in its sub-tree. The Huffman encoding scheme is an example of a greedy algorithm.

Analysis

The running time of Huffman on a set of n characters is O(nlogn).





Huffman Encoding - C Program Source code for generating Huffman Codes



#include<string.h>
#include<stdio.h>
#include<limits.h>
#include<stdlib.h>
typedef struct node
{
       
char ch;
       
int freq;
       
struct node *left;
       
struct node *right;
}node;
/*Declaring heap globally so that we do not need to pass it as an argument every time*/
/* Heap implemented  here is Min Heap */
node
* heap[1000000];
int heapSize;
/*Initialize Heap*/
void Init()
{
        heapSize
= 0;
        heap
[0] = (node *)malloc(sizeof(node));
        heap
[0]->freq = -INT_MAX;
}
/*Insert an element into the heap */
void Insert(node * element)
{
        heapSize
++;
        heap
[heapSize] = element; /*Insert in the last place*/
       
/*Adjust its position*/
       
int now = heapSize;
       
while(heap[now/2] -> freq > element -> freq)
       
{
                heap
[now] = heap[now/2];
                now
/= 2;
       
}
        heap
[now] = element;
}
node
* DeleteMin()
{
       
/* heap[1] is the minimum element. So we remove heap[1]. Size of the heap is decreased.
           Now heap[1] has to be filled. We put the last element in its place and see if it fits.
           If it does not fit, take minimum element among both its children and replaces parent with it.
           Again See if the last element fits in that place.*/

        node
* minElement,*lastElement;
       
int child,now;
        minElement
= heap[1];
        lastElement
= heap[heapSize--];
       
/* now refers to the index at which we are now */
       
for(now = 1; now*2 <= heapSize ;now = child)
       
{
               
/* child is the index of the element which is minimum among both the children */
               
/* Indexes of children are i*2 and i*2 + 1*/
                child
= now*2;
               
/*child!=heapSize beacuse heap[heapSize+1] does not exist, which means it has only one
                  child */

               
if(child != heapSize && heap[child+1]->freq < heap[child] -> freq )
               
{
                        child
++;
               
}
               
/* To check if the last element fits ot not it suffices to check if the last element
                   is less than the minimum element among both the children*/

               
if(lastElement -> freq > heap[child] -> freq)
               
{
                        heap
[now] = heap[child];
               
}
               
else /* It fits there */
               
{
                       
break;
               
}
       
}
        heap
[now] = lastElement;
       
return minElement;
}
void print(node *temp,char *code)
{
       
if(temp->left==NULL && temp->right==NULL)
       
{
                printf
("char %c code %s\n",temp->ch,code);
               
return;
       
}
       
int length = strlen(code);
       
char leftcode[512],rightcode[512];
        strcpy
(leftcode,code);
        strcpy
(rightcode,code);
        leftcode
[length] = '0';
        leftcode
[length+1] = '\0';
        rightcode
[length] = '1';
        rightcode
[length+1] = '\0';
        print
(temp->left,leftcode);
        print
(temp->right,rightcode);
}
/* Given the list of characters along with their frequencies, our goal is to predict the encoding of the
   characters such that total length of message when encoded becomes minimum */

int main()
{
       
Init();
       
int distinct_char ;
        scanf
("%d",&distinct_char);
       
char ch;
       
int freq;      
       
int iter;
       
for(iter=0;iter<distinct_char;iter++)
       
{
               
char t[4];
                scanf
("%s",t); //Scanning the character as string to avoid formatting issues of input.
                ch
= t[0];
                scanf
("%d",&freq);
                node
* temp = (node *) malloc(sizeof(node));
                temp
-> ch = ch;
                temp
-> freq = freq;
                temp
-> left = temp -> right = NULL;
               
Insert(temp);
       
}
       
/* Special Case */
       
if(distinct_char==1)
       
{
                printf
("char %c code 0\n",ch);
               
return 0;
       
}
       
for(iter=0;iter<distinct_char-1 ;iter++)
       
{
                node
* left = DeleteMin();
                node
* right = DeleteMin();
                node
* temp = (node *) malloc(sizeof(node));
                temp
-> ch = 0;
                temp
-> left = left;
                temp
-> right = right;
                temp
-> freq = left->freq + right -> freq;
               
Insert(temp);
       
}
        node
*tree = DeleteMin();
       
char code[512];
        code
[0] = '\0';
        print
(tree,code);



}


Rough notes about the Algorithm and how it is implemented in the code above:

Heap is declared globally so that we do not need to pass it as an argument every time. Heap implemented here is Min Heap. Heap’s node (Node) structure is defined with fields character type ch, integer type freq, *left (pointer to the structure Node basically denotes the left subtree of a node) and *right (pointer to the structure Node basically denotes the right subtree of a node).

Given the list of characters along with their frequencies, our goal is to predict the encoding of the characters such that total length of message when encoded becomes minimum.

Firstly Heap of type Node is initialized with heapSize = 0, heap[0] -> freq = -INT_MAX(maximum possible value of signed int) and heap[0] = (Node *)malloc(sizeof(Node)).
Input the character string and the frequency of that character. Store the values in a temp Node and initialize the left and right subtree to NULL.

temp -> ch = ch
temp -> freq = freq
temp -> left = temp -> right = NULL
Insert the temp Node in the heap using Insert function.
For a special case when there is only one character print “Character code of the character
is 0”.

For all the distinct characters (distinct_character), find the left & right leaf nodes by deleting the minimum element them from heap. Add them in left and right subtree
of temporary node. Update the frequency of the temporary node to the sum of the frequencies of he left and right subtree nodes. Insert this temporary node in the heap.

For iter=0 to distinct_char-2
Node * left = DeleteMin()
Node * right = DeleteMin()
Node * temp = (Node *) malloc(sizeof(Node))
temp -> ch = 0
temp -> left = left
temp -> right = right
temp -> freq = left->freq + right -> freq
Insert(temp)
Iter + 1
Initialize a Node tree and store the minimum element of the heap in it.
Node *tree = DeleteMin()
Declare an array of character type, code[512] and initialize code[0] = ‘\0’ (NULL)
Print the final tree using print function.

Insert function – It takes the element to be inserted in the heap as an argument.
• heapSize is increased by 1, and element is inserted at the last place.
heapSize++
heap[heapSize] = element
• Now the position of the element is adjusted such that heap property is maintained.
That is done by comparing it with its parent and swapping them until it is greater
than its parent. Store the heapSize in a temporary variable (now, refers to the
index at which we are now).Until heap[now/2] > element,
o heap[now] = heap[now/2] i.e. replace the value at index now by the

value of its parent(index now/2)

o Divide now by 2 for moving above in the list.

• Now when the right index has been found, store the element there.

heap[now] = element

DeleteMin function - heap[1] is the minimum element. So we remove heap[1]. Size of the heap is decreased. Now heap[1] is filled with the last element in the heap and see if it fits. If it does not fit, take minimum element among both its children and replace the last element with it. Again see if the last element fits in that place. Basically, percolate down and swap with minimum child as necessary. To check if the last element fits or not it suffices to check if the last element is less than the minimum element among both the children, if it is then we are done. This is done by comparing their frequencies.

print function – It takes pointer to the tree Node as temp and pointer to the code array.
• If temp->left and temp->right = NULL, then this is the leaf element of the tree.
Hence print the character and its code, and return from the function.
• Initialize an integer variable length to the length of the array code.
• Declare two arrays leftcode[512] and rightcode[512] to store the code of the left
subtree and right subtree respectively. Initially copy the code to both leftcode and
rightcode.
• Append leftcode with 0 and NULL, and rightcode with 1 and NULL.
• Move to the left subtree and right subtree of the temp Node.

Related Tutorials ( Common examples of Greedy Algorithms ) :

Elementary cases : Fractional Knapsack Problem, Task Scheduling
 
 Elementary problems in Greedy algorithms - Fractional Knapsack, Task Scheduling. Along with C Program source code.
 Data Compression using Huffman Trees

 Compression using Huffman Trees. A greedy technique for encoding information.


Some Important Data Structures and Algorithms, at a glance:

Arrays : Popular Sorting and Searching Algorithms

 

  

Bubble Sort  

Insertion Sort 

Selection Sort Shell Sort

Merge Sort  

Quick Sort 

 
Heap Sort
 
Binary Search Algorithm

Basic Data Structures  and Operations on them


  

Stacks 

Queues  

 
 Single Linked List 

Double Linked List

Circular Linked List 











Basic Data Structures and Algorithms



Sorting- at a glance

 Bubble Sort One of the most elementary sorting algorithms to implement - and also very inefficient. Runs in quadratic time. A good starting point to understand sorting in general, before moving on to more advanced techniques and algorithms. A general idea of how the algorithm works and a the code for a C program.

Insertion Sort - Another quadratic time sorting algorithm - an example of dynamic programming. An explanation and step through of how the algorithm works, as well as the source code for a C program which performs insertion sort.

Selection Sort - Another quadratic time sorting algorithm - an example of a greedy algorithm. An explanation and step through of how the algorithm works, as well as the source code for a C program which performs selection sort.

Shell Sort- An inefficient but interesting algorithm, the complexity of which is not exactly known.

Merge Sort An example of a Divide and Conquer algorithm. Works in O(n log n) time. The memory complexity for this is a bit of a disadvantage.

Quick Sort In the average case, this works in O(n log n) time. No additional memory overhead - so this is better than merge sort in this regard. A partition element is selected, the array is restructured such that all elements greater or less than the partition are on opposite sides of the partition. These two parts of the array are then sorted recursively.

Heap Sort- Efficient sorting algorithm which runs in O(n log n) time. Uses the Heap data structure.

Binary Search Algorithm- Commonly used algorithm used to find the position of an element in a sorted array. Runs in O(log n) time.

Basic Data Structures and Algorithms


 Stacks Last In First Out data structures ( LIFO ). Like a stack of cards from which you pick up the one on the top ( which is the last one to be placed on top of the stack ). Documentation of the various operations and the stages a stack passes through when elements are inserted or deleted. C program to help you get an idea of how a stack is implemented in code.

Queues First in First Out data structure (FIFO). Like people waiting to buy tickets in a queue - the first one to stand in the queue, gets the ticket first and gets to leave the queue first. Documentation of the various operations and the stages a queue passes through as elements are inserted or deleted. C Program source code to help you get an idea of how a queue is implemented in code.

Single Linked List A self referential data structure. A list of elements, with a head and a tail; each element points to another of its own kind.

Double Linked List- A self referential data structure. A list of elements, with a head and a tail; each element points to another of its own kind in front of it, as well as another of its own kind, which happens to be behind it in the sequence.

Circular Linked List Linked list with no head and tail - elements point to each other in a circular fashion.

 Binary Search Trees A basic form of tree data structures. Inserting and deleting elements in them. Different kind of binary tree traversal algorithms.

 Heaps A tree like data structure where every element is lesser (or greater) than the one above it. Heap formation, sorting using heaps in O(n log n) time.

 Height Balanced Trees - Ensuring that trees remain balanced to optimize complexity of operations which are performed on them.

Graphs

 Depth First Search - Traversing through a graph using Depth First Search in which unvisited neighbors of the current vertex are pushed into a stack and visited in that order.

Breadth First Search - Traversing through a graph using Breadth First Search in which unvisited neighbors of the current vertex are pushed into a queue and then visited in that order.

Minimum Spanning Trees: Kruskal Algorithm- Finding the Minimum Spanning Tree using the Kruskal Algorithm which is a greedy technique. Introducing the concept of Union Find.

Minumum Spanning Trees: Prim's Algorithm- Finding the Minimum Spanning Tree using the Prim's Algorithm.

Dijkstra Algorithm for Shortest Paths- Popular algorithm for finding shortest paths : Dijkstra Algorithm.

Floyd Warshall Algorithm for Shortest Paths- All the all shortest path algorithm: Floyd Warshall Algorithm

Bellman Ford Algorithm - Another common shortest path algorithm : Bellman Ford Algorithm.

Dynamic Programming A technique used to solve optimization problems, based on identifying and solving sub-parts of a problem first.

Integer Knapsack problemAn elementary problem, often used to introduce the concept of dynamic programming.

Matrix Chain Multiplication Given a long chain of matrices of various sizes, how do you parenthesize them for the purpose of multiplication - how do you chose which ones to start multiplying first?

Longest Common Subsequence Given two strings, find the longest common sub sequence between them.

 Elementary cases : Fractional Knapsack Problem, Task Scheduling - Elementary problems in Greedy algorithms - Fractional Knapsack, Task Scheduling. Along with C Program source code.

Data Compression using Huffman TreesCompression using Huffman Trees. A greedy technique for encoding information.