Huffman coding
Huffman coding is a lossless data compression algorithm that is used to compress data in a way that minimizes the number of bits used to represent the data. It is a widely used algorithm and is considered one of the most effective ways to compress data.
In this blog post, I will explain how Huffman coding works and why it is so useful.
Huffman coding algorithm is commonly used in many applications, including data transmission and storage.
A character in a computer is typically represented using 8 bits, which is known as a byte. This means that a single character can be stored using 8 bits, or 1 byte, of memory.
ASCII encoding scheme uses 7 bits to represent a character, while the Unicode scheme uses 16 bits per character.
A has a ascii code 65 and can be represented in bits as 01000001.
Let’s suppose we have a string of 20 characters:
let str = "BACABABCCACDIAIDACAB";
Total bits used are : 20 X 8 bits = 160 bits.
we need 160 bits to store a string with length of 20.
It uses a binary tree to represent the frequency of characters in a given text. The first step in the Huffman coding process is to calculate the frequency of each character in the text. The binary tree is constructed in such a way that the most frequently occurring characters are represented by shorter bit strings, while less frequently occurring characters are represented by longer bit strings.
In our string “BACABABCCACDIAIDACAB”, we have 5 characters that are repeating, so max we need 3 bits to store it.
As you can see we can represent each character using 3 bit and each different character have different bit code.
Now if we calculate how many bits we need then it will be 20 X 3 bits = 60 bits.
Reducing the size from 160 to 60 bits.
we have our own bit representation of the original message.
But the question is how the other person know by just looking at the bit representation of the message what exactly it means?
The answer is, the other person need to know the table that we have used for mapping so that the message will be converted back to original text.
So, we need to send the table along with the message as well so that it is easy to convert. we need to calculate the size of table as well.
- we have 5 characters that needs to be send in ASCII codes , so it will take 5 X 8 = 40 bits.
- For each custom bit representation of character, we need: 5 X 3 bits = 15 bits.
- so the total bits required to send the compressed message with table is 60 + 40 + 15 = 115 bits.
This is also known as Fixed size code.
Huffman coding uses optimal Merge pattern. It says that each alphabet should be arranged in increasing order of their occurance. and merge the 2 smallest to make a node.
- we have listed our unique characters based on their occurrences in increasing order. Let’s suppose each character on the base is a node.
- we need to combine two smallest node and then create a node.
- this is how our tree looks like:
Now based on this tree you can create a new table with dynamic code size. let’s do it:
Let’s understand , how we have created the codes:
we’ll start from the top and then move downwards to reach our character and we keep track of our code as well.
- To get code for character A, let’s start from top, in order to reach A we need to move downwards. so our path is 1 and 1, that is why our code for A is 11.
- To reach character B from top, we need to move 0 and 1.
- Similarly for C:
- For D:
- For I:
if we calculate now, we need :
2 bits to store A , B, and C. 3 bits to store D and I.
So total size is (7*2)+(4*2)+(5*2)+(2*3)+(2*3) = 44 bits for message.
And, 40 bits for 5 characters plus 12 bits to send new codes.
Total size of the new message along with table is 96 bits.
This is also known as Dynamic size code.
Conclusion:
Huffman coding is a method of compressing data that involves creating a binary tree of symbols and their frequencies in order to assign shorter codewords to more frequent symbols and longer codewords to less frequent symbols. This results in a more efficient way of encoding the data, resulting in smaller file sizes and faster transmission times.
Overall, Huffman coding is a useful technique for data compression and can be applied to a wide range of applications