Short Story ⟡ Informatics

Today I'm a New Member of Information Theory Club

Exploring data compression and how to efficiently represent information without losing meaning.

  • #source coding
  • #Huffman coding
  • #compression ratio
  • #prefix-free codes

"I filled out the membership form."

When Yuki handed over the document, Aoi received it.

"You're officially a new member now. Congratulations."

"Thank you so much!"

Riku applauded. "We got another member! So is today a welcome party?"

"Doing club activities properly is the best welcome," Aoi smiled. "Today's theme is source coding."

"Source coding?" Yuki tilted their head.

"The theory of data compression. Technology to encode messages efficiently."

Aoi wrote characters on the whiteboard.

"For example, suppose we transmit this string. 'aaaabbbbc'"

"Four a's, four b's, one c," Riku counted.

"Normally, representing each character with 8 bits requires 72 bits for 9 characters."

"But since a and b appear many times, we can probably make it shorter," Yuki said.

"Sharp! That's the basic idea of coding," Aoi nodded. "Assign short codes to frequent characters, long codes to rare ones."

"Specifically?"

"There's a method called Huffman coding. We can mechanically create optimal codes."

Aoi drew a table.

"a: 44 percent b: 44 percent c: 11 percent"

"For this probability distribution, we construct a Huffman tree."

Riku said anxiously, "A tree?"

"Don't worry. It's a simple procedure," Aoi began drawing a diagram.

"1. Combine the two with lowest probability 2. The new node's probability is the sum 3. Repeat until everything becomes one tree"

"Since c is 11 percent, first it's alone..." Yuki thought.

"Actually, we can also combine a and b first. There are multiple orderings."

Aoi showed the completed tree.

"a: 0 b: 10 c: 11"

"a is 1 bit, b and c are 2 bits!" Riku was surprised.

"Average code length is 0.44×1 + 0.44×2 + 0.11×2 = 1.54 bits."

"Originally 8 bits, so a major reduction!" Yuki calculated.

"Yes. But this is an idealized example. In practice, we also consider context."

Riku had a question. "But how do you know the boundaries? Is '010' 'a,b' or 'a,0,...'?"

"Good question. This is the property of prefix-free codes," Aoi explained. "No code is a prefix of another code. So we can decode uniquely."

"Prefix?"

"For example, if a were '0' and b were '01', then '01' would be a prefix of a. This makes boundaries ambiguous."

"So we need to design like '0', '10', '11'," Yuki understood.

"Exactly. Huffman codes are always prefix-free."

Riku wrote in his notebook. "So for longer messages, can we compress more?"

"We approach entropy. The source's entropy is the theoretical limit."

"Entropy appears again," Yuki laughed.

"It's a fundamental concept involved in all of information theory."

Aoi showed another example. "English text has entropy around 1.5 bits per character. So theoretically, we can compress from 8 bits to 1.5 bits."

"But how do zip and rar actually work?" Riku asked.

"They use dictionary-based methods. Find repeating patterns and replace them with references. LZ77, LZ78, LZW, and so on."

"Sounds complicated..."

"The algorithms are complex, but the principle is the same. Reduce redundancy."

Yuki said seriously, "I want to study more."

"Then next week, let's actually construct Huffman codes by hand," Aoi proposed.

"Let's do it!"

Riku nodded too. "I'll work hard."

"You both seem like you'll be good members," Aoi said with satisfaction.

Yuki's heart raced. The days in the information theory club were truly beginning.