"Aoi-senpai, can I ask a strange question?"
Yuki was comparing two graphs by the club room window.
"Strange questions are the most interesting. What is it?"
"Can we measure how 'different' two probability distributions are?"
Aoi's eyes lit up. "Kullback-Leibler divergence. It's called KL divergence."
"KL... distance?"
"Not exactly distance. But it can quantify the 'separation' between distributions."
Riku peeked over. "Cool name! Like KL-senpai."
"KL-senpai is fictional," Aoi chuckled. "But it might be a good metaphor. KL divergence always wants to measure the distance between two distributions."
Aoi wrote an equation in the notebook.
"D_KL(P||Q) = Σ P(x) log(P(x)/Q(x))"
"Ugh, another difficult-looking equation," Yuki frowned.
"Calm down. Let P be the true distribution and Q be the model's distribution. KL divergence represents the 'informational loss' when approximating P using Q."
"Informational loss?"
Riku raised his hand. "For example, say I make a weather forecast. The truth is 70 percent sunny, 30 percent rain, but my model predicts 50 percent sunny, 50 percent rain?"
"Good example, Riku," Aoi nodded. "KL divergence quantifies that error."
Yuki attempted the calculation. "Um, 0.7×log(0.7/0.5) + 0.3×log(0.3/0.5)..."
"About 0.08 bits. So Riku's model deviates from the true distribution by 0.08 bits per event."
"Seems small, but it becomes a big difference long-term," Aoi supplemented.
Riku pondered. "So if I perfectly predicted 70-30?"
"KL divergence is zero. Perfect match."
"But what if we measure the opposite? I mean, D_KL(Q||P)?"
Yuki started calculating, but Aoi stopped them.
"Generally, it's a different value. That's why KL divergence is asymmetric. This is why it's not a true distance."
"Asymmetric...?" Yuki asked curiously.
"Yes. The separation from P to Q and from Q to P aren't the same."
Aoi drew a diagram on the whiteboard.
"For example, suppose P is a broad distribution and Q is a narrow one. If rare events in P are predicted to occur frequently in Q, D_KL(P||Q) might be small. But if frequent events in Q are predicted to rarely occur in P, D_KL(Q||P) becomes large."
Riku tilted his head. "Like it changes depending on your perspective?"
"Exactly. For KL divergence, 'from whose viewpoint' matters."
Yuki suddenly thought of something. "Is this used in machine learning?"
"Sharp. In model training, we minimize KL divergence to bring the predicted distribution closer to the true distribution."
"I've heard of cross entropy," Riku said.
"That's part of KL divergence. H(P,Q) = H(P) + D_KL(P||Q). Since P's entropy is fixed, minimizing cross entropy is the same as minimizing KL divergence."
Yuki's eyes sparkled. "That's why we use cross entropy as a loss function!"
"Correct. Information theory and machine learning intersect here."
Riku drew a picture in his notebook. "KL-senpai, always standing between two distributions saying, 'You guys are 0.08 bits off!'?"
Yuki and Aoi laughed.
"Riku's personifications sometimes hit the mark," Aoi admitted.
"But asymmetry means even KL-senpai's view changes with position."
"Philosophical," Yuki said.
"Information theory lies at the boundary of mathematics and philosophy," Aoi said quietly.
Outside the window, the sunset was beginning. KL-senpai continues measuring the distance between invisible distributions today, too.