KL divergence, or Kullback-Leibler divergence, is a measure of how one probability distribution diverges from a second, expected probability distribution. It is a fundamental concept in information theory and statistics, often used to quantify the difference between two probability distributions.

Key Points about KL Divergence:

  1. Definition:

    • For discrete probability distributions ( P ) and ( Q ), the KL divergence from ( Q ) to ( P ) is defined as: $$ D{KL}(P \parallel Q) = \sum{x} P(x) \log \frac{P(x)}{Q(x)} $$
    • For continuous distributions, the sum is replaced by an integral.
  2. Interpretation:

    • KL divergence measures the "extra" amount of information required to encode samples from distribution ( P ) using a code optimized for distribution ( Q ).
    • It is not symmetric, meaning ( D{KL}(P \parallel Q) \neq D{KL}(Q \parallel P) ).
  3. Properties:

    • Non-negativity: ( D_{KL}(P \parallel Q) \geq 0 ), with equality if and only if ( P = Q ) almost everywhere.
    • Asymmetry: As mentioned, it is not symmetric, which means it is not a true metric or distance.
  4. Applications:

    • Machine Learning: Used in algorithms like Variational Autoencoders (VAEs) and in regularizing policy updates in reinforcement learning (e.g., PPO).
    • Statistics: Helps in hypothesis testing and model selection.
    • Information Theory: Measures the inefficiency of assuming that the distribution is ( Q ) when the true distribution is ( P ).

KL divergence is a powerful tool for comparing probability distributions and is widely used in various fields of data science and machine learning.

    All notes