Mutual information

In probability theory and information theory, the mutual information of two jointly distributed random variables is a quantity that measures the independence of the two variables. The unit of measurement of mutual information is the bit.

Informally, mutual information measures the information of X that is shared by Y. If X and Y are independent, then X contains no information about Y and vice versa, so their mutual information is zero. If X and Y are identical (this means they describe the same trial, not two trials of the same sort occurring together), then all information conveyed by X is shared with Y: knowing X reveals nothing new about Y and vice versa, therefore the mutual information is the same as the information conveyed by X (or Y) alone, namely the entropy of X. In a specific sense (see below), mutual information quantifies the distance between the joint distribution of X and Y and the product of their marginal distributions.

Formally, if the joint probability distribution of X and Y is p with p(x, y) = Pr(X=x, Y=y), the probability distribution of X alone is f with f(x) = Pr(X=x), and the probability distribution of Y alone is g with g(y) = Pr(Y=y), then the mutual information of X and Y is given by I(X, Y), defined as follows for the discrete case:

I(X,Y) = \sum_{x,y} p(x,y) \times \log_2 \frac{p(x,y)}{f(x)\,g(y)}.

In the continuous case, replace summation by a definite double integral:

I(X,Y) = \int_{(-\infty,\infty) \times (-\infty,\infty)} p(x,y) \times \log_2 \frac{p(x,y)}{f(x)\,g(y)} \; d(x,y)

Mutual information is a measure of independence in the following sense: I(X, Y) = 0 iff X and Y are independent random variables. This is easy to see in one direction: if X and Y are independent, then p(x,y) = f(x) × g(y), and therefore:

\log \frac{p(x,y)}{f(x)\,g(y)} = \log 1 = 0

Moreover, mutual information is nonnegative (i.e. I(X,Y) ≥ 0; see below) and symmetric (i.e. I(X,Y) = I(Y,X)).

Several generalizations of mutual information to more than two random variables have been proposed, but a widely agreed on definition has not yet emerged.

Relation to other quantities

Mutual information can be equivalently expressed as

I(X,Y) = H(X) - H(X | Y) = H(Y) - H(Y | X)

where H(X) and H(X|Y) are the unconditional entropy and conditional entropy of X, likewise H(Y) and H(Y|X) are the unconditional and conditional entropy of Y, with

H(X) = -\sum_x f(x) \,\log_2 f(x)

and

H(X\,|\,Y) = -\sum_y g(y) \sum_x \frac{p(x,y)}{g(y)} \times \log_2 \frac{p(x,y)}{g(y)}

Since H(X) > H(X|Y), this proves the nonnegativity property stated above.

Mutual information can also be expressed in terms of the Kullback-Leibler divergence. Let q(x, y) = f(x) × g(y); then

I(X,Y) = KL(p,q).

Furthermore, let hy(x) = p(x, y) / g(y). Then

I(X,Y) = \sum_y g(y) \sum_x h_y(x) \times \log_2 \frac{h_y(x)}{f(x)}
= \sum_y g(y) \; \mathit{KL}(h_y,f)
= EY[KL(hy,f)]

Thus mutual information can also be understood as the expectation of the Kullback-Leibler divergence between the conditional distribution h of X given Y and the univariate distribution f of X: the more different the distributions f and h, the greater the information gain.

Applications of mutual information

In many applications, one wants to maximize mutual information (thus increasing dependencies), which is often equivalent to minimizing conditional entropy. Examples include:

  • Discriminative training procedures for hidden Markov models have been proposed based on the maximum mutual information (MMI) criterion.
  • Mutual information has been used as a criterion for feature selection and feature transformations in machine learning.

References

  • Athanasios Papoulis. Probability, Random Variables, and Stochastic Processes, second edition. New York: McGraw-Hill, 1984. (See Chapter 15.)

This article is licensed under the GNU Free Documentation License. It uses material from Wikipedia article. Browse Wikipedia for more information.