Fisher Information

#probability #math/information-theory #data-science/modeling #math/stats/uncertainty #math/stats/frequentist

Fisher Information quantifies the amount of information that an observable random variable $X$ carries about an unknown parameter $\theta$ upon which the probability of $X$ depends. Formally, it measures the precision with which we can estimate the parameter $\theta$ from a sample.

Mathematical Definition

If $f(X; \theta)$ represents the PDF (or PMF) of the variable $X$ conditional on $\theta$, we first define the score function as the partial derivative of the natural logarithm of the likelihood function with respect to $\theta$.

Because the expected value of the score function evaluated at the true parameter is zero, the Fisher Information $\mathcal{I}(\theta)$ is defined natively as the variance of the score function: $\mathcal{I}(\theta) = \mathbb{E}\left[ \left( \frac{\partial}{\partial \theta} \log f(X; \theta) \right)^2 \right]$

Under certain regularity conditions (namely, that the distribution is sufficiently smooth and we can swap the order of integration and differentiation), Fisher Information can be equivalently (and often more conveniently) expressed as the expected value of the negative second derivative of the log-likelihood: $\mathcal{I}(\theta) = -\mathbb{E}\left[\frac{\partial^2}{\partial \theta^2} \log f(X; \theta)\right]$

Intuition

Geometrically, Fisher Information measures the curvature (or sharpness) of the log-likelihood function around its peak.

High Fisher Information implies a highly peaked log-likelihood curve. This means that even microscopic changes in $\theta$ will drastically change the likelihood of the data, allowing us to pinpoint the true parameter with high confidence and minimal margin of error.
Low Fisher Information implies a flat log-likelihood curve. The data does not strongly distinguish between different values of $\theta$, leading to high uncertainty in our estimates.

Important Connections

Cramér-Rao Lower Bound: The Fisher Information establishes the theoretical minimum variance for any unbiased estimator. Specifically, the variance of an unbiased estimator cannot be lower than the inverse of the Fisher Information ($1/\mathcal{I}(\theta)$).
Jeffreys Prior: In Bayesian inference, Fisher Information is used to construct the Jeffreys Prior ($p(\theta) \propto \sqrt{\mathcal{I}(\theta)}$), which is an objective, non-informative prior probability that remains invariant under reparameterization.

For models parameterized by a vector of multiple parameters, this concept generalizes into the Fisher Information Matrix, where the elements represent the covariances between the partial derivatives of the log-likelihoods.