Primer on efficiency and regular estimators

In this post, we briefly review some of classic statistical concepts required for studying the optimality of different estimators. Concepts and definitions are supplemented with examples and proofs (when short and simple). We consider a simple parametric setting: we assume that the model class can be parameterized by some finite-dimensional parameter $\theta\in\Theta$ .

Score function and Fisher information

Suppose that a sample $X_1,\dots, X_n$ , where $X_i\sim p(x;\theta)$ , is observed. When $p(x;\theta)$ is viewed as a funtion of $x$ with $\theta$ fixed, it is referred to as the probability density/mass function (PDF/PMF), and when it is viewed as a function of $\theta$ with $x$ fixed, it is referred to as likelihood function. For now, we assume that $n=1$ .

Definition. The score function is defined as the gradient of the log-likelihood function with respect to parameter vector $\theta$ :

s(\theta;x) = \nabla_\theta \log p(x;\theta)

Once $X$ is plugged in the score, note that $s(\theta; X)$ becomes a random variable (since $X$ is random). It is easy to see that it is a mean-zero random variable (expectations are wrt $X\sim p(x;\theta)$ ):

\begin{aligned} \mathbb{E} [s(\theta;X)] &= \int s(\theta;x) p(x;\theta) dx \\ &= \int \Big(\nabla_\theta \log p(x;\theta) \Big) p(x;\theta) dx \\ &= \int \frac{ \nabla_\theta p(x;\theta)}{p(x;\theta)} p(x;\theta) dx \\ &= \nabla_\theta \int p(x;\theta) dx \\ &= \nabla_\theta 1 = 0, \\ \end{aligned}

where we assume that the necessary regularity conditions hold.

Definition. The Fisher information matrix is defined as the expected outer product of the score function:

I(\theta) = \mathbb{E}[s(\theta;X)s^\top(\theta;X)].

Since the score is mean-zero,

I(\theta)

is the covariance matrix of the score. Consider the second moment (the expected value of the Hessian) of the log-likelihood:

\begin{aligned} \mathbb{E} (\nabla_\theta^2 \log p(X;\theta)) &= \mathbb{E} (\nabla_\theta (\nabla_\theta \log p(X;\theta))) \\ &= \mathbb{E} \Big(\nabla_\theta \big( \frac{\nabla_\theta p(X;\theta)}{p(X;\theta)}\big)\Big) \\ &= \mathbb{E} \Big( \frac{\nabla^2_\theta p(X;\theta)}{p(X;\theta)} - \frac{\nabla_\theta p(X;\theta) (\nabla_\theta p(X;\theta))^\top}{p^2(X;\theta)} \Big) \\ &= \mathbb{E} \Big( \frac{\nabla^2_\theta p(X;\theta)}{p(X;\theta)} - s(\theta;X)s^\top(\theta;X) \Big). \end{aligned}

Since (again under regularity):

\mathbb{E} \Big( \frac{\nabla^2_\theta p(X;\theta)}{p(X;\theta)} \Big) = \int \nabla^2_\theta p(X;\theta) dx = \nabla^2_\theta \int p(X;\theta) dx = 0,

we get that:

I_1(\theta) = - \mathbb{E} (\nabla_\theta^2 \log p(X;\theta)) = - \mathbb{E} (\nabla_\theta s(\theta)),

where it is highlighted that Fisher information matrix is computed based on a single observation. It is easy to see that if there are

n

samples, then the corresponding Fisher information is

I(\theta) = nI_1(\theta)

. Lastly, Fisher information is parametrization-dependent, and there is a well-known result which relates Fisher Information for different parametrizations.

Cramer-Rao bound

In many cases, we care about the optimality of a particular estimator. Showing optimality consists of two steps: (a) showing that no estimator can do better than some benchmark, and (b) showing that a particular estimator does in fact attain that benchmark. Suppose that we measure optimality using squared risk (mean-squared error):

R_{MSE}(\hat\theta, \theta) = \mathbb{E} \|\hat\theta-\theta\|^2.

Suppose that $\mathbb{E}\hat\theta= m$ (not necessarily equal to $\theta$ ). We can easily derive that MSE decomposes as follows:

\begin{aligned} \mathbb{E} \|\hat\theta - \theta\|^2 &= \mathbb{E} \|\hat\theta - m + m- \theta \|^2\\ &= \mathbb{E} \|\hat\theta - m\|^2 + 2 \mathbb{E} [(\hat\theta - m)^\top (m- \theta)] + \mathbb{E} \| m- \theta\|^2\\ &= \mathbb{E} \|\hat\theta - m\|^2 + \| m- \theta\|^2\\ &= \mathbb{V} (\hat\theta) + \| m- \theta\|^2, \end{aligned}

where the first term is the variance of $\hat\theta$ and the second term is (squared) bias of the estimator. If $\mathbb{E}\hat\theta = \theta$ , i.e., estimator is unbiased, then MSE-optimal estimator is the one that has the lowest variance (note that in many cases, we can achieve lower MSE by reducing variance at the cost of sacrificing some bias). The benchmark for unbiased estimators is presented in the following result.

Theorem (Cramer-Rao bound). Suppose that $X_1,\dots, X_n$ , where $X_i\sim p(x;\theta)$ and $\hat\theta$ is an unbiased estimator of $\theta$ . Then:

\mathbb{V} (\hat\theta) \geq \frac{1}{I(\theta)} = \frac{1}{nI_1(\theta)}.

Short proof

We consider univariate case (multivaiate case is analogous, with inequalities translating matrix inequalities). In addition, we assume $n=1$ , but the proof is identical after replacing $s(\theta;X)$ with $\sum_{i=1}^ns(\theta;X_i)$ . Below, all expectations are taken wrt to $p(x, \theta)$ . Since $\mathbb{E} s(\theta; X) = 0$ , we have:

\text{Cov} (\hat\theta, s(\theta; X)) = \mathbb{E} [\hat\theta s(\theta; X) ].

Moreover, we have ( $\hat\theta$ depends only on sample, not the parameter):

\begin{aligned} \mathbb{E} [\hat\theta \cdot s(\theta; X) ] &= \int \hat\theta (x) \cdot (\nabla_\theta \log p(x;\theta)) \cdot p(x;\theta) dx \\ &= \int \hat\theta (x) \cdot \nabla_\theta p(x;\theta) dx \\ &= \nabla_\theta \int \hat\theta(x) \cdot p(x;\theta) dx \\ &= \nabla_\theta (\mathbb{E} \hat\theta(X)) \\ &= \nabla_\theta \theta = 1, \end{aligned}

where the unbiasedness is used in the last line. By Cauchy-Schwarz, we have:

|\text{Cov} (\hat\theta, s(\theta; X))|^2 \leq \mathbb{V} (\hat\theta) \cdot \mathbb{V} [s(\theta, X)].

Therefore,

\mathbb{V} (\hat\theta) \geq \frac{|\text{Cov} (\hat\theta, s(\theta, X))|^2}{\mathbb{V} [s(\theta,X)]} = \frac{1}{I_1(\theta)}.

Side note: a version that can be used for vectors is

\mathbb{V}(Y) \geq \text{Cov}(Y,X) \mathbb{V}^{-1}(X)\text{Cov}(X,Y),

where inequalities are interpreted in a matrix sense, i.e., the difference is positive-semidefinite (PSD).

One may also care about estimating some function of $\theta$ : $\psi(\theta)$ . Suppose that $\hat\psi$ is an unbiased estimator for $\psi(\theta)$ , i.e., $\mathbb{E} [\hat\psi] = \psi(\theta)$ . Then it holds that:

\mathbb{V}(\hat\psi) \geq \frac{[\psi'(\theta)]^2}{I(\theta)}.

The proof follows exactly the same argument as the one above, but relies on the following result for covariance:

\begin{aligned} \text{Cov} (\hat\psi, s(\theta, X)) &= \mathbb{E} [\hat\psi \cdot s(\theta; X)] \\ &= \nabla_\theta [\mathbb{E} [\hat\psi]] \\ &= \nabla_\theta [\psi(\theta)]. \end{aligned}

We can also look at the above result as the generalization of Cramer-Rao bound to biased estimators, i.e., the target is $\theta$ but an estimator $\hat\theta$ is such that: $\mathbb{E}\hat\theta\neq \theta$ . In this case, the result provides optimality benchmark for different estimators with the same bias, i.e., variance of any estimator whose bias function is given by $\psi(\theta)$ is lower bounded by the aforementioned quantity.

Super-efficiency

As discussed above, the variance of any unbiased estimator is lower bounded by Cramer-Rao bound. Generally, most reasonable estimators are at least asymptotically unbiased: bias goes to zero as $n\to\infty$ , but may not be exactly equal to zero for any $n$ .

One may expect that the asymptotic variance of such estimators is also greater or equal than Cramer-Rao bound. This is generally the case: estimators whose asymptotic variance is equal to the Cramer-Rao bound are referred to as asymptotically efficient (e.g., MLE is efficient for parametric models under suitable regularity conditions).

However, one can also construct estimators that: (a) are asymptotically unbiased, and (b) have asymptotic variance matching Cramer-Rao lower bound for most parameter values $\theta\in\Theta$ , yet smaller variance for some parameter values. We consider a canonical example of a super-efficient estimator.

Hodges’ (Hodges - Le Cam) estimator

Suppose that one observes a sample $\{X_i\}_{i=1}^n$ , where $X_i\sim \mathcal{N}(\mu,1)$ . The MLE of $\mu$ is given by the sample average:

\overline{X}_n = \frac{1}{n}\sum_{i=1}^n X_i.

After $\sqrt{n}$ -scaling, it follows from central limit theorem (CLT) that:

\sqrt{n}(\overline{X}_n-\mu)\overset{d}{\to} \mathcal{N}(0,1)

(the above result is simply CLT, and for a sample of normally distibuted random variables, we actually have an equality in the above). In other words, the (asymptotic) variance is always equal to one for all parameter values $\mu$ (after root- $n$ scaling), and can be shown to match Cramer-Rao lower bound.

Next, we introduce an estimator (Hodges’ estimator) that improves upon MLE estimator at $\mu=0$ . Hodges’ estimator is defined as

\hat\mu_n = \overline{X}_n \times \textbf{1}\{|\overline{X}_n|>n^{-1/4}\},

i.e., this estimator is a thresholded version of the sample mean. First, consider the asymptotic behavior of Hodges’ estimator:

Suppose that $\mu=0$ . Since $|\overline{X}_n|\leq n^{-1/4}$ implies $\hat{\mu}_n=0$ , we have: $\begin{aligned} P_{\mu=0}(\sqrt{n}\hat\mu_n=0) &= P_{\mu=0}(\hat\mu_n=0) \\ &\geq P_{\mu=0}(|\overline{X}_n|\leq n^{-1/4}) \\ &= P_{\mu=0}(\sqrt{n}|\overline{X}_n|\leq n^{1/4}) \\ &= \Phi(n^{1/4})-\Phi(-n^{1/4}) \overset{n\to\infty}{\to} 1, \end{aligned}$ where $\Phi(\cdot)$ is the CDF of standard normal random variable. Therefore, $\sqrt{n}\hat{\mu}_n = \sqrt{n}(\hat{\mu}_n-\mu)$ converges to a degenerate random variable, and hence, the asymptotic variance is zero (after root- $n$ scaling).
Suppose that $\mu \neq 0$ . We know that $\overline{X}_n$ converges to $\mu: |\mu|>0$ almost surely. Therefore, $\sqrt{n} (\hat\mu_n-\overline{X}_n)\overset{a.s.}{\to} 0$ , and hence, $\sqrt{n}(\hat\mu_n-\mu) = \sqrt{n}(\overline{X}_n-\mu)\overset{d}{\to}\mathcal{N}(0,1)$ by Slutsky’s theorem, which in turn implies that asymptotic variance is one (after root- $n$ scaling).

It may seem that MLE is sub-optimal as the Hodges’ estimator is super-efficient: it has better asymptotic variance (despite being only asymptotically unbiased). While it may seem that super-efficiency is a good property, it is not necessarily the case.

In case of Hodges’ estimator, it is gained at the expense of poor estimation in the neighborhood of zero. To illustrate that, we consider (exact) squared risk of both estimators for different sample sizes and mean values. In the plot below, the risk for both estimators is scaled (multiplied by $n$ ) for visualization purposes: one can think of ratio of the risk of Hodges’ estimator relative to that of MLE.

Squared risk of Hodges' estimator (derivation)

Under the Gaussian setting, we have $\overline{X}_n \sim\mathcal{N}(\mu,1/n)$ . Therefore, $\mathbb{E}[\overline{X}_n^2]=\mu^2+1/n$ . We have:

\begin{aligned} \mathbb{E} [\overline{X}_n \times \textbf{1}\{|\overline{X}_n|> n^{-1/4}\}] &= \mathbb{E} [\overline{X}_n]-\mathbb{E} [\overline{X}_n \times \textbf{1}\{|\overline{X}_n|\leq n^{-1/4}\}]\\ &= \mu-\mathbb{E} [\overline{X}_n \times \textbf{1}\{|\overline{X}_n|\leq n^{-1/4}\}]. \end{aligned}

Focusing on the second term, we obtain:

\begin{aligned} & \quad \mathbb{E} [\overline{X}_n \times \textbf{1}\{|\overline{X}_n|\leq n^{-1/4}\}] \\ &= \int_{-n^{-1/4}}^{n^{-1/4}} \frac{x\sqrt{n}}{\sqrt{2\pi}}\exp\Big(-\frac{n(x-\mu)^2}{2}\Big)dx \\ &= \int_{(-n^{-1/4}-\mu)\sqrt{n}}^{(n^{-1/4}-\mu)\sqrt{n}} \frac{(t/\sqrt{n}+\mu)}{\sqrt{2\pi}}\exp\Big(-\frac{t^2}{2}\Big)dt \\ &= \frac{1}{\sqrt{n}}\int_{(-n^{-1/4}-\mu)\sqrt{n}}^{(n^{-1/4}-\mu)\sqrt{n}} \frac{t}{\sqrt{2\pi}}\exp\Big(-\frac{t^2}{2}\Big)dt \\ &+ \mu\cdot\int_{(-n^{-1/4}-\mu)\sqrt{n}}^{(n^{-1/4}-\mu)\sqrt{n}} \frac{1}{\sqrt{2\pi}}\exp\Big(-\frac{t^2}{2}\Big)dt \\ &= -\frac{1}{\sqrt{n}} \frac{1}{\sqrt{2\pi}}\exp\Big(-\frac{t^2}{2}\Big) \Big|_{(-n^{-1/4}-\mu)\sqrt{n}}^{(n^{-1/4}-\mu)\sqrt{n}} \\ &+ \mu \Big( \Phi((n^{-1/4}-\mu)\sqrt{n}) - \Phi((-n^{-1/4}-\mu)\sqrt{n}) \Big). \end{aligned}

Next, let’s derive the second moment of the Hodges’ estimator:

\begin{aligned} & \quad \mathbb{E} [\overline{X}_n^2 \times \textbf{1}\{|\overline{X}_n|\leq n^{-1/4}\}] \\ &= \int_{-n^{-1/4}}^{n^{-1/4}} \frac{x^2\sqrt{n}}{\sqrt{2\pi}}\exp\Big(-\frac{n(x-\mu)^2}{2}\Big)dx \\ &= \int_{(-n^{-1/4}-\mu)\sqrt{n}}^{(n^{-1/4}-\mu)\sqrt{n}} \frac{(t/\sqrt{n}+\mu)^2}{\sqrt{2\pi}}\exp\Big(-\frac{t^2}{2}\Big)dt \\ &= \int_{(-n^{-1/4}-\mu)\sqrt{n}}^{(n^{-1/4}-\mu)\sqrt{n}} \frac{t^2n^{-1}}{\sqrt{2\pi}}\exp\Big(-\frac{t^2}{2}\Big)dt \\ &+ \int_{(-n^{-1/4}-\mu)\sqrt{n}}^{(n^{-1/4}-\mu)\sqrt{n}} \frac{2t\mu n^{-1/2}}{\sqrt{2\pi}}\exp\Big(-\frac{t^2}{2}\Big)dt \\ &+ \int_{(-n^{-1/4}-\mu)\sqrt{n}}^{(n^{-1/4}-\mu)\sqrt{n}} \frac{\mu^2}{\sqrt{2\pi}}\exp\Big(-\frac{t^2}{2}\Big)dt. \\ \end{aligned}

For the first term (second and third are handled analogous to the mean case), we have:

\begin{aligned} &\quad \int_{(-n^{-1/4}-\mu)\sqrt{n}}^{(n^{-1/4}-\mu)\sqrt{n}} \frac{t^2n^{-1}}{\sqrt{2\pi}}\exp\Big(-\frac{t^2}{2}\Big)dt \\ &= -\frac{tn^{-1}}{\sqrt{2\pi}}\exp\Big(-\frac{t^2}{2}\Big)\Big|_{(-n^{-1/4}-\mu)\sqrt{n}}^{(n^{-1/4}-\mu)\sqrt{n}}\\ &+ \int_{(-n^{-1/4}-\mu)\sqrt{n}}^{(n^{-1/4}-\mu)\sqrt{n}} \frac{n^{-1}}{\sqrt{2\pi}}\exp\Big(-\frac{t^2}{2}\Big)dt, \end{aligned}

which completes the derivation of the first and second moment of the Hodges’ estimator. Bias and variance trivially follow.

The (scaled) squared risk of the MLE is equal to one, irrespective of sample size or parameter value $\mu$ . The (scaled) squared risk of Hodges’ estimator depends both on sample size and parameter value. We observe that super-efficiency at zero is gained at the expense of poor estimation in the neighborhood of zero. For every value of $n$ , there is some value of $\mu$ , for which the risk of the Hodges’ estimator is much higher than that of MLE.

Regular estimators

Existence of super-efficient estimators is one of the reasons for the introduction of the notion of regular estimators. Informally, the limiting distribution of a regular estimator (after appropriate normalization) does not change with small perturbation of the true parameter $\theta\in\Theta$ .

Consider a Local Data Generating Process (LDGP) at $\theta_0$ : for each $n$ , consider a (triangular) array of random variables $\{X_{in}\}_{i=1}^n$ , where $X_{in}\sim P_{\theta_n}, i=1,\dots,n,$ and $\sqrt{n}(\theta_n-\theta_0)\overset{n\to\infty}{\to} h\in\mathbb{R}^k$ i.e., $\theta_n$ is approaching some fixed parameter $\theta_0$ .

Definition [Regular estimator]. $T_n=T_n(X_{1n},\dots,X_{nn})$ is called a regular estimator of $\psi(\theta)$ at $\theta_0$ if for every LDGP at $\theta_0$ (i.e., if for every sequence $\{\theta_n\}_{n=1}^\infty$ with $\sqrt{n}(\theta_n-\theta_0)\to h\in\mathbb{R}^k$ for some $h$ ) it holds that:

\sqrt{n}(T_n-\psi(\theta_n)) \overset{d}{\to} L_{\theta_0},

where the convergence is in distribution under the law of $\theta_n$ (i.e., for each $n$ distribution is different).

Let us consider MLE and Hodges’ estimators from the previous Section. Fix any $h\in \mathbb{R}$ and consider $\mu_n = \mu + h/\sqrt{n} = h/\sqrt{n}$ . We have:

For any $n$ , it holds for MLE $\overline{X}_n$ that: $\overline{X}_n\sim \mathcal{N}(\mu_n,1/n)$ . Therefore, $\sqrt{n}(\overline{X}_n-\mu_n)\sim \mathcal{N}(0,1)$ , and hence, the limiting (exact in this case) distribution doesn’t depend on $h$ .
Note that: $\overline{X}_n \overset{d}{=}\mu_n+\frac{1}{\sqrt{n}}Z$ for $Z\sim\mathcal{N}(0,1)$ . For Hodges’ estimator, it holds that:
$\begin{aligned} \hat\mu_n &= \overline{X}_n \times \textbf{1}\{|\overline{X}_n|> n^{-1/4}\}\\ &= \Big( \mu_n+\frac{1}{\sqrt{n}}Z \Big) \times \textbf{1}\{| \mu_n+\frac{1}{\sqrt{n}}Z |> n^{-1/4}\} \\ &= \Big( \frac{h}{\sqrt{n}}+\frac{1}{\sqrt{n}}Z \Big) \times \textbf{1}\{| \frac{h}{\sqrt{n}}+\frac{1}{\sqrt{n}}Z |> n^{-1/4}\}\\ &= \frac{1}{\sqrt{n}} \Big( h+Z \Big) \times \textbf{1}\{| h+Z |> n^{1/4}\}. \end{aligned}$
Therefore, we have:
$\begin{aligned} \sqrt{n} (\hat\mu_n - \mu_n) &= \sqrt{n} \hat\mu_n - h\\ &= \Big( h+Z \Big) \times \textbf{1}\{| h+Z |> n^{1/4}\} - h. \end{aligned}$
Note that $W_n:=\textbf{1}\{| h+Z |> n^{1/4}\}$ defines a sequence of Bernoulli random variables. We can easily show that $W_n \overset{d}{\to}0$ . Indeed, we only need to show that for any $c\in(0,1)$ , $P(W_n\leq c)\to 1$ as $n\to\infty$ . Indeed,
$\begin{aligned} P(W_n\leq c) &=P(W_n=0) \\ &= P(| h+Z |\leq n^{1/4})\\ & = \Phi(n^{1/4}-h)-\Phi(-n^{1/4}-h)\overset{n\to\infty}{\to} 1. \end{aligned}$
Therefore (using Slutsky’s theorem)
$\begin{aligned} \Big( h+Z \Big) \times \textbf{1}\{| h+Z |\leq n^{1/4}\} - h \overset{d}{\to} -h, \end{aligned}$
and hence, we conclude that Hodges’ estimator is not regular at zero.

Conclusion

This is the first post in a series, and there is a lot more to cover. For example, James-Stein estimator — another famous example of a non-regular estimator — deserves a separate discussion on its own (as well as the corresponding implications like inadmissibility of MLE, etc.). Regularity plays an important role in asympotic analysis of efficient estimators, and will also be recalled in future posts.