Primer on efficiency and regular estimators

Sep 4, 2024·
Aleksandr Podkopaev
Aleksandr Podkopaev

In this post, we briefly review some of classic statistical concepts required for studying the optimality of different estimators. Concepts and definitions are supplemented with examples and proofs (when short and simple). We consider a simple parametric setting: we assume that the model class can be parameterized by some finite-dimensional parameter θΘ\theta\in\Theta.

Score function and Fisher information

Suppose that a sample X1,,XnX_1,\dots, X_n, where Xip(x;θ)X_i\sim p(x;\theta), is observed. When p(x;θ)p(x;\theta) is viewed as a funtion of xx with θ\theta fixed, it is referred to as the probability density/mass function (PDF/PMF), and when it is viewed as a function of θ\theta with xx fixed, it is referred to as likelihood function. For now, we assume that n=1n=1.

Definition. The score function is defined as the gradient of the log-likelihood function with respect to parameter vector θ\theta:

s(θ;x)=θlogp(x;θ) s(\theta;x) = \nabla_\theta \log p(x;\theta)

Once XX is plugged in the score, note that s(θ;X)s(\theta; X) becomes a random variable (since XX is random). It is easy to see that it is a mean-zero random variable (expectations are wrt Xp(x;θ)X\sim p(x;\theta)):

E[s(θ;X)]=s(θ;x)p(x;θ)dx=(θlogp(x;θ))p(x;θ)dx=θp(x;θ)p(x;θ)p(x;θ)dx=θp(x;θ)dx=θ1=0, \begin{aligned} \mathbb{E} [s(\theta;X)] &= \int s(\theta;x) p(x;\theta) dx \\ &= \int \Big(\nabla_\theta \log p(x;\theta) \Big) p(x;\theta) dx \\ &= \int \frac{ \nabla_\theta p(x;\theta)}{p(x;\theta)} p(x;\theta) dx \\ &= \nabla_\theta \int p(x;\theta) dx \\ &= \nabla_\theta 1 = 0, \\ \end{aligned}

where we assume that the necessary regularity conditions hold.

Definition. The Fisher information matrix is defined as the expected outer product of the score function:

I(θ)=E[s(θ;X)s(θ;X)]. I(\theta) = \mathbb{E}[s(\theta;X)s^\top(\theta;X)].
Since the score is mean-zero, I(θ)I(\theta) is the covariance matrix of the score. Consider the second moment (the expected value of the Hessian) of the log-likelihood: E(θ2logp(X;θ))=E(θ(θlogp(X;θ)))=E(θ(θp(X;θ)p(X;θ)))=E(θ2p(X;θ)p(X;θ)θp(X;θ)(θp(X;θ))p2(X;θ))=E(θ2p(X;θ)p(X;θ)s(θ;X)s(θ;X)). \begin{aligned} \mathbb{E} (\nabla_\theta^2 \log p(X;\theta)) &= \mathbb{E} (\nabla_\theta (\nabla_\theta \log p(X;\theta))) \\ &= \mathbb{E} \Big(\nabla_\theta \big( \frac{\nabla_\theta p(X;\theta)}{p(X;\theta)}\big)\Big) \\ &= \mathbb{E} \Big( \frac{\nabla^2_\theta p(X;\theta)}{p(X;\theta)} - \frac{\nabla_\theta p(X;\theta) (\nabla_\theta p(X;\theta))^\top}{p^2(X;\theta)} \Big) \\ &= \mathbb{E} \Big( \frac{\nabla^2_\theta p(X;\theta)}{p(X;\theta)} - s(\theta;X)s^\top(\theta;X) \Big). \end{aligned} Since (again under regularity): E(θ2p(X;θ)p(X;θ))=θ2p(X;θ)dx=θ2p(X;θ)dx=0, \mathbb{E} \Big( \frac{\nabla^2_\theta p(X;\theta)}{p(X;\theta)} \Big) = \int \nabla^2_\theta p(X;\theta) dx = \nabla^2_\theta \int p(X;\theta) dx = 0, we get that: I1(θ)=E(θ2logp(X;θ))=E(θs(θ)), I_1(\theta) = - \mathbb{E} (\nabla_\theta^2 \log p(X;\theta)) = - \mathbb{E} (\nabla_\theta s(\theta)), where it is highlighted that Fisher information matrix is computed based on a single observation. It is easy to see that if there are nn samples, then the corresponding Fisher information is I(θ)=nI1(θ)I(\theta) = nI_1(\theta). Lastly, Fisher information is parametrization-dependent, and there is a well-known result which relates Fisher Information for different parametrizations.

Cramer-Rao bound

In many cases, we care about the optimality of a particular estimator. Showing optimality consists of two steps: (a) showing that no estimator can do better than some benchmark, and (b) showing that a particular estimator does in fact attain that benchmark. Suppose that we measure optimality using squared risk (mean-squared error):

RMSE(θ^,θ)=Eθ^θ2. R_{MSE}(\hat\theta, \theta) = \mathbb{E} \|\hat\theta-\theta\|^2.

Suppose that Eθ^=m\mathbb{E}\hat\theta= m (not necessarily equal to θ\theta). We can easily derive that MSE decomposes as follows:

Eθ^θ2=Eθ^m+mθ2=Eθ^m2+2E[(θ^m)(mθ)]+Emθ2=Eθ^m2+mθ2=V(θ^)+mθ2, \begin{aligned} \mathbb{E} \|\hat\theta - \theta\|^2 &= \mathbb{E} \|\hat\theta - m + m- \theta \|^2\\ &= \mathbb{E} \|\hat\theta - m\|^2 + 2 \mathbb{E} [(\hat\theta - m)^\top (m- \theta)] + \mathbb{E} \| m- \theta\|^2\\ &= \mathbb{E} \|\hat\theta - m\|^2 + \| m- \theta\|^2\\ &= \mathbb{V} (\hat\theta) + \| m- \theta\|^2, \end{aligned}

where the first term is the variance of θ^\hat\theta and the second term is (squared) bias of the estimator. If Eθ^=θ\mathbb{E}\hat\theta = \theta, i.e., estimator is unbiased, then MSE-optimal estimator is the one that has the lowest variance (note that in many cases, we can achieve lower MSE by reducing variance at the cost of sacrificing some bias). The benchmark for unbiased estimators is presented in the following result.

Theorem (Cramer-Rao bound). Suppose that X1,,XnX_1,\dots, X_n, where Xip(x;θ)X_i\sim p(x;\theta) and θ^\hat\theta is an unbiased estimator of θ\theta. Then:

V(θ^)1I(θ)=1nI1(θ). \mathbb{V} (\hat\theta) \geq \frac{1}{I(\theta)} = \frac{1}{nI_1(\theta)}.
Short proof

We consider univariate case (multivaiate case is analogous, with inequalities translating matrix inequalities). In addition, we assume n=1n=1, but the proof is identical after replacing s(θ;X)s(\theta;X) with i=1ns(θ;Xi)\sum_{i=1}^ns(\theta;X_i). Below, all expectations are taken wrt to p(x,θ)p(x, \theta). Since Es(θ;X)=0\mathbb{E} s(\theta; X) = 0, we have:

Cov(θ^,s(θ;X))=E[θ^s(θ;X)]. \text{Cov} (\hat\theta, s(\theta; X)) = \mathbb{E} [\hat\theta s(\theta; X) ].

Moreover, we have (θ^\hat\theta depends only on sample, not the parameter):

E[θ^s(θ;X)]=θ^(x)(θlogp(x;θ))p(x;θ)dx=θ^(x)θp(x;θ)dx=θθ^(x)p(x;θ)dx=θ(Eθ^(X))=θθ=1, \begin{aligned} \mathbb{E} [\hat\theta \cdot s(\theta; X) ] &= \int \hat\theta (x) \cdot (\nabla_\theta \log p(x;\theta)) \cdot p(x;\theta) dx \\ &= \int \hat\theta (x) \cdot \nabla_\theta p(x;\theta) dx \\ &= \nabla_\theta \int \hat\theta(x) \cdot p(x;\theta) dx \\ &= \nabla_\theta (\mathbb{E} \hat\theta(X)) \\ &= \nabla_\theta \theta = 1, \end{aligned}

where the unbiasedness is used in the last line. By Cauchy-Schwarz, we have:

Cov(θ^,s(θ;X))2V(θ^)V[s(θ,X)]. |\text{Cov} (\hat\theta, s(\theta; X))|^2 \leq \mathbb{V} (\hat\theta) \cdot \mathbb{V} [s(\theta, X)].

Therefore,

V(θ^)Cov(θ^,s(θ,X))2V[s(θ,X)]=1I1(θ). \mathbb{V} (\hat\theta) \geq \frac{|\text{Cov} (\hat\theta, s(\theta, X))|^2}{\mathbb{V} [s(\theta,X)]} = \frac{1}{I_1(\theta)}.

Side note: a version that can be used for vectors is

V(Y)Cov(Y,X)V1(X)Cov(X,Y), \mathbb{V}(Y) \geq \text{Cov}(Y,X) \mathbb{V}^{-1}(X)\text{Cov}(X,Y),

where inequalities are interpreted in a matrix sense, i.e., the difference is positive-semidefinite (PSD).

One may also care about estimating some function of θ\theta: ψ(θ)\psi(\theta). Suppose that ψ^\hat\psi is an unbiased estimator for ψ(θ)\psi(\theta), i.e., E[ψ^]=ψ(θ)\mathbb{E} [\hat\psi] = \psi(\theta). Then it holds that:

V(ψ^)[ψ(θ)]2I(θ). \mathbb{V}(\hat\psi) \geq \frac{[\psi'(\theta)]^2}{I(\theta)}.

The proof follows exactly the same argument as the one above, but relies on the following result for covariance:

Cov(ψ^,s(θ,X))=E[ψ^s(θ;X)]=θ[E[ψ^]]=θ[ψ(θ)]. \begin{aligned} \text{Cov} (\hat\psi, s(\theta, X)) &= \mathbb{E} [\hat\psi \cdot s(\theta; X)] \\ &= \nabla_\theta [\mathbb{E} [\hat\psi]] \\ &= \nabla_\theta [\psi(\theta)]. \end{aligned}

We can also look at the above result as the generalization of Cramer-Rao bound to biased estimators, i.e., the target is θ\theta but an estimator θ^\hat\theta is such that: Eθ^θ\mathbb{E}\hat\theta\neq \theta. In this case, the result provides optimality benchmark for different estimators with the same bias, i.e., variance of any estimator whose bias function is given by ψ(θ)\psi(\theta) is lower bounded by the aforementioned quantity.

Super-efficiency

As discussed above, the variance of any unbiased estimator is lower bounded by Cramer-Rao bound. Generally, most reasonable estimators are at least asymptotically unbiased: bias goes to zero as nn\to\infty, but may not be exactly equal to zero for any nn.

One may expect that the asymptotic variance of such estimators is also greater or equal than Cramer-Rao bound. This is generally the case: estimators whose asymptotic variance is equal to the Cramer-Rao bound are referred to as asymptotically efficient (e.g., MLE is efficient for parametric models under suitable regularity conditions).

However, one can also construct estimators that: (a) are asymptotically unbiased, and (b) have asymptotic variance matching Cramer-Rao lower bound for most parameter values θΘ\theta\in\Theta, yet smaller variance for some parameter values. We consider a canonical example of a super-efficient estimator.

Hodges’ (Hodges - Le Cam) estimator

Suppose that one observes a sample {Xi}i=1n\{X_i\}_{i=1}^n, where XiN(μ,1)X_i\sim \mathcal{N}(\mu,1). The MLE of μ\mu is given by the sample average:

Xn=1ni=1nXi. \overline{X}_n = \frac{1}{n}\sum_{i=1}^n X_i.

After n\sqrt{n}-scaling, it follows from central limit theorem (CLT) that:

n(Xnμ)dN(0,1) \sqrt{n}(\overline{X}_n-\mu)\overset{d}{\to} \mathcal{N}(0,1)

(the above result is simply CLT, and for a sample of normally distibuted random variables, we actually have an equality in the above). In other words, the (asymptotic) variance is always equal to one for all parameter values μ\mu (after root-nn scaling), and can be shown to match Cramer-Rao lower bound.

Next, we introduce an estimator (Hodges’ estimator) that improves upon MLE estimator at μ=0\mu=0. Hodges’ estimator is defined as

μ^n=Xn×1{Xn>n1/4}, \hat\mu_n = \overline{X}_n \times \textbf{1}\{|\overline{X}_n|>n^{-1/4}\},

i.e., this estimator is a thresholded version of the sample mean. First, consider the asymptotic behavior of Hodges’ estimator:

  • Suppose that μ=0\mu=0. Since Xnn1/4|\overline{X}_n|\leq n^{-1/4} implies μ^n=0\hat{\mu}_n=0, we have: Pμ=0(nμ^n=0)=Pμ=0(μ^n=0)Pμ=0(Xnn1/4)=Pμ=0(nXnn1/4)=Φ(n1/4)Φ(n1/4)n1, \begin{aligned} P_{\mu=0}(\sqrt{n}\hat\mu_n=0) &= P_{\mu=0}(\hat\mu_n=0) \\ &\geq P_{\mu=0}(|\overline{X}_n|\leq n^{-1/4}) \\ &= P_{\mu=0}(\sqrt{n}|\overline{X}_n|\leq n^{1/4}) \\ &= \Phi(n^{1/4})-\Phi(-n^{1/4}) \overset{n\to\infty}{\to} 1, \end{aligned} where Φ()\Phi(\cdot) is the CDF of standard normal random variable. Therefore, nμ^n=n(μ^nμ)\sqrt{n}\hat{\mu}_n = \sqrt{n}(\hat{\mu}_n-\mu) converges to a degenerate random variable, and hence, the asymptotic variance is zero (after root-nn scaling).
  • Suppose that μ0\mu \neq 0. We know that Xn\overline{X}_n converges to μ:μ>0\mu: |\mu|>0 almost surely. Therefore, n(μ^nXn)a.s.0\sqrt{n} (\hat\mu_n-\overline{X}_n)\overset{a.s.}{\to} 0, and hence, n(μ^nμ)=n(Xnμ)dN(0,1)\sqrt{n}(\hat\mu_n-\mu) = \sqrt{n}(\overline{X}_n-\mu)\overset{d}{\to}\mathcal{N}(0,1) by Slutsky’s theorem, which in turn implies that asymptotic variance is one (after root-nn scaling).

It may seem that MLE is sub-optimal as the Hodges’ estimator is super-efficient: it has better asymptotic variance (despite being only asymptotically unbiased). While it may seem that super-efficiency is a good property, it is not necessarily the case.

In case of Hodges’ estimator, it is gained at the expense of poor estimation in the neighborhood of zero. To illustrate that, we consider (exact) squared risk of both estimators for different sample sizes and mean values. In the plot below, the risk for both estimators is scaled (multiplied by nn) for visualization purposes: one can think of ratio of the risk of Hodges’ estimator relative to that of MLE.

Squared risk of Hodges' estimator (derivation)

Under the Gaussian setting, we have XnN(μ,1/n)\overline{X}_n \sim\mathcal{N}(\mu,1/n). Therefore, E[Xn2]=μ2+1/n\mathbb{E}[\overline{X}_n^2]=\mu^2+1/n. We have:

E[Xn×1{Xn>n1/4}]=E[Xn]E[Xn×1{Xnn1/4}]=μE[Xn×1{Xnn1/4}]. \begin{aligned} \mathbb{E} [\overline{X}_n \times \textbf{1}\{|\overline{X}_n|> n^{-1/4}\}] &= \mathbb{E} [\overline{X}_n]-\mathbb{E} [\overline{X}_n \times \textbf{1}\{|\overline{X}_n|\leq n^{-1/4}\}]\\ &= \mu-\mathbb{E} [\overline{X}_n \times \textbf{1}\{|\overline{X}_n|\leq n^{-1/4}\}]. \end{aligned}

Focusing on the second term, we obtain:

E[Xn×1{Xnn1/4}]=n1/4n1/4xn2πexp(n(xμ)22)dx=(n1/4μ)n(n1/4μ)n(t/n+μ)2πexp(t22)dt=1n(n1/4μ)n(n1/4μ)nt2πexp(t22)dt+μ(n1/4μ)n(n1/4μ)n12πexp(t22)dt=1n12πexp(t22)(n1/4μ)n(n1/4μ)n+μ(Φ((n1/4μ)n)Φ((n1/4μ)n)). \begin{aligned} & \quad \mathbb{E} [\overline{X}_n \times \textbf{1}\{|\overline{X}_n|\leq n^{-1/4}\}] \\ &= \int_{-n^{-1/4}}^{n^{-1/4}} \frac{x\sqrt{n}}{\sqrt{2\pi}}\exp\Big(-\frac{n(x-\mu)^2}{2}\Big)dx \\ &= \int_{(-n^{-1/4}-\mu)\sqrt{n}}^{(n^{-1/4}-\mu)\sqrt{n}} \frac{(t/\sqrt{n}+\mu)}{\sqrt{2\pi}}\exp\Big(-\frac{t^2}{2}\Big)dt \\ &= \frac{1}{\sqrt{n}}\int_{(-n^{-1/4}-\mu)\sqrt{n}}^{(n^{-1/4}-\mu)\sqrt{n}} \frac{t}{\sqrt{2\pi}}\exp\Big(-\frac{t^2}{2}\Big)dt \\ &+ \mu\cdot\int_{(-n^{-1/4}-\mu)\sqrt{n}}^{(n^{-1/4}-\mu)\sqrt{n}} \frac{1}{\sqrt{2\pi}}\exp\Big(-\frac{t^2}{2}\Big)dt \\ &= -\frac{1}{\sqrt{n}} \frac{1}{\sqrt{2\pi}}\exp\Big(-\frac{t^2}{2}\Big) \Big|_{(-n^{-1/4}-\mu)\sqrt{n}}^{(n^{-1/4}-\mu)\sqrt{n}} \\ &+ \mu \Big( \Phi((n^{-1/4}-\mu)\sqrt{n}) - \Phi((-n^{-1/4}-\mu)\sqrt{n}) \Big). \end{aligned}

Next, let’s derive the second moment of the Hodges’ estimator:

E[Xn2×1{Xnn1/4}]=n1/4n1/4x2n2πexp(n(xμ)22)dx=(n1/4μ)n(n1/4μ)n(t/n+μ)22πexp(t22)dt=(n1/4μ)n(n1/4μ)nt2n12πexp(t22)dt+(n1/4μ)n(n1/4μ)n2tμn1/22πexp(t22)dt+(n1/4μ)n(n1/4μ)nμ22πexp(t22)dt. \begin{aligned} & \quad \mathbb{E} [\overline{X}_n^2 \times \textbf{1}\{|\overline{X}_n|\leq n^{-1/4}\}] \\ &= \int_{-n^{-1/4}}^{n^{-1/4}} \frac{x^2\sqrt{n}}{\sqrt{2\pi}}\exp\Big(-\frac{n(x-\mu)^2}{2}\Big)dx \\ &= \int_{(-n^{-1/4}-\mu)\sqrt{n}}^{(n^{-1/4}-\mu)\sqrt{n}} \frac{(t/\sqrt{n}+\mu)^2}{\sqrt{2\pi}}\exp\Big(-\frac{t^2}{2}\Big)dt \\ &= \int_{(-n^{-1/4}-\mu)\sqrt{n}}^{(n^{-1/4}-\mu)\sqrt{n}} \frac{t^2n^{-1}}{\sqrt{2\pi}}\exp\Big(-\frac{t^2}{2}\Big)dt \\ &+ \int_{(-n^{-1/4}-\mu)\sqrt{n}}^{(n^{-1/4}-\mu)\sqrt{n}} \frac{2t\mu n^{-1/2}}{\sqrt{2\pi}}\exp\Big(-\frac{t^2}{2}\Big)dt \\ &+ \int_{(-n^{-1/4}-\mu)\sqrt{n}}^{(n^{-1/4}-\mu)\sqrt{n}} \frac{\mu^2}{\sqrt{2\pi}}\exp\Big(-\frac{t^2}{2}\Big)dt. \\ \end{aligned}

For the first term (second and third are handled analogous to the mean case), we have:

(n1/4μ)n(n1/4μ)nt2n12πexp(t22)dt=tn12πexp(t22)(n1/4μ)n(n1/4μ)n+(n1/4μ)n(n1/4μ)nn12πexp(t22)dt, \begin{aligned} &\quad \int_{(-n^{-1/4}-\mu)\sqrt{n}}^{(n^{-1/4}-\mu)\sqrt{n}} \frac{t^2n^{-1}}{\sqrt{2\pi}}\exp\Big(-\frac{t^2}{2}\Big)dt \\ &= -\frac{tn^{-1}}{\sqrt{2\pi}}\exp\Big(-\frac{t^2}{2}\Big)\Big|_{(-n^{-1/4}-\mu)\sqrt{n}}^{(n^{-1/4}-\mu)\sqrt{n}}\\ &+ \int_{(-n^{-1/4}-\mu)\sqrt{n}}^{(n^{-1/4}-\mu)\sqrt{n}} \frac{n^{-1}}{\sqrt{2\pi}}\exp\Big(-\frac{t^2}{2}\Big)dt, \end{aligned}

which completes the derivation of the first and second moment of the Hodges’ estimator. Bias and variance trivially follow.

The (scaled) squared risk of the MLE is equal to one, irrespective of sample size or parameter value μ\mu. The (scaled) squared risk of Hodges’ estimator depends both on sample size and parameter value. We observe that super-efficiency at zero is gained at the expense of poor estimation in the neighborhood of zero. For every value of nn, there is some value of μ\mu, for which the risk of the Hodges’ estimator is much higher than that of MLE.

Regular estimators

Existence of super-efficient estimators is one of the reasons for the introduction of the notion of regular estimators. Informally, the limiting distribution of a regular estimator (after appropriate normalization) does not change with small perturbation of the true parameter θΘ\theta\in\Theta.

Consider a Local Data Generating Process (LDGP) at θ0\theta_0: for each nn, consider a (triangular) array of random variables {Xin}i=1n\{X_{in}\}_{i=1}^n, where XinPθn,i=1,,n,X_{in}\sim P_{\theta_n}, i=1,\dots,n, and n(θnθ0)nhRk\sqrt{n}(\theta_n-\theta_0)\overset{n\to\infty}{\to} h\in\mathbb{R}^k i.e., θn\theta_n is approaching some fixed parameter θ0\theta_0.

Definition [Regular estimator]. Tn=Tn(X1n,,Xnn)T_n=T_n(X_{1n},\dots,X_{nn}) is called a regular estimator of ψ(θ)\psi(\theta) at θ0\theta_0 if for every LDGP at θ0\theta_0 (i.e., if for every sequence {θn}n=1\{\theta_n\}_{n=1}^\infty with n(θnθ0)hRk\sqrt{n}(\theta_n-\theta_0)\to h\in\mathbb{R}^k for some hh) it holds that:

n(Tnψ(θn))dLθ0, \sqrt{n}(T_n-\psi(\theta_n)) \overset{d}{\to} L_{\theta_0},

where the convergence is in distribution under the law of θn\theta_n (i.e., for each nn distribution is different).

Let us consider MLE and Hodges’ estimators from the previous Section. Fix any hRh\in \mathbb{R} and consider μn=μ+h/n=h/n\mu_n = \mu + h/\sqrt{n} = h/\sqrt{n}. We have:

  • For any nn, it holds for MLE Xn\overline{X}_n that: XnN(μn,1/n)\overline{X}_n\sim \mathcal{N}(\mu_n,1/n). Therefore, n(Xnμn)N(0,1)\sqrt{n}(\overline{X}_n-\mu_n)\sim \mathcal{N}(0,1), and hence, the limiting (exact in this case) distribution doesn’t depend on hh.

  • Note that: Xn=dμn+1nZ\overline{X}_n \overset{d}{=}\mu_n+\frac{1}{\sqrt{n}}Z for ZN(0,1)Z\sim\mathcal{N}(0,1). For Hodges’ estimator, it holds that:

    μ^n=Xn×1{Xn>n1/4}=(μn+1nZ)×1{μn+1nZ>n1/4}=(hn+1nZ)×1{hn+1nZ>n1/4}=1n(h+Z)×1{h+Z>n1/4}. \begin{aligned} \hat\mu_n &= \overline{X}_n \times \textbf{1}\{|\overline{X}_n|> n^{-1/4}\}\\ &= \Big( \mu_n+\frac{1}{\sqrt{n}}Z \Big) \times \textbf{1}\{| \mu_n+\frac{1}{\sqrt{n}}Z |> n^{-1/4}\} \\ &= \Big( \frac{h}{\sqrt{n}}+\frac{1}{\sqrt{n}}Z \Big) \times \textbf{1}\{| \frac{h}{\sqrt{n}}+\frac{1}{\sqrt{n}}Z |> n^{-1/4}\}\\ &= \frac{1}{\sqrt{n}} \Big( h+Z \Big) \times \textbf{1}\{| h+Z |> n^{1/4}\}. \end{aligned}

    Therefore, we have:

    n(μ^nμn)=nμ^nh=(h+Z)×1{h+Z>n1/4}h. \begin{aligned} \sqrt{n} (\hat\mu_n - \mu_n) &= \sqrt{n} \hat\mu_n - h\\ &= \Big( h+Z \Big) \times \textbf{1}\{| h+Z |> n^{1/4}\} - h. \end{aligned}

    Note that Wn:=1{h+Z>n1/4}W_n:=\textbf{1}\{| h+Z |> n^{1/4}\} defines a sequence of Bernoulli random variables. We can easily show that Wnd0W_n \overset{d}{\to}0. Indeed, we only need to show that for any c(0,1)c\in(0,1), P(Wnc)1P(W_n\leq c)\to 1 as nn\to\infty. Indeed,

    P(Wnc)=P(Wn=0)=P(h+Zn1/4)=Φ(n1/4h)Φ(n1/4h)n1. \begin{aligned} P(W_n\leq c) &=P(W_n=0) \\ &= P(| h+Z |\leq n^{1/4})\\ & = \Phi(n^{1/4}-h)-\Phi(-n^{1/4}-h)\overset{n\to\infty}{\to} 1. \end{aligned}

    Therefore (using Slutsky’s theorem)

    (h+Z)×1{h+Zn1/4}hdh, \begin{aligned} \Big( h+Z \Big) \times \textbf{1}\{| h+Z |\leq n^{1/4}\} - h \overset{d}{\to} -h, \end{aligned}

    and hence, we conclude that Hodges’ estimator is not regular at zero.

Conclusion

This is the first post in a series, and there is a lot more to cover. For example, James-Stein estimator — another famous example of a non-regular estimator — deserves a separate discussion on its own (as well as the corresponding implications like inadmissibility of MLE, etc.). Regularity plays an important role in asympotic analysis of efficient estimators, and will also be recalled in future posts.

Aleksandr Podkopaev
Authors
Senior Applied Scientist
Using Stat/ML to tackle real-world tasks