The first assignment is
available. Today I will go over the Rao-Blackwell theorem
again, with clear proofs of the lemmas that it uses.
Lemma: Let A be any set of mutually exclusive, exhaustive events. Let f be any function of x. Then E[f(x)] = E[ E[f(x)|A] where the outer expectation averages over the events A, and the inner expectation averages over the x belonging to each A.
The statement of the lemma assumes that the subsets {A} are a
partition of X.
Proof (discrete case only): Using the definition of expectation, E[ E[f(x)|A] = SUM_A p(A) * [ SUM_x in A p(x|x in A)f(x) ].
Write A(x) to mean the subset A containing x. Because the {A}
are a partition of X, E[ E[f(x)|A] = SUM_x p(A(x)) *
p(x|x in A(x)) * f(x).
Now p(x|x in B) = p(x and x in B)/p(x in B) by the definition of conditional probability. The numerator is 0 if x is not in B. It is just p(x) if x is in B.
So E[ E[f(x)|A] = SUM_x p(A(x)) * p(x) / p(A(x)) *
f(x) and we can cancel p(A(x)) out, giving the result.
Definition: The function c: R-> R is convex if bc(x) + (1-b)c(y) >= c(bx +(1-b)y) for all x and y, for 0 <= b <= 1.
For example, the functions x^2 and 1/x are convex. Subject to minor conditions, a function is convex iff its second derivative is always positive. (This means that if the first derivative is ever zero, the function has a global minimum.) See picture!Lemma [Jensen's
inequality, 1906]: Let u be a function X -> R (also called
a real-valued random variable) and let c be a convex function .
Then c(E[u]) <= E[c(u)].
Note: The probability distribution on X determines a
probability distribution on the real numbers that are the values u(x).
Proof: We'll just do the proof in the finite discrete
case. Let U = {u1, ...., un} be the n different values of u with
non-zero probability.
The proof goes by induction. The base case is n = 2.
Inductive case: Suppose Jensen's inequality is true for n. We'll prove it for n+1. We need to prove E[c(u)] >= c(E[u]).
Note that E[u] = pu_n+1 + (1-p)z where p = P(u = u_n+1) and z = E[u1, ..., un].
Similarly, E[c(u)] = pc(u_n+1) +
(1-p)E[c(u1), ...,c(un)] .
The inductive hypothesis tells us that E[c(u1),
...,c(un)] >= c(E[u1, ..., un]). So E[c(u)] >=
pc(u_n+1) + (1-p)c(z).
The definition of convexity tells us that pc(u_n+1) +
(1-p)c(z) >= c(pu_n+1 + (1-p)z ) =
c(E[u]). END OF PROOF.
Theorem [Calyampudi Radhakrishna Rao, 1945]: Let P_theta be a family of distributions on a sample space X, where theta in Theta. Suppose g tilde: X -> R is an unbiased estimator of g: Theta -> R. Let t be a sufficient statistic for theta. Then g hat(a) = E_theta[ g tilde(x) | t(x)=a] is an unbiased estimator for g with variance equal-or-smaller to that of g tilde.
Proof outline:
(1) Show that g hat(a) = E_theta [ g tilde(x) | t(x) = a ] really is a
function of a only, not of theta.
(2) Show that E_theta [ g hat(t) ] = g(theta) for every theta .
(3) Show that var_theta(g hat(t)) <= var_theta( g tilde(x)).
Step (1): t is sufficient for theta, so p(x | t(x) = a), i.e. the distribution of x conditional on t, does not depend on theta. By the definition of expectation, if the space X is discrete then the expectation of f(x) given t(x) = a is
SUM_{x s.t. t(x) = a} f(x)p(x | t(x)=a)For any function f, this expectation is the same regardless of theta. A similar argument applies if the space X is continuous, with an integral instead of a sum, but there are technical details we won't go into.
Step(2): We must show that E_theta [ g hat(t) ] = g(theta) for
arbitrary theta.
Proof: We use the lemma about nested expectations. f(x) is g tilde(x) and the event A contains all x such that t(x) = a. We have
E[g hat(a)] = E[ E[ g tilde(x) | t(x) = a] ] = E[ g tilde(x)] = g(theta).Step (3): We use Jensen's inequality where we condition on the event t(x) = a, and the random variable u = g tilde(x). The inequality says E[ c(g tilde(x)) | t(x)=a ] >= c(E[ g tilde(x) | t(x)=a ]) = c(g hat(t)).
Taking the expectation again of each side gives E[ E[ c(g tilde(x)) | t(x) ] ] >= E[ c(g hat(t)) ]
Now set c(u) = (u - g(theta))^2. Note that for each theta this
is a different function of u, but that's ok. The constant g(theta)
is the expectation of g hat(t) so E[ c(g hat(t)) ] is
the variance of g hat(t).
Using the lemma about nested expectations again gives
E[ E[ c(g tilde(x)) | t(x)=a ] ] = E[ c(g tilde(x)) ]which is the variance of g tilde(x).
Lemma: Suppose g hat is the unique function of t that is an unbiased estimator of g(theta). Then g hat(t) is an MVUE.
Proof: Let g star(x) be any unbiased estimator of g(theta). Note that E[ g star(x) | t ] is a function of t and an unbiased estimator. So E[g star(x)] = g hat(t). By the Rao-Blackwell theorem, g hat(t) has variance equal-or-smaller than the variance of g star(x). So g hat(t) is the estimator that has smallest variance among all unbiased estimators.How can we know that the sufficient statistic t has the property
that g hat(t) is unique? The answer is via the concept of
completeness.
Definition: Let P_theta be a family of distributions on a sample space Y. This family is complete if
E_theta [f(y)] = 0 for every theta implies f(y) = 0 almost everywhere.In other words, for every function f(y) that is not zero almost everywhere, its expectation for some theta reveals this fact.
Example: Consider the family of binomial distributions
P_theta(t) = (n choose t) theta^t (1-theta)^(n-t)for 0 <= theta <= 1 and t = 0, 1, ... n.
Suppose that SUM_t=0 to n f(t) (n choose t) theta^t (1-theta)^(n-t) equals zero for all theta.
This implies that SUM_t=0 to n f(t) (n choose t) phi^t equals zero for all phi > 0. The righthand side is a polynomial in phi which can equal zero only if the coefficients f(t) are all zero. So the family of binomial distributions is complete.Theorem [Lehmann-Scheffe]: Suppose t(x) is a complete sufficient statistic for theta, and suppose g tilde: X->R is an unbiased estimator of g(theta). Then g hat(t(x)) = E_theta[ g tilde(y) | t(y)=t(x) ] is the unique MVUE.
Proof: Let g star(x) be any other unbiased estimator of g(theta). Consider g bar(t) = E[ g star(x) | t ]. This is a function of t and an unbiased estimator.
So E[ g hat(t) - g bar(t) ] = 0 for every theta. By completeness g hat(t) - g bar(t) = 0 for all t, so g hat and g bar are the same.
So the Rao-Blackwell process always gives the same improved estimator, regardless of which crude estimator we begin with.
Algorithm:
(1) Find a sufficient statistic t.
(2) Show that the family of distributions of t is complete.
(3) Find a crude unbiased estimator g tilde(x).
(4) Evaluate g hat(t(x)) = E_theta[ g tilde(y) | t(y)=t(x) ]
Instead of steps 3 and 4, sometines you can directly guess some g bar(t) and prove that it is unbiased.
Steps 1 and 2 only have to be done once for a given family of distributions P_theta. They can then be reused for different estimation targets g(theta).
How do you do step (1)? Use the factorization theorem.
How do you do step (2)? Use the completeness theorem for an
exponential family.
p_theta(x) = f(theta,t(x))*h(x).Proof: See Silvey, page 27.
Example: Let x = (x1 ... xn) be an iid sample from a Gaussian N(mu,sigma^2). We have
p_theta(x) = (2 pi signa^2)^-0.5n * exp( -1/2sigma^2 * SUM (xi - mu)^2 )Here it looks like the parameter mu is involved with each separate xi. However we can rewrite the above as
p_theta(x) = (2 pi signa^2)^-0.5n * exp( -1/2sigma^2 * [ SUM (xi - x bar^2 + n(x bar - mu)^2] )Here it is only a function of x that is involved with mu and sigma^2, namely t(x)= (x bar, SUM (xi - xbar)^2). So this t(x) is a sufficient statistic.
Note: For non-Gaussian distribution families, typically the
sample mean and variance are not sufficient by themselves.