9.9. Summary#
9.9.1. Terminology Review#
Use the flashcards below to help you review the terminology introduced in this chapter. \(~~~~ ~~~~ ~~~~\)
9.9.2. Key take-aways#
Expected Value
The expected value, or mean, of a discrete random variable \(X\) is \begin{equation*} \mu_X = E[X] = \sum_{x \in \operatorname{Range}(x)} x p_X(x). \end{equation*}
The expected value, or mean, of a continuous random variable \(X\) is \begin{equation*} \mu_X = E[X] = \int_{-\infty}^{\infty} x f_X(x)~dx. \end{equation*}
The mean \(\mu_X\) is the value that minimizes the mean-squared error to the random variable, \(E\left[ \left( X - \mu_X\right)^2 \right]\).
The mode of a random variable \(X\) is the value with the highest probability (if \(X\) is discrete) or highest probability density (if \(X\) is continuous).
The median of a random variable is a value that divides the probability into equal parts. I.e., the median is a value \(m\) such that \(P(X \le m) = P(X>m) = 1/2\). The median is not necessarily unique for discrete random variables.
The mean is not generally equal to the median or mode.
Properties of expected value:
Expected value of a constant is that constant \(E[c] = c\).
Expected value is a linear operator, \(E[aX +bY] =a E[X] + b E[Y]\).
Moments
The Law of the Unconscious Statistician (LOTUS rule) says:
If \(X\) is a discrete random variable and \(g(x)\) is a real function, then \begin{equation*} E\left[ g \left( X \right) \right] = \sum_x g(x) p_X(x). \end{equation*}
\(\mbox{ }\)
If \(X\) is a continuous random variable and \(g(x)\) is a real function, then \begin{equation*} E\left[ g \left( X \right) \right] = \int_{\infty}^{\infty} g(x) f_X(x)~ dx. \end{equation*}
The \(n\)th moment is \(E[X^n]\). For continuous random variables, this is \begin{equation*} E\left[ X^n \right] = \int_{\infty}^{\infty} x^n f_X(x)~ dx. \end{equation*}
The mean is the first moment.
The mean weights each value of the random variable by the probability or probability density for that value.
The \(n\)th central moment is \(E\left[\left(X - \mu_X\right) ^n\right ]\). For continuous random variables, this is \begin{equation*} E\left[ \left( X - \mu_X \right)^n \right] = \int_{\infty}^{\infty} \left(x - \mu_X \right)^n f_X(x)~ dx. \end{equation*}
The variance is the second central moment, \(\sigma_{X}^{2} = E \left[ \left(X - \mu_X \right)^2 \right]\).
The variance weights each value of the random variable by its distance squared from the mean. Thus, the variance measures the spread of the distribution away from the mean.
The standard deviation of a random variable is the square root of its variance, \(\sigma_X = \sqrt{ E \left[ \left( X - \mu_X \right)^2 \right] }\).
Properties of variance:
Variance of a constant is 0: \(\operatorname{Var}[c] = 0\).
Variance is not affected by adding a constant: \(\operatorname{Var}[X+c] = \operatorname{Var}[X]\).
Scaling a random variable \(X\) by a deterministic constant \(c\) scales the variance by the square of the constant: \(\operatorname{Var}[cX] = c^2 \operatorname{Var}[X]\).
The variance of the sum of independent random variables \(X_0, X_1, \ldots, X_{n-1}\) is the sum of the variances: \begin{equation*} \operatorname{Var}\left[ \sum_{i=0}^{n-1} X_i \right] = \sum_{i=0}^{n-1} \operatorname{Var}\left[X_i \right] . \end{equation*}
Parameter Estimation
In the following, let \(X\) be a random variable that has a distribution with some parameter \(\theta\). For instance, \(\theta\) could be the mean or variance of the distribution.
Let \(\mathbf{x}\) be a \(N\)-vector of independent data points drawn from the distribution of \(\mathbf{X}\). Let \(\mathbf{X}\) be a random vector consisting of \(N\) independent random variables with the same distribution as \(\mathbf{X}\).
In classical parameter estimation, \(\theta\) is treated as deterministic but unknown; in Bayesian parameter estimation, \(\theta\) is treated as a random variable.
We only consider classical parameter estimation.
A point estimate or estimate \(\hat{\theta}\) is a single numerical value that estimates \(\theta\) and is computed from the data vector \(\mathbf{x}\); i.e., it can be computed as some function \(\hat{\theta} = g(\mathbf{x})\).
A point estimator or estimator is a random variable \(\hat{\Theta}\) that we get if we apply the estimation function \(g()\) to the random vector \(\mathbf{X}\). The estimator can be used to characterize the estimation process over the possible values of the data.
The estimator error is \(\hat{\Theta} - \theta\).
The estimator bias is \(E\left[\hat{\Theta}\right] - \theta\).
An estimator is unbiased if \(E\left[\hat{\Theta}\right] = \theta\), i.e., if the estimator bias is zero.
The sample mean estimator is unbiased.
The sample mean estimator minimizes the mean-squared error.
If the mean is known, the sample variance estimator is unbiased using denominator \(N\).
If the mean is unknown, the sample variance estimator is biased if the denominator is \(N\) and unbiased if the denominator is \(N-1\). The change in the denominator can be considered a reduction by 1 in the number of degrees of freedom caused by using one degree of freedom to estimate the mean.
The biased variance estimator is denoted by \(S_{N}^{2}\), and the unbiased variance estimator is denoted by \(S_{N-1}^{2}\).
The biased variance estimate is denoted by \(s_{N}^{2}\), and the unbiased variance estimator is denoted by \(s_{N-1}^{2}\).
The unbiased estimator has a higher mean-squared error than the biased estimator.
Confidence Intervals for Estimates
Confidence intervals can be used to provide more information than a point estimate.
As before, a \(c\)% confidence interval does not mean that the true value \(\theta\) is in that confidence interval with probability \(c\); instead, it means that when the same procedure is repeated many times, \(c\)% of the confidence intervals constructed will contain the true value.
Confidence intervals can be estimated using bootstrap resampling by generating many sample point estimates and finding an interval that contains 95% of the sample point estimates.
Confidence intervals created via bootstrapping are generally too narrow unless the data contains more than 100 values.
The standard error of the mean (SEM) is the standard deviation of the mean estimator.
For \(n\) independent samples from a random variable with known standard deviation \(\sigma_X\), the SEM is \begin{equation*} \sigma_{\hat{X}} = \frac{ \sigma_X}{\sqrt{n}}. \end{equation*}
Confidence intervals for the sample mean can be constructed using a Normal approximation when the variances of the groups are known.
When the variances of the groups are not known, we have to use the variance estimate instead of the true variance. The result is that the confidence intervals are computed from a Student’s \(t\) distribution instead of the Normal distribution.
For small samples, both the bootstrap confidence intervals and the analytical confidence intervals using the Student’s \(t\) approximation are too narrow to achieve the specified confidence level.
Testing a Difference of Means
In this section, we considered how to perform a NHST for a difference of means between two groups.
The null hypothesis is that the two groups come from the same population and have the same means.
The test statistic is the difference in means between the two groups.
If the two groups come from distributions with known variances, the test statistic is Normal.
If the two groups come from distributions with unknown variances, the test statistic follows the Student’s \(t\) distribution. A NHST based on this test statistic is called a \(t\)-test.
A NHST can also be carried out using bootstrap resampling.
Scipy.stats has a
scipy.stats.ttest_ind()
function for carrying out a \(t\)-test on data from independent groups.
Sampling and Bootstrap Distributions of Parameters
The sampling distribution for an estimator \(\hat{\Theta}\) is the distribution of \(\hat{\Theta}\).
The sampling distribution for the mean estimator is approximately Normal, provided the number of data samples is sufficiently large (at least 10s).
The sampling distribution can be approximated by the bootstrap distribution of the estimator when the number of data samples is sufficiently large (100 or more).
Effect Size, Power, and Sample Size Selection
Let \(\alpha\) and \(\beta\) be the acceptable levels of Type-I and Type-II errors, respectively.
The power is \(1- \beta\).
Effect size is a measure of separation between distributions.
Cohen’s \(d\) is a type of effect size for a difference of means. Cohen’s \(d\) is defined as \begin{equation*} d = \frac{ \mu_X - \mu_Y}{\sigma}, \end{equation*} where \(\sigma\) is the standard deviation for both groups, and \(\sigma\) is known.
Effect sizes are sometimes mapped to text descriptors that vary from “very small” for \(d=0.01\) to “huge” for \(d=2\).
The required group sizes can be computed from \(\alpha\), \(1-\beta\), and \(d\). There are generally multiple solutions.
Under the Normal approximation, the group sizes can be computed using inverse \(Q\) functions.
The required sizes of the groups are larger when the effect size is smaller.