Let \(X_1,... X_n\) be a simple random sample from a population with mean \(\mu\) and variance \(\sigma ^2\).
\(\bar X\) is the sample mean, \(S^2\) is the sample variance
If the population variance \(\sigma^2\) is known, then by CLT
\[ {\bar X-\mu\over \sigma/\sqrt{n}}\dot\sim N(0,1) \]
This is equivalent to
\[ \bar X \dot\sim N(\mu,{\sigma^2\over n}) \]
\[ P(-z_{\alpha/2}\leq {\bar X-\mu\over \sigma/\sqrt{n}}\leq z_{\alpha/2})=1-\alpha\\ P(-z_{\alpha/2}\cdot \sigma/\sqrt{n}\leq \bar X-\mu\leq z_{\alpha/2}\cdot \sigma/\sqrt{n})=1-\alpha\\ P( \bar X-z_{\alpha/2}\cdot \sigma/\sqrt{n}\leq\mu\leq \bar X+z_{\alpha/2}\cdot \sigma/\sqrt{n})=1-\alpha \]
A \((1-\alpha)100\%\) confidence interval for \(\mu\) is
\[ \begin{pmatrix}\bar X-z_{\alpha/2}{\sigma\over \sqrt{n}},\bar X+z_{\alpha/2}{\sigma\over \sqrt{n}}\end{pmatrix} \]
\(z_{\alpha/2}{\sigma\over \sqrt{n}}\) is the Margin of Error.
This confidence interval only works if
\(n\geq 30\) (from central limit theorem) or the population has a normal distribution
\(\sigma^2\) is known
This method is not really commonly used…
If we make the additional assumption that \(X-1,...,x-n\) are iid \(N(\mu,\sigma^2)\), then
\[ {\bar X-\mu\over S/\sqrt{n}}\sim T_{n-1} \]
We replace \(\sigma\) by sample standard deviation S. with n-1 degree of freedom
A \((1-\alpha)100%\) confidence interval for \(\mu\) is
\[ \begin{pmatrix}\bar X-t_{n-1,\alpha/2}{S\over \sqrt{n}},\bar X-t_{n-1,\alpha/2}{S\over \sqrt{n}}\end {pmatrix} \]
Replace \(z_{\alpha/2}\) with \(t_{n-1,\alpha/2}\) (critical value) \(z_{\alpha/2}<t_{n-1}\).
Notes:
If the population is normal, this confidence interval hods for all n.
If the population is not normal, this confidence interval will still work well if \(n\geq 30\)
\(t_{n-1,\alpha/2}\) can be
found in R using qt(1-\alpha/2, n-1)
Example: The following data include temperatures for a random sample of 10 time points in New York in September 1973. Find a 90% confidence interval for the average temperature.
temp=c(81,72,67,88,93,96,84,82,82,67)
qqnorm(temp)
qqline(temp)
This looks fine for normality
\[ n=10,\bar X=81.20,S=10.01\\ t_{9,0.05}=qt(0.95,9)=1.833\\ ME=1.833\times{10.01\over\sqrt{10}}=5.80\\ 81.20\pm5.8=[75.4,87.00] \]
Interpretation:
We are 90% confident that the average temperature in New York in Sep. 1973 is between 75.4 and 87.00 degree.
We can obtain T confidence intervals in R:
confint(lm(temp ~ 1),level =0.9)
## 5 % 95 %
## (Intercept) 75.39804 87.00196
# lm stands for linear model, level is the confidence interval. 90% in this case)
Precision
For means, the margin of error is
\[ MOE=z_{\alpha/2}{\sigma\over \sqrt{n}}~ or~ MOE=t_{n-1,\alpha/2} {S\over \sqrt{n}} \]
L is the desired length of Confidence Interval.
\(L=2\times z_{\alpha/2}\times {\sigma\over \sqrt{n}}\to n=(2\times z_{\alpha/2}\times {\sigma\over L})^2\)
To make the confidence interval shorter, we can increase the sample size.
Choose n so that
\[ n\geq(2z_{\alpha/2}{S_{pr}\over L})^2 \]
Note: Often a preliminary estimate of \(\sigma\) is not available.
Guess a likely range of values that the variable will take.
Use \(S_{pr}={range\over 3.5}\) (if uniform) \(S_{pr}={range\over 4}\) (is normal)
Example: What sample size should we use to obtain a 90% confidence interval of length 5 for the average temperature in New York in 1973?
\[ L=5, S_{pr}=10.01,z_{\alpha/2}=qnorm(0.95)=1.645\\ n\geq[2\times 1.645\times{0.01\over 5}]^2=43.38=44 \]
Always round up.
Example: Suppose we want a confidence interval for the variance in temperature in New York in 1973.
\(\sigma^2\)= true variance, \(S^2\)= sample variance to estimate \(\sigma^2\)
It turns out that if the population has a normal distribution
\[ {(n-1)S^2\over \sigma^2}\sim\chi^2_{n-1} \]
If \(Z_1,...,Z_n\) are iid with \(N(0,1)\) then \(Z_1^2,...,Z_n^2\sim \chi_n^2\).
\(X\sim \chi_v^2\) has the \(\chi^2\) distribution with v degrees of freedom
Plot of \(\chi_v^2\) densities:
Positive support
Non-symmetric
library(ggplot2)
ggplot(data.frame(x = c(0, 20)), aes(x = x)) +
stat_function(fun = dchisq, args = list(df = 1))
ggplot(data.frame(x = c(0, 20)), aes(x = x)) +
stat_function(fun = dchisq, args = list(df = 2))
ggplot(data.frame(x = c(0, 20)), aes(x = x)) +
stat_function(fun = dchisq, args = list(df = 4))
ggplot(data.frame(x = c(0, 20)), aes(x = x)) +
stat_function(fun = dchisq, args = list(df = 8))
A \((1-\alpha)100\%\) confidence interval for \(\sigma^2\) is
\[ {(n-1)S^2\over \chi_{n-1,\alpha/2}^2}<\sigma^2<{(n-1)S^2\over \chi_{n-1,1-\alpha/2}^2} \]
percentiles for the \(\chi^2\) distribution can be found in R:
qchisq(p, v) gives \(\chi^2_{v,1-p}\)