Paired data arise from an alternative sampling design used for the comparison of two means.
Each data point in the first sample is matched with a unique data point in the second sample.
Studies with paired data can reduce uncontrolled variability.
Example: A certain lake has been designated for pollution clean-up. One way to assess the effectiveness of the clean-up measures is to randomly select a number n of locations from which water samples are taken and analyzed both before and after the clean-up.
The paired data is the condition before and after clean-up
Example: Another way of designing the study is to select a random sample of \(n_1\) locations from which water samples are taken to assess the water quality before the clean-up measures, and a different random sample of \(n_2\) locations which will serve to assess the water quality after the clean-up measures.
In such case we have independent sample. \(n_1\) and \(n_2\) could be different
Example: Suppose a sample of 10 students were given a diagnostic test before studying a particular module and then again after completing the module. We want to find out if, in general, our teaching leads to improvements in students knowledge/skills (i.e. test scores). The scores are as follows:
| Student | Pre | Post |
|---|---|---|
| 1 | 18 | 22 |
| 2 | 21 | 25 |
| 3 | 16 | 17 |
| 4 | 22 | 24 |
| 5 | 19 | 16 |
| 6 | 24 | 29 |
| 7 | 17 | 20 |
| 8 | 21 | 23 |
| 9 | 23 | 19 |
| 10 | 18 | 20 |
Analyzing paired data involves analyzing the differences between the observations for each pair, D.
Notation:
\(\mu_D=\mu_1-\mu_2=\) population mean difference
\(\bar D=\) sample mean difference
\(S_D=\) sample standard deviation of the difference
\(n=\) number of pairs
Example:
\[ n=10,~\bar D=1.6,~S_D=2.95 \]
\[ H_0:\mu_D=0\\ H_a:\mu_D>0 \]
Result:
If the difference are normally distributed, then
\[ {\bar D-\mu_D\over S_D/\sqrt{n}}\sim T_{n-1} \]
\(S_D/\sqrt{n}\) is the standard error of \(\bar D\)
The calculation similar as one-sample t test, but testing the difference.
A \((a-\alpha)100\%\) confidence interval for \(\mu_D=\mu_1-\mu_2\) is
\[ \bar D\pm t_{n-1,\alpha/2}{S_D\over \sqrt{n}} \]
Example: What is a 90% confidence interval for the average improvement in test scores?
\[ n=10,~n-1=9 \]
\[ t_{9,0.05}=qt(0.95,df=9)=1.833\\ \]
So a 90% confidence interval for \(\mu_0\) is
\[ 1.6\pm 1.833\times {2.95\over \sqrt{10}}=(-0.11,3.31) \]
We are 90% confident that the average improvement is between -0.11 and 3.31.
\[ H_0:\mu_D=\Delta_0~~~H_0:\mu_D=\Delta_0~~~H_0:\mu_D=\Delta_0\\ H_a:\mu_D>\Delta_0~~~H_a:\mu_D<\Delta_0~~~H_0:\mu_D\neq\Delta_0 \]
\[ T_{H_0}={\bar D-\mu_D\over S_D/\sqrt{n}} \]
If \(H_0\) is true, \(T_{H_0}\sim T_{n-1}\).
We can find rejection rules and p-values using the same method as our previous T test.
Example: Test whether the average improvement is greater than 1 point using \(\alpha\) = 0.10.
\[ H_0:\mu_0=1\\ H_a:\mu_0>1 \]
\[ T_{H_0}={\bar D-\mu_D\over S_D/\sqrt{n}}={1.6-1\over 2.95/\sqrt{10}}=0.643 \]
Method 1: Reject \(H_0\) if \(T_{H_0}>qt(0.9,df=9)\)
Method 2: Reject \(H_0\) if p-value is less than \(\alpha\) = 0.10,
p-value is
\[ 1-pt(0.643,df=9)=0.2681 \]
We do not have sufficient evidence that the average improvement is greater than 1 point.
pre <- c(18,21,16,22,19,24,17,21,23,18)
post <- c(22,25,17,24,16,29,20,23,19,20)
t.test(post, pre, mu = 1, paired = T, alternative = "greater")
##
## Paired t-test
##
## data: post and pre
## t = 0.64286, df = 9, p-value = 0.2682
## alternative hypothesis: true mean difference is greater than 1
## 95 percent confidence interval:
## -0.1109054 Inf
## sample estimates:
## mean difference
## 1.6
We can compare proportions from two populations using data from two independent samples.
Let \(X_1\sim Bin(n_1,p_1)\) and let \(\hat p_1={X_1\over n_1}\), \(\sim N(p_1,{p_1(1-p_1)\over n_1})\)
Let \(X_2\sim Bin(n_2,p_2)\) and let \(\hat p_2={X_2\over n_2}\), \(\sim N(p_2,{p_2(1-p_2)\over n_12})\)
Result: If \(n_1\hat p_1\geq 8\), \((1-n_1)\hat p_1\geq 8\), \(n_2\hat p_2\geq8\), \((1-n_2)\hat p_2\geq8\), then
\[ {(\hat p_1-\hat p_2)-(p_1-p_2)\over \sqrt{{\hat p_1(1-\hat p_1)\over n_1}+{\hat p_2(1-\hat p_2)\over n_2}}}\dot \sim N(0,1) \]
\(\sqrt{{\hat p_1(1-\hat p_1)\over n_1}+{\hat p_2(1-\hat p_2)\over n_2}}\) is the standard error.
Example: Time magazine reported the result of a telephone poll of 800 adult Americans. The question posed of the Americans who were surveyed was: “Should the federal tax on cigarettes be raised to pay for health care reform?” Non-smokers Smokers $ n_1 = 605,~n_2 = 195 $ 351 said “yes” 41 said “yes”
A $(1-) 100% $ confidence interval for \(p_1-p_2\) is
\[ (\hat p_1-\hat p_2)\pm z_{\alpha/2}\sqrt{{\hat p_1(1-\hat p_1)\over n_1}+{\hat p_2(1-\hat p_2)\over n_2}} \]
Example: Find 95% confidence interval for the difference in the proportions of non-smokers and smokers who say “yes”.