Paired Data

Paired data arise from an alternative sampling design used for the comparison of two means.

Example: A certain lake has been designated for pollution clean-up. One way to assess the effectiveness of the clean-up measures is to randomly select a number n of locations from which water samples are taken and analyzed both before and after the clean-up.

The paired data is the condition before and after clean-up

Example: Another way of designing the study is to select a random sample of \(n_1\) locations from which water samples are taken to assess the water quality before the clean-up measures, and a different random sample of \(n_2\) locations which will serve to assess the water quality after the clean-up measures.

In such case we have independent sample. \(n_1\) and \(n_2\) could be different

Example: Suppose a sample of 10 students were given a diagnostic test before studying a particular module and then again after completing the module. We want to find out if, in general, our teaching leads to improvements in students knowledge/skills (i.e. test scores). The scores are as follows:

Student Pre Post
1 18 22
2 21 25
3 16 17
4 22 24
5 19 16
6 24 29
7 17 20
8 21 23
9 23 19
10 18 20

Analyzing paired data involves analyzing the differences between the observations for each pair, D.

Notation:

Example:

\[ n=10,~\bar D=1.6,~S_D=2.95 \]

\[ H_0:\mu_D=0\\ H_a:\mu_D>0 \]

Result:

If the difference are normally distributed, then

\[ {\bar D-\mu_D\over S_D/\sqrt{n}}\sim T_{n-1} \]

\(S_D/\sqrt{n}\) is the standard error of \(\bar D\)

The calculation similar as one-sample t test, but testing the difference.

Confidence Interval

A \((a-\alpha)100\%\) confidence interval for \(\mu_D=\mu_1-\mu_2\) is

\[ \bar D\pm t_{n-1,\alpha/2}{S_D\over \sqrt{n}} \]

Example: What is a 90% confidence interval for the average improvement in test scores?

\[ n=10,~n-1=9 \]

\[ t_{9,0.05}=qt(0.95,df=9)=1.833\\ \]

So a 90% confidence interval for \(\mu_0\) is

\[ 1.6\pm 1.833\times {2.95\over \sqrt{10}}=(-0.11,3.31) \]

We are 90% confident that the average improvement is between -0.11 and 3.31.

Hypothesis Test

  1. State the hypotheses

\[ H_0:\mu_D=\Delta_0~~~H_0:\mu_D=\Delta_0~~~H_0:\mu_D=\Delta_0\\ H_a:\mu_D>\Delta_0~~~H_a:\mu_D<\Delta_0~~~H_0:\mu_D\neq\Delta_0 \]

  1. Compute the test statistic

\[ T_{H_0}={\bar D-\mu_D\over S_D/\sqrt{n}} \]

  1. Reach a conclusion
  • If \(H_0\) is true, \(T_{H_0}\sim T_{n-1}\).

  • We can find rejection rules and p-values using the same method as our previous T test.

  1. State the conclusion in the context fo the problem.

Example: Test whether the average improvement is greater than 1 point using \(\alpha\) = 0.10.

  1. State the hypothesis

\[ H_0:\mu_0=1\\ H_a:\mu_0>1 \]

\[ T_{H_0}={\bar D-\mu_D\over S_D/\sqrt{n}}={1.6-1\over 2.95/\sqrt{10}}=0.643 \]

  1. Method 1: Reject \(H_0\) if \(T_{H_0}>qt(0.9,df=9)\)

  2. Method 2: Reject \(H_0\) if p-value is less than \(\alpha\) = 0.10,

p-value is

\[ 1-pt(0.643,df=9)=0.2681 \]

We do not have sufficient evidence that the average improvement is greater than 1 point.

In R

pre <- c(18,21,16,22,19,24,17,21,23,18)
post <- c(22,25,17,24,16,29,20,23,19,20)
 t.test(post, pre, mu = 1, paired = T, alternative = "greater")
## 
##  Paired t-test
## 
## data:  post and pre
## t = 0.64286, df = 9, p-value = 0.2682
## alternative hypothesis: true mean difference is greater than 1
## 95 percent confidence interval:
##  -0.1109054        Inf
## sample estimates:
## mean difference 
##             1.6

Comparing Two Proportions

We can compare proportions from two populations using data from two independent samples.

Result: If \(n_1\hat p_1\geq 8\), \((1-n_1)\hat p_1\geq 8\), \(n_2\hat p_2\geq8\), \((1-n_2)\hat p_2\geq8\), then

\[ {(\hat p_1-\hat p_2)-(p_1-p_2)\over \sqrt{{\hat p_1(1-\hat p_1)\over n_1}+{\hat p_2(1-\hat p_2)\over n_2}}}\dot \sim N(0,1) \]

\(\sqrt{{\hat p_1(1-\hat p_1)\over n_1}+{\hat p_2(1-\hat p_2)\over n_2}}\) is the standard error.

Example: Time magazine reported the result of a telephone poll of 800 adult Americans. The question posed of the Americans who were surveyed was: “Should the federal tax on cigarettes be raised to pay for health care reform?” Non-smokers Smokers $ n_1 = 605,~n_2 = 195 $ 351 said “yes” 41 said “yes”

Confidence Interval

A $(1-) 100% $ confidence interval for \(p_1-p_2\) is

\[ (\hat p_1-\hat p_2)\pm z_{\alpha/2}\sqrt{{\hat p_1(1-\hat p_1)\over n_1}+{\hat p_2(1-\hat p_2)\over n_2}} \]

Example: Find 95% confidence interval for the difference in the proportions of non-smokers and smokers who say “yes”.