-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path07-hypothesis_testing.Rmd
1098 lines (823 loc) · 65.3 KB
/
07-hypothesis_testing.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
output:
html_document:
df_print: paged
toc: yes
html_notebook: default
pdf_document:
toc: yes
---
```{r, echo=FALSE, warning=FALSE}
library(knitr)
#This code automatically tidies code so that it does not reach over the page
opts_chunk$set(tidy.opts=list(width.cutoff=50),tidy=TRUE, rownames.print = FALSE, rows.print = 10)
options(scipen = 999, digits = 7)
```
# Hypothesis testing
## Introduction
::: {.infobox .download data-latex="{download}"}
[You can download the corresponding R-Code here](./Code/06-hypothesis_testing.R)
:::
<br>
<div align="center">
<iframe width="560" height="315" src="https://www.youtube.com/embed/cJRwmWWCpZE" frameborder="0" allowfullscreen></iframe>
</div>
<br>
We test hypotheses because we are confined to taking samples – we rarely work with the entire population. In the previous chapter, we introduced the standard error (i.e., the standard deviation of a large number of hypothetical samples) as an estimate of how well a particular sample represents the population. We also saw how we can construct confidence intervals around the sample mean $\bar x$ by computing $SE_{\bar x}$ as an estimate of $\sigma_{\bar x}$ using $s$ as an estimate of $\sigma$ and calculating the 95% CI as $\bar x \pm 1.96 * SE_{\bar x}$. Although we do not know the true population mean ($\mu$), we might have an hypothesis about it and this would tell us how the corresponding sampling distribution looks like. Based on the sampling distribution of the hypothesized population mean, we could then determine the probability of a given sample **assuming that the hypothesis is true**.
Let us again begin by assuming we know the entire population using the example of music listening times among students from the previous example. As a reminder, the following plot shows the distribution of music listening times in the population of WU students.
```{r, message = FALSE, warning=FALSE}
library(tidyverse)
library(ggplot2)
library(latex2exp)
set.seed(321)
hours <- rgamma(n = 25000, shape = 2, scale = 10)
ggplot(data.frame(hours)) +
geom_histogram(aes(x = hours), bins = 30, fill='white', color='black') +
geom_vline(xintercept = mean(hours), size = 1) + theme_bw() +
labs(title = "Histogram of listening times",
subtitle = TeX(sprintf("Population mean ($\\mu$) = %.2f; population standard deviation ($\\sigma$) = %.2f",round(mean(hours),2),round(sd(hours),2))),
y = 'Number of students',
x = 'Hours')
```
In this example, the population mean ($\mu$) is equal to `r round(mean(hours),2)`, and the population standard deviation $\sigma$ is equal to `r round(sd(hours),2)`.
### The null hypothesis
<br>
<div align="center">
<iframe width="560" height="315" src="https://www.youtube.com/embed/DZOVAkWNgTg" frameborder="0" allowfullscreen></iframe>
</div>
<br>
Let us assume that we were planning to take a random sample of 50 students from this population and our hypothesis was that the mean listening time is equal to some specific value $\mu_0$, say $10$. This would be our **null hypothesis**. The null hypothesis refers to the statement that is being tested and is usually a statement of the status quo, one of no difference or no effect. In our example, the null hypothesis would state that there is no difference between the true population mean $\mu$ and the hypothesized value $\mu_0$ (in our example $10$), which can be expressed as follows:
$$
H_0: \mu = \mu_0
$$
When conducting research, we are usually interested in providing evidence against the null hypothesis. If we then observe sufficient evidence against it and our estimate is said to be significant. If the null hypothesis is rejected, this is taken as support for the **alternative hypothesis**. The alternative hypothesis assumes that some difference exists, which can be expressed as follows:
$$
H_1: \mu \neq \mu_0
$$
Accepting the alternative hypothesis in turn will often lead to changes in opinions or actions. Note that while the null hypothesis may be rejected, it can never be accepted based on a single test. If we fail to reject the null hypothesis, it means that we simply haven't collected enough evidence against the null hypothesis to disprove it. In classical hypothesis testing, there is no way to determine whether the null hypothesis is true. **Hypothesis testing** provides a means to quantify to what extent the data from our sample is in line with the null hypothesis.
In order to quantify the concept of "sufficient evidence" we look at the theoretical distribution of the sample means given our null hypothesis and the sample standard error. Using the available information we can infer the sampling distribution for our null hypothesis. Recall that the standard deviation of the sampling distribution (i.e., the standard error of the mean) is given by $\sigma_{\bar x}={\sigma \over \sqrt{n}}$, and thus can be computed as follows:
```{r}
mean_pop <- mean(hours)
sigma <- sd(hours) #population standard deviation
n <- 50 #sample size
standard_error <- sigma/sqrt(n) #standard error
standard_error
```
Since we know from the central limit theorem that the sampling distribution is normal for large enough samples, we can now visualize the expected sampling distribution **if our null hypothesis was in fact true** (i.e., if the was no difference between the true population mean and the hypothesized mean of 10).
```{r message=FALSE, warning=FALSE, echo=FALSE, eval=TRUE, fig.align="center", fig.height = 4, fig.width = 8}
library(latex2exp)
H_0 <- 10
p1 <- 0.025
p2 <- 0.975
min <- 0
max <- 20
norm1 <- round(qnorm(p1), digits = 3)
norm2 <- round(qnorm(p2), digits = 3)
ggplot(data.frame(x = c(min, max)), aes(x = x)) +
stat_function(fun = dnorm, args = list(mean = H_0, sd = standard_error)) +
stat_function(fun = dnorm, args = list(mean = H_0, sd = standard_error), xlim = c(min, qnorm(p1, mean = H_0, sd = standard_error)), geom = "area") +
stat_function(fun = dnorm, args = list(mean = H_0, sd = standard_error), xlim = c(max, qnorm(p2, mean = H_0, sd = standard_error)), geom = "area") +
scale_x_continuous(breaks=c(0,qnorm(p1, mean = H_0, sd = standard_error),10,qnorm(p2, mean = H_0, sd = standard_error),20), labels=c("0",TeX(sprintf("%.2f $* \\sigma_{\\bar x}$",qnorm(p1))),"10",TeX(sprintf("%.2f $* \\sigma_{\\bar x}$",qnorm(p2))),"20")) +
labs(title = TeX(sprintf("Theoretical density given null hypothesis $\\mu_0=$ 10 ($\\sigma_{\\bar x}$ = %.2f)",standard_error)),x = "Hours", y = "Density") +
theme(legend.position="none") +
theme_bw()
```
We also know that 95% of the probability is within `r round(qnorm(p2),2)` standard deviations from the mean. Values higher than that are rather unlikely, if our hypothesis about the population mean was indeed true. This is shown by the shaded area, also known as the "rejection region". To test our hypothesis that the population mean is equal to $10$, let us take a random sample from the population.
```{r}
set.seed(12567)
H_0 <- 10
student_sample <- sample(1:25000, size = 50, replace = FALSE)
music_listening_sample <- data.frame(hours = hours[student_sample])
mean_sample <- mean(music_listening_sample$hours)
ggplot(music_listening_sample) +
geom_histogram(aes(x = hours), fill = 'white', color = 'black', bins = 20) +
theme_bw() + geom_vline(xintercept = mean_sample, color = 'black', size=1) +
labs(title = TeX(sprintf("Distribution of values in the sample ($n =$ %.0f, $\\bar{x] = $ %.2f, s = %.2f)",n,mean_sample,sd(music_listening_sample$hours))),x = "Hours", y = "Frequency")
```
The mean listening time in the sample (black line) $\bar x$ is `r round(mean_sample,2)`. We can already see from the graphic above that such a value is rather unlikely under the hypothesis that the population mean is $10$. Intuitively, such a result would therefore provide evidence against our null hypothesis. But how could we quantify specifically how unlikely it is to obtain such a value and decide whether or not to reject the null hypothesis? Significance tests can be used to provide answers to these questions.
### Statistical inference on a sample
#### Test statistic
##### z-scores
Let's go back to the sampling distribution above. We know that 95% of all values will fall within `r round(qnorm(p2),2)` standard deviations from the mean. So if we could express the distance between our sample mean and the null hypothesis in terms of standard deviations, we could make statements about the probability of getting a sample mean of the observed magnitude (or more extreme values). Essentially, we would like to know how many standard deviations ($\sigma_{\bar x}$) our sample mean ($\bar x$) is away from the population mean if the null hypothesis was true ($\mu_0$). This can be formally expressed as follows:
$$
\bar x- \mu_0 = z \sigma_{\bar x}
$$
In this equation, ```z``` will tell us how many standard deviations the sample mean $\bar x$ is away from the null hypothesis $\mu_0$. Solving for ```z``` gives us:
$$
z = {\bar x- \mu_0 \over \sigma_{\bar x}}={\bar x- \mu_0 \over \sigma / \sqrt{n}}
$$
This standardized value (or "z-score") is also referred to as a **test statistic**. Let's compute the test statistic for our example above:
```{r}
z_score <- (mean_sample - H_0)/(sigma/sqrt(n))
z_score
```
To make a decision on whether the difference can be deemed statistically significant, we now need to compare this calculated test statistic to a meaningful threshold. In order to do so, we need to decide on a significance level $\alpha$, which expresses the probability of finding an effect that does not actually exist (i.e., Type I Error). You can find a detailed discussion of this point at the end of this chapter. For now, we will adopt the widely accepted significance level of 5% and set $\alpha$ to 0.05. The critical value for the normal distribution and $\alpha$ = 0.05 can be computed using the ```qnorm()``` function as follows:
```{r}
z_crit <- qnorm(0.975)
z_crit
```
We use ```0.975``` and not ```0.95``` since we are running a two-sided test and need to account for the rejection region at the other end of the distribution. Recall that for the normal distribution, 95% of the total probability falls within `r round(qnorm(0.975),2)` standard deviations of the mean, so that higher (absolute) values provide evidence against the null hypothesis. Generally, we speak of a statistically significant effect if the (absolute) calculated test statistic is larger than the (absolute) critical value. We can easily check if this is the case in our example:
```{r}
abs(z_score) > abs(z_crit)
```
Since the absolute value of the calculated test statistic is larger than the critical value, we would reject $H_0$ and conclude that the true population mean $\mu$ is significantly different from the hypothesized value $\mu_0 = 10$.
##### t-statistic
You may have noticed that the formula for the z-score above assumes that we know the true population standard deviation ($\sigma$) when computing the standard deviation of the sampling distribution ($\sigma_{\bar x}$) in the denominator. However, the population standard deviation is usually not known in the real world and therefore represents another unknown population parameter which we have to estimate from the sample. We saw in the previous chapter that we usually use $s$ as an estimate of $\sigma$ and $SE_{\bar x}$ as and estimate of $\sigma_{\bar x}$. Intuitively, we should be more conservative regarding the critical value that we used above to assess whether we have a significant effect to reflect this uncertainty about the true population standard deviation. That is, the threshold for a "significant" effect should be higher to safeguard against falsely claiming a significant effect when there is none. If we replace $\sigma_{\bar x}$ by it's estimate $SE_{\bar x}$ in the formula for the z-score, we get a new test statistic (i.e, the **t-statistic**) with its own distribution (the **t-distribution**):
$$
t = {\bar x- \mu_0 \over SE_{\bar x}}={\bar x- \mu_0 \over s / \sqrt{n}}
$$
Here, $\bar X$ denotes the sample mean and $s$ the sample standard deviation. The t-distribution has more probability in its "tails", i.e. farther away from the mean. This reflects the higher uncertainty introduced by replacing the population standard deviation by its sample estimate. Intuitively, this is particularly relevant for small samples, since the uncertainty about the true population parameters decreases with increasing sample size. This is reflected by the fact that the exact shape of the t-distribution depends on the **degrees of freedom**, which is the sample size minus one (i.e., $n-1$). To see this, the following graph shows the t-distribution with different degrees of freedom for a two-tailed test and $\alpha = 0.05$. The grey curve shows the normal distribution.
```{r message=FALSE, warning=FALSE, echo=FALSE, eval=TRUE, fig.align="center", fig.height = 8, fig.width = 8}
library(cowplot)
library(gridExtra)
library(grid)
df <- 5
p1 <- 0.025
p2 <- 0.975
min <- -5
max <- 5
t1 <- round(qt(p1, df = df), digits = 3)
t2 <- round(qt(p2, df = df), digits = 3)
plot1 <- ggplot(data.frame(x = c(min, max)), aes(x = x)) +
stat_function(fun = dnorm, color = "grey") +
stat_function(fun = dt, args = list(df = df)) +
stat_function(fun = dt, args = list(df = df), xlim = c(min, qt(p1, df = df)), geom = "area") +
stat_function(fun = dt, args = list(df = df), xlim = c(max, qt(p2, df = df)), geom = "area") +
stat_function(fun = dnorm, color = "grey") +
scale_x_continuous(breaks = c(t1, 0, t2)) +
labs(title = paste0("df= ", df),x = "x", y = "Density") +
theme(legend.position="none") +
theme_bw()
df <- 10
p1 <- 0.025
p2 <- 0.975
min <- -5
max <- 5
t1 <- round(qt(p1, df = df), digits = 3)
t2 <- round(qt(p2, df = df), digits = 3)
plot2 <- ggplot(data.frame(x = c(min, max)), aes(x = x)) +
stat_function(fun = dnorm, color = "grey") +
stat_function(fun = dt, args = list(df = df)) +
stat_function(fun = dt, args = list(df = df), xlim = c(min, qt(p1, df = df)), geom = "area") +
stat_function(fun = dt, args = list(df = df), xlim = c(max, qt(p2, df = df)), geom = "area") +
scale_x_continuous(breaks = c(t1, 0, t2)) +
labs(title = paste0("df= ",df),x = "x", y = "Density") +
theme(legend.position = "none") +
theme_bw()
df <- 100
p1 <- 0.025
p2 <- 0.975
min <- -5
max <- 5
t1 <- round(qt(p1, df = df), digits = 3)
t2 <- round(qt(p2, df = df), digits = 3)
plot3 <- ggplot(data.frame(x = c(min, max)), aes(x = x)) +
stat_function(fun = dnorm, color = "grey") +
stat_function(fun = dt, args = list(df = df)) +
stat_function(fun = dt, args = list(df = df), xlim = c(min, qt(p1, df = df)), geom = "area") +
stat_function(fun = dt, args = list(df = df), xlim = c(max, qt(p2, df = df)), geom = "area") +
scale_x_continuous(breaks = c(t1, 0, t2)) +
labs(title = paste0("df= ",df),x = "x", y = "Density") +
theme(legend.position = "none") +
theme_bw()
df <- 1000
p1 <- 0.025
p2 <- 0.975
min <- -5
max <- 5
t1 <- round(qt(p1, df = df), digits = 3)
t2 <- round(qt(p2, df = df), digits = 3)
plot4 <- ggplot(data.frame(x = c(min, max)), aes(x = x)) +
stat_function(fun = dnorm, color = "grey") +
stat_function(fun = dt, args = list(df = df)) +
stat_function(fun = dt, args = list(df = df), xlim = c(min, qt(p1, df = df)), geom = "area") +
stat_function(fun = dt, args = list(df = df), xlim = c(max, qt(p2, df = df)), geom = "area") +
scale_x_continuous(breaks = c(t1, 0, t2)) +
labs(title = paste0("df= ",df),
x = "x", y = "Density") +
theme(legend.position = "none") +
theme_bw()
p <- plot_grid(plot1, plot2, plot3, plot4, ncol = 2,
labels = c("A", "B","C","D"))
title <- ggdraw() + draw_label('Degrees of freedom and the t-distribution', fontface='bold')
p <- plot_grid(title, p, ncol=1, rel_heights=c(0.1, 1)) # rel_heights values control title margins
print(p)
```
Notice that as $n$ gets larger, the t-distribution gets closer and closer to the normal distribution, reflecting the fact that the uncertainty introduced by $s$ is reduced. To summarize, we now have an estimate for the standard deviation of the distribution of the sample mean (i.e., $SE_{\bar x}$) and an appropriate distribution that takes into account the necessary uncertainty (i.e., the t-distribution). Let us now compute the t-statistic according to the formula above:
```{r}
SE <- (sd(music_listening_sample$hours)/sqrt(n))
t_score <- (mean_sample - H_0)/SE
t_score
```
Notice that the value of the t-statistic is higher compared to the z-score (`r round(z_score,2)`). This can be attributed to the fact that by using the $s$ as and estimate of $\sigma$, we underestimate the true population standard deviation. Hence, the critical value would need to be larger to adjust for this. This is what the t-distribution does. Let us compute the critical value from the t-distribution with ```n - 1```degrees of freedom.
```{r}
df = n - 1
t_crit <- qt(0.975, df = df)
t_crit
```
Again, we use ```0.975``` and not ```0.95``` since we are running a two-sided test and need to account for the rejection region at the other end of the distribution. Notice that the new critical value based on the t-distributionis larger, to reflect the uncertainty when estimating $\sigma$ from $s$. Now we can see that the calculated test statistic is still larger than the critical value.
```{r}
abs(t_score) > abs(t_crit)
```
The following graphics shows that the calculated test statistic (red line) falls into the rejection region so that in our example, we would reject the null hypothesis that the true population mean is equal to $10$.
```{r message=FALSE, warning=FALSE, echo=FALSE, eval=TRUE, fig.align="center", fig.height = 6, fig.width = 8}
p1 <- 0.025
p2 <- 0.975
min <- -6
max <- 6
t1 <- round(qt(p1, df = df), digits = 3)
t2 <- round(qt(p2, df = df), digits = 3)
ggplot(data.frame(x = c(min, max)), aes(x = x)) +
stat_function(fun = dt, args = list(df = df)) +
stat_function(fun = dt, args = list(df = df), xlim = c(min, qt(p1, df = df)), geom = "area") +
stat_function(fun = dt, args = list(df = df), xlim = c(max, qt(p2, df = df)), geom = "area") +
geom_vline(xintercept = t_score, color = 'red', size=1) +
scale_x_continuous(breaks = c(t1, 0, t2)) +
labs(title = "Theoretical density given null hypothesis 10 and sample t-statistic",
x = "x", y = "Density") +
theme(legend.position = "none") +
theme_bw()
```
**Decision:** Reject $H_0$, given that the calculated test statistic is larger than critical value.
Something to keep in mind here is the fact the test statistic is a function of the sample size. This, as $n$ gets large, the test statistic gets larger as well and we are more likely to find a significant effect. This reflects the decrease in uncertainty about the true population mean as our sample size increases.
#### P-values
In the previous section, we computed the test statistic, which tells us how close our sample is to the null hypothesis. The p-value corresponds to the probability that the test statistic would take a value as extreme or more extreme than the one that we actually observed, **assuming that the null hypothesis is true**. It is important to note that this is a **conditional probability**: we compute the probability of observing a sample mean (or a more extreme value) conditional on the assumption that the null hypothesis is true. The ```pnorm()```function can be used to compute this probability. It is the cumulative probability distribution function of the `normal distribution. Cumulative probability means that the function returns the probability that the test statistic will take a value **less than or equal to** the calculated test statistic given the degrees of freedom. However, we are interested in obtaining the probability of observing a test statistic **larger than or equal to** the calculated test statistic under the null hypothesis (i.e., the p-value). Thus, we need to subtract the cumulative probability from 1. In addition, since we are running a two-sided test, we need to multiply the probability by 2 to account for the rejection region at the other side of the distribution.
```{r}
p_value <- 2*(1-pt(abs(t_score), df = df))
p_value
```
This value corresponds to the probability of observing a mean equal to or larger than the one we obtained from our sample, if the null hypothesis was true. As you can see, this probability is very low. A small p-value signals that it is unlikely to observe the calculated test statistic under the null hypothesis. To decide whether or not to reject the null hypothesis, we would now compare this value to the level of significance ($\alpha$) that we chose for our test. For this example, we adopt the widely accepted significance level of 5%, so any test results with a p-value < 0.05 would be deemed statistically significant. Note that the p-value is directly related to the value of the test statistic. The relationship is such that the higher (lower) the value of the test statistic, the lower (higher) the p-value.
**Decision:** Reject $H_0$, given that the p-value is smaller than 0.05.
#### Confidence interval
For a given statistic calculated for a sample of observations (e.g., listening times), a 95% confidence interval can be constructed such that in 95% of samples, the true value of the true population mean will fall within its limits. If the parameter value specified in the null hypothesis (here $10$) does not lie within the bounds, we reject $H_0$. Building on what we learned about confidence intervals in the previous chapter, the 95% confidence interval based on the t-distribution can be computed as follows:
$$
CI_{lower} = {\bar x} - t_{1-{\alpha \over 2}} * SE_{\bar x} \\
CI_{upper} = {\bar x} + t_{1-{\alpha \over 2}} * SE_{\bar x}
$$
It is easy to compute this interval manually:
```{r message=FALSE, warning=FALSE}
ci_lower <- (mean_sample)-qt(0.975, df = df)*SE
ci_upper <- (mean_sample)+qt(0.975, df = df)*SE
ci_lower
ci_upper
```
The interpretation of this interval is as follows: if we would (hypothetically) take 100 samples and calculated the mean and confidence interval for each of them, then the true population mean would be included in 95% of these intervals. The CI is informative when reporting the result of your test, since it provides an estimate of the uncertainty associated with the test result. From the test statistic or the p-value alone, it is not easy to judge in which range the true population parameter is located. The CI provides an estimate of this range.
**Decision:** Reject $H_0$, given that the parameter value from the null hypothesis ($10$) is not included in the interval.
To summarize, you can see that we arrive at the same conclusion (i.e., reject $H_0$), irrespective if we use the test statistic, the p-value, or the confidence interval. However, keep in mind that rejecting the null hypothesis does not prove the alternative hypothesis (we can merely provide support for it). Rather, think of the p-value as the chance of obtaining the data we've collected assuming that the null hypothesis is true. You should report the confidence interval to provide an estimate of the uncertainty associated with your test results.
### Choosing the right test
The test statistic, as we have seen, measures how close the sample is to the null hypothesis and often follows a well-known distribution (e.g., normal, t, or chi-square). To select the correct test, various factors need to be taken into consideration. Some examples are:
* On what scale are your variables measured (categorical vs. continuous)?
* Do you want to test for relationships or differences?
* If you test for differences, how many groups would you like to test?
* For parametric tests, are the assumptions fulfilled?
The previous discussion used a **one sample t-test** as an example, which requires that variable is measured on an interval or ratio scale. If you are confronted with other settings, the following flow chart provides a rough guideline on selecting the correct test:
![Flowchart for selecting an appropriate test (source: McElreath, R. (2016): Statistical Rethinking, p. 2)](https://github.com/IMSMWU/Teaching/raw/master/MRDA2017/testselection.JPG)
For a detailed overview over the different type of tests, please also refer to <a href="https://stats.idre.ucla.edu/other/mult-pkg/whatstat/" target="_blank">this overview</a> by the UCLA.
#### Parametric vs. non-parametric tests
A basic distinction can be made between parametric and non-parametric tests. **Parametric tests** require that variables are measured on an interval or ratio scale and that the sampling distribution follows a known distribution. **Non-Parametric tests** on the other hand do not require the sampling distribution to be normally distributed (a.k.a. "assumption free tests"). These tests may be used when the variable of interest is measured on an ordinal scale or when the parametric assumptions do not hold. They often rely on ranking the data instead of analyzing the actual scores. By ranking the data, information on the magnitude of differences is lost. Thus, parametric tests are more powerful if the sampling distribution is normally distributed. In this chapter, we will first focus on parametric tests and cover non-parametric tests later.
#### One-tailed vs. two-tailed test
For some tests you may choose between a **one-tailed test** versus a **two-tailed test**. The choice depends on the hypothesis you specified, i.e., whether you specified a directional or a non-directional hypotheses. In the example above, we used a **non-directional hypothesis**. That is, we stated that the mean is different from the comparison value $\mu_0$, but we did not state the direction of the effect. A **directional hypothesis** states the direction of the effect. For example, we might test whether the population mean is smaller than a comparison value:
$$
H_0: \mu \ge \mu_0 \\
H_1: \mu < \mu_0
$$
Similarly, we could test whether the population mean is larger than a comparison value:
$$
H_0: \mu \le \mu_0 \\
H_1: \mu > \mu_0
$$
Connected to the decision of how to phrase the hypotheses (directional vs. non-directional) is the choice of a **one-tailed test** versus a **two-tailed test**. Let's first think about the meaning of a one-tailed test. Using a significance level of 0.05, a one-tailed test means that 5% of the total area under the probability distribution of our test statistic is located in one tail. Thus, under a one-tailed test, we test for the possibility of the relationship in one direction only, disregarding the possibility of a relationship in the other direction. In our example, a one-tailed test could test either if the mean listening time is significantly larger or smaller compared to the control condition, but not both. Depending on the direction, the mean listening time is significantly larger (smaller) if the test statistic is located in the top (bottom) 5% of its probability distribution.
The following graph shows the critical values that our test statistic would need to surpass so that the difference between the population mean and the comparison value would be deemed statistically significant.
```{r message=FALSE, warning=FALSE, echo=FALSE, eval=TRUE, fig2, fig.align="center", fig.height = 3, fig.width = 10}
library(cowplot)
library(gridExtra)
library(grid)
df <- n-1
p1 <- 0.025
p2 <- 0.975
min <- -5
max <- 5
t1 <- round(qt(p1, df = df), digits = 3)
t2 <- round(qt(p2, df = df), digits = 3)
plot1 <- ggplot(data.frame(x = c(min, max)), aes(x = x)) +
stat_function(fun = dt, args = list(df = df)) +
stat_function(fun = dt, args = list(df = df), xlim = c(min, qt(p1, df = df)), geom = "area") +
stat_function(fun = dt, args = list(df = df), xlim = c(max, qt(p2, df = df)), geom = "area") +
scale_x_continuous(breaks = c(t1, 0, t2)) +
labs(title = paste0("Two-sided test"),
subtitle = "0.025 of total area on each side; df = 49",
x = "x", y = "Density") +
theme(legend.position = "none") +
theme_bw()
df <- n-1
p1 <- 0.000
p2 <- 0.950
min <- -5
max <- 5
t1 <- round(qt(p1,df=df), digits = 3)
t2 <- round(qt(p2,df=df), digits = 3)
plot2 <- ggplot(data.frame(x = c(min, max)), aes(x = x)) +
stat_function(fun = dt, args = list(df = df)) +
stat_function(fun = dt, args = list(df = df), xlim = c(min,qt(p1,df=df)), geom = "area") +
stat_function(fun = dt, args = list(df = df), xlim = c(max,qt(p2,df=df)), geom = "area") +
scale_x_continuous(breaks = c(t1,0,t2)) +
labs(title = paste0("One-sided test (right)"),
subtitle = "0.05 of total area on the right; df = 49",
x = "x", y = "Density") +
theme(legend.position="none") + theme_bw()
df <- n-1
p1 <- 0.000
p2 <- 0.050
min <- -5
max <- 5
t1 <- round(qt(p1,df=df), digits = 3)
t2 <- round(qt(p2,df=df), digits = 3)
plot3 <- ggplot(data.frame(x = c(min, max)), aes(x = x)) +
stat_function(fun = dt, args = list(df = df)) +
stat_function(fun = dt, args = list(df = df), xlim = c(max,qt(p1,df=df)), geom = "area") +
stat_function(fun = dt, args = list(df = df), xlim = c(min,qt(p2,df=df)), geom = "area") +
scale_x_continuous(breaks = c(t1,0,t2)) +
labs(title = paste0("One-sided test (left)"),
subtitle = "0.05 of total area on the left; df = 49",
x = "x", y = "Density") +
theme(legend.position="none") + theme_bw()
p <- plot_grid(plot3,plot1, plot2, ncol = 3)
print(p)
```
It can be seen that under a one-sided test, the rejection region is at one end of the distribution or the other. In a two-sided test, the rejection region is split between the two tails. As a consequence, the critical value of the test statistic is smaller using a one-tailed test, meaning that it has more power to detect an effect. Having said that, in most applications, we would like to be able catch effects in both directions, simply because we can often not rule out that an effect might exist that is not in the hypothesized direction. For example, if we would conduct a one-tailed test for a mean larger than some specified value but the mean turns out to be substantially smaller, then testing a one-directional hypothesis ($H_0: \mu \le \mu_0 $) would not allow us to conclude that there is a significant effect because there is not rejection at this end of the distribution.
As we have seen, the process of hypothesis testing consists of various steps:
1. Formulate null and alternative hypotheses
2. Select an appropriate test
3. Choose the level of significance ($\alpha$)
4. Descriptive statistics and data visualization
5. Conduct significance test
6. Report results and draw a marketing conclusion
In the following, we will go through the individual steps using examples for different tests.
## One sample t-test
The example we used in the introduction was an example of the **one sample t-test** and we computed all statistics by hand to explain the underlying intuition. When you conduct hypothesis tests using R, you do not need to calculate these statistics by hand, since there are build-in routines to conduct the steps for you. Let us use the same example again to see how you would conduct hypothesis tests in R.
**1. Formulate null and alternative hypotheses**
The null hypothesis states that there is no difference between the true population mean $\mu$ and the hypothesized value (i.e., $10$), while the alternative hypothesis states the opposite:
$$
H_0: \mu = 10 \\
H_1: \mu \neq 10
$$
**2. Select an appropriate test**
Because we would like to test if the mean of a variable is different from a specified threshold, the one-sample t-test is appropriate. The assumptions of the test are 1) that the variable is measured using an interval or ratio scale, and 2) that the sampling distribution is normal. Both assumptions are met since 1) listening time is a ratio scale, and 2) we deem the sample size (n = 50) large enough to assume a normal sampling distribution according to the central limit theorem.
**3. Choose the level of significance**
We choose the conventional 5% significance level.
**4. Descriptive statistics and data visualization**
Provide descriptive statistics using the ```describe()``` function:
```{r, message = FALSE, warning=FALSE}
library(psych)
describe(student_sample)
```
From this, we can already see that the mean is different from the hypothesized value. The question however remains, whether this difference is significantly different, given the sample size and the variability in the data. Since we only have one continuous variable, we can visualize the distribution in a histogram.
```{r}
ggplot(music_listening_sample) +
geom_histogram(aes(x = hours), fill = 'white', color = 'black', bins = 20) +
theme_bw() +
labs(title = "Distribution of values in the sample",x = "Hours", y = "Frequency")
```
**5. Conduct significance test**
In the beginning of the chapter, we saw, how you could conduct significance test by hand. However, R has built-in routines that you can use to conduct the analyses. The ```t.test()``` function can be used to conduct the test. To test if the listening time among WU students was 10, you can use the following code:
```{r}
H_0 <- 10
t.test(music_listening_sample$hours, mu = H_0, alternative = 'two.sided')
```
Note that if you would have stated a directional hypothesis (i.e., the mean is either greater or smaller than 10 hours), you could easily amend the code to conduct a one sided test by changing the argument ```alternative```from ```'two.sided'``` to either ```'less'``` or ```'greater'```.
Note that you could also combine the results from the statistical test and the visualization using the `ggstatsplot` package as follows.
```{r, message = FALSE, warning=FALSE, fig.align="center"}
library(ggstatsplot)
gghistostats(
data = music_listening_sample, # dataframe from which variable is to be taken
x = hours, # numeric variable whose distribution is of interest
title = "Distribution of listening times", # title for the plot
caption = "Notes: Test based on a random sample of 50 students.",
type = "parametric", # one sample t-test
conf.level = 0.95, # changing confidence level for effect size
bar.measure = "mix", # what does the bar length denote
test.value = 10, # default value is 0
test.value.line = TRUE, # display a vertical line at test value
effsize.type = "d", # display effect size (Cohen's d in output)
test.value.color = "#0072B2", # color for the line for test value
centrality.para = "mean", # which measure of central tendency is to be plotted
centrality.color = "darkred", # decides color for central tendency line
binwidth = 2, # binwidth value (experiment)
messages = FALSE, # turn off the messages
bf.message = FALSE
)
```
You may nice some additional output in this plot related to the measure of effect size (Cohen's d). Don't worry about it at this stage, we will come back to this later in this chapter.
**6. Report results and draw a marketing conclusion**
Note that the results are the same as above, when we computed the test by hand. You could summarize the results as follows:
On average, the listening times in our sample were different form 10 hours per month (Mean = 18.59 hours, SE = 1.77). This difference was significant t(49) = 4.842, p < .05 (95% CI = [15.03; 22.16]). Based on this evidence, we can conclude that the mean in our sample is significantly lower compared to the hypothesized population mean of $10$ hours, providing evidence against the null hypothesis.
Note that in the reporting above, the number ```49``` in parenthesis refers to the degrees of freedom that are available from the output.
## Comparing two means
In the one-sample test above, we tested the hypothesis that the population mean has some specific value $\mu_0$ using data from only one sample. In marketing (as in many other disciplines), you will often be confronted with a situation where you wish to compare the means of two groups. For example, you may conduct an experiment and randomly split your sample into two groups, one of which receives a treatment (experimental group) while the other doesn't (control group). In this case, the units (e.g., participants, products) in each group are different ('between-subjects design') and the samples are said to be independent. Hence, we would use a **independent-means t-test**. If you run an experiment with two experimental conditions and the same units (e.g., participants, products) were observed in both experimental conditions, the sample is said to be dependent in the sense that you have the same units in each group ('within-subjects design'). In this case, we would need to conduct an **dependent-means t-test**. Both tests are described in the following sections, beginning with the independent-means t-test.
### Independent-means t-test
<br>
<div align="center">
<iframe width="560" height="315" src="https://www.youtube.com/embed/7APeiQ3_46A" frameborder="0" allowfullscreen></iframe>
</div>
<br>
Using an independent-means t-test, we can compare the means of two possibly different populations. It is, for example, quite common for online companies to test new service features by running an experiment and randomly splitting their website visitors into two groups: one is exposed to the website with the new feature (experimental group) and the other group is not exposed to the new feature (control group). This is a typical A/B-Test scenario.
As an example, imagine that a music streaming service would like to introduce a new playlist feature that let's their users access playlists created by other users. The goal is to analyze how the new service feature impacts the listening time of users. The service randomly splits a representative subset of their users into two groups and collects data about their listening times over one month. Let's create a data set to simulate such a scenario.
```{r, eval = TRUE, echo = FALSE, message = FALSE, warning=FALSE}
set.seed(321)
hours_population_1 <- rgamma(25000, shape = 2, scale = 10)
set.seed(12567)
sample_1 <- sample(1:25000, size = 98, replace = FALSE)
sample_1_hours <- hours_population_1[sample_1]
sample_1_df <- data.frame(hours = round(sample_1_hours,0), group = "A")
set.seed(321)
hours_population_2 <- rgamma(25000, shape = 2.5, scale = 11)
set.seed(12567)
sample_2 <- sample(1:25000, size = 112, replace = FALSE)
sample_2_hours <- hours_population_2[sample_2]
sample_2_df <- data.frame(hours = round(sample_2_hours,0), group = "B")
hours_a_b <- rbind(sample_1_df,sample_2_df)
head(hours_a_b)
```
```{r, eval = TRUE, echo = TRUE, message = FALSE, warning=FALSE}
hours_a_b <- read.table("https://raw.githubusercontent.com/IMSMWU/Teaching/master/MRDA2017/hours_a_b.csv",
sep = ",",
header = TRUE)
head(hours_a_b)
```
This data set contains two variables: the variable ```hours``` indicates the music listening times (in hours) and the variable ```group``` indicates from which group the observation comes, where 'A' refers to the control group (with the standard service) and 'B' refers to the experimental group (with the new playlist feature). Let's first look at the descriptive statistics by group using the ```describeBy``` function:
```{r}
library(psych)
describeBy(hours_a_b$hours, hours_a_b$group)
```
From this, we can already see that there is a difference in means between groups A and B. We can also see that the number of observations is different, as is the standard deviation. The question that we would like to answer is whether there is a significant difference in mean listening times between the groups. Remember that different users are contained in each group ('between-subjects design') and that the observations in one group are independent of the observations in the other group. Before we will see how you can easily conduct an independent-means t-test, let's go over some theory first.
#### Theory
As a starting point, let us label the unknown population mean of group A (control group) in our experiment $\mu_1$, and that of group B (experimental group) $\mu_2$. In this setting, the null hypothesis would state that the mean in group A is equal to the mean in group B:
$$
H_0: \mu_1=\mu_2
$$
This is equivalent to stating that the difference between the two groups ($\delta$) is zero:
$$
H_0: \mu_1 - \mu_2=0=\delta
$$
That is, $\delta$ is the new unknown population parameter, so that the null and alternative hypothesis become:
$$
H_0: \delta = 0 \\
H_1: \delta \ne 0
$$
Remember that we usually don't have access to the entire population so that we can not observe $\delta$ and have to estimate is from a sample statistic, which we define as $d = \bar x_1-\bar x_2$, i.e., the difference between the sample means from group a ($\bar x_1$) and group b ($\bar x_2$). But can we really estimate $d$ from $\delta$? Remember from the previous chapter, that we could estimate $\mu$ from $\bar x$, because if we (hypothetically) take a larger number of samples, the distribution of the means of these samples (the sampling distribution) will be normally distributed and its mean will be (in the limit) equal to the population mean. It turns out that we can use the same underlying logic here. The above samples were drawn from two different populations with $\mu_1$ and $\mu_2$. Let us compute the difference in means between these two populations:
```{r}
delta_pop <- mean(hours_population_1)-mean(hours_population_2)
delta_pop
```
This means that the true difference between the mean listening times of groups a and b is `r round(delta_pop,2)`. Let us now repeat the exercise from the previous chapter: let us repeatedly draw a large number of $20,000$ random samples of 100 users from each of these populations, compute the difference (i.e., $d$, our estimate of $\delta$), store the difference for each draw and create a histogram of $d$.
```{r}
set.seed(321)
hours_population_1 <- rgamma(25000, shape = 2, scale = 10)
hours_population_2 <- rgamma(25000, shape = 2.5, scale = 11)
samples <- 20000
mean_delta <- matrix(NA, nrow = samples)
for (i in 1:samples){
student_sample <- sample(1:25000, size = 100, replace = FALSE)
mean_delta[i,] <- mean(hours_population_1[student_sample])-mean(hours_population_2[student_sample])
}
ggplot(data.frame(mean_delta)) +
geom_histogram(aes(x = mean_delta), bins = 30, fill='white', color='black') +
theme_bw() +
theme(legend.title = element_blank()) +
geom_vline(aes(xintercept = mean(mean_delta)), size=1) + xlab("d") +
ggtitle(TeX(sprintf("%d samples; $d_{\\bar{x}}$ = %.2f",samples, round(mean(mean_delta),2))))
```
This gives us the sampling distribution of the mean differences between the samples. You will notice that this distribution follows a normal distribution and is centered around the true difference between the populations. This means that, on average, the difference between two sample means $d$ is a good estimate of $\delta$. In our example, the difference between $\bar x_1$ and $\bar x_2$ is:
```{r}
mean_x1 <- mean(hours_a_b[hours_a_b$group=="A","hours"])
mean_x1
mean_x2 <- mean(hours_a_b[hours_a_b$group=="B","hours"])
mean_x2
d <- mean_x1-mean_x2
d
```
Now that we have $d$ as an estimate of $\delta$, how can we find out if the observed difference is significantly different from the null hypothesis (i.e., $\delta = 0$)?
Recall from the previous section, that the standard deviation of the sampling distribution $\sigma_{\bar x}$ (i.e., the standard error) gives us indication about the precision of our estimate. Further recall that the standard error can be calculated as $\sigma_{\bar x}={\sigma \over \sqrt{n}}$. So how can we calculate the standard error of the difference between two population means? According to the **variance sum law**, to find the variance of the sampling distribution of differences, we merely need to add together the variances of the sampling distributions of the two populations that we are comparing. To find the standard error, we only need to take the square root of the variance (because the standard error is the standard deviation of the sampling distribution and the standard deviation is the square root of the variance), so that we get:
$$
\sigma_{\bar x_1-\bar x_2} = \sqrt{{\sigma_1^2 \over n_1}+{\sigma_2^2 \over n_2}}
$$
But recall that we don't actually know the true population standard deviation, so we use $SE_{\bar x_1-\bar x_2}$ as an estimate of $\sigma_{\bar x_1-\bar x_2}$:
$$
SE_{\bar x_1-\bar x_2} = \sqrt{{s_1^2 \over n_1}+{s_2^2 \over n_2}}
$$
Hence, for our example, we can calculate the standard error as follows:
```{r}
n1 <- 98
n2 <- 112
s1 <- var(hours_a_b[hours_a_b$group=="A","hours"])
s1
s2 <- var(hours_a_b[hours_a_b$group=="B","hours"])
s2
SE_x1_x2 <- sqrt(s1/n1+s2/n2)
SE_x1_x2
```
Recall from above that we can calculate the t-statistic as:
$$
t= {\bar x - \mu_0 \over {s \over \sqrt{n}}}
$$
Exchanging $\bar x$ for $d$, we get
$$
t= {(\bar{x}_1 - \bar{x}_2) - (\mu_1 - \mu_2) \over {\sqrt{{s_1^2 \over n_1}+{s_2^2 \over n_2}}}}
$$
Note that according to our hypothesis $\mu_1-\mu_2=0$, so that we can calculate the t-statistic as:
```{r}
t_score <- d/SE_x1_x2
t_score
```
Following the example of our one sample t-test above, we would now need to compare this calculated test statistic to a critical value in order to assess if $d$ is sufficiently far away from the null hypothesis to be statistically significant. To do this, we would need to know the exact t-distribution, which depends on the degrees of freedom. The problem is that deriving the degrees of freedom in this case is not that obvious. If we were willing to assume that $\sigma_1=\sigma_2$, the correct t-distribution has $n_1 -1 + n_2-1$ degrees of freedom (i.e., the sum of the degrees of freedom of the two samples). However, because in real life we don not know if $\sigma_1=\sigma_2$, we need to account for this additional uncertainty. We will not go into detail here, but R automatically uses a sophisticated approach to correct the degrees of freedom called the Welch's correction, as we will see in the subsequent application.
#### Application
The section above explained the theory behind the independent-means t-test and showed how to compute the statistics manually. Obviously you don't have to compute these statistics by hand in this section shows you how to conduct an independent-means t-test in R using the example from above.
**1. Formulate null and alternative hypotheses**
We wish to analyze whether there is a significant difference in music listening times between groups A and B. So our null hypothesis is that the means from the two populations are the same (i.e., there is no difference), while the alternative hypothesis states the opposite:
$$
H_0: \mu_1=\mu_2\\
H_1: \mu_1 \ne \mu_2
$$
**2. Select an appropriate test**
Since we have a ratio scaled variable (i.e., listening times) and two independent groups, where the mean of one sample is independent of the group of the second sample (i.e., the groups contain different units), the independent-means t-test is appropriate.
**3. Choose the level of significance**
We choose the conventional 5% significance level.
**4. Descriptive statistics and data visualization**
We can compute the descriptive statistics for each group separately, using the ```describeBy()``` function:
```{r}
library(psych)
describeBy(hours_a_b$hours, hours_a_b$group)
```
This already shows us that the mean between groups A and B are different. We can visualize the data using a boxplot and a histogram.
```{r, message = FALSE, warning=FALSE}
ggplot(hours_a_b, aes(x = group, y = hours)) +
geom_boxplot() +
geom_jitter(alpha = 0.2, color = "red") +
labs(x = "Group", y = "Listening time (hours)") +
ggtitle("Boxplot of listening times") +
theme_bw()
ggplot(hours_a_b,aes(hours)) +
geom_histogram(col = "black", fill = "darkblue") +
labs(x = "Listening time (hours)", y = "Frequency") +
ggtitle("Histogram of listening times") +
facet_wrap(~group) +
theme_bw()
```
**5. Conduct significance test**
To conduct the independent means t-test, we can use the ```t.test()``` function:
```{r}
t.test(hours ~ group, data = hours_a_b, mu = 0, alternative = "two.sided", conf.level = 0.95, var.equal = FALSE)
```
Again, we could combine the results of the statistical test and the visualization using the `ggstatsplot` package.
```{r, message = FALSE, warning=FALSE, fig.align="center"}
library(ggstatsplot)
ggbetweenstats(
data = hours_a_b,
plot.type = "box",
x = group, # 2 groups
y = hours ,
type = "p", # default
effsize.type = "d", # display effect size (Cohen's d in output)
messages = FALSE,
bf.message = FALSE,
mean.ci = TRUE,
title = "Mean listening times for different groups"
)
```
**6. Report results and draw a marketing conclusion**
The results showed that listening times were higher in the experimental group (Mean = 28.50, SE = 1.70) compared to the control group (Mean = 18.11, SE = 1.22). This means that the listening times were `r abs(round(d,2))` hours higher on average in the experimental group, compared to the control group. An independent-means t-test showed that this difference is significant t(195.73) = 4.96, p < .05 (95% CI = [6.26, 14.51]).
### Dependent-means t-test
<br>
<div align="center">
<iframe width="560" height="315" src="https://www.youtube.com/embed/vIcrWJ6sJu8" frameborder="0" allowfullscreen></iframe>
</div>
<br>
While the independent-means t-test is used when different units (e.g., participants, products) were assigned to the different condition, the **dependent-means t-test** is used when there are two experimental conditions and the same units (e.g., participants, products) were observed in both experimental conditions.
Imagine, for example, a slightly different experimental setup for the above experiment. Imagine that we do not assign different users to the groups, but that a sample of 100 users gets to use the music streaming service with the new feature for one month and we compare the music listening times of these users during the month of the experiment with the listening time in the previous month. Let us generate data for this example:
```{r, eval = TRUE, echo = FALSE, message = FALSE, warning=FALSE}
set.seed(321)
hours_population_1 <- rgamma(25000, shape = 2, scale = 10)
set.seed(12567)
sample_1 <- sample(1:25000, size = 100, replace = FALSE)
sample_1_hours <- hours_population_1[sample_1]
set.seed(321)
hours_population_2 <- rgamma(25000, shape = 2.5, scale = 11)
set.seed(12567)
sample_2 <- sample(1:25000, size = 100, replace = FALSE)
sample_2_hours <- hours_population_2[sample_2]
hours_a_b_paired <- data.frame(hours_a = round(sample_1_hours,0),hours_b = round(sample_2_hours,0))
head(hours_a_b_paired)
```
```{r, eval = TRUE, echo = TRUE, message = FALSE, warning=FALSE}
hours_a_b_paired <- read.table("https://raw.githubusercontent.com/IMSMWU/Teaching/master/MRDA2017/hours_a_b_paired.csv",
sep = ",",
header = TRUE)
head(hours_a_b_paired)
```
Note that the data set has almost the same structure as before only that we know have two variables representing the listening times of each user in the month before the experiment and during the month of the experiment when the new feature was tested.
#### Theory
In this case, we want to test the hypothesis that there is no difference in mean the mean listening times between the two months. This can be expressed as follows:
$$
H_0: \mu_D = 0 \\
$$
Note that the hypothesis only refers to one population, since both observations come from the same units (i.e., users). To use consistent notation, we replace $\mu_D$ with $\delta$ and get:
$$
H_0: \delta = 0 \\
H_1: \delta \neq 0
$$
where $\delta$ denotes the difference between the observed listening times from the two consecutive months **of the same users**. As is the previous example, since we do not observe the entire population, we estimate $\delta$ based on the sample using $d$, which is the difference in mean listening time between the two months for our sample. Note that we assume that everything else (e.g., number of new releases) remained constant over the two month to keep it simple. We can show as above that the sampling distribution follows a normal distribution with a mean that is (in the limit) the same as the population mean. This means, again, that the difference in sample means is a good estimate for the difference in population means. Let's compute a new variable $d$, which is the difference between two month.
```{r}
hours_a_b_paired$d <- hours_a_b_paired$hours_a - hours_a_b_paired$hours_b
head(hours_a_b_paired)
```
Note that we now have a new variable, which is the difference in listening times (in hours) between the two months. The mean of this difference is:
```{r}
mean_d <- mean(hours_a_b_paired$d)
mean_d
```
Again, we use $SE_{\bar x}$ as an estimate of $\sigma_{\bar x}$:
$$
SE_{\bar d}={s \over \sqrt{n}}
$$
Hence, we can compute the standard error as:
```{r}
n <- nrow(hours_a_b_paired)
SE_d <- sd(hours_a_b_paired$d)/sqrt(n)
SE_d
```
The test statistic is therefore:
$$
t = {\bar d- \mu_0 \over SE_{\bar d}}
$$
on 99 (i.e., n-1) degrees of freedom. Now we can compute the t-statistic as follows:
```{r}
t_score <- mean_d/SE_d
t_score
qt(0.975,df=99)
```
Note that in the case of the dependent-means t-test, we only base our hypothesis on one population and hence there is only one population variance. This is because in the dependent sample test, the observations come from the same observational units (i.e., users). Hence, there is no unsystematic variation due to potential differences between users that were assigned to the experimental groups. This means that the influence of unobserved factors (unsystematic variation) relative to the variation due to the experimental manipulation (systematic variation) is not as strong in the dependent-means test compared to the independent-means test and we don't need to correct for differences in the population variances.
#### Application
Again, we don't have to compute all this by hand since the ```t.test(...)``` function can be used to do it for us. Now we have to use the argument ```paired=TRUE``` to let R know that we are working with dependent observations.
**1. Formulate null and alternative hypotheses**
We would like to the test if there is a difference in music listening times between the two consecutive months, so our null hypothesis is that there is no difference, while the alternative hypothesis states the opposite:
$$
H_0: \mu_D = 0 \\
H_0: \mu_D \ne 0
$$
**2. Select an appropriate test**
Since we have a ratio scaled variable (i.e., listening times) and two observations of the same group of users (i.e., the groups contain the same units), the dependent-means t-test is appropriate.
**3. Choose the level of significance**
We choose the conventional 5% significance level.
**4. Descriptive statistics and data visualization**
We can compute the descriptive statistics for each month separately, using the ```describe()``` function:
```{r}
library(psych)
describe(hours_a_b_paired)
```
This already shows us that the mean between the two months are different. We can visiualize the data using a plot of means, boxplot, and a histogram.
To plot the data, we need to do some restructuring first, since the variables are now stored in two different columns ("hours_a" and "hours_b"). This is also known as the "wide" format. To plot the data we need all observations to be stored in one variable. This is also known as the "long" format. We can use the ```melt(...)``` function from the ```reshape2```package to "melt" the two variable into one column to plot the data.
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE, fig.align="center", fig.cap = "plot of means (dependent test)"}
library(reshape2)
hours_a_b_paired_long <- melt(hours_a_b_paired[, c("hours_a", "hours_b")])
names(hours_a_b_paired_long) <- c("group","hours")
head(hours_a_b_paired_long)
```
Now we are ready to plot the data:
```{r, message = FALSE, warning=FALSE}
ggplot(hours_a_b_paired_long, aes(x = group, y = hours)) +
geom_boxplot() +
geom_jitter(alpha = 0.2, color = "red") +
labs(x = "Group", y = "Listening time (hours)") +
ggtitle("Boxplot of listening times") +
theme_bw()
ggplot(hours_a_b_paired_long,aes(hours)) +
geom_histogram(col = "black", fill = "darkblue") +
labs(x = "Listening time (hours)", y = "Frequency") +
ggtitle("Histogram of listening times") +
facet_wrap(~group) +
theme_bw()
```
**5. Conduct significance test**
To conduct the independent means t-test, we can use the ```t.test()``` function with the argument ```paired = TRUE```:
```{r, message = FALSE, warning=FALSE}
t.test(hours_a_b_paired$hours_a, hours_a_b_paired$hours_b, mu = 0, alternative = "two.sided", conf.level = 0.95, paired=TRUE)
```
Again, we could combine the results of the statistical test and the visualization using the `ggstatsplot` package.
```{r, message = FALSE, warning=FALSE, fig.align="center"}
library(ggstatsplot)
ggwithinstats(
data = hours_a_b_paired_long,
x = group,
y = hours,
path.point = FALSE,
path.mean = TRUE,
sort = "descending", # ordering groups along the x-axis based on
sort.fun = median, # values of `y` variable
title = "Mean listening times for different treatments",
messages = FALSE,
bf.message = FALSE,
mean.ci = TRUE,
effsize.type = "d" # display effect size (Cohen's d in output)
)
```
**6. Report results and draw a marketing conclusion**
On average, the same users used the service more when it included the new feature (M = 29.58, SE = 1.84) compared to the service without the feature (M = 17.93, SE = 1.21). This difference was significant t(99) = 5.41, p < .05 (95% CI = [7.38, 15.91]).
### Further considerations
<br>
<div align="center">
<iframe width="560" height="315" src="https://www.youtube.com/embed/ctwQn6YYUBM" frameborder="0" allowfullscreen></iframe>
</div>
<br>
#### Type I and Type II Errors
When choosing the level of significance ($\alpha$). It is important to note that the choice of the significance level affects the type 1 and type 2 error:
* Type I error: When we believe there is a genuine effect in our population, when in fact there isn't. Probability of type I error ($\alpha$) = level of significance.
* Type II error: When we believe that there is no effect in the population, when in fact there is.
This following table shows the possible outcomes of a test (retain vs. reject $H_0$), depending on whether $H_0$ is true or false in reality.
| Retain <b>H<sub>0</sub></b> | Reject <b>H<sub>0</sub></b>
--------------- | -------------------------------------- | --------------------------------------
<b>H<sub>0</sub> is true</b> | Correct decision:<br>1-α (probability of correct retention); | Type 1 error:<br> α (level of significance)
<b>H<sub>0</sub> is false</b> | Type 2 error:<br>β (type 2 error rate) | Correct decision:<br>1-β (power of the test)
#### Significance level, sample size, power, and effect size
When you plan to conduct an experiment, there are some factors that are under direct control of the researcher:
* **Significance level ($\alpha$)**: The probability of finding an effect that does not genuinely exist.
* **Sample size (n)**: The number of observations in each group of the experimental design.
Unlike α and n, which are specified by the researcher, the magnitude of β depends on the actual value of the population parameter. In addition, β is influenced by the effect size (e.g., Cohen’s d), which can be used to determine a standardized measure of the magnitude of an observed effect. The following parameters are affected more indirectly:
* **Power (1-β)**: The probability of finding an effect that does genuinely exists.
* **Effect size (d)**: Standardized measure of the effect size under the alternate hypothesis.
Although β is unknown, it is related to α. For example, if we would like to be absolutely sure that we do not falsely identify an effect which does not exist (i.e., make a type I error), this means that the probability of identifying an effect that does exist (i.e., 1-β) decreases and vice versa. Thus, an extremely low value of α (e.g., α = 0.0001) will result in intolerably high β errors. A common approach is to set α=0.05 and β=0.80.
Unlike the t-value of our test, the effect size (d) is unaffected by the sample size and can be categorized as follows (see Cohen, J. 1988):
* 0.2 (small effect)
* 0.5 (medium effect)
* 0.8 (large effect)
In order to test more subtle effects (smaller effect sizes), you need a larger sample size compared to the test of more obvious effects. In <a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2205186" target="_blank">this paper</a>, you can find a list of examples for different effect sizes and the number of observations you need to reliably find an effect of that magnitude. Although the exact effect size is unknown before the experiment, you might be able to make a guess about the effect size (e.g., based on previous studies).
If you wish to obtain a standardized measure of the effect, you may compute the effect size (Cohen's d) using the ```cohensD()``` function from the ```lsr``` package. Using the examples from the independent-means t-test above, we would use: