output | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
|
::: {.infobox .download data-latex="{download}"} You can download the corresponding R-Code here :::
We test hypotheses because we are confined to taking samples – we rarely work with the entire population. In the previous chapter, we introduced the standard error (i.e., the standard deviation of a large number of hypothetical samples) as an estimate of how well a particular sample represents the population. We also saw how we can construct confidence intervals around the sample mean
Let us again begin by assuming we know the entire population using the example of music listening times among students from the previous example. As a reminder, the following plot shows the distribution of music listening times in the population of WU students.
library(tidyverse)
library(ggplot2)
library(latex2exp)
set.seed(321)
hours <- rgamma(n = 25000, shape = 2, scale = 10)
ggplot(data.frame(hours)) + geom_histogram(aes(x = hours),
bins = 30, fill = "white", color = "black") + geom_vline(xintercept = mean(hours),
size = 1) + theme_bw() + labs(title = "Histogram of listening times",
subtitle = TeX(sprintf("Population mean ($\\mu$) = %.2f; population standard deviation ($\\sigma$) = %.2f",
round(mean(hours), 2), round(sd(hours), 2))),
y = "Number of students", x = "Hours")
In this example, the population mean (
Let us assume that we were planning to take a random sample of 50 students from this population and our hypothesis was that the mean listening time is equal to some specific value
$$ H_0: \mu = \mu_0 $$ When conducting research, we are usually interested in providing evidence against the null hypothesis. If we then observe sufficient evidence against it and our estimate is said to be significant. If the null hypothesis is rejected, this is taken as support for the alternative hypothesis. The alternative hypothesis assumes that some difference exists, which can be expressed as follows:
$$ H_1: \mu \neq \mu_0 $$ Accepting the alternative hypothesis in turn will often lead to changes in opinions or actions. Note that while the null hypothesis may be rejected, it can never be accepted based on a single test. If we fail to reject the null hypothesis, it means that we simply haven't collected enough evidence against the null hypothesis to disprove it. In classical hypothesis testing, there is no way to determine whether the null hypothesis is true. Hypothesis testing provides a means to quantify to what extent the data from our sample is in line with the null hypothesis.
In order to quantify the concept of "sufficient evidence" we look at the theoretical distribution of the sample means given our null hypothesis and the sample standard error. Using the available information we can infer the sampling distribution for our null hypothesis. Recall that the standard deviation of the sampling distribution (i.e., the standard error of the mean) is given by
mean_pop <- mean(hours)
sigma <- sd(hours) #population standard deviation
n <- 50 #sample size
standard_error <- sigma/sqrt(n) #standard error
standard_error
## [1] 2.001639
Since we know from the central limit theorem that the sampling distribution is normal for large enough samples, we can now visualize the expected sampling distribution if our null hypothesis was in fact true (i.e., if the was no difference between the true population mean and the hypothesized mean of 10).
We also know that 95% of the probability is within 1.96 standard deviations from the mean. Values higher than that are rather unlikely, if our hypothesis about the population mean was indeed true. This is shown by the shaded area, also known as the "rejection region". To test our hypothesis that the population mean is equal to
set.seed(12567)
H_0 <- 10
student_sample <- sample(1:25000, size = 50, replace = FALSE)
music_listening_sample <- data.frame(hours = hours[student_sample])
mean_sample <- mean(music_listening_sample$hours)
ggplot(music_listening_sample) + geom_histogram(aes(x = hours),
fill = "white", color = "black", bins = 20) + theme_bw() +
geom_vline(xintercept = mean_sample, color = "black",
size = 1) + labs(title = TeX(sprintf("Distribution of values in the sample ($n =$ %.0f, $\\bar{x] = $ %.2f, s = %.2f)",
n, mean_sample, sd(music_listening_sample$hours))),
x = "Hours", y = "Frequency")
The mean listening time in the sample (black line)
Let's go back to the sampling distribution above. We know that 95% of all values will fall within 1.96 standard deviations from the mean. So if we could express the distance between our sample mean and the null hypothesis in terms of standard deviations, we could make statements about the probability of getting a sample mean of the observed magnitude (or more extreme values). Essentially, we would like to know how many standard deviations (
In this equation, z
will tell us how many standard deviations the sample mean z
gives us:
This standardized value (or "z-score") is also referred to as a test statistic. Let's compute the test statistic for our example above:
z_score <- (mean_sample - H_0)/(sigma/sqrt(n))
z_score
## [1] 4.292454
To make a decision on whether the difference can be deemed statistically significant, we now need to compare this calculated test statistic to a meaningful threshold. In order to do so, we need to decide on a significance level qnorm()
function as follows:
z_crit <- qnorm(0.975)
z_crit
## [1] 1.959964
We use 0.975
and not 0.95
since we are running a two-sided test and need to account for the rejection region at the other end of the distribution. Recall that for the normal distribution, 95% of the total probability falls within 1.96 standard deviations of the mean, so that higher (absolute) values provide evidence against the null hypothesis. Generally, we speak of a statistically significant effect if the (absolute) calculated test statistic is larger than the (absolute) critical value. We can easily check if this is the case in our example:
abs(z_score) > abs(z_crit)
## [1] TRUE
Since the absolute value of the calculated test statistic is larger than the critical value, we would reject
You may have noticed that the formula for the z-score above assumes that we know the true population standard deviation (
Here,
Notice that as
SE <- (sd(music_listening_sample$hours)/sqrt(n))
t_score <- (mean_sample - H_0)/SE
t_score
## [1] 4.84204
Notice that the value of the t-statistic is higher compared to the z-score (4.29). This can be attributed to the fact that by using the n - 1
degrees of freedom.
df = n - 1
t_crit <- qt(0.975, df = df)
t_crit
## [1] 2.009575
Again, we use 0.975
and not 0.95
since we are running a two-sided test and need to account for the rejection region at the other end of the distribution. Notice that the new critical value based on the t-distributionis larger, to reflect the uncertainty when estimating
abs(t_score) > abs(t_crit)
## [1] TRUE
The following graphics shows that the calculated test statistic (red line) falls into the rejection region so that in our example, we would reject the null hypothesis that the true population mean is equal to
Decision: Reject
Something to keep in mind here is the fact the test statistic is a function of the sample size. This, as
In the previous section, we computed the test statistic, which tells us how close our sample is to the null hypothesis. The p-value corresponds to the probability that the test statistic would take a value as extreme or more extreme than the one that we actually observed, assuming that the null hypothesis is true. It is important to note that this is a conditional probability: we compute the probability of observing a sample mean (or a more extreme value) conditional on the assumption that the null hypothesis is true. The pnorm()
function can be used to compute this probability. It is the cumulative probability distribution function of the `normal distribution. Cumulative probability means that the function returns the probability that the test statistic will take a value less than or equal to the calculated test statistic given the degrees of freedom. However, we are interested in obtaining the probability of observing a test statistic larger than or equal to the calculated test statistic under the null hypothesis (i.e., the p-value). Thus, we need to subtract the cumulative probability from 1. In addition, since we are running a two-sided test, we need to multiply the probability by 2 to account for the rejection region at the other side of the distribution.
p_value <- 2 * (1 - pt(abs(t_score), df = df))
p_value
## [1] 0.00001326885
This value corresponds to the probability of observing a mean equal to or larger than the one we obtained from our sample, if the null hypothesis was true. As you can see, this probability is very low. A small p-value signals that it is unlikely to observe the calculated test statistic under the null hypothesis. To decide whether or not to reject the null hypothesis, we would now compare this value to the level of significance (
Decision: Reject
For a given statistic calculated for a sample of observations (e.g., listening times), a 95% confidence interval can be constructed such that in 95% of samples, the true value of the true population mean will fall within its limits. If the parameter value specified in the null hypothesis (here
It is easy to compute this interval manually:
ci_lower <- (mean_sample) - qt(0.975, df = df) * SE
ci_upper <- (mean_sample) + qt(0.975, df = df) * SE
ci_lower
## [1] 15.02606
ci_upper
## [1] 22.15783
The interpretation of this interval is as follows: if we would (hypothetically) take 100 samples and calculated the mean and confidence interval for each of them, then the true population mean would be included in 95% of these intervals. The CI is informative when reporting the result of your test, since it provides an estimate of the uncertainty associated with the test result. From the test statistic or the p-value alone, it is not easy to judge in which range the true population parameter is located. The CI provides an estimate of this range.
Decision: Reject
To summarize, you can see that we arrive at the same conclusion (i.e., reject
The test statistic, as we have seen, measures how close the sample is to the null hypothesis and often follows a well-known distribution (e.g., normal, t, or chi-square). To select the correct test, various factors need to be taken into consideration. Some examples are:
- On what scale are your variables measured (categorical vs. continuous)?
- Do you want to test for relationships or differences?
- If you test for differences, how many groups would you like to test?
- For parametric tests, are the assumptions fulfilled?
The previous discussion used a one sample t-test as an example, which requires that variable is measured on an interval or ratio scale. If you are confronted with other settings, the following flow chart provides a rough guideline on selecting the correct test:
For a detailed overview over the different type of tests, please also refer to this overview by the UCLA.
A basic distinction can be made between parametric and non-parametric tests. Parametric tests require that variables are measured on an interval or ratio scale and that the sampling distribution follows a known distribution. Non-Parametric tests on the other hand do not require the sampling distribution to be normally distributed (a.k.a. "assumption free tests"). These tests may be used when the variable of interest is measured on an ordinal scale or when the parametric assumptions do not hold. They often rely on ranking the data instead of analyzing the actual scores. By ranking the data, information on the magnitude of differences is lost. Thus, parametric tests are more powerful if the sampling distribution is normally distributed. In this chapter, we will first focus on parametric tests and cover non-parametric tests later.
For some tests you may choose between a one-tailed test versus a two-tailed test. The choice depends on the hypothesis you specified, i.e., whether you specified a directional or a non-directional hypotheses. In the example above, we used a non-directional hypothesis. That is, we stated that the mean is different from the comparison value
Similarly, we could test whether the population mean is larger than a comparison value:
Connected to the decision of how to phrase the hypotheses (directional vs. non-directional) is the choice of a one-tailed test versus a two-tailed test. Let's first think about the meaning of a one-tailed test. Using a significance level of 0.05, a one-tailed test means that 5% of the total area under the probability distribution of our test statistic is located in one tail. Thus, under a one-tailed test, we test for the possibility of the relationship in one direction only, disregarding the possibility of a relationship in the other direction. In our example, a one-tailed test could test either if the mean listening time is significantly larger or smaller compared to the control condition, but not both. Depending on the direction, the mean listening time is significantly larger (smaller) if the test statistic is located in the top (bottom) 5% of its probability distribution.
The following graph shows the critical values that our test statistic would need to surpass so that the difference between the population mean and the comparison value would be deemed statistically significant.
It can be seen that under a one-sided test, the rejection region is at one end of the distribution or the other. In a two-sided test, the rejection region is split between the two tails. As a consequence, the critical value of the test statistic is smaller using a one-tailed test, meaning that it has more power to detect an effect. Having said that, in most applications, we would like to be able catch effects in both directions, simply because we can often not rule out that an effect might exist that is not in the hypothesized direction. For example, if we would conduct a one-tailed test for a mean larger than some specified value but the mean turns out to be substantially smaller, then testing a one-directional hypothesis (
As we have seen, the process of hypothesis testing consists of various steps:
- Formulate null and alternative hypotheses
- Select an appropriate test
- Choose the level of significance (
$\alpha$ ) - Descriptive statistics and data visualization
- Conduct significance test
- Report results and draw a marketing conclusion
In the following, we will go through the individual steps using examples for different tests.
The example we used in the introduction was an example of the one sample t-test and we computed all statistics by hand to explain the underlying intuition. When you conduct hypothesis tests using R, you do not need to calculate these statistics by hand, since there are build-in routines to conduct the steps for you. Let us use the same example again to see how you would conduct hypothesis tests in R.
1. Formulate null and alternative hypotheses
The null hypothesis states that there is no difference between the true population mean
2. Select an appropriate test
Because we would like to test if the mean of a variable is different from a specified threshold, the one-sample t-test is appropriate. The assumptions of the test are 1) that the variable is measured using an interval or ratio scale, and 2) that the sampling distribution is normal. Both assumptions are met since 1) listening time is a ratio scale, and 2) we deem the sample size (n = 50) large enough to assume a normal sampling distribution according to the central limit theorem.
3. Choose the level of significance
We choose the conventional 5% significance level.
4. Descriptive statistics and data visualization
Provide descriptive statistics using the describe()
function:
library(psych)
describe(student_sample)
From this, we can already see that the mean is different from the hypothesized value. The question however remains, whether this difference is significantly different, given the sample size and the variability in the data. Since we only have one continuous variable, we can visualize the distribution in a histogram.
ggplot(music_listening_sample) + geom_histogram(aes(x = hours),
fill = "white", color = "black", bins = 20) + theme_bw() +
labs(title = "Distribution of values in the sample",
x = "Hours", y = "Frequency")
5. Conduct significance test
In the beginning of the chapter, we saw, how you could conduct significance test by hand. However, R has built-in routines that you can use to conduct the analyses. The t.test()
function can be used to conduct the test. To test if the listening time among WU students was 10, you can use the following code:
H_0 <- 10
t.test(music_listening_sample$hours, mu = H_0, alternative = "two.sided")
##
## One Sample t-test
##
## data: music_listening_sample$hours
## t = 4.842, df = 49, p-value = 0.00001327
## alternative hypothesis: true mean is not equal to 10
## 95 percent confidence interval:
## 15.02606 22.15783
## sample estimates:
## mean of x
## 18.59194
Note that if you would have stated a directional hypothesis (i.e., the mean is either greater or smaller than 10 hours), you could easily amend the code to conduct a one sided test by changing the argument alternative
from 'two.sided'
to either 'less'
or 'greater'
.
Note that you could also combine the results from the statistical test and the visualization using the ggstatsplot
package as follows.
library(ggstatsplot)
gghistostats(
data = music_listening_sample, # dataframe from which variable is to be taken
x = hours, # numeric variable whose distribution is of interest
title = "Distribution of listening times", # title for the plot
caption = "Notes: Test based on a random sample of 50 students.",
type = "parametric", # one sample t-test
conf.level = 0.95, # changing confidence level for effect size
bar.measure = "mix", # what does the bar length denote
test.value = 10, # default value is 0
test.value.line = TRUE, # display a vertical line at test value
effsize.type = "d", # display effect size (Cohen's d in output)
test.value.color = "#0072B2", # color for the line for test value
centrality.para = "mean", # which measure of central tendency is to be plotted
centrality.color = "darkred", # decides color for central tendency line
binwidth = 2, # binwidth value (experiment)
messages = FALSE, # turn off the messages
bf.message = FALSE
)
You may nice some additional output in this plot related to the measure of effect size (Cohen's d). Don't worry about it at this stage, we will come back to this later in this chapter.
6. Report results and draw a marketing conclusion
Note that the results are the same as above, when we computed the test by hand. You could summarize the results as follows:
On average, the listening times in our sample were different form 10 hours per month (Mean = 18.59 hours, SE = 1.77). This difference was significant t(49) = 4.842, p < .05 (95% CI = [15.03; 22.16]). Based on this evidence, we can conclude that the mean in our sample is significantly lower compared to the hypothesized population mean of
Note that in the reporting above, the number 49
in parenthesis refers to the degrees of freedom that are available from the output.
In the one-sample test above, we tested the hypothesis that the population mean has some specific value
Using an independent-means t-test, we can compare the means of two possibly different populations. It is, for example, quite common for online companies to test new service features by running an experiment and randomly splitting their website visitors into two groups: one is exposed to the website with the new feature (experimental group) and the other group is not exposed to the new feature (control group). This is a typical A/B-Test scenario.
As an example, imagine that a music streaming service would like to introduce a new playlist feature that let's their users access playlists created by other users. The goal is to analyze how the new service feature impacts the listening time of users. The service randomly splits a representative subset of their users into two groups and collects data about their listening times over one month. Let's create a data set to simulate such a scenario.
hours_a_b <- read.table("https://raw.githubusercontent.com/IMSMWU/Teaching/master/MRDA2017/hours_a_b.csv",
sep = ",", header = TRUE)
head(hours_a_b)
This data set contains two variables: the variable hours
indicates the music listening times (in hours) and the variable group
indicates from which group the observation comes, where 'A' refers to the control group (with the standard service) and 'B' refers to the experimental group (with the new playlist feature). Let's first look at the descriptive statistics by group using the describeBy
function:
library(psych)
describeBy(hours_a_b$hours, hours_a_b$group)
##
## Descriptive statistics by group
## group: A
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 98 18.11 12.1 15 16.88 10.38 2 65 63 1.08 1.21 1.22
## ------------------------------------------------------------
## group: B
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 112 28.5 17.97 24.5 26.56 15.57 1 83 82 0.96 0.82 1.7
From this, we can already see that there is a difference in means between groups A and B. We can also see that the number of observations is different, as is the standard deviation. The question that we would like to answer is whether there is a significant difference in mean listening times between the groups. Remember that different users are contained in each group ('between-subjects design') and that the observations in one group are independent of the observations in the other group. Before we will see how you can easily conduct an independent-means t-test, let's go over some theory first.
As a starting point, let us label the unknown population mean of group A (control group) in our experiment
This is equivalent to stating that the difference between the two groups (
That is,
Remember that we usually don't have access to the entire population so that we can not observe
delta_pop <- mean(hours_population_1) - mean(hours_population_2)
delta_pop
## [1] -7.422855
This means that the true difference between the mean listening times of groups a and b is -7.42. Let us now repeat the exercise from the previous chapter: let us repeatedly draw a large number of
set.seed(321)
hours_population_1 <- rgamma(25000, shape = 2, scale = 10)
hours_population_2 <- rgamma(25000, shape = 2.5, scale = 11)
samples <- 20000
mean_delta <- matrix(NA, nrow = samples)
for (i in 1:samples) {
student_sample <- sample(1:25000, size = 100, replace = FALSE)
mean_delta[i, ] <- mean(hours_population_1[student_sample]) -
mean(hours_population_2[student_sample])
}
ggplot(data.frame(mean_delta)) + geom_histogram(aes(x = mean_delta),
bins = 30, fill = "white", color = "black") + theme_bw() +
theme(legend.title = element_blank()) + geom_vline(aes(xintercept = mean(mean_delta)),
size = 1) + xlab("d") + ggtitle(TeX(sprintf("%d samples; $d_{\\bar{x}}$ = %.2f",
samples, round(mean(mean_delta), 2))))
This gives us the sampling distribution of the mean differences between the samples. You will notice that this distribution follows a normal distribution and is centered around the true difference between the populations. This means that, on average, the difference between two sample means
mean_x1 <- mean(hours_a_b[hours_a_b$group == "A", "hours"])
mean_x1
## [1] 18.11224
mean_x2 <- mean(hours_a_b[hours_a_b$group == "B", "hours"])
mean_x2
## [1] 28.5
d <- mean_x1 - mean_x2
d
## [1] -10.38776
Now that we have
Recall from the previous section, that the standard deviation of the sampling distribution
But recall that we don't actually know the true population standard deviation, so we use
Hence, for our example, we can calculate the standard error as follows:
n1 <- 98
n2 <- 112
s1 <- var(hours_a_b[hours_a_b$group == "A", "hours"])
s1
## [1] 146.4924
s2 <- var(hours_a_b[hours_a_b$group == "B", "hours"])
s2
## [1] 322.9189
SE_x1_x2 <- sqrt(s1/n1 + s2/n2)
SE_x1_x2
## [1] 2.092373
Recall from above that we can calculate the t-statistic as:
Exchanging
Note that according to our hypothesis
t_score <- d/SE_x1_x2
t_score
## [1] -4.964581
Following the example of our one sample t-test above, we would now need to compare this calculated test statistic to a critical value in order to assess if
The section above explained the theory behind the independent-means t-test and showed how to compute the statistics manually. Obviously you don't have to compute these statistics by hand in this section shows you how to conduct an independent-means t-test in R using the example from above.
1. Formulate null and alternative hypotheses
We wish to analyze whether there is a significant difference in music listening times between groups A and B. So our null hypothesis is that the means from the two populations are the same (i.e., there is no difference), while the alternative hypothesis states the opposite:
2. Select an appropriate test
Since we have a ratio scaled variable (i.e., listening times) and two independent groups, where the mean of one sample is independent of the group of the second sample (i.e., the groups contain different units), the independent-means t-test is appropriate.
3. Choose the level of significance
We choose the conventional 5% significance level.
4. Descriptive statistics and data visualization
We can compute the descriptive statistics for each group separately, using the describeBy()
function:
library(psych)
describeBy(hours_a_b$hours, hours_a_b$group)
##
## Descriptive statistics by group
## group: A
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 98 18.11 12.1 15 16.88 10.38 2 65 63 1.08 1.21 1.22
## ------------------------------------------------------------
## group: B
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 112 28.5 17.97 24.5 26.56 15.57 1 83 82 0.96 0.82 1.7
This already shows us that the mean between groups A and B are different. We can visualize the data using a boxplot and a histogram.
ggplot(hours_a_b, aes(x = group, y = hours)) + geom_boxplot() +
geom_jitter(alpha = 0.2, color = "red") + labs(x = "Group",
y = "Listening time (hours)") + ggtitle("Boxplot of listening times") +
theme_bw()
ggplot(hours_a_b, aes(hours)) + geom_histogram(col = "black",
fill = "darkblue") + labs(x = "Listening time (hours)",
y = "Frequency") + ggtitle("Histogram of listening times") +
facet_wrap(~group) + theme_bw()
5. Conduct significance test
To conduct the independent means t-test, we can use the t.test()
function:
t.test(hours ~ group, data = hours_a_b, mu = 0, alternative = "two.sided",
conf.level = 0.95, var.equal = FALSE)
##
## Welch Two Sample t-test
##
## data: hours by group
## t = -4.9646, df = 195.73, p-value = 0.000001494
## alternative hypothesis: true difference in means between group A and group B is not equal to 0
## 95 percent confidence interval:
## -14.514246 -6.261264
## sample estimates:
## mean in group A mean in group B
## 18.11224 28.50000
Again, we could combine the results of the statistical test and the visualization using the ggstatsplot
package.
library(ggstatsplot)
ggbetweenstats(
data = hours_a_b,
plot.type = "box",
x = group, # 2 groups
y = hours ,
type = "p", # default
effsize.type = "d", # display effect size (Cohen's d in output)
messages = FALSE,
bf.message = FALSE,
mean.ci = TRUE,
title = "Mean listening times for different groups"
)
6. Report results and draw a marketing conclusion
The results showed that listening times were higher in the experimental group (Mean = 28.50, SE = 1.70) compared to the control group (Mean = 18.11, SE = 1.22). This means that the listening times were 10.39 hours higher on average in the experimental group, compared to the control group. An independent-means t-test showed that this difference is significant t(195.73) = 4.96, p < .05 (95% CI = [6.26, 14.51]).
While the independent-means t-test is used when different units (e.g., participants, products) were assigned to the different condition, the dependent-means t-test is used when there are two experimental conditions and the same units (e.g., participants, products) were observed in both experimental conditions.
Imagine, for example, a slightly different experimental setup for the above experiment. Imagine that we do not assign different users to the groups, but that a sample of 100 users gets to use the music streaming service with the new feature for one month and we compare the music listening times of these users during the month of the experiment with the listening time in the previous month. Let us generate data for this example:
hours_a_b_paired <- read.table("https://raw.githubusercontent.com/IMSMWU/Teaching/master/MRDA2017/hours_a_b_paired.csv",
sep = ",", header = TRUE)
head(hours_a_b_paired)
Note that the data set has almost the same structure as before only that we know have two variables representing the listening times of each user in the month before the experiment and during the month of the experiment when the new feature was tested.
In this case, we want to test the hypothesis that there is no difference in mean the mean listening times between the two months. This can be expressed as follows:
$$
H_0: \mu_D = 0 \
$$
Note that the hypothesis only refers to one population, since both observations come from the same units (i.e., users). To use consistent notation, we replace
where
hours_a_b_paired$d <- hours_a_b_paired$hours_a - hours_a_b_paired$hours_b
head(hours_a_b_paired)
Note that we now have a new variable, which is the difference in listening times (in hours) between the two months. The mean of this difference is:
mean_d <- mean(hours_a_b_paired$d)
mean_d
## [1] -11.65
Again, we use
$$ SE_{\bar d}={s \over \sqrt{n}} $$ Hence, we can compute the standard error as:
n <- nrow(hours_a_b_paired)
SE_d <- sd(hours_a_b_paired$d)/sqrt(n)
SE_d
## [1] 2.151503
The test statistic is therefore:
$$ t = {\bar d- \mu_0 \over SE_{\bar d}} $$ on 99 (i.e., n-1) degrees of freedom. Now we can compute the t-statistic as follows:
t_score <- mean_d/SE_d
t_score
## [1] -5.41482
qt(0.975, df = 99)
## [1] 1.984217
Note that in the case of the dependent-means t-test, we only base our hypothesis on one population and hence there is only one population variance. This is because in the dependent sample test, the observations come from the same observational units (i.e., users). Hence, there is no unsystematic variation due to potential differences between users that were assigned to the experimental groups. This means that the influence of unobserved factors (unsystematic variation) relative to the variation due to the experimental manipulation (systematic variation) is not as strong in the dependent-means test compared to the independent-means test and we don't need to correct for differences in the population variances.
Again, we don't have to compute all this by hand since the t.test(...)
function can be used to do it for us. Now we have to use the argument paired=TRUE
to let R know that we are working with dependent observations.
1. Formulate null and alternative hypotheses
We would like to the test if there is a difference in music listening times between the two consecutive months, so our null hypothesis is that there is no difference, while the alternative hypothesis states the opposite:
2. Select an appropriate test
Since we have a ratio scaled variable (i.e., listening times) and two observations of the same group of users (i.e., the groups contain the same units), the dependent-means t-test is appropriate.
3. Choose the level of significance
We choose the conventional 5% significance level.
4. Descriptive statistics and data visualization
We can compute the descriptive statistics for each month separately, using the describe()
function:
library(psych)
describe(hours_a_b_paired)
This already shows us that the mean between the two months are different. We can visiualize the data using a plot of means, boxplot, and a histogram.
To plot the data, we need to do some restructuring first, since the variables are now stored in two different columns ("hours_a" and "hours_b"). This is also known as the "wide" format. To plot the data we need all observations to be stored in one variable. This is also known as the "long" format. We can use the melt(...)
function from the reshape2
package to "melt" the two variable into one column to plot the data.
library(reshape2)
hours_a_b_paired_long <- melt(hours_a_b_paired[, c("hours_a",
"hours_b")])
names(hours_a_b_paired_long) <- c("group", "hours")
head(hours_a_b_paired_long)
Now we are ready to plot the data:
ggplot(hours_a_b_paired_long, aes(x = group, y = hours)) +
geom_boxplot() + geom_jitter(alpha = 0.2, color = "red") +
labs(x = "Group", y = "Listening time (hours)") +
ggtitle("Boxplot of listening times") + theme_bw()
ggplot(hours_a_b_paired_long, aes(hours)) + geom_histogram(col = "black",
fill = "darkblue") + labs(x = "Listening time (hours)",
y = "Frequency") + ggtitle("Histogram of listening times") +
facet_wrap(~group) + theme_bw()
5. Conduct significance test
To conduct the independent means t-test, we can use the t.test()
function with the argument paired = TRUE
:
t.test(hours_a_b_paired$hours_a, hours_a_b_paired$hours_b,
mu = 0, alternative = "two.sided", conf.level = 0.95,
paired = TRUE)
##
## Paired t-test
##
## data: hours_a_b_paired$hours_a and hours_a_b_paired$hours_b
## t = -5.4148, df = 99, p-value = 0.00000043
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -15.919048 -7.380952
## sample estimates:
## mean of the differences
## -11.65
Again, we could combine the results of the statistical test and the visualization using the ggstatsplot
package.
library(ggstatsplot)
ggwithinstats(
data = hours_a_b_paired_long,
x = group,
y = hours,
path.point = FALSE,
path.mean = TRUE,
sort = "descending", # ordering groups along the x-axis based on
sort.fun = median, # values of `y` variable
title = "Mean listening times for different treatments",
messages = FALSE,
bf.message = FALSE,
mean.ci = TRUE,
effsize.type = "d" # display effect size (Cohen's d in output)
)
6. Report results and draw a marketing conclusion
On average, the same users used the service more when it included the new feature (M = 29.58, SE = 1.84) compared to the service without the feature (M = 17.93, SE = 1.21). This difference was significant t(99) = 5.41, p < .05 (95% CI = [7.38, 15.91]).
When choosing the level of significance (
- Type I error: When we believe there is a genuine effect in our population, when in fact there isn't. Probability of type I error (
$\alpha$ ) = level of significance. - Type II error: When we believe that there is no effect in the population, when in fact there is.
This following table shows the possible outcomes of a test (retain vs. reject
Retain H0 | Reject H0 | |
---|---|---|
H0 is true | Correct decision: 1-α (probability of correct retention); |
Type 1 error: α (level of significance) |
H0 is false | Type 2 error: β (type 2 error rate) |
Correct decision: 1-β (power of the test) |
When you plan to conduct an experiment, there are some factors that are under direct control of the researcher:
-
Significance level (
$\alpha$ ): The probability of finding an effect that does not genuinely exist. - Sample size (n): The number of observations in each group of the experimental design.
Unlike α and n, which are specified by the researcher, the magnitude of β depends on the actual value of the population parameter. In addition, β is influenced by the effect size (e.g., Cohen’s d), which can be used to determine a standardized measure of the magnitude of an observed effect. The following parameters are affected more indirectly:
- Power (1-β): The probability of finding an effect that does genuinely exists.
- Effect size (d): Standardized measure of the effect size under the alternate hypothesis.
Although β is unknown, it is related to α. For example, if we would like to be absolutely sure that we do not falsely identify an effect which does not exist (i.e., make a type I error), this means that the probability of identifying an effect that does exist (i.e., 1-β) decreases and vice versa. Thus, an extremely low value of α (e.g., α = 0.0001) will result in intolerably high β errors. A common approach is to set α=0.05 and β=0.80.
Unlike the t-value of our test, the effect size (d) is unaffected by the sample size and can be categorized as follows (see Cohen, J. 1988):
- 0.2 (small effect)
- 0.5 (medium effect)
- 0.8 (large effect)
In order to test more subtle effects (smaller effect sizes), you need a larger sample size compared to the test of more obvious effects. In this paper, you can find a list of examples for different effect sizes and the number of observations you need to reliably find an effect of that magnitude. Although the exact effect size is unknown before the experiment, you might be able to make a guess about the effect size (e.g., based on previous studies).
If you wish to obtain a standardized measure of the effect, you may compute the effect size (Cohen's d) using the cohensD()
function from the lsr
package. Using the examples from the independent-means t-test above, we would use:
library(lsr)
cohensD(hours ~ group, data = hours_a_b)
## [1] 0.6696301
According to the thresholds defined above, this effect would be judged to be a small-medium effect.
For the dependent-means t-test, we would use:
cohensD(hours_a_b_paired$hours_a, hours_a_b_paired$hours_b,
method = "paired")
## [1] 0.541482
According to the thresholds defined above, this effect would also be judged to be a small-medium effect.
When constructing an experimental design, your goal should be to maximize the power of the test while maintaining an acceptable significance level and keeping the sample as small as possible. To achieve this goal, you may use the pwr
package, which let's you compute n
, d
, alpha
, and power
. You only need to specify three of the four input variables to get the fourth.
For example, what sample size do we need (per group) to identify an effect with d = 0.6, α = 0.05, and power = 0.8:
library(pwr)
pwr.t.test(d = 0.6, sig.level = 0.05, power = 0.8,
type = c("two.sample"), alternative = c("two.sided"))
##
## Two-sample t test power calculation
##
## n = 44.58577
## d = 0.6
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group
Or we could ask, what is the power of our test with 51 observations in each group, d = 0.6, and α = 0.05:
pwr.t.test(n = 51, d = 0.6, sig.level = 0.05, type = c("two.sample"),
alternative = c("two.sided"))
##
## Two-sample t test power calculation
##
## n = 51
## d = 0.6
## sig.level = 0.05
## power = 0.850985
## alternative = two.sided
##
## NOTE: n is number in *each* group
From my experience, students tend to place a lot of weight on p-values when interpreting their research findings. It is therefore important to note some points that hopefully help to put the meaning of a "significant" vs. "insignificant" test result into perspective. So what does a significant test result actually tell us?
- The importance of an effect? → No, significance depends on sample size.
- That the null hypothesis is false? → No, it is always false.
- That the null hypothesis is true? → No, it is never true.
It is important to understand what the p-value actually tells you.
::: {.infobox_orange .hint data-latex="{hint}"}
A p-value of < 0.05 means that the probability of finding a difference of at least the observed magnitude is less than 5% if the null hypothesis was true. In other words, if there really wouldn't be a difference between the groups, it tells you the probability of observing the difference that you found in your data (or more extreme differences).
:::
The following points provide some guidance on how to interpret significant and insignificant test results.
Significant result
- Even if the probability of the effect being a chance result is small (e.g., less than .05) it doesn't necessarily mean that the effect is important.
- Very small and unimportant effects can turn out to be statistically significant if the sample size is large enough.
Insignificant result
- If the probability of the effect occurring by chance is large (greater than .05), the alternative hypothesis is rejected. However, this does not mean that the null hypothesis is true.
- Although an effect might not be large enough to be anything other than a chance finding, it doesn't mean that the effect is zero.
- In fact, two random samples will always have slightly different means that would deemed to be statistically significant if the samples were large enough.
Thus, you should not base your research conclusion on p-values alone!
It is also crucial to determine the sample size before you run the experiment or before you start your analysis. Why? Consider the following example:
- You run an experiment
- After each respondent you analyze the data and look at the mean difference between the two groups with a t-test
- You stop when you have a significant effect
This is called p-hacking and should be avoided at all costs. Assuming that both groups come from the same population (i.e., there is no difference in the means): What is the likelihood that the result will be significant at some point? In other words, what is the likelihood that you will draw the wrong conclusion from your data that there is an effect, while there is none? This is shown in the following graph using simulated data - the color red indicates significant test results that arise although there is no effect (i.e., false positives).