-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path10-Regression.Rmd
1693 lines (1244 loc) · 120 KB
/
10-Regression.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
output:
html_document:
toc: yes
pdf_document:
toc: yes
html_notebook: default
---
```{r echo=FALSE, eval=TRUE, message=FALSE, warning=FALSE}
library(knitr)
options(scipen = 999)
#This code automatically tidies code so that it does not reach over the page
opts_chunk$set(tidy.opts=list(width.cutoff=50),tidy=TRUE, rownames.print = FALSE, rows.print = 10)
opts_chunk$set(cache=T)
```
```{r message=FALSE, warning=FALSE, echo=F, eval=TRUE,paged.print = FALSE}
options(digits = 8)
```
# Regression
::: {.infobox .download data-latex="{download}"}
[You can download the corresponding R-Code here](./Code/10-regression.R)
:::
## Correlation
<br>
<div align="center">
<iframe width="560" height="315" src="https://www.youtube.com/embed/HeIW9_ijHF4" frameborder="0" allowfullscreen></iframe>
</div>
<br>
Before we start with regression analysis, we will review the basic concept of correlation first. Correlation helps us to determine the degree to which the variation in one variable, X, is related to the variation in another variable, Y.
### Correlation coefficient
The correlation coefficient summarizes the strength of the linear relationship between two metric (interval or ratio scaled) variables. Let's consider a simple example. Say you conduct a survey to investigate the relationship between the attitude towards a city and the duration of residency. The "Attitude" variable can take values between 1 (very unfavorable) and 12 (very favorable), and the "duration of residency" is measured in years. Let's further assume for this example that the attitude measurement represents an interval scale (although it is usually not realistic to assume that the scale points on an itemized rating scale have the same distance). To keep it simple, let's further assume that you only asked 12 people. We can create a short data set like this:
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE,paged.print = FALSE}
library(psych)
attitude <- c(6,9,8,3,10,4,5,2,11,9,10,2)
duration <- c(10,12,12,4,12,6,8,2,18,9,17,2)
att_data <- data.frame(attitude, duration)
att_data <- att_data[order(-attitude), ]
att_data$respodentID <- c(1:12)
str(att_data)
psych::describe(att_data[, c("attitude","duration")])
att_data
```
Let's look at the data first. The following graph shows the individual data points for the "duration of residency"" variable, where the y-axis shows the duration of residency in years and the x-axis shows the respondent ID. The blue horizontal line represents the mean of the variable (`r round(mean(att_data$duration),2)`) and the vertical lines show the distance of the individual data points from the mean.
```{r message=FALSE, warning=FALSE, echo=FALSE, eval=TRUE, fig.align="center", fig.cap = "Scores for duration of residency variable"}
library(ggplot2)
h <- round(mean(att_data$duration), 2)
ggplot(att_data, aes(x = respodentID, y = duration)) +
geom_point(size = 3, color = "deepskyblue4") +
scale_x_continuous(breaks = 1:12) +
geom_hline(data = att_data, aes(yintercept = mean(duration)), color ="deepskyblue4") +
labs(x = "Observations",y = "Duration of residency", size = 11) +
coord_cartesian(ylim = c(0, 18)) +
geom_segment(aes(x = respodentID,y = duration, xend = respodentID,
yend = mean(duration)), color = "deepskyblue4", size = 1) +
theme(axis.title = element_text(size = 16),
axis.text = element_text(size=16),
strip.text.x = element_text(size = 16),
legend.position="none") +
theme_bw()
```
You can see that there are some respondents that have been living in the city longer than average and some respondents that have been living in the city shorter than average. Let's do the same for the second variable ("Attitude"). Again, the y-axis shows the observed scores for this variable and the x-axis shows the respondent ID.
```{r message=FALSE, warning=FALSE, echo=FALSE, eval=TRUE, fig.align="center", fig.cap = "Scores for attitude variable"}
ggplot(att_data, aes(x = respodentID, y = attitude)) +
geom_point(size = 3, color = "#f9756d") +
scale_x_continuous(breaks = 1:12) +
geom_hline(data = att_data, aes(yintercept = mean(attitude)), color = "#f9756d") +
labs(x = "Observations",y = "Attitude", size = 11) +
coord_cartesian(ylim = c(0,18)) +
geom_segment(aes(x = respodentID, y = attitude, xend = respodentID,
yend = mean(attitude)), color = "#f9756d", size = 1) +
theme_bw()
```
Again, we can see that some respondents have an above average attitude towards the city (more favorable) and some respondents have a below average attitude towards the city. Let's combine both variables in one graph now to see if there is some co-movement:
```{r message=FALSE, warning=FALSE, echo=FALSE, eval=TRUE,fig.align="center", fig.cap = "Scores for attitude and duration of residency variables"}
ggplot(att_data) +
geom_point(size = 3, aes(respodentID, attitude), color = "#f9756d") +
geom_point(size = 3, aes(respodentID, duration), color = "deepskyblue4") +
scale_x_continuous(breaks = 1:12) +
geom_hline(data = att_data, aes(yintercept = mean(duration)), color = "deepskyblue4") +
geom_hline(data = att_data, aes(yintercept = mean(attitude)), color = "#f9756d") +
labs(x = "Observations", y = "Duration/Attitude", size = 11) +
coord_cartesian(ylim = c(0, 18)) +
scale_color_manual(values = c("#f9756d", "deepskyblue4")) +
theme_bw()
```
We can see that there is indeed some co-movement here. The variables <b>covary</b> because respondents who have an above (below) average attitude towards the city also appear to have been living in the city for an above (below) average amount of time and vice versa. Correlation helps us to quantify this relationship. Before you proceed to compute the correlation coefficient, you should first look at the data. We usually use a scatterplot to visualize the relationship between two metric variables:
```{r message=FALSE, warning=FALSE, echo=FALSE, eval=TRUE, fig.align="center", fig.cap = "Scatterplot for durationand attitute variables"}
ggplot(att_data) +
geom_point(size = 3, aes(duration, attitude)) +
labs(x = "Duration",y = "Attitude", size = 11) +
theme_bw()
```
How can we compute the correlation coefficient? Remember that the variance measures the average deviation from the mean of a variable:
\begin{equation}
\begin{split}
s_x^2&=\frac{\sum_{i=1}^{N} (X_i-\overline{X})^2}{N-1} \\
&= \frac{\sum_{i=1}^{N} (X_i-\overline{X})*(X_i-\overline{X})}{N-1}
\end{split}
(\#eq:variance)
\end{equation}
When we consider two variables, we multiply the deviation for one variable by the respective deviation for the second variable:
<p style="text-align:center;">
$(X_i-\overline{X})*(Y_i-\overline{Y})$
</p>
This is called the cross-product deviation. Then we sum the cross-product deviations:
<p style="text-align:center;">
$\sum_{i=1}^{N}(X_i-\overline{X})*(Y_i-\overline{Y})$
</p>
... and compute the average of the sum of all cross-product deviations to get the <b>covariance</b>:
\begin{equation}
Cov(x, y) =\frac{\sum_{i=1}^{N}(X_i-\overline{X})*(Y_i-\overline{Y})}{N-1}
(\#eq:covariance)
\end{equation}
You can easily compute the covariance manually as follows
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE}
x <- att_data$duration
x_bar <- mean(att_data$duration)
y <- att_data$attitude
y_bar <- mean(att_data$attitude)
N <- nrow(att_data)
cov <- (sum((x - x_bar)*(y - y_bar))) / (N - 1)
cov
```
Or you simply use the built-in ```cov()``` function:
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE}
cov(att_data$duration, att_data$attitude) # apply the cov function
```
A positive covariance indicates that as one variable deviates from the mean, the other variable deviates in the same direction. A negative covariance indicates that as one variable deviates from the mean (e.g., increases), the other variable deviates in the opposite direction (e.g., decreases).
However, the size of the covariance depends on the scale of measurement. Larger scale units will lead to larger covariance. To overcome the problem of dependence on measurement scale, we need to convert the covariance to a standard set of units through standardization by dividing the covariance by the standard deviation (similar to how we compute z-scores).
With two variables, there are two standard deviations. We simply multiply the two standard deviations. We then divide the covariance by the product of the two standard deviations to get the standardized covariance, which is known as a correlation coefficient r:
\begin{equation}
r=\frac{Cov_{xy}}{s_x*s_y}
(\#eq:corcoeff)
\end{equation}
This is known as the product moment correlation (r) and it is straight-forward to compute:
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE}
x_sd <- sd(att_data$duration)
y_sd <- sd(att_data$attitude)
r <- cov/(x_sd*y_sd)
r
```
Or you could just use the ```cor()``` function:
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE}
cor(att_data[, c("attitude", "duration")], method = "pearson", use = "complete")
```
The properties of the correlation coefficient ('r') are:
* ranges from -1 to + 1
* +1 indicates perfect linear relationship
* -1 indicates perfect negative relationship
* 0 indicates no linear relationship
* ± .1 represents small effect
* ± .3 represents medium effect
* ± .5 represents large effect
### Significance testing
How can we determine if our two variables are significantly related? To test this, we denote the population moment correlation *ρ*. Then we test the null of no relationship between variables:
$$H_0:\rho=0$$
$$H_1:\rho\ne0$$
The test statistic is:
\begin{equation}
t=\frac{r*\sqrt{N-2}}{\sqrt{1-r^2}}
(\#eq:cortest)
\end{equation}
It has a t distribution with n - 2 degrees of freedom. Then, we follow the usual procedure of calculating the test statistic and comparing the test statistic to the critical value of the underlying probability distribution. If the calculated test statistic is larger than the critical value, the null hypothesis of no relationship between X and Y is rejected.
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE}
t_calc <- r*sqrt(N - 2)/sqrt(1 - r^2) #calculated test statistic
t_calc
df <- (N - 2) #degrees of freedom
t_crit <- qt(0.975, df) #critical value
t_crit
pt(q = t_calc, df = df, lower.tail = F) * 2 #p-value
```
Or you can simply use the ```cor.test()``` function, which also produces the 95% confidence interval:
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE}
cor.test(att_data$attitude, att_data$duration, alternative = "two.sided", method = "pearson", conf.level = 0.95)
```
To determine the linear relationship between variables, the data only needs to be measured using interval scales. If you want to test the significance of the association, the sampling distribution needs to be normally distributed (we usually assume this when our data are normally distributed or when N is large). If parametric assumptions are violated, you should use non-parametric tests:
* Spearman's correlation coefficient: requires ordinal data and ranks the data before applying Pearson's equation.
* Kendall's tau: use when N is small or the number of tied ranks is large.
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE}
cor.test(att_data$attitude, att_data$duration, alternative = "two.sided", method = "spearman", conf.level = 0.95)
cor.test(att_data$attitude, att_data$duration, alternative = "two.sided", method = "kendall", conf.level = 0.95)
```
Report the results:
A Pearson product-moment correlation coefficient was computed to assess the relationship between the duration of residence in a city and the attitude toward the city. There was a positive correlation between the two variables, r = 0.936, n = 12, p < 0.05. A scatterplot summarizes the results (Figure XY).
**A note on the interpretation of correlation coefficients:**
As we have already seen in chapter 1, correlation coefficients give no indication of the direction of causality. In our example, we can conclude that the attitude toward the city is more positive as the years of residence increases. However, we cannot say that the years of residence cause the attitudes to be more positive. There are two main reasons for caution when interpreting correlations:
* Third-variable problem: there may be other unobserved factors that affect both the 'attitude towards a city' and the 'duration of residency' variables
* Direction of causality: Correlations say nothing about which variable causes the other to change (reverse causality: attitudes may just as well cause the years of residence variable).
## Regression
<br>
<div align="center">
<iframe width="560" height="315" src="https://www.youtube.com/embed/rtvDHLuXUEI" frameborder="0" allowfullscreen></iframe>
</div>
<br>
Correlations measure relationships between variables (i.e., how much two variables covary). Using regression analysis we can predict the outcome of a dependent variable (Y) from one or more independent variables (X). For example, we could be interested in how many products will we will sell if we increase the advertising expenditures by 1000 Euros? In regression analysis, we fit a model to our data and use it to predict the values of the dependent variable from one predictor variable (bivariate regression) or several predictor variables (multiple regression). The following table shows a comparison of correlation and regression analysis:
<br>
| Correlation | Regression
-------------|-------------------------- | --------------------------
Estimated coefficient | Coefficient of correlation (bounded between -1 and +1) | Regression coefficient (not bounded a priori)
Interpretation | Linear association between two variables; Association is bidirectional | (Linear) relation between one or more independent variables and dependent variable; Relation is directional
Role of theory | Theory neither required nor testable | Theory required and testable
<br>
### Simple linear regression
In simple linear regression, we assess the relationship between one dependent (regressand) and one independent (regressor) variable. The goal is to fit a line through a scatterplot of observations in order to find the line that best describes the data (scatterplot).
Suppose you are a marketing research analyst at a music label and your task is to suggest, on the basis of historical data, a marketing plan for the next year that will maximize product sales. The data set that is available to you includes information on the sales of music downloads (thousands of units), advertising expenditures (in Euros), the number of radio plays an artist received per week (airplay), the number of previous releases of an artist (starpower), repertoire origin (country; 0 = local, 1 = international), and genre (1 = rock, 2 = pop, 3 = electronic). Let's load and inspect the data first:
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE}
regression <- read.table("https://raw.githubusercontent.com/IMSMWU/Teaching/master/MRDA2017/music_sales_regression.dat",
sep = "\t",
header = TRUE) #read in data
regression$country <- factor(regression$country, levels = c(0:1), labels = c("local", "international")) #convert grouping variable to factor
regression$genre <- factor(regression$genre, levels = c(1:3), labels = c("rock", "pop","electronic")) #convert grouping variable to factor
head(regression)
```
```{r message=FALSE, warning=FALSE, eval=TRUE, echo=TRUE, paged.print = FALSE}
psych::describe(regression) #descriptive statistics using psych
```
As stated above, regression analysis may be used to relate a quantitative response ("dependent variable") to one or more predictor variables ("independent variables"). In a simple linear regression, we have one dependent and one independent variable and we regress the dependent variable on the independent variable.
Here are a few important questions that we might seek to address based on the data:
* Is there a relationship between advertising budget and sales?
* How strong is the relationship between advertising budget and sales?
* Which other variables contribute to sales?
* How accurately can we estimate the effect of each variable on sales?
* How accurately can we predict future sales?
* Is the relationship linear?
* Is there synergy among the advertising activities?
We may use linear regression to answer these questions. We will see later that the interpretation of the results strongly depends on the goal of the analysis - whether you would like to simply predict an outcome variable or you would like to explain the causal effect of the independent variable on the dependent variable (see chapter 1). Let's start with the first question and investigate the relationship between advertising and sales.
#### Estimating the coefficients
A simple linear regression model only has one predictor and can be written as:
\begin{equation}
Y=\beta_0+\beta_1X+\epsilon
(\#eq:regequ)
\end{equation}
In our specific context, let's consider only the influence of advertising on sales for now:
\begin{equation}
Sales=\beta_0+\beta_1*adspend+\epsilon
(\#eq:regequadv)
\end{equation}
The word "adspend" represents data on advertising expenditures that we have observed and β<sub>1</sub> (the "slope"") represents the unknown relationship between advertising expenditures and sales. It tells you by how much sales will increase for an additional Euro spent on advertising. β<sub>0</sub> (the "intercept") is the number of sales we would expect if no money is spent on advertising. Together, β<sub>0</sub> and β<sub>1</sub> represent the model coefficients or *parameters*. The error term (ε) captures everything that we miss by using our model, including, (1) misspecifications (the true relationship might not be linear), (2) omitted variables (other variables might drive sales), and (3) measurement error (our measurement of the variables might be imperfect).
Once we have used our training data to produce estimates for the model coefficients, we can predict future sales on the basis of a particular value of advertising expenditures by computing:
\begin{equation}
\hat{Sales}=\hat{\beta_0}+\hat{\beta_1}*adspend
(\#eq:predreg)
\end{equation}
We use the hat symbol, <sup>^</sup>, to denote the estimated value for an unknown parameter or coefficient, or to denote the predicted value of the response (sales). In practice, β<sub>0</sub> and β<sub>1</sub> are unknown and must be estimated from the data to make predictions. In the case of our advertising example, the data set consists of the advertising budget and product sales of 200 music songs (n = 200). Our goal is to obtain coefficient estimates such that the linear model fits the available data well. In other words, we fit a line through the scatterplot of observations and try to find the line that best describes the data. The following graph shows the scatterplot for our data, where the black line shows the regression line. The grey vertical lines shows the difference between the predicted values (the regression line) and the observed values. This difference is referred to as the residuals ("e").
```{r message=FALSE, warning=FALSE, echo=FALSE, eval=TRUE, fig.align="center", fig.cap = "Ordinary least squares (OLS)"}
library(dplyr)
options(scipen = 999)
rm(regression)
rm(advertising)
regression <- read.table("https://raw.githubusercontent.com/IMSMWU/Teaching/master/MRDA2017/multiple_regression.dat",
sep = "\t",
header = TRUE) #read in data
lm <- lm(sales ~ adspend, data = regression)
regression$yhat <- predict(lm)
ggplot(regression, aes(x = adspend,y = sales)) +
geom_point(size = 2, color = "deepskyblue4") +
labs(x = "Advertising expenditure (in Euros)", y = "Sales", size = 11) +
geom_segment(aes(x = adspend, y = sales, xend = adspend, yend = yhat),
color = "grey",size = 0.5, linetype = "solid", alpha = 0.8) +
geom_smooth(method = "lm", se = FALSE, color = "black") +
theme(axis.title = element_text(size = 16),
axis.text = element_text(size = 16),
strip.text.x = element_text(size = 16),
legend.position="none") +
theme_bw()
```
The estimation of the regression function is based on the idea of the method of least squares (OLS = ordinary least squares). The first step is to calculate the residuals by subtracting the observed values from the predicted values.
<p style="text-align:center;">
$e_i = Y_i-(\beta_0+\beta_1X_i)$
</p>
This difference is then minimized by minimizing the sum of the squared residuals:
\begin{equation}
\sum_{i=1}^{N} e_i^2= \sum_{i=1}^{N} [Y_i-(\beta_0+\beta_1X_i)]^2\rightarrow min!
(\#eq:rss)
\end{equation}
e<sub>i</sub>: Residuals (i = 1,2,...,N)<br>
Y<sub>i</sub>: Values of the dependent variable (i = 1,2,...,N) <br>
β<sub>0</sub>: Intercept<br>
β<sub>1</sub>: Regression coefficient / slope parameters<br>
X<sub>ni</sub>: Values of the nth independent variables and the i*th* observation<br>
N: Number of observations<br>
This is also referred to as the <b>residual sum of squares (RSS)</b>, which you may still remember from the previous chapter on ANOVA. Now we need to choose the values for β<sub>0</sub> and β<sub>1</sub> that minimize RSS. So how can we derive these values for the regression coefficient? The equation for β<sub>1</sub> is given by:
\begin{equation}
\hat{\beta_1}=\frac{COV_{XY}}{s_x^2}
(\#eq:slope)
\end{equation}
The exact mathematical derivation of this formula is beyond the scope of this script, but the intuition is to calculate the first derivative of the squared residuals with respect to β<sub>1</sub> and set it to zero, thereby finding the β<sub>1</sub> that minimizes the term. Using the above formula, you can easily compute β<sub>1</sub> using the following code:
```{r message=FALSE, warning=FALSE, eval=TRUE, echo=TRUE}
cov_y_x <- cov(regression$adspend, regression$sales)
cov_y_x
var_x <- var(regression$adspend)
var_x
beta_1 <- cov_y_x/var_x
beta_1
```
The interpretation of β<sub>1</sub> is as follows:
For every extra Euro spent on advertising, sales can be expected to increase by `r round(beta_1,3)` units. Or, in other words, if we increase our marketing budget by 1,000 Euros, sales can be expected to increase by `r round(beta_1,3)*1000` units.
Using the estimated coefficient for β<sub>1</sub>, it is easy to compute β<sub>0</sub> (the intercept) as follows:
\begin{equation}
\hat{\beta_0}=\overline{Y}-\hat{\beta_1}\overline{X}
(\#eq:intercept)
\end{equation}
The R code for this is:
```{r message=FALSE, warning=FALSE, eval=TRUE, echo=TRUE}
beta_0 <- mean(regression$sales) - beta_1*mean(regression$adspend)
beta_0
```
The interpretation of β<sub>0</sub> is as follows:
If we spend no money on advertising, we would expect to sell `r round(beta_0,3)` units.
You may also verify this based on a scatterplot of the data. The following plot shows the scatterplot including the regression line, which is estimated using OLS.
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE, fig.align="center", fig.cap = "Scatterplot"}
ggplot(regression, mapping = aes(adspend, sales)) +
geom_point(shape = 1) +
geom_smooth(method = "lm", fill = "blue", alpha = 0.1) +
labs(x = "Advertising expenditures (EUR)", y = "Number of sales") +
theme_bw()
```
You can see that the regression line intersects with the y-axis at `r round(beta_0,2)`, which corresponds to the expected sales level when advertising expenditure (on the x-axis) is zero (i.e., the intercept β<sub>0</sub>). The slope coefficient (β<sub>1</sub>) tells you by how much sales (on the y-axis) would increase if advertising expenditures (on the x-axis) are increased by one unit.
#### Significance testing
In a next step, we assess if the effect of advertising on sales is statistically significant. This means that we test the null hypothesis H<sub>0</sub>: "There is no relationship between advertising and sales" versus the alternative hypothesis H<sub>1</sub>: "The is some relationship between advertising and sales". Or, to state this formally:
$$H_0:\beta_1=0$$
$$H_1:\beta_1\ne0$$
How can we test if the effect is statistically significant? Recall the generalized equation to derive a test statistic:
\begin{equation}
test\ statistic = \frac{effect}{error}
(\#eq:teststatgeneral)
\end{equation}
The effect is given by the β<sub>1</sub> coefficient in this case. To compute the test statistic, we need to come up with a measure of uncertainty around this estimate (the error). This is because we use information from a sample to estimate the least squares line to make inferences regarding the regression line in the entire population. Since we only have access to one sample, the regression line will be slightly different every time we take a different sample from the population. This is sampling variation and it is perfectly normal! It just means that we need to take into account the uncertainty around the estimate, which is achieved by the standard error. Thus, the test statistic for our hypothesis is given by:
\begin{equation}
t = \frac{\hat{\beta_1}}{SE(\hat{\beta_1})}
(\#eq:teststatreg)
\end{equation}
After calculating the test statistic, we compare its value to the values that we would expect to find if there was no effect based on the t-distribution. In a regression context, the degrees of freedom are given by ```N - p - 1``` where N is the sample size and p is the number of predictors. In our case, we have 200 observations and one predictor. Thus, the degrees of freedom is 200 - 1 - 1 = 198. In the regression output below, R provides the exact probability of observing a t value of this magnitude (or larger) if the null hypothesis was true. This probability - as we already saw in chapter 6 - is the p-value. A small p-value indicates that it is unlikely to observe such a substantial association between the predictor and the outcome variable due to chance in the absence of any real association between the predictor and the outcome.
To estimate the regression model in R, you can use the ```lm()``` function. Within the function, you first specify the dependent variable ("sales") and independent variable ("adspend") separated by a ```~``` (tilde). As mentioned previously, this is known as _formula notation_ in R. The ```data = regression``` argument specifies that the variables come from the data frame named "regression". Strictly speaking, you use the ```lm()``` function to create an object called "simple_regression," which holds the regression output. You can then view the results using the ```summary()``` function:
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE}
simple_regression <- lm(sales ~ adspend, data = regression) #estimate linear model
summary(simple_regression) #summary of results
```
Note that the estimated coefficients for β<sub>0</sub> (`r round(summary(simple_regression)$coefficients[1],3)`) and β<sub>1</sub> (`r round(summary(simple_regression)$coefficients[2],3)`) correspond to the results of our manual computation above. The associated t-values and p-values are given in the output. The t-values are larger than the critical t-values for the 95% confidence level, since the associated p-values are smaller than 0.05. In case of the coefficient for β<sub>1</sub>, this means that the probability of an association between the advertising and sales of the observed magnitude (or larger) is smaller than 0.05, if the value of β<sub>1</sub> was, in fact, 0. This finding leads us to reject the null hypothesis of no association between advertising and sales.
The coefficients associated with the respective variables represent <b>point estimates</b>. To obtain a better understanding of the range of values that the coefficients could take, it is helpful to compute <b>confidence intervals</b>. A 95% confidence interval is defined as a range of values such that with a 95% probability, the range will contain the true unknown value of the parameter. For example, for β<sub>1</sub>, the confidence interval can be computed as.
\begin{equation}
CI = \hat{\beta_1}\pm(t_{1-\frac{\alpha}{2}}*SE(\beta_1))
(\#eq:regCI)
\end{equation}
It is easy to compute confidence intervals in R using the ```confint()``` function. You just have to provide the name of you estimated model as an argument:
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE}
confint(simple_regression)
```
For our model, the 95% confidence interval for β<sub>0</sub> is [`r round(confint(simple_regression)[1,1],2)`,`r round(confint(simple_regression)[1,2],2)`], and the 95% confidence interval for β<sub>1</sub> is [`r round(confint(simple_regression)[2,1],2)`,`r round(confint(simple_regression)[2,2],2)`]. Thus, we can conclude that when we do not spend any money on advertising, sales will be somewhere between `r round(confint(simple_regression)[1,1],0)` and `r round(confint(simple_regression)[1,2],0)` units on average. In addition, for each increase in advertising expenditures by one Euro, there will be an average increase in sales of between `r round(confint(simple_regression)[2,1],2)` and `r round(confint(simple_regression)[2,2],2)`. If you revisit the graphic depiction of the regression model above, the uncertainty regarding the intercept and slope parameters can be seen in the confidence bounds (blue area) around the regression line.
#### Assessing model fit
<br>
<div align="center">
<iframe width="560" height="315" src="https://www.youtube.com/embed/nG4_st29Qe8" frameborder="0" allowfullscreen></iframe>
</div>
<br>
Once we have rejected the null hypothesis in favor of the alternative hypothesis, the next step is to investigate how well the model represents ("fits") the data. How can we assess the model fit?
* First, we calculate the fit of the most basic model (i.e., the mean)
* Then, we calculate the fit of the best model (i.e., the regression model)
* A good model should fit the data significantly better than the basic model
* R<sup>2</sup>: Represents the percentage of the variation in the outcome that can be explained by the model
* The F-ratio measures how much the model has improved the prediction of the outcome compared to the level of inaccuracy in the model
Similar to ANOVA, the calculation of model fit statistics relies on estimating the different sum of squares values. SS<sub>T</sub> is the difference between the observed data and the mean value of Y (aka. total variation). In the absence of any other information, the mean value of Y ($\overline{Y}$) represents the best guess on where a particular observation $Y_{i}$ at a given level of advertising will fall:
\begin{equation}
SS_T= \sum_{i=1}^{N} (Y_i-\overline{Y})^2
(\#eq:regSST)
\end{equation}
The following graph shows the total sum of squares:
```{r message=FALSE, warning=FALSE, echo=FALSE, eval=TRUE, fig.align="center", fig.cap = "Total sum of squares"}
library(dplyr)
options(scipen = 999)
rm(regression)
rm(advertising)
regression <- read.table("https://raw.githubusercontent.com/IMSMWU/Teaching/master/MRDA2017/music_sales_regression.dat",
sep = "\t",
header = TRUE) #read in data
lm <- lm(sales ~ adspend, data = regression)
regression$yhat <- predict(lm)
ggplot(regression, aes(x = adspend, y = sales)) +
geom_point(size = 2, color = "deepskyblue4") +
labs(x = "Advertising", y = "Sales", size = 11) +
geom_segment(aes(x = adspend, y = sales, xend = adspend,
yend = mean(sales)), color = "grey",
size = 0.5, linetype = "solid", alpha = 0.8) +
geom_hline(data = regression, aes(yintercept = mean(sales)), color = "black", size = 1) +
theme(axis.title = element_text(size = 16),
axis.text = element_text(size = 16),
strip.text.x = element_text(size = 16),
legend.position="none") +
theme_bw()
```
Based on our linear model, the best guess about the sales level at a given level of advertising is the predicted value $\hat{Y}_i$. The model sum of squares (SS<sub>M</sub>) therefore has the mathematical representation:
\begin{equation}
SS_M= \sum_{i=1}^{N} (\hat{Y}_i-\overline{Y})^2
(\#eq:regSSM)
\end{equation}
The model sum of squares represents the improvement in prediction resulting from using the regression model rather than the mean of the data. The following graph shows the model sum of squares for our example:
```{r message=FALSE, warning=FALSE, echo=FALSE, eval=TRUE, fig.align="center", fig.cap = "Ordinary least squares (OLS)"}
ggplot(regression, aes(x = adspend, y = sales)) +
geom_point(size = 2, color = "deepskyblue4") +
labs(x = "Advertising",y = "Sales", size = 11) +
geom_segment(aes(x = adspend, y = yhat, xend = adspend, yend = mean(sales)),
color = "grey", size = 0.5, linetype = "solid", alpha = 0.8) +
geom_smooth(method = "lm", se = FALSE, color = "black") +
geom_hline(data = regression, aes(yintercept = mean(sales)), color = "black", size = 1) +
theme(axis.title = element_text(size = 16),
axis.text = element_text(size = 16),
strip.text.x = element_text(size = 16),
legend.position = "none") +
theme_bw()
```
The residual sum of squares (SS<sub>R</sub>) is the difference between the observed data points ($Y_{i}$) and the predicted values along the regression line ($\hat{Y}_{i}$), i.e., the variation *not* explained by the model.
\begin{equation}
SS_R= \sum_{i=1}^{N} ({Y}_{i}-\hat{Y}_{i})^2
(\#eq:regSSR)
\end{equation}
The following graph shows the residual sum of squares for our example:
```{r message=FALSE, warning=FALSE, echo=FALSE, eval=TRUE, fig.align="center", fig.cap = "Ordinary least squares (OLS)"}
ggplot(regression, aes(x = adspend, y = sales)) +
geom_point(size = 2, color = "deepskyblue4") +
labs(x = "Advertising",y = "Sales", size = 11) +
geom_segment(aes(x = adspend, y = sales, xend = adspend, yend = yhat),
color = "grey", size = 0.5, linetype = "solid", alpha = 0.8) +
geom_smooth(method = "lm", se = FALSE, color = "black") +
theme(axis.title = element_text(size = 16),
axis.text = element_text(size = 16),
strip.text.x = element_text(size = 16),
legend.position = "none") +
theme_bw()
```
Based on these statistics, we can determine have well the model fits the data as we will see next.
##### R-squared {-}
The R<sup>2</sup> statistic represents the proportion of variance that is explained by the model and is computed as:
\begin{equation}
R^2= \frac{SS_M}{SS_T}
(\#eq:regSSR)
\end{equation}
It takes values between 0 (very bad fit) and 1 (very good fit). Note that when the goal of your model is to *predict* future outcomes, a "too good" model fit can pose severe challenges. The reason is that the model might fit your specific sample so well, that it will only predict well within the sample but not generalize to other samples. This is called **overfitting** and it shows that there is a trade-off between model fit and out-of-sample predictive ability of the model, if the goal is to predict beyond the sample. We will come back to this point later in this chapter.
You can get a first impression of the fit of the model by inspecting the scatter plot as can be seen in the plot below. If the observations are highly dispersed around the regression line (left plot), the fit will be lower compared to a data set where the values are less dispersed (right plot).
```{r message=FALSE, warning=FALSE, echo=FALSE, eval=TRUE, fig.align="center", fig.height = 4, fig.width = 10, fig.cap="Good vs. bad model fit"}
library(cowplot)
library(gridExtra)
library(grid)
set.seed(44)
#x3 <- rlnorm(250, log(1), log(0.6))
options(scipen = 999)
options(digits = 2)
x1 <- rnorm(200,614.41,485)
x <- as.data.frame(subset(x1,x1>0))
names(x)<-c('adspend')
error <- rnorm(nrow(x))
sales <- round(134 + 0.094*x$adspend + 50*error)
sales_data <- data.frame(sales,x$adspend)
names(sales_data)<-c('sales','adspend')
examplereg <- subset(sales_data,sales>0 & adspend>0)
#summary(examplereg)
lm <- lm(sales ~ adspend, data = examplereg)
#summary(lm)
examplereg$yhat <- predict(lm)
scatter_plot1 <- ggplot(examplereg,aes(adspend,sales)) +
geom_point(size=2,shape=1) + # Use hollow circles
geom_smooth(method="lm") + # Add linear examplereg line (by default includes 95% confidence region);
scale_x_continuous(name="advertising expenditures", limits=c(0, 1800)) +
scale_y_continuous(name="sales", limits=c(0, 400)) +
theme_bw() +
labs(title = paste0("R-squared: ",round(summary(lm)$r.squared,2)))
#x3 <- rlnorm(250, log(1), log(0.6))
options(scipen = 999)
options(digits = 2)
x1 <- rnorm(200,614.41,485)
x <- as.data.frame(subset(x1,x1>0))
names(x)<-c('adspend')
error <- rnorm(nrow(x))
sales <- round(134 + 0.094*x$adspend + 20*error)
sales_data <- data.frame(sales,x$adspend)
names(sales_data)<-c('sales','adspend')
#summary(sales_data)
examplereg <- subset(sales_data,sales>0 & adspend>0)
#summary(examplereg)
lm <- lm(sales ~ adspend, data = examplereg)
examplereg$yhat <- predict(lm)
scatter_plot2 <- ggplot(examplereg,aes(adspend,sales)) +
geom_point(size=2,shape=1) + # Use hollow circles
geom_smooth(method="lm") + # Add linear examplereg line (by default includes 95% confidence region);
scale_x_continuous(name="advertising expenditures", limits=c(0, 1800)) +
scale_y_continuous(name="sales", limits=c(0, 400)) +
theme_bw() +
labs(title = paste0("R-squared: ",round(summary(lm)$r.squared,2)))
#p <- plot_grid(plot1, plot2, ncol = 2)
p <- plot_grid(scatter_plot1,scatter_plot2, ncol = 2)
print(p)
# now add the title
#title <- ggdraw() + draw_label("", fontface='bold')
#plot_full <- plot_grid(title, p, ncol=1, rel_heights=c(0.1, 1)) # rel_heights values control title margins
#print(plot_full)
```
The R<sup>2</sup> statistic is reported in the regression output (see above). However, you could also extract the relevant sum of squares statistics from the regression object using the ```anova()``` function to compute it manually:
```{r message=FALSE, warning=FALSE, eval=TRUE, echo=TRUE,paged.print = FALSE}
anova(simple_regression) #anova results
```
Now we can compute R<sup>2</sup> in the same way that we have computed Eta<sup>2</sup> in the last section:
```{r message=FALSE, warning=FALSE, eval=TRUE, echo=TRUE,paged.print = FALSE}
r2 <- anova(simple_regression)$'Sum Sq'[1]/(anova(simple_regression)$'Sum Sq'[1] + anova(simple_regression)$'Sum Sq'[2]) #compute R2
r2
```
##### Adjusted R-squared {-}
Due to the way the R<sup>2</sup> statistic is calculated, it will never decrease if a new explanatory variable is introduced into the model. This means that every new independent variable either doesn't change the R<sup>2</sup> or increases it, even if there is no real relationship between the new variable and the dependent variable. Hence, one could be tempted to just add as many variables as possible to increase the R<sup>2</sup> and thus obtain a "better" model. However, this actually only leads to more noise and therefore a worse model.
To account for this, there exists a test statistic closely related to the R<sup>2</sup>, the **adjusted R<sup>2</sup>**. It can be calculated as follows:
\begin{equation}
\overline{R^2} = 1 - (1 - R^2)\frac{n-1}{n - k - 1}
(\#eq:adjustedR2)
\end{equation}
where ```n``` is the total number of observations and ```k``` is the total number of explanatory variables. The adjusted R<sup>2</sup> is equal to or less than the regular R<sup>2</sup> and can be negative. It will only increase if the added variable adds more explanatory power than one would expect by pure chance. Essentially, it contains a "penalty" for including unnecessary variables and therefore favors more parsimonious models. As such, it is a measure of suitability, good for comparing different models and is very useful in the model selection stage of a project. In R, the standard ```lm()``` function automatically also reports the adjusted R<sup>2</sup> as you can see above.
##### F-test {-}
Similar to the ANOVA in chapter 6, another significance test is the F-test, which tests the null hypothesis:
$$H_0:R^2=0$$
<br>
Or, to state it slightly differently:
$$H_0:\beta_1=\beta_2=\beta_3=\beta_k=0$$
<br>
This means that, similar to the ANOVA, we test whether any of the included independent variables has a significant effect on the dependent variable. So far, we have only included one independent variable, but we will extend the set of predictor variables below.
The F-test statistic is calculated as follows:
\begin{equation}
F=\frac{\frac{SS_M}{k}}{\frac{SS_R}{(n-k-1)}}=\frac{MS_M}{MS_R}
(\#eq:regSSR)
\end{equation}
which has a F distribution with k number of predictors and n degrees of freedom. In other words, you divide the systematic ("explained") variation due to the predictor variables by the unsystematic ("unexplained") variation.
The result of the F-test is provided in the regression output. However, you might manually compute the F-test using the ANOVA results from the model:
```{r message=FALSE, warning=FALSE, eval=TRUE, echo=TRUE,paged.print = FALSE}
anova(simple_regression) #anova results
f_calc <- anova(simple_regression)$'Mean Sq'[1]/anova(simple_regression)$'Mean Sq'[2] #compute F
f_calc
f_crit <- qf(.95, df1 = 1, df2 = 100) #critical value
f_crit
f_calc > f_crit #test if calculated test statistic is larger than critical value
```
#### Using the model
After fitting the model, we can use the estimated coefficients to predict sales for different values of advertising. Suppose you want to predict sales for a new product, and the company plans to spend 800 Euros on advertising. How much will it sell? You can easily compute this either by hand:
$$\hat{sales}=134.134 + 0.09612*800=211$$
<br>
... or by extracting the estimated coefficients from the model summary:
```{r message=FALSE, warning=FALSE, eval=TRUE, echo=TRUE}
summary(simple_regression)$coefficients[1,1] + # the intercept
summary(simple_regression)$coefficients[2,1]*800 # the slope * 800
```
The predicted value of the dependent variable is 211 units, i.e., the product will (on average) sell 211 units.
### Multiple linear regression
<br>
<div align="center">
<iframe width="560" height="315" src="https://www.youtube.com/embed/SDB2PhHMgxg" frameborder="0" allowfullscreen></iframe>
</div>
<br>
Multiple linear regression is a statistical technique that simultaneously tests the relationships between two or more independent variables and an interval-scaled dependent variable. The general form of the equation is given by:
\begin{equation}
Y=(\beta_0+\beta_1*X_1+\beta_2*X_2+\beta_n*X_n)+\epsilon
(\#eq:regequ)
\end{equation}
Again, we aim to find the linear combination of predictors that correlate maximally with the outcome variable. Note that if you change the composition of predictors, the partial regression coefficient of an independent variable will be different from that of the bivariate regression coefficient. This is because the regressors are usually correlated, and any variation in Y that was shared by X1 and X2 was attributed to X1. The interpretation of the partial regression coefficients is the expected change in Y when X is changed by one unit and all other predictors are held constant.
Let's extend the previous example. Say, in addition to the influence of advertising, you are interested in estimating the influence of radio airplay on the number of album downloads. The corresponding equation would then be given by:
\begin{equation}
Sales=\beta_0+\beta_1*adspend+\beta_2*airplay+\epsilon
(\#eq:regequadv)
\end{equation}
The words "adspend" and "airplay" represent data that we have observed on advertising expenditures and number of radio plays, and β<sub>1</sub> and β<sub>2</sub> represent the unknown relationship between sales and advertising expenditures and radio airplay, respectively. The corresponding coefficients tell you by how much sales will increase for an additional Euro spent on advertising (when radio airplay is held constant) and by how much sales will increase for an additional radio play (when advertising expenditures are held constant). Thus, we can make predictions about album sales based not only on advertising spending, but also on radio airplay.
With several predictors, the partitioning of sum of squares is the same as in the bivariate model, except that the model is no longer a 2-D straight line. With two predictors, the regression line becomes a 3-D regression plane. In our example:
```{r message=FALSE, warning=FALSE, echo=FALSE, eval=TRUE,fig.align="center", fig.cap = "Regression plane"}
#https://stackoverflow.com/questions/44322350/add-regression-plane-in-r-using-plotly
library(plotly)
library(reshape2)
n <- nrow(regression)
x1 <- regression$adspend
x2 <- regression$airplay
x3 <- rnorm(n)>0.5
y <- regression$sales
df <- data.frame(y, x1, x2)
### Estimation of the regression plane
mod <- lm(y ~ x1+x2)
cf.mod <- coef(mod)
### Calculate z on a grid of x-y values
x1.seq <- seq(min(x1),max(x1),length.out=25)
x2.seq <- seq(min(x2),max(x2),length.out=25)
z <- t(outer(x1.seq, x2.seq, function(x,y) cf.mod[1]+cf.mod[2]*x+cf.mod[3]*y))
#### Draw the plane with "plot_ly" and add points with "add_trace"
cols <- c("darkgray", "steelblue")
cols <- cols[x3+1]
library(plotly)
p <- plot_ly(x=~x1.seq, y=~x2.seq, z=~z,
colors = c("#A9D0F5", "#08088A"),type="surface") %>%
add_trace(data=df, x=x1, y=x2, z=y, mode="markers", type="scatter3d",
marker = list(color=cols, size=3, opacity=0.8, symbol=75)) %>%
layout(scene = list(
xaxis = list(title = "adspend"),
yaxis = list(title = "airplay"),
zaxis = list(title = "sales")))
p
```
Like in the bivariate case, the plane is fitted to the data with the aim to predict the observed data as good as possible. The deviation of the observations from the plane represent the residuals (the error we make in predicting the observed data from the model). Note that this is conceptually the same as in the bivariate case, except that the computation is more complex (we won't go into details here). The model is fairly easy to plot using a 3-D scatterplot, because we only have two predictors. While multiple regression models that have more than two predictors are not as easy to visualize, you may apply the same principles when interpreting the model outcome:
* Total sum of squares (SS<sub>T</sub>) is still the difference between the observed data and the mean value of Y (total variation)
* Residual sum of squares (SS<sub>R</sub>) is still the difference between the observed data and the values predicted by the model (unexplained variation)
* Model sum of squares (SS<sub>M</sub>) is still the difference between the values predicted by the model and the mean value of Y (explained variation)
* R measures the multiple correlation between the predictors and the outcome
* R<sup>2</sup> is the amount of variation in the outcome variable explained by the model
Estimating multiple regression models is straightforward using the ```lm()``` function. You just need to separate the individual predictors on the right hand side of the equation using the ```+``` symbol. For example, the model:
\begin{equation}
Sales=\beta_0+\beta_1*adspend+\beta_2*airplay+\beta_3*starpower+\epsilon
(\#eq:regequadv)
\end{equation}
could be estimated as follows:
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE}
multiple_regression <- lm(sales ~ adspend + airplay + starpower, data = regression) #estimate linear model
summary(multiple_regression) #summary of results
```
The interpretation of the coefficients is as follows:
* adspend (β<sub>1</sub>): when advertising expenditures increase by 1 Euro, sales will increase by `r round(summary(multiple_regression)$coefficients[2],3)` units
* airplay (β<sub>2</sub>): when radio airplay increases by 1 play per week, sales will increase by `r round(summary(multiple_regression)$coefficients[3],3)` units
* starpower (β<sub>3</sub>): when the number of previous albums increases by 1, sales will increase by `r round(summary(multiple_regression)$coefficients[4],3)` units
The associated t-values and p-values are also given in the output. You can see that the p-values are smaller than 0.05 for all three coefficients. Hence, all effects are "significant". This means that if the null hypothesis was true (i.e., there was no effect between the variables and sales), the probability of observing associations of the estimated magnitudes (or larger) is very small (e.g., smaller than 0.05).
Again, to get a better feeling for the range of values that the coefficients could take, it is helpful to compute <b>confidence intervals</b>.
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE}
confint(multiple_regression)
```
What does this tell you? Recall that a 95% confidence interval is defined as a range of values such that with a 95% probability, the range will contain the true unknown value of the parameter. For example, for β<sub>3</sub>, the confidence interval is [`r confint(multiple_regression)[4,1]`,`r confint(multiple_regression)[4,2]`]. Thus, although we have computed a point estimate of `r round(summary(multiple_regression)$coefficients[4],3)` for the effect of starpower on sales based on our sample, the effect might actually just as well take any other value within this range, considering the sample size and the variability in our data. You could also visualize the output from your regression model including the confidence intervals using the `ggstatsplot` package as follows:
```{r message=FALSE, warning=FALSE, eval=TRUE, echo=TRUE,fig.align="center", fig.cap = "Confidence intervals for regression model"}
library(ggstatsplot)
ggcoefstats(x = multiple_regression,
title = "Sales predicted by adspend, airplay, & starpower")
```
The output also tells us that `r summary(multiple_regression)$r.squared*100`% of the variation can be explained by our model. You may also visually inspect the fit of the model by plotting the predicted values against the observed values. We can extract the predicted values using the ```predict()``` function. So let's create a new variable ```yhat```, which contains those predicted values.
```{r message=FALSE, warning=FALSE, eval=TRUE, echo=TRUE}
regression$yhat <- predict(simple_regression)
```
We can now use this variable to plot the predicted values against the observed values. In the following plot, the model fit would be perfect if all points would fall on the diagonal line. The larger the distance between the points and the line, the worse the model fit. In other words, if all points would fall exactly on the diagonal line, the model would perfectly predict the observed values.
```{r message=FALSE, warning=FALSE, eval=TRUE, echo=TRUE,fig.align="center", fig.cap = "Model fit"}
ggplot(regression,aes(yhat,sales)) +
geom_point(size=2,shape=1) + #Use hollow circles
scale_x_continuous(name="predicted values") +
scale_y_continuous(name="observed values") +
geom_abline(intercept = 0, slope = 1) +
theme_bw()
```
**Partial plots**
In the context of a simple linear regression (i.e., with a single independent variable), a scatter plot of the dependent variable against the independent variable provides a good indication of the nature of the relationship. If there is more than one independent variable, however, things become more complicated. The reason is that although the scatter plot still show the relationship between the two variables, it does not take into account the effect of the other independent variables in the model. Partial regression plot show the effect of adding another variable to a model that already controls for the remaining variables in the model. In other words, it is a scatterplot of the residuals of the outcome variable and each predictor when both variables are regressed separately on the remaining predictors. As an example, consider the effect of advertising expenditures on sales. In this case, the partial plot would show the effect of adding advertising expenditures as an explanatory variable while controlling for the variation that is explained by airplay and starpower in both variables (sales and advertising). Think of it as the purified relationship between advertising and sales that remains after controlling for other factors. The partial plots can easily be created using the ```avPlots()``` function from the ```car``` package:
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE,fig.align="center", fig.cap = "Partial plots"}
library(car)
avPlots(multiple_regression)
```
**Using the model**
After fitting the model, we can use the estimated coefficients to predict sales for different values of advertising, airplay, and starpower. Suppose you would like to predict sales for a new music album with advertising expenditures of 800, airplay of 30 and starpower of 5. How much will it sell?
$$\hat{sales}=−26.61 + 0.084 * 800 + 3.367*30 + 11.08 ∗ 5= 197.74$$
<br>
... or by extracting the estimated coefficients:
```{r message=FALSE, warning=FALSE, eval=TRUE, echo=TRUE}
summary(multiple_regression)$coefficients[1,1] +
summary(multiple_regression)$coefficients[2,1]*800 +
summary(multiple_regression)$coefficients[3,1]*30 +
summary(multiple_regression)$coefficients[4,1]*5
```
The predicted value of the dependent variable is 198 units, i.e., the product will sell 198 units.
**Comparing effects**
Using the output from the regression model above, it is difficult to compare the effects of the independent variables because they are all measured on different scales (Euros, radio plays, releases). Standardized regression coefficients can be used to judge the relative importance of the predictor variables. Standardization is achieved by multiplying the unstandardized coefficient by the ratio of the standard deviations of the independent and dependent variables:
\begin{equation}
B_{k}=\beta_{k} * \frac{s_{x_k}}{s_y}
(\#eq:stdcoeff)
\end{equation}
Hence, the standardized coefficient will tell you by how many standard deviations the outcome will change as a result of a one standard deviation change in the predictor variable. Standardized coefficients can be easily computed using the ```lm.beta()``` function from the ```lm.beta``` package.
```{r message=FALSE, warning=FALSE, eval=TRUE, echo=TRUE}
library(lm.beta)
lm.beta(multiple_regression)
```
The results show that for ```adspend``` and ```airplay```, a change by one standard deviation will result in a 0.51 standard deviation change in sales, whereas for ```starpower```, a one standard deviation change will only lead to a 0.19 standard deviation change in sales. Hence, while the effects of ```adspend``` and ```airplay``` are comparable in magnitude, the effect of ```starpower``` is less strong.
<br>
## Potential problems
Once you have built and estimated your model it is important to run diagnostics to ensure that the results are accurate. In the following section we will discuss common problems.
### Outliers
<br>
<div align="center">
<iframe width="560" height="315" src="https://www.youtube.com/embed/TphnIgvRUlA" frameborder="0" allowfullscreen></iframe>
</div>
<br>
Outliers are data points that differ vastly from the trend. They can introduce bias into a model due to the fact that they alter the parameter estimates. Consider the example below. A linear regression was performed twice on the same data set, except during the second estimation the two green points were changed to be outliers by being moved to the positions indicated in red. The solid red line is the regression line based on the unaltered data set, while the dotted line was estimated using the altered data set. As you can see the second regression would lead to different conclusions than the first. Therefore it is important to identify outliers and further deal with them.
```{r, message=F, warning=F, echo = FALSE, eval = TRUE, fig.cap = "Effects of outliers", fig.align="center"}
set.seed(123)
x <- 1:30
y <- x*2.8 + rnorm(30, sd = 10)
y_altered <- y
y_altered[29] <- -50
y_altered[27] <- -47
ggplot(data = data.frame(X = x, Y = y, Y_alt = y_altered), aes(x = X, y = Y)) +
geom_point(aes(x = X, y = Y_alt), col = "red") +
geom_point() +
geom_point(aes(x = 27, y = Y[27]), col = "green") +
geom_point(aes(x = 29, y = Y[29]), col = "green") +
geom_smooth(method = "lm", se = FALSE, col = "firebrick", lwd = 0.5) +
geom_smooth(aes(x = X, y = Y_alt), method = "lm", se = FALSE, col = "black", lwd = 0.5, lty = 2) +
theme_bw()
```
One quick way to visually detect outliers is by creating a scatterplot (as above) to see whether anything seems off. Another approach is to inspect the studentized residuals. If there are no outliers in your data, about 95% will be between -2 and 2, as per the assumptions of the normal distribution. Values well outside of this range are unlikely to happen by chance and warrant further inspection. As a rule of thumb, observations whose studentized residuals are greater than 3 in absolute values are potential outliers.
The studentized residuals can be obtained in R with the function ```rstudent()```. We can use this function to create a new variable that contains the studentized residuals e music sales regression from before yields the following residuals:
```{r, warning=FALSE, message=FALSE}
regression$stud_resid <- rstudent(multiple_regression)
head(regression)
```
A good way to visually inspect the studentized residuals is to plot them in a scatterplot and roughly check if most of the observations are within the -3, 3 bounds.
```{r, warning=FALSE, message=FALSE, fig.cap="Plot of the studentized residuals", fig.align="center"}
plot(1:nrow(regression),regression$stud_resid, ylim=c(-3.3,3.3)) #create scatterplot
abline(h=c(-3,3),col="red",lty=2) #add reference lines
```
To identify potentially influential observations in our data set, we can apply a filter to our data:
```{r, warning = FALSE, message = FALSE}
outliers <- subset(regression,abs(stud_resid)>3)
outliers
```
After a detailed inspection of the potential outliers, you might decide to delete the affected observations from the data set or not. If an outlier has resulted from an error in data collection, then you might simply remove the observation. However, even though data may have extreme values, they might not be influential to determine a regression line. That means, the results wouldn’t be much different if we either include or exclude them from analysis. This means that the decision of whether to exclude an outlier or not is closely related to the question whether this observation is an influential observation, as will be discussed next.
### Influential observations
Related to the issue of outliers is that of influential observations, meaning observations that exert undue influence on the parameters. It is possible to determine whether or not the results are driven by an influential observation by calculating how far the predicted values for your data would move if the model was fitted without this particular observation. This calculated total distance is called **Cook's distance**. To identify influential observations, we can inspect the respective plots created from the model output. A rule of thumb to determine whether an observation should be classified as influential or not is to look for observation with a Cook's distance > 1 (although opinions vary on this). The following plot can be used to see the Cook's distance associated with each data point:
```{r, warning = FALSE, message = FALSE,fig.align="center", fig.cap = "Cook's distance"}
plot(multiple_regression,4)
```
It is easy to see that none of the Cook's distance values is close to the critical value of 1. Another useful plot to identify influential observations is plot number 5 from the output:
```{r, warning = FALSE, message = FALSE,fig.align="center", fig.cap = "Residuals vs. Leverage"}
plot(multiple_regression,5)
```
In this plot, we look for cases outside of a dashed line, which represents **Cook’s distance**. Lines for Cook’s distance thresholds of 0.5 and 1 are included by default. In our example, this line is not even visible, since the Cook’s distance values are far away from the critical values. Generally, you would watch out for outlying values at the upper right corner or at the lower right corner of the plot. Those spots are the places where cases can be influential against a regression line. In our example, there are no influential cases.
To see how influential observations can impact your regression, have a look at <a href="https://en.wikipedia.org/wiki/Anscombe%27s_quartet" target="_blank">this example</a>.
::: {.infobox_orange .hint data-latex="{hint}"}
To summarize, if you detected outliers in your data, you should test if these observations exert undue influence on your results using the Cook's distance statistic as described above. If you detect observations which bias your results, you should remove these observations.
:::
### Non-linearity
<br>
<div align="center">
<iframe width="560" height="315" src="https://www.youtube.com/embed/Fsf0q_6sKKY" frameborder="0" allowfullscreen></iframe>
</div>
<br>
An important underlying assumption for OLS is that of linearity, meaning that the relationship between the dependent and the independent variable can be reasonably approximated in linear terms. One quick way to assess whether a linear relationship can be assumed is to inspect the added variable plots that we already came across earlier:
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE,fig.align="center", fig.cap = "Partial plots"}
library(car)
avPlots(multiple_regression)
```
In our example, it appears that linear relationships can be reasonably assumed. Please note, however, that the linear model implies two things:
* Constant marginal returns
* Elasticities increase with X
These assumptions may not be justifiable in certain contexts. As an example, consider the effect of marketing expenditures on sales. The linear model assumes that if you change your advertising expenditures from, say 10€ to 11€, this will change sales by the same amount as if you would change your marketing expenditure from, say 100,000€ to 100,001€. This is what we mean by **constant marginal returns** - irrespective of the level of advertising, spending an additional Euro on advertising will change sales by the same amount. Or consider the effect of price on sales. A linear model assumes that changing the price from, say 10€ to 11€, will change the sales by the same amount as increasing the price from, say 20€ to 21€. An elasticity tells you the relative change in the outcome variable (e.g., sales) due to a relative change in the predictor variable. For example, if we change our advertising expenditures by 1%, sales will change by XY%. As we have seen, the linear model assumes constant marginal returns, which implies that the **elasticity increases** with the level of the independent variable. In our example, advertising becomes relatively more effective since as we move to higher levels of advertising expenditures, a relatively smaller change in advertising expenditure will yield the same return.
In marketing applications, it is often more realistic to assume **decreasing marginal returns**, meaning that the return from an increase in advertising is decreasing with increasing levels of advertising (e.g., and increase in advertising expenditures from 10€ to 11€ will lead to larger changes in sales, compared to a change from, say 100,000€ to 100,001€). We will see how to implement such a model further below in the section on extensions of the non-linear model.