#16-regression_appendix.Rmd#

---
output:
  html_document:
    df_print: paged
  html_notebook: default
---

```{r, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
options(digits = 7)
```

## Regression

### Linear regression 

This chapter contains the Appendix for the chapter on regression analysis.

In order to understand the coefficients in the multiple regression we can derive them separately. We have already seen that in the univariate case:

$$
\hat{\beta_1}=\frac{COV_{XY}}{s_x^2}
$$

We have [also seen that we can isolate the partial effect](#partial-plots) of a single variable in a multiple regression. Specifically, we can calculate the coefficient of the i-th variable as

$$
\hat{\beta_{i}} = {COV(\tilde{Y_{i}}, \tilde{X_{i}}) \over V(\tilde{X_i})}
$$

where $\tilde{Y_{i}}$ is the residual from the regression of $Y$ on all variables except for the i-th and $\tilde{X_i}$ is the residual from the regression of $X_i$ on all other variables. Let's illustrate this with an example.

```{r, echo=TRUE}
x1 <- rnorm(100, 5, 15)
x2 <- rnorm(100, 20, 10) + 3*x1

y <- 50 + 2*x1 + 7*x2 + rnorm(100, 0, 5)
```

We have added some randomness through the ```rnorm``` command so the coefficients are not exactly as we set them but close. Clearly x1 and x2 have partial influence on y.

```{r }
mod <- lm(y~x1+x2)
summary(mod)
```

To see the partial effect of x1 we run regressions for x1 on x2, as well as y on x2 and obtain the residuals $\tilde{x1}$ and $\tilde{x2}$. Notice that x1 is highly correlated with x2. 

```{r }
regX1 <- lm(x1~x2)  
summary(regX1)
tildeX1 <- residuals(regX1)
regY1 <- lm(y~x2)
tildeY1 <- residuals(regY1)
```

We can use the residuals to create the partial plots as seen above.

```{r }
# This is the same as avPlot
ggplot(data.frame(y = tildeY1, x = tildeX1), aes(x = x, y = y)) +
  geom_point(shape = 1) +
  geom_smooth(method = 'lm', se = FALSE, color = "red") +
  labs(y = "y | others", x = "x1 | others") + 
  theme_bw()
avPlots(mod)
```

And the regression of the residuals of y on x1 yields the same coefficient as in the original regression (minus the rounding errors). 

```{r}
# This is the same coefficient as in the original regression 
summary(lm(tildeY1~tildeX1 - 1)) # -1 since we want no intercept
print("Coefficient using the partial covariance:")
cov(tildeY1, tildeX1)/var(tildeX1)
```


### Logistic regression

#### Maximum likelihood estimation

For non-linear models we cannot turn to linear algebra to solve for the unknown parameters. However, we can use a statistical method called maximum likelihood estimation (MLE). Given observed data and an assumption about their distribution we can estimate for which parameter values the observed data is most likely to occur. We do this by calculating the joint log likelihood of the observations given _a_ choice of parameters (more on that below). Next, we change our parameter values a little bit and see if the joint log likelihood improved. We keep doing that until we cannot find better parameter values anymore (that is we have reached the maximum likelihood). In the following example we have a histogram of a data set which we assume to be normally distributed. You can now try to find the best parameter values by changing the mean and the variance. Recall that the normal distribution is fully described by just those two parameters. In each step the log likelihood is automatically calculated. The maximum likelihood so far is shown as the high score and the line in the lower graph shows the development of the log likelihood for the past couple of tries.  

<iframe src="https://learn.wu.ac.at/shiny/imsm/maxLikelihood/" style="border: none; width: 800px; height: 900px"></iframe>

#### Estimation of the parameters $\beta_i$

Let's return to the model to figure out how the parameters $\beta_i$ are estimated. In the data we observe the $y_i$, the outcome variable that is either $0$ or $1$, and the $x_{i,j}$, the predictors for each observation. In addition, by using the logit model we have assumed the following functional form.

$$
P(y_i = 1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 * x_{1,i} + \beta_2 * x_{2,i} + ... +\beta_m * x_{m,i})}}
$$

So far we have looked at estimating the $\beta_i$ with R but have not discussed how that works. In contrast to linear models (e.g. OLS) we cannot rely on linear algebra to solve for $\beta_i$ since the function is now non-linear. Therefore, we turn to a methodology called _maximum likelihood estimation_. Basically, we try out different $\beta_i$ and choose the combination that best its the observed data. In other words: choose the combination of $\beta_i$ that maximize the likelihood of observing the given data set. For each individual we have information about the binary outcome and the predictors. For those whose outcome is $1$ we want to choose the $\beta_i$ such that our predicted probability of the outcome ($P(y_i = 1)$) is as close to $1$ as possible. At the same time, for those whose outcome is $0$ we would like the prediction ($P(y_i = 1)$) to be as close to $0$ as possible. Therefore, we "split" the sample into two groups (outcome $1$ and outcome $0$). For the first group each individual has the probability function shown above which we want to maximize for this group. For the other group we would need to minimize the joint probability since we want our prediction to be as close to $0$ as possible. Alternatively we can easily transform the probability to $P(y_i = 0)$ which we can then maximize. Since probabilities add up to $1$ and we only have two possible outcomes $P(y_i = 0) = 1 - P(y_i = 1)$. This is convenient because now we can calculate the joint likelihood of all observations and maximize a single function. We assume that the observations are independent from each other and thus the joint likelihood is just the product of the individual probabilities. Thus for the group of $n_1$ individuals with outcome $1$ we have the joint likelihood

$$
\prod_{i=1}^{n_1} {1 \over 1 + e^{-(\beta_0 + \beta_1 * x_{1,i} + \beta_2 * x_{2,i} + ... +\beta_m * x_{m,i})}}
$$

For the second group of $n_2$ individuals with outcome $0$ we get the joint likelihood 

$$
\prod_{j=1}^{n_{2}} 1 - {1 \over 1 + e^{-(\beta_0 + \beta_1 * x_{1,i} + \beta_2 * x_{2,i} + ... +\beta_m * x_{m,i})}} 
$$

In order to find the optimal $\beta_i$ we need to combine to two groups into one joint likelihood function for all observations. We can once again do that by multiplying them with each other. However, now we need to add an indicator in order to determine in which group the observation is.


$$
\prod_{k = 1}^{n_1 + n_2} \left({1 \over 1 + e^{-(\beta_0 + \beta_1 * x_{1,i} + \beta_2 * x_{2,i} + ... +\beta_m * x_{m,i})}}\right)^{y_i} \times \left( 1 - {1 \over 1 + e^{-(\beta_0 + \beta_1 * x_{1,i} + \beta_2 * x_{2,i} + ... +\beta_m * x_{m,i})}}\right)^{1-y_i}
$$

The indicators $(\cdot)^{y_i}$ and $(\cdot)^{1-y_i}$ select the appropriate likelihood. To illustrate this consider an individual with outcome $y_j = 1$ 

$$
\begin{align*}
&\left({1 \over 1 + e^{-(\beta_0 + \beta_1 * x_{1,j} + \beta_2 * x_{2,j} + ... +\beta_m * x_{m,j})}}\right)^{y_j} \times \left( 1 - {1 \over 1 + e^{-(\beta_0 + \beta_1 * x_{1,j} + \beta_2 * x_{2,j} + ... +\beta_m * x_{m,j})}}\right)^{1-y_j}\\
=&\left({1 \over 1 + e^{-(\beta_0 + \beta_1 * x_{1,j} + \beta_2 * x_{2,j} + ... +\beta_m * x_{m,j})}}\right)^{1} \times \left( 1 - {1 \over 1 + e^{-(\beta_0 + \beta_1 * x_{1,j} + \beta_2 * x_{2,j} + ... +\beta_m * x_{m,j})}}\right)^{0}\\
=&{1 \over 1 + e^{-(\beta_0 + \beta_1 * x_{1,j} + \beta_2 * x_{2,j} + ... +\beta_m * x_{m,j})}} \times 1\\
=&{1 \over 1 + e^{-(\beta_0 + \beta_1 * x_{1,j} + \beta_2 * x_{2,j} + ... +\beta_m * x_{m,j})}} 
\end{align*}
$$

Equivalently for an individual with outcome $y_s = 0$:

$$
\begin{align*}
&\left({1 \over 1 + e^{-(\beta_0 + \beta_1 * x_{1,i} + \beta_2 * x_{2,i} + ... +\beta_m * x_{m,i})}}\right)^{y_i} \times \left( 1 - {1 \over 1 + e^{-(\beta_0 + \beta_1 * x_{1,i} + \beta_2 * x_{2,i} + ... +\beta_m * x_{m,i})}}\right)^{1-y_i}\\
=&\left({1 \over 1 + e^{-(\beta_0 + \beta_1 * x_{1,j} + \beta_2 * x_{2,j} + ... +\beta_m * x_{m,j})}}\right)^{0} \times \left( 1 - {1 \over 1 + e^{-(\beta_0 + \beta_1 * x_{1,j} + \beta_2 * x_{2,j} + ... +\beta_m * x_{m,j})}}\right)^{1}\\
=&1\times \left( 1 - {1 \over 1 + e^{-(\beta_0 + \beta_1 * x_{1,j} + \beta_2 * x_{2,j} + ... +\beta_m * x_{m,j})}}\right)^{1}\\
=& 1 - {1 \over 1 + e^{-(\beta_0 + \beta_1 * x_{1,j} + \beta_2 * x_{2,j} + ... +\beta_m * x_{m,j})}}
\end{align*}
$$

#### Example

Consider an experiment in which we want to find out whether a person will listen to the full song or skip it. Further, assume that the genre of the song is the only determining factor of whether somebody is likely to skip it or not. Each individual has rated the genre on a scale from 1 (worst) to 10 (best). Our model looks as follows:

$$ 
P(\text{skipped}_i) = {1 \over 1 + e^{-(\beta_0 + \beta_1 \text{rating}_i)}}
$$

The probability that individual $i$ will skip a song is the logistic function with $X = \beta_0 + \beta_1 * \text{rating}_i$, where $\text{rating}_i$ is individual $i$'s rating of the genre of the song. Since we assume independence of individuals we can write the joint likelihood as

$$
\prod_{i=1}^{N} \left({1 \over 1 + e^{-(\beta_0 + \beta_1 \text{rating}_i)}}\right)^{y_i} \times \left(1- {1 \over 1 + e^{-(\beta_0 + \beta_1 \text{rating}_i)}}\right)^{(1-y_i)}
$$

Notice that $y_i$ is either equal to one or to zero. So for each individual only one of the two parts is not equal to one (recall that any real number to the power of 0 is equal to 1). The the part left of the plus sign is "looking at" individuals who skipped the song given their rating and the right part is looking at individuals who did not skip the song given their rating. For the former we want the predicted probability of skipping to be as high as possible. For the latter we want the predicted probability of *not* skipping to be as high as possible and thus write $1 - {1 \over 1+ e^{-(\beta_0 + \beta_1 \text{rating}_i)}}$. This is convenient since we can now maximize a single function. Another way to simplify the calculation and make it computationally feasible is taking the logarithm. This will ensure that extremely small probabilities can still be processed by the computer ([see this illustration](#sum-of-ln-vs-product)).    
```{r echo = FALSE}
set.seed(123)
```

Manually applying this in R would looks as follows. We begin by simulating some data. ```ratingGenre``` is a vector of 10000 randomly generated numbers between 1 and 10. ```pSkip``` will be a vector of probabilities generated by applying the logistic function to our linear model, with the parameters $\beta_0 = 1$ and $\beta_1 = -0.3$.

```{r, message = FALSE, warning = FALSE}
ratingGenre <- sample(1:10, size = 10000, replace = TRUE)
linModel <- 1 - 0.3 * ratingGenre
pSkip <- 1/(1+exp(-linModel))
```

Now we have to sample whether a user skipped a song or not, based on their probability of skipping. The resulting vector ```skipped``` is composed of 0s and 1s and indicates whether or not a person skipped a song. 

```{r, message = FALSE, warning = FALSE}
skipped <- 1:length(pSkip)
for(i in 1:length(pSkip)){
  skipped[i] <- sample(c(1, 0), size = 1, prob = c(pSkip[i], 1-pSkip[i]))
}
```

Our simulated data now looks as follows:

```{r, echo = FALSE, fig.align="center", fig.cap="Percent of songs skipped as a function of the genre rating"}
freq <- prop.table(table(data.frame(skipped = skipped, rating = ratingGenre)), margin = 2) * 100

freq <- as.data.frame(freq)

colnames(freq) <- c("Skipped", "rating", "Freq")

freq$Skipped <- factor(freq$Skipped, levels = c(0,1), labels = c("No", "Yes"))

ggplot(data = freq, aes(x = rating, y = Freq, fill = Skipped)) +
geom_col() +
  ylab("Percent") + 
  xlab("Genre rating") +
  theme_bw()

```

The visualization shows that an increase in ```genreRating``` leads to a decrease in the probability of a song being skipped. Now we want to perform maximum likelihood estimation and see how close we get to the true parameter values. To achieve this, we need a function that, given a value for $\beta_0$ and $\beta_1$, gives us the value of the log likelihood. The following code defines such a function.

```{r, message = FALSE, warning = FALSE}
mlEstimate <- function(beta_0, beta_1){
  pred <- 1/(1+exp(-(beta_0 + beta_1 * ratingGenre)))
  loglik <- skipped * log(pred) + (1-skipped) * log(1 - pred)
  sum(loglik)
}
```

The log likelihood function has the following form.

```{r, echo = FALSE, fig.cap="Plot of the log likelihood function", fig.align="center"}
library(plotly)

beta_0 <- seq(0, 2, length.out = 100)
beta_1 <- seq(-0.4, -0.2, length.out = 100)
opts <- expand.grid(beta_0, beta_1)
opts$loglik <- apply(opts, 1, function(x) mlEstimate(x[1], x[2]))
z <- matrix(opts$loglik, ncol = length(beta_0), byrow = TRUE)

plot_ly(x =beta_1, y = beta_0, z = z) %>% 
  add_surface() %>%
  layout(scene = list(
          xaxis = list(title = "beta_1"),
          yaxis = list(title = "beta_0"),
          zaxis = list(title = "Log likelihood")
    ))
```

As you can see, the maximum of the log likelihood function lies around ```-0.3, 1```, the true parameter values. Now we need to find an algorithm that finds the combination of $\beta_0$ and $\beta_1$ that optimizes this function. There are multiple ways to go about this. The ```glm()``` function uses the built-in optimization function ```optim()```. While we could do the same, we will use a slightly different approach to make the process more intuitive. We are going to employ something called _grid maximization_. Basically, we make a list of all plausible combinations of parameter values and calculate the log likelihood for each combination. Then we simply select the parameter combination that has the highest log likelihood.

```prob_beta_0``` and ```prob_beta_1``` are now vectors that contain 100 plausible values for each parameter.
```{r, message = FALSE, warning = FALSE}
prob_beta_0 <- seq(0.5, 1.5, length.out = 100)
prob_beta_1 <- seq(-1, 0, length.out = 100)

# Print one vector as an example
prob_beta_0
```

Next we create a data frame that contains all possible combinations of ```prob_beta_0``` and ```prob_beta_1```. The ```expand.grid()``` function does exactly that. 

```{r, message = FALSE, warning = FALSE}
params <- expand.grid(prob_beta_0, prob_beta_1)

# Print df
params
```

With the ```apply()``` function we can calculate the log likelihood for each value in the data frame. Note that the ```params``` data frame now has a now column containing the log likelihood.

```{r, message = FALSE, warning = FALSE}
params$loglik <- apply(params, 1, function(x) mlEstimate(x[1], x[2]))

# print df
params
```

Next we simply find the highest log likelihood value and the associated parameter values.

```{r, message = FALSE, warning = FALSE}
maxLik <- which.max(params$loglik)
params[maxLik, ]
```

As you can see, our method comes pretty close to the true parameter values. For comparison, here is the same model calculated with the ```glm()``` function.


```{r, message = FALSE, warning = FALSE}
summary(glm(skipped ~ ratingGenre, family = binomial(link = 'logit')))

```

##### Sum of LN vs Product

To illustrate why the natural log (ln) is important for computations with numbers close to zero (such as joint probabilities), we create a vector of fictitious probabilities and multiply them with each other. As we can see, already a small number of values close to 0 will lead to their product being erroneously interpreted as being 0. If we are only interested in comparing magnitude (as in the case of likelihoods) we can safely apply the ln, since it is a monotonically increasing function and will thus not change the ranking. 
```{r, message = FALSE, warning = FALSE}
probs <- c(0.000001, 0.000001, 0.000001, 0.000001, 0.000001, 0.000001, 0.000001,
           0.000001, 0.000001, 0.000001, 0.000001, 0.000001, 0.000001, 0.000001,
           0.000001, 0.000001, 0.000001, 0.000001, 0.000001, 0.000001, 0.000001,
           0.000001, 0.000001, 0.000001, 0.000001, 0.000001, 0.000001, 0.000001,
           0.000001, 0.000001, 0.000001, 0.000001, 0.000001, 0.000001, 0.000001,
           0.000001, 0.000001, 0.000001, 0.000001, 0.000001, 0.000001, 0.000001,
           0.000001, 0.000001, 0.000001, 0.000001, 0.000001, 0.000001, 0.000001,
           0.000001, 0.000001, 0.000001, 0.000001, 0.000001, 0.000001, 0.000001)
prod(probs)

lProbs <- log(probs) # this is the ln
sum(lProbs)

probs2 <- rbind(probs, probs, probs, probs, probs,
                probs, probs, probs, probs, probs) # now 10 times as many values
lProbs2 <- log(probs2)
sum(lProbs2)
```