-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathsolutions.Rmd
413 lines (355 loc) · 17.6 KB
/
solutions.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
---
title: "DSC 531 - Statistical Simulations and Data Analysis: Second Assignment"
author:
- Nikolas Petrou^[petrou.p.nikolas@ucy.ac.cy]
- Michail Athanasiades^[mathan10@ucy.ac.cy]
- Fotis Kyriakou^[fkyria01@ucy.ac.cy]
output: html_document
---
## Introduction
The Markdown includes the programming part for the second assignment of the course DSC 531 - Statistical Simulations and Data Analysis of the University of Cyprus MSc in Data Science programme.
The essential mathematical calculations and most of the comments-observations are included in the $LaTeX$ report document.
## Declaration of general functions and constants
Declaring a Constant Variable which will be used as system tolerance for the different testing checks throughout the work.
```{r}
# Declaring a Constant Variable which will be used as
# system tolerance for the different testing checks throughout the work
# Setting tolerance to take a very small value
TOL <- .Machine$double.eps^0.9
lockBinding("TOL", globalenv()) # Variable is now locked, and cannot change value
```
Defining functions which will be used for different checks throughout the work.
```{r}
# Custom function that returns TRUE if the given x is a natural number
# considering that 0 is not a natural number
is.naturalnumber <- function(x){
x > TOL & abs(x - round(x)) < TOL
}
# Custom function that returns TRUE if the given x is an integer number
is.integernumber <- function(x){
abs(x - round(x)) < TOL
}
```
## Exercise 1: Pseudorandom number generator
### Exercise 1.A
Following, a function that for a given natural number $n$ and an initial value $X_0$ computes the $n$ next terms $(X_1, X_2, ..., X_n)$ by using a specified pseudorandom number generator
```{r}
# Function that for a given natural number n and an initial value X0
# computes the n next terms (X1, X2, ..., Xn) by using a specified pseudorandom
# number generator
generate_terms <- function(n, X0){
# Necessary checks for the input arguments
if (! is.naturalnumber(n))
stop("Input of given numbers of terms 'n' is not a Natural number")
if (! (is.integernumber(X0) && (X0 >= 0) && (X0 <=99)))
stop("Input 'X0' must be an integer number in [0, 99]")
# Initialize terms vector
X <- c(X0)
for (i in 1:n){
X_i_sqr <- X[i]^2
# Calculate how many zeros are required for padding
padding_digits_required <- 4 - nchar(as.character(X_i_sqr))
# Initializing the padding zero characters
padding <- strrep('0', padding_digits_required)
# Initializing character string by pasting the padding and X_i_sqr
X_i_char <- paste(c(padding, X_i_sqr), collapse = "")
# Extracting the two middle digits of the four digit
# number as an integer
# Function as.integer() also removes the zeros from head
X[i+1] <- as.integer(substr(X_i_char, 2, 3))
}
# Return vector with the n generated terms
# Notice that the initial first value X0 is not return
return(X[-1])
}
```
### Exercise 1.B
Following, the above method of Exercise 1.A was used for various values of $n$ and $X_0$
In this case, where the value of $n$ is small the method did not seem to cause any problem. But the reason was because the value of $n$ saw very small.
```{r}
# Case which does not seem to cause any problem (low value of n)
generate_terms(n=3, X0=8)
```
First problematic case, where during the generation process, the occurred four-digit number included zeros for its middle two digits. Therefore, the generator continued generating only zero numbers.
```{r}
# First problematic case, where during the generation process, the occurred four-digit number included zeros for its middle two digits. Therefore, the generator continued generating only zero numbers.
generate_terms(n=10, X0=8)
# Again the same problematic case
generate_terms(n=50, X0=33)
```
Second problematic case the occurred four-digit number $x$ included a number for its middle two digits, where its square recursively produced the same four-digit number $x$ again and again. Therefore, the generator converged and continued generating only $x$ as outcomes.
```{r}
# Second problematic case the occurred four-digit number x included a number for its middle two digits, where its square recursively produced the same four-digit number x again and again. Therefore, the generator converged and continued generating only x as outcomes.
generate_terms(n=10, X0=22)
# Again the same problematic case
generate_terms(n=50, X0=88)
```
## Exercise 2: Inversion
Defining function that returns the value of the probability density function $f(x)$ for a given $x$
```{r}
# Function that returns the probability density function f
# value for a given x
f <- function(x){
# Necessary checks for the input arguments
if (! is.numeric(x))
stop("Input 'x' must be a real number")
# Returns the function value for given x
ifelse(x>=1 , 1/(x^2) , 0)
}
```
Defining a function which which generates random numbers with probability density function $f$, by using the inversion method
```{r}
# Function which generates random numbers with probability density
# function f, by using the inversion method
generate.inversion.f <- function(n=1){
# Necessary checks for the input arguments
if (! is.naturalnumber(n))
stop("Input of given numbers of terms 'n' should be a Natural number")
# Seed for reproducibility
set.seed(420)
# To simulate with pdf f(x), generate u from U(0, 1)
# and set x = 1/(1-u) = 1/u
U <- runif(n, min = 0, max = 1)
X <- 1/U
}
```
Plotting a histogram and a density plot of 10000 random numbers from the implemented function in order to compare with the actual density $f$.
```{r}
# Simulate n=10000 samples
sim.sample <- generate.inversion.f(10000)
# Plot histogram
# By default, the histogram plot will be extended to a range so that the plot approaches 0 at the extreme.
# In order to restrict the plot to a range that will be more easily interpretable,
# the x axis was limited to 25.
hist(sim.sample, xlim=c(0, 25), ylim=c(0, 1.1),freq=FALSE, breaks='freedman-diaconis', main="Histogram of simulated sample")
# Plot the theoretical curve
curve(f(x), col = "blue", lwd=3, add = T)
# By default, density will extend the range so that the density curve approaches 0 at the extreme.
# In order to restrict the curve to a range that will be more easily interpretable, the x axis was limited to 25.
dens.our <- density(sim.sample, from = 0, to = 25)
plot(dens.our, main="Density estimation", ylim=c(0, 1.0), lwd = 2)
lines(dens.our$x, f(dens.our$x), col="blue", lwd = 2)
legend("topright",legend=c("Sample", "f(x)"),
col=c("black", "blue"), lty=1, cex=1.5)
```
## Exercise 3: Rejection sampling
### Exercise 3.A
Initialize function that employs the rejection algorithm in order to simulate $n$ samples from the target distribution $\mathcal{N}(0,1)$, with the proposal being the standard Cauchy distribution.
```{r}
# Function that employs the rejection algorithm in order to simulate n samples from the standard normal distribution, with the proposal being the standard Cauchy distribution. The function also returns the occurred total number of draws.
generate.rejection.f<- function(n=1){
# Necessary checks for the input arguments
if (! is.naturalnumber(n))
stop("Input of given numbers of samples 'n' should be a Natural number")
# Seed for reproducibility
set.seed(420)
# From theory, the expected number of draws C until one acceptance is where f(x)/g(x) is maximized,
# where g(x) and f(x) are the densities of the candidate and target distributions respectively.
# It was shown that it is maximized when x is equal to 1 or -1.
# C is approximately equal to 1.52
C <- dnorm(x=1)/dcauchy(x=1)
# Initialize empty numeric vector
sampled_data <- numeric()
# Initialize variable for draws
total_draws <- 0
# While the accepted simulated values are less than the
# desired simulations
while(length(sampled_data) < n){
# Simulate from uniform(0,1)
u <- runif(n=1, min = 0, max = 1)
# Simulate from candidate g
x <- rcauchy(n=1, location = 0, scale = 1)
# if u<=f(x)/(c*g(x)) accept else reject
if (u<=dnorm(x)/(C*dcauchy(x)))
sampled_data <- append(sampled_data, x)
total_draws <- total_draws + 1
}
return(list(data = sampled_data, total_draws = total_draws))
}
```
### Exercise 3.B
Running the simulation for $n=10000$ samples and comparing the occurred histogram with the theoretical density of the target $\mathcal{N}(0,1)$
```{r}
n <- 10000
simulation <- generate.rejection.f(n=n)
hist(simulation$data, freq=FALSE, main = "Histogram of the simulated data")
curve(dnorm(x), col="blue", add = TRUE)
legend("topright",legend=c("Density of N(0,1)"),col=c("blue"), lty=1, cex=0.8)
```
### Exercise 3.C
The expected number of draws and total occurred draws were to be compared.
The simulation function returned the total number of draws as an output. Regarding the expected theoretical value of the draws, it was shown that he expected number of draws $C$ until an acceptance occurs was the $c = sup_{x \in \mathbb{R}}\frac{f(x)}{g(x)}$. After calculations, it was shown that the local maximum is achieved where $x=\pm1$.
Following the two above mentioned values are compared:
```{r}
# From theory, the expected number of draws C until we accept is where f(x)/g(x) is maximized,
# where g(x) and f(x) are the densities of the candidate and target distributions respectively.
# It was shown that it is maximized when x is equal to 1 or -1.
# C is approximately equal to 1.52
C <- dnorm(x=1)/dcauchy(x=1)
# Calculate expected number of draws
expected_draws <- C * n
paste("Total draws occurred: ", simulation$total_draws)
paste("Expected draws: ", expected_draws)
paste("Total draws occurred - Expected draws ratio: ",
simulation$total_draws / expected_draws)
```
In conclusion, the two values are approximately equal.
## Exercise 4
### Exercise 4.A
Regarding the first part of the exercise, the integral $J$ was evaluated after calculations which are shown in the $LaTeX$ document report. The evaluation states that the integral was calculated as follows:
$J =\int_{-\infty}^{\infty}(x+\alpha)^{2} \phi(x) d x = 1+a^2$
For the more detailed mathematical calculations refer to the $LaTeX$ document report.
### Exercise 4.B
Initialize function which returns the Monte Carlo estimate of $\mathbb{E}(h(X))$, for $X \sim N(0, 1)$ for a given function $h$.
```{r}
# Function which returns the Monte Carlo estimate for for a given function func()
# by generating simulations X from the standard normal N(0, 1) and returning the
# expected value E(func(X))
MC_integrate_sd_norm <- function(n, func, a){
# Necessary checks for the input arguments
if (! is.naturalnumber(n))
stop("Input of given numbers of samples 'n' should be a Natural number")
if (! is.numeric(a))
stop("Input 'a' must be a real number")
if (! is.function(func))
stop("Input 'func' should be a defined function")
if (! length(formals(func)) == 2)
stop("Input function 'func' should take exactly two parameters (x and a)")
# Seed for reproducibility
set.seed(420)
# Simulate n times from normal
x.sam <- rnorm(n, mean=0, sd=1)
# Calculate and return the monte-carlo estimate
theta.mc <- mean(sapply(x.sam, func, a=a))
}
```
Initialize $h(x)$ function, which in this case, $h(x) = (x+a)^2$.
```{r}
# Initialize h function
h <- function(x, a){
# Necessary checks for the input arguments
if (! is.numeric(x))
stop("Input 'x' must be a real number")
if (! is.numeric(a))
stop("Input 'a' must be a real number")
# Function to be returned
(x+a)**2
}
```
Estimating $J$ by Monte-Carlo integration using different number of simulations $n$ and $a$ values
```{r}
# Estimating J by Monte-Carlo integration using different number of simulations n and a values
for(a in 0:4){
for(n in c(100, 1000)){
print(paste("n=", n," a=", a))
print(paste("Estimated value for J ", MC_integrate_sd_norm(n=n, func=h, a=a)))
print(paste("Actual value for J ", 1+a**2))
cat("\n")
}
}
```
As expected, as number of samples $n$ is increased the estimation is more accurate (similar) compared with the actual value.
### Exercise 4.C
Following, the $J$ is computed through Importance Sampling based on the candidate $g(x) = \phi(x-\alpha)$. It is important to mention that $\phi(x-a)$ is equivalent with the Probability Density Function of the distribution $\mathcal{N}(\alpha, 1)$. Therefore the candidate was set to be the function $g(x)$, which is the Probability Density Function of the distribution $\mathcal{N}(\alpha, 1)$.
```{r}
# Function which uses Importance Sampling based on the candidate g(x) = N(a, 1), to return samples from
# the target distribution of N(0, 1)
IS_generate<- function(n, func, a){
# Necessary checks for the input arguments
if (! is.naturalnumber(n))
stop("Input of given numbers of samples 'n' should be a Natural number")
if (! is.numeric(a))
stop("Input 'a' must be a real number")
if (! is.function(func))
stop("Input 'func' should be a defined function")
if (! length(formals(func)) == 2)
stop("Input function 'func' should take exactly two parameters (x and a)")
# Seed for reproducibility
set.seed(420)
# Simulate from g(x)
x.sam = rnorm(n, mean=a, sd=1)
# Weight the simulations
# Actual calculation: w = f(x)/g(x)
# Modified equivalent calculation: w = exp^(log(f(x))-log(g(x)))
w = exp(dnorm(x.sam, mean=0, sd=1, log=TRUE) - dnorm(x.sam, mean=a, sd=1, log=TRUE))
# Returning the sample average
mean(w*sapply(x.sam, func, a=a))
}
```
Estimating $J$ by Importance Sampling using different number of simulations $n$ and $a$ values.
```{r}
for(a in 0:4){
for(n in c(100, 1000)){
print(paste("n=", n," a=", a))
print(paste("Estimated value for J using IS: ", IS_generate(n=n, func=h, a=a)))
print(paste("Actual value for J: ", 1+a**2))
cat("\n")
}
}
```
## Exercise 5
Function that consists of the implementation of the random walk Metropolis-Hastings algorithm for sampling from the target distribution $N(100, 1)$, using proposals $Y = X +\epsilon$ where $\epsilon$ follows the standard normal distribution.
```{r}
# Function that consists of the implementation of the random walk Metropolis-Hastings algorithm for sampling from the target distribution N(100, 1), using proposals Y = X + epsilon where epsilon follows the standard normal distribution
random.walk.metropolis.hastings.generate <- function(n=1, x0){
# Necessary checks for the input arguments
if (! is.numeric(x0))
stop("Input 'x0' must be a real number")
if (! is.naturalnumber(n))
stop("Input of given numbers of samples 'n' should be a Natural number")
# Seed for reproducibility
set.seed(420)
# Initialize empty numeric vector
X <- numeric()
X[1] <- x0
# While we haven't reached the Xn path
for(i in 1:n){
# Simulate epsilons which follow the symmetric proposal N(0,1)
# and let y = x + epsilon
epsilon <- rnorm(n=1, mean=0, sd=1)
y <- X[i] + epsilon
# Compute the acceptance ratio a
# from theory it was shown that since proposal is symmetric
# the following formula is used
a <- min(dnorm(x=y, mean = 100, sd = 1, log = TRUE) - dnorm(x=X[i], mean = 100, sd = 1, log = TRUE),
log(1)) # Using log since we can run to numerical issues with mean equal to 100
# Simulate u from u(0, 1)
u <- runif(n=1, min=0, max=1)
# If u<= a(y|x) then accept y and move with prob a
# otherwise "reject" move, and stay at x
x_new <- ifelse(log(u) <= a, y, X[i])
X <- append(X, x_new)
}
return(X)
}
```
Experimenting with different starting values for $X_0$ and creating a plot for the path $X_0, X_1, ..., X_n$ which illustrates the burn-in period.
```{r}
library('ggplot2')
library('gridExtra')
par(mfrow=c(2,2))
# Initialize list that will includes the trace plots
plist <- list()
# Set n for 1000 samples
n <- 1000
# Running the Metropolis-Hastings algorithm for sampling from the target distribution N(100, 1)
# with different starting values
for (x0 in c(-100, 85, 100, 115)){
X <- random.walk.metropolis.hastings.generate(n=n, x0=x0)
# Trace plot of the samples to see how the chain moved around
plist[[length(plist)+1]] <- ggplot(data=data.frame(X), aes(x=1:length(X), y=X)) +
geom_line() +
labs(title = paste("Chain movement for x0=", x0) , x='Iteration') +
theme(text=element_text(size=12.5))
# Histograms of the samples to see if they indeed look like a standard normal distribution
hist(X, breaks='fd', main = paste("Histogram for x0 = ", x0))
}
# Plotting the traces of the samples to see how the chain moved
# around for the different starting values
# Finding an ideal number of columns for plot dynamically
nCol <- floor(sqrt(length(plist)))
# Dynamically call grid.arrange to plot multiple ggplots
do.call("grid.arrange", c(plist, ncol=nCol))
```
As expected, it was noticed that when the algorithm is initialized with a $X_0$ value which is very different (e.g. $x_0 = -100$) from the actual mean $\mu=100$ of the target distribution, the algorithm requires more iterations (sampling) until it converges. On the other hand, when the algorithm is initialized with the exact value of $\mu=100$ it converges instantly and samples from the target distribution are generated from the start.
It was also confirmed from the histograms, that with initial $X_0$ values closer to the actual $\mu=100$, the generated samples were more likely to follow the target distribution $\mathcal{N}(100,1)$. In cases where many samples were generated before the convergence occurred, those generated samples had an impact on their corresponing histogram. In addition, the more iterations it took for the algorithm to achieve stationarity, the less normal the histogram of the generated samples looked like.