-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathproject.rmd
1515 lines (1224 loc) · 88.4 KB
/
project.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Students Performance in Exams: Exploratory Data Analysis"
author: "Nikolas Petrou"
date: "13/10/2021"
output:
html_document:
toc: true
toc_depth: 3
word_document: default
pdf_document: default
---
# Introduction
This work is a project for the **DSC531: Statistical Simulation course**. The project focuses on the Exploratory Data Analysis (EDA) of the given Performance dataset, which includes marks obtained by students in different subjects.
The aim was to first clean (if necessary) the data and then perform a full Exploratory Data Analysis (with summary statistics, plots and statistical hypothesis testing), which would help to understand the variation within the variables. Additionally, it was requested to find correlations or any patterns between variables if they exist, and more depending on the questions that will be raised.
# Requirements and libraries
```{r include = FALSE}
# Will make the plots fit better in the rmarkdown document
knitr::opts_chunk$set(fig.height = 7, fig.width = 15)
```
The current list of objects from the environment are removed, in order to ensure a clean R environment before any operations
```{r}
# Remove the list of objects from the environment, just to
# ensure a clean R environment before any operations
rm(list=ls())
```
The libraries which are going to used are loaded
```{r}
# Import library for the functionality that will render an R DataFrame as an HTML table
# knitr create tables in LaTeX, HTML or Markdown
# Specifically, the function which is in interest is the knitr::kable()
library(knitr)
# Library that checks for missing values
# Will be needed for the function miss_var_summary()
library(naniar)
# Libraries for data visualization
library(ggplot2)
# Will be using the ggarrange() function in order to draw multiple ggplots on the same plot
library(ggpubr)
# Library that is used for data manipulation
# Will mainly be used for the pipeline operations %>% and for the filter() and bind() functions
library(dplyr)
# Library that which will be used to look at the samples' skewness
library(moments)
# The library of 'fastDummies' will be utilized in order to obtain
# the one-hot representation of the categorical variables
library(fastDummies)
# The corrplot package provided a visual exploratory tool on correlation matrix
# which helped on the detection of hidden patterns among variables
library('corrplot')
# Companion to Applied Regression package
library('car')
# For the Linear Models which use Lasso, the glmnet library utilized,
# which provides efficient procedures for fitting Linear Regression models
library(glmnet)
# Using the gbm for the boosting models, specifically for the gradient boosting trees
library(gbm)
# Package that has functions to streamline the model training process for complex regression and classification problems
library(caret)
```
Setting a seed to make the script reproducible
```{r}
# Make the script reproducible
set.seed(420)
```
# The Students Perfomance dataset
The Performance dataset consists of the marks secured by students in various subjects. It is already known and given, that the variables in the data are:
* Gender
* Race/ethnicity as a group variable
* Level of education of parents
* Quality of lunch taken
* Whether the student has completed a test preparation course
* The students’ scores for:
+ mathematics
+ reading
+ writing
## Data Load and Missing Values
Initially, the dataset was loaded directly from the csv file which contained the Performance dataset
```{r}
# Read the Performance dataset from the csv file
data <- read.csv('Performance.csv', header = TRUE)
# Confirming that the occurred data variable is a DataFrame
class(data)
```
The first ten rows of the dataframe are presented with the utilization of the _**kable**_ table generator
```{r}
# Observation of the dataset
# Printing the first ten rows-data points of the dataset
kable(head(data, 10))
```
It was noticed that indeed the columns-variables are the eight variables which were described earlier.
Next, it was examined whether the data had missing (NA) values. The _**miss_var_summary()**_ function of the _**naniar**_ package was used in order to summarize the missing values in each variable.
```{r}
# Confirming that the dataset has no NA values
miss_var_summary(data) # Get a summary for the missing values of the different variables
cat('Total null values in dataset:', sum(is.na(data)))
# Heatplot of missingness across the entire data frame
vis_miss(data)
```
Fortunately, there were no missing values at all. Therefore no imputation was needed for the values of the different variables.
## Structure of Data and Variables
Following, the structure and types of the variables are going to be studied.
```{r}
# Observe the structure and data type of each column of the DataFrame,
# by utilizing the str() function
str(data)
```
Since it was concluded that for different versions of R, sometimes categorical columns were not directly loaded as Factors but as characters, the specific found columns were manually casted to Factors
```{r}
# Loops through the columns of the DataFrame and for the character
# and change their type to Factors
for (colname in colnames(data)){
# Checks if the column is from the character class
# and changes its type to Factor
if (class(data[[colname]]) == "character")
data[[colname]] <- as.factor(data[[colname]])
}
```
```{r}
# Observe the structure and data type of each column of the DataFrame,
# by utilizing the str() function
str(data)
```
As it is shown, there are 1000 datapoints-observations in the dataset.
Additionally, the three score variables math.score, reading,score, writing,score) were integers, while the rest of the variables (gender, race.ethnicity, parental.level.of.education, lunch, test.preparation.course) were Factors (variables that are used to categorize the stored data).
The Factor levels of the categorical variables were inspected, in order to distinguish the unique values of the categorical variables
```{r}
# Loops through the columns of the DataFrame and for the factor columns prints
# the different factor levels (unique values of the categorical variables)
for (colname in colnames(data)){
# Checks if the column is from the factor class
if (class(data[[colname]]) == "factor")
cat("Unique values of", colname, ": ", levels(data[[colname]]), '\n')
}
```
As it is distinguished, the different categorical variables have the following unique values:
* Gender:
+ female
+ male
* Race/ethnicity:
+ group A
+ group B
+ group C
+ group D
+ group E
* Parental level of education:
+ associate's degree
+ bachelor's degree
+ high school master's degree
+ master's degree
+ some college
+ some high school
* Lunch:
+ free/reduced
+ standard
* Test preparation course:
+ completed
+ none
From the levels of parental level of education it can be easily seen that it would make sense for its different levels-values to follow a specific order, thus to be sorted (since for example a master's degree is considered higher education than high school).
In order to accurately sort the levels of educations, some things regarding the different levels must be specified. Obviously, some high school and high school are the lowest levels of education for the current dataset.
Regarding associate degrees and college diplomas, they both typically last two to three years, and they are considered as a level of qualification above a high school diploma and below a bachelor's degree. Thus, a person who has only completed some college is considered _"less educated"_ than a person who has completed an associate degree. Obviously, since a bachelor's degree requires three to four years, it will be ranked above the aforementioned levels, but below master's degree, since a bachelor's degree is pre-required in most of the master's programmes.
Therefore, the order of the factors was changed into the following order:
* Parental level of education:
+ some high school
+ high school
+ some college
+ associate's degree
+ bachelor's degree
+ master's degree
```{r}
# Changing the ordering of the levels for the Factor "parental.level.of.education"
data$parental.level.of.education <- factor(data$parental.level.of.education ,
levels = c("some high school", "high school", "some college",
"associate's degree", "bachelor's degree", "master's degree"))
# Confirmation that the order has been changed successfully
cat("New Factor levels of", colname, ": ", levels(data$parental.level.of.education), '\n')
```
## Data summary
Subsequently, some summary statistics were analyzed, in order to get an initial idea of how the values of the variables are distributed.
```{r}
# Summary for the Factor variables
kable(summary(Filter(is.factor, data)))
# Summary for the score integer variables
kable(summary(Filter(is.integer, data)))
```
Even from a brief summary, it can be seen that the minimum of score in maths was zero, which indicates that some extreme cases exist. This is going to be further analyzed later with some visualization.
Additionally, it is strange to see that most of the individuals did not take the test preparation course. Most of the parents have either completed some college or have associate's degree, while only a few have a master's degree.
# Data Statistics
## Analysis of the scoring variables
Next, boxplots for the score variables were plotted in order to visualize the distribution of scores
```{r}
# boxplot for math.score
boxplot.math.score <- ggplot(data, aes(x=math.score)) +
geom_boxplot(fill='#A4A4A4', color="black") + coord_flip() +
labs(title="Box plot for Math scores",x="Math Scores", y = "")
# boxplot for reading.score
boxplot.reading.score<- ggplot(data, aes(x=reading.score)) +
geom_boxplot(fill='#A4A4A4', color="black") + coord_flip() +
labs(title="Box plot for Reading Scores",x="Reading Scores", y = "")
# boxplot for writing.score
boxplot.writing.score <- ggplot(data, aes(x=writing.score)) +
geom_boxplot(fill='#A4A4A4', color="black") + coord_flip() +
labs(title="Box plot for Writing scores",x="Writing Scores", y = "")
# The three defined plots are going to be plotted on this page
ggarrange(boxplot.writing.score, boxplot.math.score, boxplot.reading.score,
ncol = 3, nrow = 1)
```
From the above plots, it was seen that there were values which are lower than the lower quartile (whiskey). Even though those values are more than 1.5 IQR below Q1 (Q1 - 1.5 * IQR), they are either exactly zero or above zero, which in our case is the lowest possible score a person can get.
Interpreting the above plots, looking at the values which are in the Interquartile Range (IQR), for all of the subjects most of students' scores were above 50, which is usually the pass/fail boundary in exams. Therefore, only a small portion of the students were below the pass/fail boundary
Specifically, assuming that the pass/fail boundary is the score of 50, the ratios of passed students for the subjects were calculated. In order to avoid code repetition and increase readability, two functions (check_above_fifty() and get_prop_of_passed_in_subject ()) that were multiple times used were implemented.
```{r}
# In order to avoid code repetition and increase readability, two functions that were multiple times used were implemented.
# Returns true if the given x is above 50
check_above_fifty <- function(x){
# Check if the given x is integer or numeric, and stop if it is not
stopifnot((class(x) == "integer") || (class(x) == "numeric"))
return (x > 50)
}
# Returns the proportion of the passed students for the given subject
get_prop_of_passed_in_subject <- function(data, subject.score){
# Check if the given data is a data.frame, and stop if it is not
stopifnot(class(data) == "data.frame")
# Check if the given subject.score variable exists as a column in the data.frame data, and stop if it is not
stopifnot(subject.score %in% colnames(data))
# Counting the total students/rows that their scores were above 50 (pass/fail limit)
total_students_passed <- length(Filter(check_above_fifty, data[, subject.score]))
# Returning the proportion of passed students for the given subject
return (total_students_passed/length(data[, subject.score]))
}
# Calculating the pass ratios for the three subjects
pass.math <- get_prop_of_passed_in_subject(data, 'math.score')
pass.reading <-get_prop_of_passed_in_subject(data, 'reading.score')
pass.writing <- get_prop_of_passed_in_subject(data, 'writing.score')
# Saving the values in a data.frame
pass_ratios <- data.frame(subject=c("Maths", "Reading", "Writing"),
pass_rate=c(pass.math, pass.reading, pass.writing ))
# Plotting pass rates with basic ggplot barplot
ggplot(data=pass_ratios, aes(x=subject, y=pass_rate)) +
geom_bar(stat="identity", fill="steelblue") +
geom_text(aes(label=pass_rate), vjust=1.6, color="white", size=5) +
theme_minimal() +
labs(title="Pass rates per subject", x="Subject", y = "Pass rate")
```
In general, assuming that a score of 50 is the minimum to pass, the passing rates are good. The students which fail mostly fail on the math exams, while the reading subject seems to be the most passed subject.
Next, the densities of the different scores were plotted on top of their histograms. The distribution-density of each score was visualized with a histogram. The bins of the histograms were calculated manually with the use of the __Freedman–Diaconis' rule__, which was also discussed in class and is less sensitive to outliers in data. The __Freedman–Diaconis'__ choice calculates the bin width h as:
$$
h=2 \frac{\operatorname{IQR}(x)}{\sqrt[3]{n}}
$$
```{r}
# Histogram with FD method and color by groups for writing scores
breaks.writing.score <- pretty(range(data[,'writing.score']), n = nclass.FD(data[,'writing.score']), min.n = 1)
writing.score.dens.plot <- ggplot(data, aes(x=writing.score)) +
geom_histogram(aes(y=..density..), alpha=0.5, breaks = breaks.writing.score,
position="identity") +
geom_density(alpha=.2) + ggtitle("Distribution of Writing Scores")
# Histogram with FD method and color by groups for math scores
breaks.math.score <- pretty(range(data[,'math.score']), n = nclass.FD(data[,'math.score']), min.n = 1)
math.score.dens.plot <- ggplot(data, aes(x=math.score)) +
geom_histogram(aes(y=..density..), alpha=0.5, breaks = breaks.math.score,
position="identity") +
geom_density(alpha=.2) + ggtitle("Distribution of Math Scores")
# Histogram with FD method and color by groups for reading scores
breaks.reading.score <- pretty(range(data[,'reading.score']), n = nclass.FD(data[,'reading.score']), min.n = 1)
reading.score.dens.plot <- ggplot(data, aes(x=reading.score)) +
geom_histogram(aes(y=..density..), alpha=0.5, breaks = breaks.reading.score,
position="identity") +
geom_density(alpha=.2) + ggtitle("Distribution of Reading Scores")
# The three defined plots are going to be plotted on this page
ggarrange(math.score.dens.plot, reading.score.dens.plot, writing.score.dens.plot,
ncol = 3, nrow = 1)
```
As illustrated, the histogram of math scores has an empty gap on its left side, which indicates the low score outliers that were also observed in the boxplots. Even though all of the score variables follow a bell shaped distribution, both their boxplots and plotted densities looked a little bit negatively skewed.
The script below calculate the skewness of the samples.
```{r}
# The sample skewness for writing.score
skewness(data$writing.score)
# The sample skewness for math.score
skewness(data$math.score)
# The sample skewness for reading.score
skewness(data$reading.score)
```
Indeed all the score variables were not exactly symmetric, since their skewnewss indicated that the distributions were a little negatively skewed.
Since the distributions are not exactly symmetric but follow a bell shaped distribution, it was not easily distinguishable whether the distributions follow the Normal-Gaussian distribution or not. Thus, it was furthered checked if the scores follow the Normal-Gaussian distribution with more visualization and hypothesis testing.
The variables were first standardized in order to have zero mean and standard deviation of one, in order to compare their densities with the density of the Standard Normal distribution. The Probability Density function (PDF) of the data was estimated, by utilizing the default Gaussian Kernel Density Estimation which computes kernel density estimates.
```{r}
# Arranging 3 figures in 1 rows and 3 columns
par(mfrow=c(1,3))
# Density Estimation(Mi parametriki ektimisi spp kanonikopoiimenou deigmatos)
# Density estimation of scaled Math Score
std.ms<-scale(data$math.score)
dens.std <- density(std.ms)
x.data <- dens.std$x
plot(dens.std, main="Density estimation of scaled Math Score", ylim=c(0,0.5))
lines(x.data,dnorm(x.data), col="red")
legend("topleft",legend=c("Standardised data", "Standard Normal"),
col=c("black", "red"), lty=1, cex=1.25)
# Density estimation of scaled Reading Score
std.rs<-scale(data$reading.score)
dens.std <- density(std.rs)
x.data <- dens.std$x
plot(dens.std, main="Density estimation of scaled Reading Score", ylim=c(0,0.5))
lines(x.data,dnorm(x.data), col="red")
legend("topleft",legend=c("Standardised data", "Standard Normal"),
col=c("black", "red"), lty=1, cex=1.25)
# Density estimation of scaled Writing Score
std.ws<-scale(data$writing.score)
dens.std <- density(std.ws)
x.data <- dens.std$x
plot(dens.std, main="Density estimation of scaled Writing Score", ylim=c(0,0.5))
lines(x.data,dnorm(x.data), col="red")
legend("topleft",legend=c("Standardised data", "Standard Normal"),
col=c("black", "red"), lty=1, cex=1.25)
```
The estimated PDF of the math score looks very similar with the PDF of the Standard Normal, while the other two are questionable. On that account, a Quantile-Quantile plot (QQ plot) and the non-parametric Kolmogorov-Smirnov test were employed, in order to test for normality of the distributions.
```{r}
# Arranging 3 figures in 1 rows and 3 columns
par(mfrow=c(1,3))
# QQ-plots for the different score variables
# For each different score, first standardize the scores (scale and center)
# and compare with Normal Distribution by using a qqtest and run a Kolmogorov-Smirnov test afterwards to compare with the Normal Distribution
qqnorm(std.ms,main='Normal QQ-plot of standardised Maths Score',col='deepskyblue4')
qqline(std.ms,col='red')
qqnorm(std.rs,main='Normal QQ-plot of standardised Reading Score',col='deepskyblue4')
qqline(std.rs,col='red')
qqnorm(std.ws,main='Normal QQ-plot of standardised Writing Score', col='deepskyblue4')
qqline(std.ws,col='red')
# Kolmogorov-Smirnov tests
ks.test(std.ms,'pnorm')
ks.test(std.rs,'pnorm')
ks.test(std.ws,'pnorm')
```
The null hypothesis of the tests were that the specified score variable follows the Normal distribution (no deviation from normality).
The p-values for the math score (0.297) and writing (0.06297) are larger than 0.05 (5% level of significance), therefore it was concluded that the distribution of math and writing scores were not significantly different from Normal distribution.
Regarding the scores for reading, the p-value was 0.04257 which is lower than 0.05 (5% level of significance), thus the null-hypothesis that the reading scores follow the Normal distribution was rejected.
Next the mean and median were compared for the different scores.
```{r}
math_stats <- data.frame(subject=c('Maths', 'Maths'), metric=c('median','mean'),
metric_value=c(median(data$math.score), mean(data$math.score)))
reading_stats <- data.frame(subject=c('Reading', 'Reading'), metric=c('median','mean'),
metric_value=c(median(data$reading.score), mean(data$reading.score)))
writing_stats <- data.frame(subject=c('Writing', 'Writing'), metric=c('median','mean'),
metric_value=c(median(data$writing.score), mean(data$writing.score)))
# Bind the two data frames by row and keep the columns (function of dplyr)
subject_stats <- bind_rows(math_stats, reading_stats, writing_stats)
# Using "position=position_dodge()" in order to have three different bars per metric
ggplot(data=subject_stats, aes(x=subject, fill=metric, y=metric_value)) +
geom_bar(stat="identity", position=position_dodge()) +
guides(fill = guide_legend(title = "Metric")) +
geom_text(aes(label=round(metric_value, digits = 2)), color="black", size=5, position=position_dodge(width = .9)) +
labs(title="Mean and median writing score per subject", x="Subject", y = "score")
```
The lowest median and mean scores are for maths, while the highest are for reading. That further shows that maths is the subject that most students have a hard time to deal with, while students perform better in reading.
## Comparisons between males and females
During this part of the analysis, differences between the population of males and females were explored.
Following, the distribution of the two genders is presented
```{r}
# Cast the gender column from the data in a table-format to get the counts of each gender group
gender_count <- table(data$gender)
# Define gender columns
genders <- names(gender_count)
# Pie chart for the gender distribution (In order to show the percentage)
pie_data <- as.data.frame(round(gender_count/sum(gender_count)*100, digits=2)) # First cast the table to a new Data Frame with the percentages
ggplot(data=pie_data, aes(x = "", y = Freq, fill = Var1)) +
geom_col() +
geom_text(aes(label = paste("", gender_count,"\n", Freq, "%")),
position = position_stack(vjust = 0.5),
show.legend = FALSE) + coord_polar(theta = "y") +
guides(fill = guide_legend(title = "Gender")) +
theme( axis.title.x = element_blank(), axis.title.y = element_blank())
```
The frequencies of the genders in the dataset is not exactly balanced, since the females are slightly more than the males (518 to 482). This will be taken into account for the rest of the analysis.
Following, the distribution of the different categorical variables by gender are illustrated by utilizing the barplot of the _ggplot_ package.
```{r}
# Stacked barplot with multiple groups for the race.ethnicity
# Using "position=position_dodge()" in order to have two different bars per group
race.ethnicity.plt <- ggplot(data=data, aes(x=race.ethnicity, fill=gender)) +
geom_bar(stat="count", position=position_dodge()) +
guides(fill = guide_legend(title = "Gender")) +
labs(title="Distribution of Ethnicities", x="race ethinicity", y = "")
# Stacked barplot with multiple groups for the parental.level.of.education
# Using "position=position_dodge()" in order to have two different bars per group
parental.level.of.education.plt <- ggplot(data=data, aes(x=parental.level.of.education, fill=gender)) +
geom_bar(stat="count", position=position_dodge()) +
guides(fill = guide_legend(title = "Gender")) +
labs(title="Distribution of Different PLE", x="parental level of education", y = "")
# Stacked barplot with multiple groups for the test.preparation.course.plt
# Using "position=position_dodge()" in order to have two different bars per group
test.preparation.course.plt <- ggplot(data=data, aes(x=test.preparation.course, fill=gender)) +
geom_bar(stat="count", position=position_dodge()) +
guides(fill = guide_legend(title = "Gender")) +
labs(title="Distribution of Test Preparation Course",x="Test preperation course", y = "")
# Stacked barplot with multiple groups for the lunch.plt
# Using "position=position_dodge()" in order to have two different bars per group
lunch.plt <- ggplot(data=data, aes(x=lunch, fill=gender)) +
geom_bar(stat="count", position=position_dodge()) +
guides(fill = guide_legend(title = "Gender")) +
labs(title="Distribution of Lunch",x="Lunch", y = "")
# The three defined plots are going to be plotted on this page
ggarrange(race.ethnicity.plt, parental.level.of.education.plt,
test.preparation.course.plt, lunch.plt,
ncol = 2, nrow = 2)
```
It can be seen that some ethnicities have more male students than females (groups A, D and E), and vice-versa (groups D, C). Also, it was mentioned that the population of males and females is not exactly balanced, thus it is not safe to draw many conclusions from the above plots.
One solution for the above mentioned issue was to plot proportions instead of the frequencies by re-scaling.
Regarding which of the who populations takes the test preparation course more, the relative proportions were plotted
```{r}
data %>% count(gender, test.preparation.course) %>% group_by(gender) %>%
mutate(prop = n / sum(n)) %>%
ggplot(mapping = aes(x = gender, y = test.preparation.course)) +
geom_tile(mapping = aes(fill = prop)) +
geom_text(aes(label=round(prop, digits=3)), color="white")
```
As it can be observed, the difference is very little between the two groups. The proportion of males which completed the test is a little bit higher compared to the females (0.361 to 0.355).
Regarding which of the who populations takes the standard lunch more, the relative proportions were plotted
```{r}
data %>% count(gender, lunch) %>% group_by(gender) %>%
mutate(prop = n / sum(n)) %>%
ggplot(mapping = aes(x = gender, y = lunch)) +
geom_tile(mapping = aes(fill = prop)) +
geom_text(aes(label=round(prop, digits=3)), color="white")
```
Again, the difference is only little between the two groups. The proportion of females which have reduced lunch is a little bit higher compared to the males (0.365 to 0.344).
Succeeding, boxplots for the different score variables, were plotted gender wise. Additionally, the distribution-density of each score was visualized with a histogram. The bins of the histograms were calculated manually with the use of the __Freedman–Diaconis' metric__.
```{r}
# boxplot for math.score
# Using "position=position_dodge()" in order to have a different boxplot per gender
boxplot.gender.math.score <- ggplot(data=data, aes(x=math.score, y=gender, fill=gender)) +
geom_boxplot() + coord_flip() +
labs(title="Box plot for Math Scores",x="Math Scores", y = "Gender")
# boxplot for reading.score
# Using "position=position_dodge()" in order to have a different boxplot per gender
boxplot.gender.reading.score<- ggplot(data=data, aes(x=reading.score, y=gender, fill=gender)) +
geom_boxplot() + coord_flip() +
labs(title="Box plot for Reading Scores",x="Reading Scores", y = "Gender")
# boxplot for writing.score
# Using "position=position_dodge()" in order to have a different boxplot per gender
boxplot.gender.writing.score <- ggplot(data=data, aes(x=writing.score, y=gender, fill=gender)) +
geom_boxplot() + coord_flip() +
labs(title="Box plot for Writing Scores",x="Writing Scores", y = "Gender")
# The three defined plots are going to be plotted on this page
ggarrange(boxplot.gender.math.score, boxplot.gender.reading.score, boxplot.gender.writing.score,
ncol = 3, nrow = 1)
```
```{r}
# Histogram with Freedman–Diaconis method and color by groups for writing scores
breaks.writing.score <- pretty(range(data[,'writing.score']), n = nclass.FD(data[,'writing.score']), min.n = 1)
writing.score.dens.plot <- ggplot(data, aes(x=writing.score, color=gender, fill=gender)) +
geom_histogram(aes(y=..density..), alpha=0.5, breaks = breaks.writing.score,
position="identity") +
geom_density(alpha=.2) + ggtitle("Distribution of Writing Scores")
# Histogram with Freedman–Diaconis method and color by groups for math scores
breaks.math.score <- pretty(range(data[,'math.score']), n = nclass.FD(data[,'math.score']), min.n = 1)
math.score.dens.plot <- ggplot(data, aes(x=math.score, color=gender, fill=gender)) +
geom_histogram(aes(y=..density..), alpha=0.5, breaks = breaks.math.score,
position="identity") +
geom_density(alpha=.2) + ggtitle("Distribution of Math Scores")
# Histogram with Freedman–Diaconis method and color by groups for reading scores
breaks.reading.score <- pretty(range(data[,'reading.score']), n = nclass.FD(data[,'reading.score']), min.n = 1)
reading.score.dens.plot <- ggplot(data, aes(x=reading.score, color=gender, fill=gender)) +
geom_histogram(aes(y=..density..), alpha=0.5, breaks = breaks.reading.score,
position="identity") +
geom_density(alpha=.2) + ggtitle("Distribution of Reading Scores")
# The three defined plots are going to be plotted on this page
ggarrange(math.score.dens.plot, reading.score.dens.plot, writing.score.dens.plot,
ncol = 3, nrow = 1)
```
Firstly, an interesting point, is that the median and interquantile ranges (IQR) of the populations for the different scores, revealed that the females are more likely to perform better on the writing and reading exams, while the males performed better for the math exams. Those observations were then checked and confirmed with hypothesis testing. Moreover, what can be seen is that even though females performed better on writing and reading, the outliers which are more than 1.5 IQR below Q1 in the female population, are more in total compared to the males.
In order to compare the scores of the two individual populations (males and females) it was decided to utilize a two sample t-test hypothesis testing. As literature suggests, t-test assumes normality in the data. During the **__Analysis of the scoring variables__** section, it was shown that only the math and writing scores were normally distributed. The scoring variables of the two individual populations (males and females) were furthered tested for their normality in order to employ the t-test and accurately compare the means of the two groups.
A Quantile-Quantile plot (QQ plot) and the non-parametric Kolmogorov-Smirnov test were employed, in order to test for normality of the distribution for the different scores of the males
```{r}
# QQ-plots of Males' Performance:
par(mfrow=c(1,3))
# For each different score of the males, first standardize the scores (scale and center)
# and compare with Normal Distribution by using a qqtest and run a Kolmogorov-Smirnov test afterwards to compare with the Normal Distribution
std.males_ms<-scale(data$math.score[data$gender=='male'])
qqnorm(std.males_ms,main='Normal QQ-plot of standardised Maths Score of Males',col='deepskyblue4')
qqline(std.males_ms,col='red')
std.males_rs<-scale(data$reading.score[data$gender=='male'])
qqnorm(std.males_rs,main='Normal QQ-plot of standardised Reading Score of Males',col='deepskyblue4')
qqline(std.males_rs,col='red')
std.males_ws<-scale(data$writing.score[data$gender=='male'])
qqnorm(std.males_ws,main='Normal QQ-plot of standardised Writing Score of Males', col='deepskyblue4')
qqline(std.males_ws,col='red')
# Run the non-parametric Kolmogorov-Smirnov test to compare with the Normal Distribution
ks.test(std.males_ms,'pnorm')
ks.test(std.males_ws,'pnorm')
ks.test(std.males_rs,'pnorm')
```
The null hypothesis of the tests were that the a specified score variable (for the males only) follows the Normal distribution.
Regarding the male group, the p-values for the math score (0.4632), writing (0.4271) and reading (0.2654) are a lot larger than 0.05 (5% level of significance), therefore it was concluded that the distributions of the different scores for the males were not significantly different from Normal distribution.
A Quantile-Quantile plot (QQ plot) and the non-parametric Kolmogorov-Smirnov test were employed, in order to test for normality of the distribution for the different scores of the females.
```{r}
# QQ-plots of Females' Performance:
par(mfrow=c(1,3))
# For each different score of the females, first standardize the scores (scale and center)
# and compare with Normal Distribution by using a qqtest and run a Kolmogorov-Smirnov test afterwards to compare with the Normal Distribution
std.females_ms<-scale(data$math.score[data$gender=='female'])
qqnorm(std.females_ms,main='Normal QQ-plot of standardised Maths Score of Females',col='deepskyblue4')
qqline(std.females_ms,col='red')
std.females_rs<-scale(data$reading.score[data$gender=='female'])
qqnorm(std.females_rs,main='Normal QQ-plot of standardised Reading Score of Females',col='deepskyblue4')
qqline(std.females_rs,col='red')
std.females_ws<-scale(data$writing.score[data$gender=='female'])
qqnorm(std.females_ws,main='Normal QQ-plot of standardised Writing Score of Females', col='deepskyblue4')
qqline(std.females_ws,col='red')
# Run the non-parametric Kolmogorov-Smirnov test to compare with the Normal Distribution
ks.test(std.females_ms,'pnorm')
ks.test(std.females_ws,'pnorm')
ks.test(std.females_rs,'pnorm')
```
The null hypothesis of the tests were that the a specified score variable (for the females only) follows the Normal distribution.
Regarding the female group, the p-values for the math score (0.2835) and reading (0.2058) are larger than 0.05 (5% level of significance), therefore it was concluded that the distribution of math and reading scores were not significantly different from Normal distribution.
Regarding the scores for writing, the p-value was 0.04018 which is lower than 0.05 (5% level of significance), thus the null-hypothesis that the writing scores follow the Normal distribution was rejected.
Therefore since for both of the individual populations, normality exists only for math and reading scores, the t-tests were employed only for those two scores.
Eventually, a two sample t-test hypothesis testing had been performed in order to see if there is significant difference in the means of the two different populations (males and females) for the scores in the two subjects.
There are more types of t-tests, but in this case a two-sample t-test was utilized since we were interested in groups that come from two different groups.
Since more t-tests were employed during the process of this analysis, and in order to avoid code repetition and increase readability, a function that runs the whole procedure dynamically has been implemented. Specifically, the implemented function __compare.populations()__ divides the data samples in to two populations, prints the mean, variance, standard deviation of the two samples and finally runs a two sample t-test.
```{r}
# Function that dynamically divides the data samples of the given dataframe data.df in to two populations based on group.variable. Prints summary statistics of the two samples for the variable.to.observe column and finally runs a two sample t-test
compare.populations <- function(data.df, group.variable, variable.to.observe){
# Check conditions for the input of the data
stopifnot(class(group.variable) == "character")
stopifnot(class(variable.to.observe) == "character")
stopifnot(class(data) == "data.frame")
groups <- levels(data.df[[group.variable]])
# Check that groups is a vector of size two
stopifnot(class(groups) == "character")
stopifnot(length(groups) == 2)
# Create an empty list which the vectors of the data
# are going to be added
groups.data <- list()
# Divide data samples in to two populations dynamically
for (group in groups){
group.data <- filter(data.df, data.df[[group.variable]] == group)
groups.data[[length(groups.data)+1]] <- group.data
}
# Comparing the Measure of variability between both samples
i <- 1
for (group.data in groups.data){
cat("Mean of group", groups[i], mean(group.data[[variable.to.observe]]), "\n")
cat("Variance of group", groups[i], var(group.data[[variable.to.observe]]), "\n")
cat("Standard Deviation of group", groups[i], sd(group.data[[variable.to.observe]]), "\n", "\n")
i <- i + 1
}
# Two-sample t-test
t.test(data.df[[variable.to.observe]] ~ data.df[[group.variable]], data = data)
}
```
The following two sample t-test for the reading score:
* Variable to split population: gender
* Hypotheses:
+ H0: There is no difference between the average reading score of male and female students
+ H1: There is a difference between the average reading score of male and female students
* Significant level: 0.05 (95 percent confidence interval)
```{r}
# TWO SAMPLE T-TEST
# VARIABLE: GENDER
# Hypotheses for reading score:
# H0: There is no difference between the average reading score of male and female students
# H1: There is a difference between the average reading score of male and female students
# Significant level 0.05
compare.populations(data, 'gender', 'reading.score')
```
Interpreting the outcome of the t-test, the difference in reading score between females (Mean = 72.6; SD = 14.37)
and males (Mean = 65.47; SD = 13.93) was significant (t-value = 7.968) with p < 4.376e-15.
Next, the following two sample t-test for the math score:
The following two sample t-test:
* Variable to split population: gender
* Hypotheses:
+ H0: There is no difference between the average math score of male and female students
+ H1: There is a difference between the average math score of male and female students
* Significant level: 0.05 (95 percent confidence interval)
```{r}
# TWO SAMPLE T-TEST
# VARIABLE: GENDER
# Hypotheses for math score:
# H0: There is no difference between the average math score of male and female students
# H1: There is a difference between the average math score of male and female students
# Significant level 0.05
compare.populations(data, 'gender', 'math.score')
```
Interpreting the outcome of the t-test, the difference in math.score between females (Mean = 63.63; SD = 15.49) and males (Mean = 68.73; SD = 14.36) was significant (t-value = -5.398; p < 8.421e-08).
Lastly, since the writing scores are not normally distributed for the female group, the subject's scores were compared by the mean and median.
```{r}
# Isolating the writing scores of the two populations
female.writing.score <- filter(data, data$gender == 'female')[, 'writing.score']
male.writing.score <- filter(data, data$gender == 'male')[, 'writing.score']
# Constructing data.frames for the median and mean of the two groups
female_writing_stats <- data.frame(gender=c('female', 'female'), metric=c('median','mean'),
metric_value=c(median(female.writing.score), mean(female.writing.score)))
male_writing_stats <- data.frame(gender=c('male', 'male'), metric=c('median','mean'),
metric_value=c(median(male.writing.score), mean(male.writing.score)))
# Bind the two data frames by row and keep the columns (function of dplyr)
writing_stats <- bind_rows(female_writing_stats, male_writing_stats)
# Stacked barplot with multiple groups for the means and medians
# Using "position=position_dodge()" in order to have two different bars per group
ggplot(data=writing_stats, aes(x=metric, fill=gender, y=metric_value)) +
geom_bar(stat="identity", position=position_dodge()) +
guides(fill = guide_legend(title = "Gender")) +
geom_text(aes(label=round(metric_value, digits = 2)), color="black", size=5, position=position_dodge(width = .9)) +
labs(title="Mean and median writing score per gender", x="metric", y = "score")
```
As it was shown, both mean and median writing score of the female group were higher than the male group.
In conclusion, after running the two hypothesis testings, it was proven that it is statistically significant with 95% confidence interval, that generally the females outperformed the males in reading, while the males performed better in maths. Finally, by looking at the distributions, box plots and the summary statistics for the writing score, it seems like the females outperform the males for that subject.
## Analysis of test preparation course
From the summary of variables, it was already known that the group samples population across the test preparation is unequal(none=642, completed=358). However, this was something that it has taken into account while analyzing the findings on the influence of test preparation on subjects scores.
To see if test preparation affects the subject results, a two-sample Welch's t-test was used to compare the means of the two individual populations (none, completed). The Welch's t-test assumes normality from the distribution for the different scores of the none and completes course populations, hence the non-parametric Kolmogorov-Smirnov test was used to test for normality.
```{r}
### VARIABLE: TEST PREPARATION COURSE
# CHECK FOR NORMALITY reading
std.none_rs<-scale(data$reading.score[data$test.preparation.course=="none"])
std.completed_rs<-scale(data$reading.score[data$test.preparation.course=='completed'])
ks.test(std.none_rs,'pnorm')
ks.test(std.completed_rs,'pnorm')
# CHECK FOR NORMALITY WRITING
std.none_ws<-scale(data$writing.score[data$test.preparation.course=="none"])
std.completed_ws<-scale(data$writing.score[data$test.preparation.course=='completed'])
ks.test(std.none_ws,'pnorm')
ks.test(std.completed_ws,'pnorm')
# CHECK FOR NORMALITY MATH
std.none_ms<-scale(data$math.score[data$test.preparation.course=="none"])
std.completed_ms<-scale(data$math.score[data$test.preparation.course=='completed'])
ks.test(std.none_ms,'pnorm')
ks.test(std.completed_ms,'pnorm')
```
As a result of the non-parametric Kolmogorov-Smirnov test with a 95 percent confidence interval, we may infer that the various scores of none and completed course populations do not violate the Welch's t-test normality assumption, since all P-values are greater than 0.05.
We were able to continue with the hypothesis testing and infer if the test preparation did really influence the students' results, since the Welch's t-test took into account that the variance across the two individual populations might not the same.
The following two sample Welch's-test was conducted for the writing score variable between the two different populations (none and completes) test preparation.
```{r}
# TWO SAMPLE T-TEST
#->Variables to Observe: WRITING SCORE - COURSE PREP
#->Hypotheses:
#H0: There is no difference between the average writing score on preparation and none preparation course .
#H1: There is a difference between the average writing score on preparation and none preparation course .
#->Significant level 0.05
#two-sample t-test
t.test(writing.score ~ test.preparation.course, data = data)
## p-value < 0.05 -> accept H1 so, significant difference between average writing score of completed and none completed prep course)
```
The result of the hypothesis testing reveal that indeed there is a significant difference between the mean writing score on none and completed test preparation course, since the P-value(2.2e-16) of the test was less than 0.05, the null hypothesis is rejected.
The following two sample Welch's-test was conducted for the reading score variable between the two different populations (none and completes) test preparation.
```{r}
# TWO SAMPLE T-TEST
#->Variables to Observe: READING SCORE - COURSE PREP
#->Hypotheses:
#H0: There is no difference between the average reading score on preparation and none preparation course .
#H1: There is a difference between the average reading score on preparation and none preparation course .
#->Significant level 0.05
#two-sample t-test
t.test(reading.score ~ test.preparation.course, data = data)
## p-value < 0.05 -> accept H1 so, significant difference between average reading score of completed and none completed prep course)
```
The following two sample Welch's-test was conducted for the maths score variable between the two different populations (none and completes) test preparation.
The result of the hypothesis testing reveal that indeed there is a significant difference between the mean reading score on none and completed test preparation course, since the P-value(4.389e-15) of the test was less than 0.05, and the null hypothesis was rejected.
```{r}
# TWO SAMPLE T-TEST
#->Variables to Observe: MATH SCORE - COURSE PREP
#->Hypotheses:
#H0: There is no difference between the average math score on preparation and none preparation course .
#H1: There is a difference between the average math score on preparation and none preparation course .
#->Significant level 0.05
#two-sample t-test
t.test(math.score ~ test.preparation.course, data = data)
## p-value < 0.05 -> accept H1 so, significant difference between average math score of completed and none completed prep course)
```
The result of the hypothesis testing reveal that indeed there is a significant difference between the mean math score on none and completed test preparation course, since the P-value(1.043e-08) of the test was less than 0.05, the null hypothesis was rejected.
To conclude, students who had completed a course preparation test appear to score better in all subjects than those who did not have any preparation test.
## Analysis of lunch
From the summary of variables, it was already known that the group samples population across the lunch was unequal (free/reduced=355, standard=645). However, this was something which had been taken into account while analysing our findings on the influence of lunch on subjects scores.
To obsrve if lunch affects the subject results, a two-sample Welch's t-test was used to compare the means of the two individual populations (free/reduced, standard). The Welch's t-test assumes normality from the distribution for the different scores of the none and completes course populations, hence the non-parametric Kolmogorov-Smirnov test was used to test for normality.
```{r}
### VARIABLE:LUNCH
#CHECK FOR NORMALITY reading
std.freereduced_rs<-scale(data$reading.score[data$lunch=="free/reduced"])
std.standard_rs<-scale(data$reading.score[data$lunch=='standard'])
ks.test(std.freereduced_rs,'pnorm')
ks.test(std.standard_rs,'pnorm')
#CHECK FOR NORMALITY WRITING
std.freereduced_ws<-scale(data$writing.score[data$lunch=="free/reduced"])
std.standard_ws<-scale(data$writing.score[data$lunch=='standard'])
ks.test(std.freereduced_ws,'pnorm')
ks.test(std.standard_ws,'pnorm')
#CHECK FOR NORMALITY MATH
std.freereduced_ms<-scale(data$math.score[data$lunch=="free/reduced"])
std.standard_ms<-scale(data$math.score[data$lunch=='standard'])
ks.test(std.freereduced_ms,'pnorm')
ks.test(std.standard_ms,'pnorm')
```
As a result of the non-parametric Kolmogorov-Smirnov test with a 95 percent confidence interval, we may infer that the various scores of free/reduce and standard populations do not violate the Welch's t-test normality assumption, since all P-values are greater than 0.05.
We were able to continue with the hypothesis testing and infer if the lunch did really influence the students' results, since the Welch's t-test took into account that the variance across the two individual populations might not the same.
The following two sample Welch's-test was conducted for the different score variables between the two different populations (free/reduce and standard) lunch.
```{r}
# TWO SAMPLE T-TEST
#->Variables to Observe: WRITING SCORE/READING/MATH- LUNCH
#->Hypotheses:
#H0: There is no difference between the average writing score on standard and free lunch .
#H1: There is a difference between the average writing score on standard and free lunch .
#->Significant level 0.05
#two-sample t-test
t.test(writing.score ~ lunch, data = data)
## p-value < 0.05 -> accept H1 so, significant difference between averages of standard and free/reduced lunch)
#two-sample t-test
t.test(reading.score ~ lunch, data = data)
## p-value < 0.05 -> accept H1 so, significant difference between averages of standard and free/reduced lunch)
#two-sample t-test
t.test(math.score ~ lunch, data = data)
## p-value < 0.05 -> accept H1 so, significant difference between averages of standard and free/reduced lunch)
```
Regarding the reading score, the result of the hypothesis testing reveal that indeed lunch has a significant difference in mean reading score between free/reduce and standard population, since the P-value(8.422e-13) of the test was less than 0.05, and the null hypothesis was rejected.
For the writing score, the result of the hypothesis testing reveal that indeed lunch has a significant difference in mean writing score between free/reduce and standard population, since the P-value(1.716e-14) of the test was less than 0.05, the null hypothesis has been rejected.
Finally, for the math score, the result of the hypothesis testing revealed that indeed there is a significant difference between the mean writing score on free/reduce and standard, since the P-value(2.2e-16) of the test was less than 0.05 and the null hypothesis was rejected.
## Comparisons between the different ethnicities
From the initial summary of variables, it was already known that the group samples population across the race ethnicity is unequal. However, this was something we took into account while analysing the different findings on the influence of racial ethnicity on subject score.
Boxplots for the scores based on the different race/ethnicity
```{r}
# boxplot for math.score
# Using "position=position_dodge()" in order to have a different boxplot per group
boxplot.gender.math.score <- ggplot(data=data, aes(x=math.score, y=race.ethnicity, fill=race.ethnicity)) +
geom_boxplot() + coord_flip() +
labs(title="Box plot for Math Scores",x="Math Scores", y = "Ethinicity")
# boxplot for reading.score
# Using "position=position_dodge()" in order to have a different boxplot per group
boxplot.gender.reading.score<- ggplot(data=data, aes(x=reading.score, y=race.ethnicity, fill=race.ethnicity)) +
geom_boxplot() + coord_flip() +
labs(title="Box plot for Reading Scores",x="Reading Scores", y = "Ethinicity")
# boxplot for writing.score
# Using "position=position_dodge()" in order to have a different boxplot per group
boxplot.gender.writing.score <- ggplot(data=data, aes(x=writing.score, y=race.ethnicity, fill=race.ethnicity)) +
geom_boxplot() + coord_flip() +
labs(title="Box plot for Writing Scores",x="Writing Scores", y = "Ethinicity")
# The three defined plots are going to be plotted on this page
ggarrange(boxplot.gender.math.score, boxplot.gender.reading.score, boxplot.gender.writing.score,
ncol = 3, nrow = 1)
```
It discovered that for all the topics, race ethnicity under group E had the highest performance, although group A did not do as well as the other ethnicities. One thing to keep in mind is that the order of greater performance in all subjects is almost the same. However, in order to compare the means of each race ethnicity population and be more exact about our conclusions, it was chosen to cross-check these results using an ANOVA test (analysis of variance).
During the analysis, it was determined if race/ethnicity had an impact on subject scores or not by using an ANOVA test. However, ANOVA assumes normality in the data \ref{lantz2013impact}, so it was investigated how the different ethnicity groups' populations are distributed in order to establish the tests and conclude if ethnicity has an influence on subject scores.
To assess the normality of the distribution for the different scores of each race ethnicity population, the non-parametric Kolmogorov-Smirnov test was used.
```{r}
### VARIABLE: RACE ETHNICITY
#CHECK FOR NORMALITY READIND SCORE
std.groupA_rs<-scale(data$reading.score[data$race.ethnicity=='group A'])
std.groupB_rs<-scale(data$reading.score[data$race.ethnicity=='group B'])
std.groupC_rs<-scale(data$reading.score[data$race.ethnicity=='group C'])
std.groupD_rs<-scale(data$reading.score[data$race.ethnicity=='group D'])
std.groupE_rs<-scale(data$reading.score[data$race.ethnicity=='group E'])
ks.test(std.groupA_rs,'pnorm')
ks.test(std.groupB_rs,'pnorm')
ks.test(std.groupC_rs,'pnorm')
ks.test(std.groupD_rs,'pnorm')
ks.test(std.groupE_rs,'pnorm')
# CHECK FOR NORMALITY FOR WRITING SCORE
std.groupA_ws<-scale(data$writing.score[data$race.ethnicity=='group A'])
std.groupB_ws<-scale(data$writing.score[data$race.ethnicity=='group B'])
std.groupC_ws<-scale(data$writing.score[data$race.ethnicity=='group C'])
std.groupD_ws<-scale(data$writing.score[data$race.ethnicity=='group D'])
std.groupE_ws<-scale(data$writing.score[data$race.ethnicity=='group E'])
ks.test(std.groupA_ws,'pnorm')
ks.test(std.groupB_ws,'pnorm')
ks.test(std.groupC_ws,'pnorm')
ks.test(std.groupD_ws,'pnorm')
ks.test(std.groupE_ws,'pnorm')
# CHECK FOR NORMALITY FOR MATH SCORE
std.groupA_ms<-scale(data$math.score[data$race.ethnicity=='group A'])
std.groupB_ms<-scale(data$math.score[data$race.ethnicity=='group B'])
std.groupC_ms<-scale(data$math.score[data$race.ethnicity=='group C'])
std.groupD_ms<-scale(data$math.score[data$race.ethnicity=='group D'])
std.groupE_ms<-scale(data$math.score[data$race.ethnicity=='group E'])
ks.test(std.groupA_ms,'pnorm')
ks.test(std.groupB_ms,'pnorm')
ks.test(std.groupC_ms,'pnorm')
ks.test(std.groupD_ms,'pnorm')
ks.test(std.groupE_ms,'pnorm')
```
As a result of the non-parametric Kolmogorov-Smirnov test with a 95% confidence interval, we may infer that the various scores of each race ethnicity population do not violate the ANOVA test's normality assumption, since all P-values are greater than 0.05.However, we were hesitant to employ the ANOVA test since the test's accuracy may be influenced by the different sample group sizes.
As a response, we end up employing the one-way Welch's ANOVA test, which accounts for the inequality and homogeneity of variances across our sample sizes.
Race/ethnicity influence for the different scores
```{r}
#perform Welch's ANOVA
oneway.test(data$reading.score ~ data$race.ethnicity, data = data, var.equal = FALSE) #notice that we check with
#var.equal= false
#Tukey-Kramer test
TukeyHSD(aov(data$reading.score ~ data$race.ethnicity))
#perform Welch's ANOVA
oneway.test(data$writing.score ~ data$race.ethnicity, data = data, var.equal = FALSE) #notice that we check with
#var.equal= false
#Tukey-Kramer test
TukeyHSD(aov(data$writing.score ~ data$race.ethnicity))
#perform Welch's ANOVA
oneway.test(data$math.score ~ data$race.ethnicity, data = data, var.equal = FALSE) #notice that we check with
#var.equal= false
#Tukey-Kramer test
TukeyHSD(aov(data$math.score ~ data$race.ethnicity))
```
To summarize, students from groups D and E outperform students from group A in reading scores, while students from group E appear to outperform group B as well. In terms of writing scores, it was found that students from ethnicities C, D, and E had higher writing scores than students from group A. Students from ethnicity D and E appear to outperform even those from group B. Finally, in respect of math scores, we may infer that group E outperformed groups C, D, B, and A, whereas students from race D outperformed those from groups B and A.
## Comparisons between the different levels of education
Boxplots for the scores based on the different parental level of education
```{r}
# boxplot for writing.score
# Using "position=position_dodge()" in order to have a different boxplot per group
boxplot.gender.writing.score <- ggplot(data=data, aes(x=writing.score, y=parental.level.of.education, fill=parental.level.of.education)) +
geom_boxplot() + coord_flip() + theme(axis.text.x=element_blank()) +
labs(title="Box plot for Writing Scores",x="Writing Scores", y = "parental education")
# boxplot for math.score
# Using "position=position_dodge()" in order to have a different boxplot per group
boxplot.gender.math.score <- ggplot(data=data, aes(x=math.score, y=parental.level.of.education, fill=parental.level.of.education)) +
geom_boxplot() + coord_flip() + theme(axis.text.x=element_blank()) +
labs(title="Box plot for Math Scores",x="Math Scores", y = "parental education")
# boxplot for reading.score
# Using "position=position_dodge()" in order to have a different boxplot per group
boxplot.gender.reading.score<- ggplot(data=data, aes(x=reading.score, y=parental.level.of.education, fill=parental.level.of.education)) +
geom_boxplot() + coord_flip() + theme(axis.text.x=element_blank()) +