This is a simple analysis that I've done using R with Students Performance in Exams in Kaggle.
Loading the library
library(ggplot2)
library(dplyr)
library(gridExtra)
library(fastDummies)
library(caTools)
library(randomForest)
library(caret)
library(broom)
Reading the data
data <- read.csv('./data/StudentsPerformance.csv')
data <- data %>%
mutate(average = (reading.score + math.score + writing.score) / 3) %>%
mutate(race.ethnicity = substr(race.ethnicity, 7, 7)) %>%
mutate(test.preparation.course = test.preparation.course == "completed") %>%
mutate(lunch = lunch == "standard")
head(data)
# Edit the levels of education
levels(data$parental.level.of.education)[match("high school",levels(data$parental.level.of.education))] <- "hs"
levels(data$parental.level.of.education)[match("master's degree",levels(data$parental.level.of.education))] <- "md"
levels(data$parental.level.of.education)[match("bachelor's degree",levels(data$parental.level.of.education))] <- "bd"
levels(data$parental.level.of.education)[match("some college",levels(data$parental.level.of.education))] <- "c"
levels(data$parental.level.of.education)[match("associate's degree",levels(data$parental.level.of.education))] <- "as"
levels(data$parental.level.of.education)[match("some high school",levels(data$parental.level.of.education))] <- "h"
Exploring the factor columns
explore_boxplot <- function(mv){
p <- list()
for (rv in names(data[, 6:9])) {
tmp <- ggplot(data) +
geom_boxplot(aes_string(x = mv, y = rv))
p[[length(p) + 1]] <- tmp
}
do.call(grid.arrange, c(p))
}
Gender
explore_boxplot("gender")
- Looks like gender does affect the scores, on average both are quite balanced
- But males do better in Math while Females do better in writing & reading
Race
explore_boxplot("race.ethnicity")
- Can’t say for sure, but a rough look tells me that race will affect too!
- Race group E seems like doing better than other races while A looks like it is not doing as well.
Lunch
explore_boxplot("lunch")
- TRUE lunch indicates standard lunch
- This conforms to expectation where good lunch gives better score,
- So here can give a short conclusion: Eat good lunch for better score :D
Test Preparation
explore_boxplot("test.preparation.course")
- Again, as expected, the students who took test preparation course performed better than those who did not take.
Parent's Education
explore_boxplot("parental.level.of.education")
- There definitely exists a relationship between parent’s education level and the children’s scores, but it isn’t as clear and direct as other relationships.
- A rough glance, though, tells me that students’ parents with just high school levelof education perform the worse.
Confirming the relationship
- Unlike the other 4 properties, parent’s level of education’s effect on the student might not be as significant. Perhaps I will try a T test on a linear model with parent’s level of education.
- Assuming sigma = 0.05.
mod <- lm(average ~ parental.level.of.education, data = data)
summary(mod)
##
## Call:
## lm(formula = average ~ parental.level.of.education, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56.108 -9.259 0.874 9.895 33.892
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 69.5691 0.9344 74.457 < 2e-16 ***
## parental.level.of.educationbd 2.3547 1.5860 1.485 0.13796
## parental.level.of.educationhs -6.4721 1.3645 -4.743 2.41e-06 ***
## parental.level.of.educationmd 4.0298 2.0391 1.976 0.04840 *
## parental.level.of.educationc -1.0927 1.3155 -0.831 0.40640
## parental.level.of.educationh -4.4611 1.3985 -3.190 0.00147 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.92 on 994 degrees of freedom
## Multiple R-squared: 0.05131, Adjusted R-squared: 0.04654
## F-statistic: 10.75 on 5 and 994 DF, p-value: 4.381e-10
- By using the sigma, we can see that parent’s high school level of education has a significant effect on the student while other levels not so significant.
- This conforms to the hypothesis I made earlier.
Testing with a simple linear model
model <- lm(math.score ~ parental.level.of.education + test.preparation.course + lunch + gender + race.ethnicity, data = data)
summary(model)
##
## Call:
## lm(formula = math.score ~ parental.level.of.education + test.preparation.course +
## lunch + gender + race.ethnicity, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.357 -8.744 0.166 9.001 30.655
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.1358 1.8115 28.780 < 2e-16 ***
## parental.level.of.educationbd 1.9661 1.5020 1.309 0.190831
## parental.level.of.educationhs -4.8027 1.2971 -3.703 0.000225 ***
## parental.level.of.educationmd 2.8884 1.9382 1.490 0.136490
## parental.level.of.educationc -0.5827 1.2470 -0.467 0.640431
## parental.level.of.educationh -4.2487 1.3331 -3.187 0.001482 **
## test.preparation.courseTRUE 5.4947 0.8756 6.275 5.22e-10 ***
## lunchTRUE 10.8768 0.8727 12.463 < 2e-16 ***
## gendermale 4.9953 0.8390 5.954 3.63e-09 ***
## race.ethnicityB 2.0408 1.6998 1.201 0.230181
## race.ethnicityC 2.4700 1.5918 1.552 0.121060
## race.ethnicityD 5.3410 1.6241 3.289 0.001042 **
## race.ethnicityE 10.1347 1.8015 5.626 2.41e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.17 on 987 degrees of freedom
## Multiple R-squared: 0.2548, Adjusted R-squared: 0.2457
## F-statistic: 28.12 on 12 and 987 DF, p-value: < 2.2e-16
- Looking at how some values are insignificant, perhaps I will create a dummy variable & split the data into training and test data.
Creating dummy variables & cleaning the data
cleaned <- dummy_cols(data, select_columns = c("gender", "race.ethnicity",
"parental.level.of.education",
"lunch", "test.preparation.course"), remove_selected_columns = TRUE)
head(cleaned)
# Dropping redundant columns
redundants <- c("race.ethnicity_E", "gender_male", "parental.level.of.education_h", "lunch_TRUE", "test.preparation.course_TRUE")
cleaned <- cleaned[, !(names(cleaned) %in% redundants)]
head(cleaned)
splitted <- sample.split(cleaned$average, SplitRatio = 0.8)
train <- cleaned[splitted == TRUE, ]
test <- cleaned[splitted == FALSE, ]
dim(train)
## [1] 811 16
dim(test)
## [1] 189 16
Various Models
I am going to try some of the regression algorithm I am familiar with:
- Linear Regression
- Random Forest
- KNN
Linear Regression
# A function that builds the formula based on variables given
build_linear_formula <- function(label, variables) {
f <- as.formula(paste(label, "~", paste(variables, collapse = "+")))
return(f)
}
Building the model
linear_model <- lm(build_linear_formula("math.score",
names(cleaned)[5:16]), data = train)
summary(linear_model)
##
## Call:
## lm(formula = build_linear_formula("math.score", names(cleaned)[5:16]),
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.257 -8.984 0.465 9.593 31.186
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 78.8385 1.7978 43.854 < 2e-16 ***
## gender_female -5.1019 0.9391 -5.433 7.36e-08 ***
## race.ethnicity_A -9.6317 1.9874 -4.846 1.51e-06 ***
## race.ethnicity_B -7.4053 1.6811 -4.405 1.20e-05 ***
## race.ethnicity_C -6.2709 1.5260 -4.109 4.38e-05 ***
## race.ethnicity_D -3.9705 1.5595 -2.546 0.011085 *
## parental.level.of.education_as 4.9676 1.4891 3.336 0.000890 ***
## parental.level.of.education_bd 6.7571 1.7601 3.839 0.000133 ***
## parental.level.of.education_hs -0.3117 1.5396 -0.202 0.839609
## parental.level.of.education_md 9.1215 2.1580 4.227 2.64e-05 ***
## parental.level.of.education_c 3.5386 1.4877 2.379 0.017615 *
## lunch_FALSE -11.7172 0.9756 -12.010 < 2e-16 ***
## test.preparation.course_FALSE -6.4910 0.9866 -6.579 8.57e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.28 on 798 degrees of freedom
## Multiple R-squared: 0.2757, Adjusted R-squared: 0.2648
## F-statistic: 25.31 on 12 and 798 DF, p-value: < 2.2e-16
- This model doesn’t perform really well, as can be seen that R^2 is quite small, so not much of the variance can be explained by the model.
Evaluating the model
predict_linear <- predict(linear_model, test)
linear_test <- test %>%
mutate(test = predict_linear) %>%
mutate(se = (test - math.score) ** 2)
rmse_linear <- mean(linear_test$se) ** 0.5
rmse_linear
## [1] 12.90526
- For this simple linear regression model, RMSE is about 12, will other algorithms do better?
Random Forest
Training the model
rf_model <- randomForest(
build_linear_formula("math.score", names(cleaned)[5:16]),
train
)
Evaluating Random forest
predict_rf <- predict(rf_model, test)
rf_test <- test %>%
mutate(test = predict_rf) %>%
mutate(se = (test - math.score) ** 2)
rmse_rf <- mean(rf_test$se) ** 0.5
rmse_rf
## [1] 13.71061
- Hmmm, looks like random forest performs about the same level as the linear regression.
KNN
Training the model
knn_model <- knnreg(
train[, 5:16],
train[, 1]
)
Evaluating KNN
predict_knn <- predict(knn_model, test[, 5:16])
knn_test <- test %>%
mutate(test = predict_knn) %>%
mutate(se = (test - math.score) ** 2)
rmse_knn <- mean(knn_test$se) ** 0.5
rmse_knn
## [1] 14.01039
Trying different k values
my_knn <- function(n) {
knn_model_inner <- knnreg(
train[, 5:16],
train[, 1],
k = n)
predict_knn_inner <- predict(knn_model_inner, test[, 5:16])
knn_test_inner <- test %>%
mutate(test = predict_knn_inner) %>%
mutate(se = (test - math.score) ** 2)
rmse_knn <- mean(knn_test_inner$se) ** 0.5
return(rmse_knn)
}
res <- sapply(3:50, my_knn)
nums <- 3:50
knn_multiple <- data.frame(n = nums)
knn_multiple[["result"]] <- res
ggplot(knn_multiple, aes(x = n, y = result)) +
geom_line()
least <- min(knn_multiple$result)
best_k <- knn_multiple[knn_multiple$result == least, 1]
paste("The best k values for KNN model appears to be ", best_k, " with the RMSE of ", least)
## [1] "The best k values for KNN model appears to be 38 with the RMSE of 12.6103025866823"
Conclusion
Running the test multiple times shows that these 3 basic models will give an RMSE of 11 - 13 for math.test score.
By referring to notebook from another Kaggle’s user, Student Performance Regressor by Joseph Chan, I agree that the parameters provided to this dataset is insufficient to make an accurate prediction on the test score. Furthermore, predicting continuous variable from discrete values isn't a good choice.
However, by referring to some of the plots I made above, here are some conclusions that I can make:
- Female students perform better in reading and writing while male students perform better in Mathematics, but average are approximately the same.
- Students who ate standard lunch performed better in exam overall.
- Students who took preparation course performed better in exam overall.
- Students of race E appeared to be performing better in exam while students of race A performed slightly worse.
- Students with parents of education level of high school did slightly worse in comparison to other students.