In the previous chapter, which looked at correlation, we identified several variables as being appropriate for use in testing the hypothesis generated in the preparation phase. The justification for selection was based on assessing the normality of the variables and their correlation using the general linear statistical model. Records with missing data were investigated and cleaned from the dataset. In using correlations, we are trying to see if there is a consistent pattern between two variables: does one increase or decrease when the other increases? The other scenario we need to investigate is if there is a difference in how concepts are experienced by different groupings. For example, one of the hypotheses suggested a difference between the performance of male and female students. Before we can evaluate the nature of that difference, we must first test if, based on our sample data, these two cohorts represent separate populations in terms of our concept of interest.
In this chapter we will continue the exploration and analysis and select appropriate statistical tests to evaluate the differential effects with specific reference to the Portuguese Student Performance dataset from Cortez and A. Silva (2008).
There are two dataset that must be imported:
############
# PART: Import data
############
# Dateset 1 : Import sperformance-dataset
tbl_sperf_all <- read.csv('sperformance-dataset.csv', header = TRUE)
names(tbl_sperf_all)[1] <- 'School' # Fix issue with the name of first field.
# Dateset 2 : Import sperformance-dataset variable description, created by me
tbl_sperf_description_all <- read.csv('TU060_MATH9102_Student_variables_description.csv', header = TRUE)
The choice of test we perform to Investigate if differential effects exist for different groups depends on the measurement level of the variable and the shape of the data. If our data measurement level is at interval data and normally distributed then we can use the independent samples t-test, provided the variables are independent. If our data is ordinal or is continuous scale data that does not conform to the normal distribution, then we must use a non-parametric test, such as the Mann-Whitney U-Test.
The other type of test we have is a repeated measures test. This is where we have the same grouping but we take two measurements, the first at time T1 and the second at time T2, where T2 > T1. If our data measurement level is interval data and normally distributed then we can use the paired samples t-test. If the data is not normally distributed or is ordinal we can use a Wilcoxon test.
In t-testing, the theory is that, for the variable of interest for the two groups we’re investigating, if they are from the same population then the mean and standard deviation will be the same. If we find that the mean and standard deviation are different with regards to our significance level, then we can assume the groups are from two different populations rather than one for the variable of interest. After 120 data observations the t-distribution becomes the z-distribution.
At this point we can use t-distribution to calculate probability and evaluate the Null hypothesis that there is no difference in the mean between the two groups, versus the alternative hypothesis that there is a difference. If our result is within the rejection region, it means that our t statistics is such that only 5% of samples from the same population for the variable of interest would produce the same T value, assuming two tailed significance levels of +/- 1.96. This is also known as a Type 1 error.
Because the t-test is parametric, it is based on the normal distribution. As such when selecting the parametric t-test to assess difference, the questions we need to answer are:
The relationship between two groups is quantified by the t statistical measure. From that, we can calculate the effect size, which is a measure of how different the groups are for the variable of interest.
When selecting the non-parametric U-test to assess difference, the criteria are less exclusive:
The relationship between two groups is quantified by the U statistic for the Mann Whitney test, or the W statistic in R as we are using the Wilcox implementation. From that, we can calculate the effect size, which is a measure of how different the groups are for the variable of interest based on relative rankings.
If we have a small number of outliers, where small is determined by our significance level, then we can tolerate a level of accuracy to our alpha level and use parametric tests without significantly increasing the possibility of making an incorrect inference or interpretation of a hypothesis and we will get additional accuracy from using the parametric tests as opposed to non-parametric tests. The theory is that it is in general better to have the higher power of parametric testing over non-parametric if our data can be assumed to be approximately normal. The acceptable range is related to the domain the analysis is applicable to.
For assessing normality of distribution for variables with outliers, Tabachnik and Fidell (2007), suggested the heuristic that if missing data represent less than 5% of the total and is missing in a random pattern from a large data set, almost any procedure for handling missing values yields similar results, including simply omitting the outliers.
From the preparation phase we had the Hypothesis:
Using our student performance dataset we are going to investigate if there is a significant difference in the mean performance score for students who are Male or Female. To do this difference test, we have one independent/predictor/input variable that is considered a categorical variable. It is a nominal binary variable with two possible values, m or f, representing whether the observed student is male or female. We have one continuous dependent variable (grade in each subject) which meets the criteria of being at least interval data. Our goal is to use the Independent samples t-test to tell us if there is a significant difference between these two groups (male and female students). The null hypothesis is that there is no difference in performance between Male (m) and Female (f) students. The alternative hypothesis is that there is a difference between Male and Female students. This will be a two tailed test with a significance level of .005. Observations are independent because they came from different people.Male and Female students will perform differently overall.
The only remaining questions to answer before selecting which difference test to perform relate to Normality and Homoscedasticity and they are addressed below.
In this instance, as part of our data exploration phase we have already determined that all Performance variables in our dataset can be treated as normal once outliers have been removed. As such these steps are not repeated here but the summary statistics are shown for reference.
median | mean | SE.mean | CI.mean.0.95 | var | std.dev | coef.var | std_skew | std_kurt | gt_2sd | gt_3sd | |
---|---|---|---|---|---|---|---|---|---|---|---|
mG1 | 11 | 11.28614 | 0.1758996 | 0.3459959 | 10.488890 | 3.238656 | 0.2869588 | 1.646218 | -2.505487 | 3.539823 | 0 |
mG2 | 11 | 11.43953 | 0.1730551 | 0.3404007 | 10.152397 | 3.186283 | 0.2785327 | 1.551482 | -2.093272 | 7.079646 | 0 |
mG3 | 11 | 11.61947 | 0.1769549 | 0.3480716 | 10.615123 | 3.258086 | 0.2803989 | 1.646578 | -1.701956 | 3.834808 | 0 |
pG1 | 12 | 12.36578 | 0.1311432 | 0.2579596 | 5.830305 | 2.414602 | 0.1952648 | 1.024839 | -1.390545 | 3.539823 | 0 |
pG2 | 12 | 12.47493 | 0.1302716 | 0.2562453 | 5.753068 | 2.398555 | 0.1922701 | 2.509813 | -1.315284 | 3.244838 | 0 |
pG3 | 13 | 12.82301 | 0.1402359 | 0.2758450 | 6.666806 | 2.582016 | 0.2013580 | -1.47584 | 3.064222 | 5.014749 | 0.2949853 |
Below we see the summary statistics, box plots and Levene test results for each group. The output for the Levene test for each subject grade (using the median) are shown below for the two groups. The result is non-significant for the exam scores (the value in the Pr (>F) column is more than .05) regardless of whether we centre with the median or mean. This indicates that the variances are not significantly different (i.e., they are similar and the homogeneity of variance assumption is tenable).
############
# PART: Homoscedasticity
############
# Create a subset dataframe with just the variables of interest.
tbl_sperf_sex_diff <- tbl_sperf_all %>%
select(sex, contains('mG'), contains('pG')) %>%
filter(mG1 != 0,
mG2 != 0,
mG3 != 0,
pG1 != 0,
pG2 != 0,
pG3 != 0) # Filtering records with missing data.
# -------------- Box Plot --------------- #
# Just a little eye ball test fo variance and mean to cross check with Leven's test
tbl_sperf_sex_diff %>%
gather(pG1, mG1, pG2, mG2, pG3, mG3, key = "var", value = "value") %>%
ggplot(aes(x = var, y = value, fill = sex)) +
geom_boxplot() +
theme_bw() +
labs(
y = "Grades",
x = "Performance Variables",
title = "Box Plots to eye ball variance",
subtitle = "Difference testing: Male and Female"
)
# -------------- Create summary statistics --------------- #
tbl_sperf_sex_diff_stats <-
psych::describeBy(tbl_sperf_sex_diff, tbl_sperf_sex_diff$sex, mat = TRUE) %>%
filter(!is.na(skew)) # removes categorical variables.
# Pretty print table
tbl_sperf_sex_diff_stats %>%
kbl(caption = "Summary statistics for Performance (zero scores removed) by student Sex") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
item | group1 | vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
mG11 | 3 | F | 2 | 173 | 10.83237 | 3.197145 | 10 | 10.65468 | 2.9652 | 5 | 19 | 14 | 0.4188343 | -0.6508696 | 0.2430744 |
mG12 | 4 | M | 2 | 166 | 11.75904 | 3.223363 | 12 | 11.74627 | 2.9652 | 3 | 19 | 16 | 0.0164799 | -0.5784710 | 0.2501815 |
mG21 | 5 | F | 3 | 173 | 10.99422 | 3.126215 | 10 | 10.92086 | 2.9652 | 5 | 18 | 13 | 0.2573263 | -0.6044830 | 0.2376817 |
mG22 | 6 | M | 3 | 166 | 11.90361 | 3.191331 | 12 | 11.82090 | 4.4478 | 5 | 19 | 14 | 0.1460622 | -0.5770168 | 0.2476953 |
mG31 | 7 | F | 4 | 173 | 11.16185 | 3.230908 | 11 | 11.10072 | 2.9652 | 4 | 19 | 15 | 0.2429386 | -0.4505928 | 0.2456414 |
mG32 | 8 | M | 4 | 166 | 12.09639 | 3.227212 | 12 | 11.99254 | 2.9652 | 5 | 20 | 15 | 0.2030577 | -0.5528532 | 0.2504802 |
pG11 | 9 | F | 5 | 173 | 12.86705 | 2.295018 | 13 | 12.84173 | 1.4826 | 7 | 19 | 12 | 0.0070466 | -0.0936779 | 0.1744870 |
pG12 | 10 | M | 5 | 166 | 11.84337 | 2.432018 | 12 | 11.73134 | 2.9652 | 7 | 18 | 11 | 0.3346958 | -0.4699227 | 0.1887612 |
pG21 | 11 | F | 6 | 173 | 12.94220 | 2.250868 | 13 | 12.84173 | 2.9652 | 8 | 19 | 11 | 0.3756715 | -0.2234017 | 0.1711303 |
pG22 | 12 | M | 6 | 166 | 11.98795 | 2.456872 | 12 | 11.85075 | 2.9652 | 7 | 18 | 11 | 0.4232647 | -0.4939750 | 0.1906902 |
pG31 | 13 | F | 7 | 173 | 13.32370 | 2.307562 | 13 | 13.26619 | 2.9652 | 8 | 19 | 11 | 0.2148588 | -0.5819840 | 0.1754407 |
pG32 | 14 | M | 7 | 166 | 12.30120 | 2.751242 | 12 | 12.32836 | 2.9652 | 1 | 19 | 18 | -0.2924813 | 1.0350364 | 0.2135378 |
# -------------- Levene's test --------------- #
# Conduct Levene's test for homogeneity of variance in library car - the null hypothesis is that variances in groups are equal so to
# assume homogeneity we woudl expect probaility to not be statistically significant.
# Part 1: Iterate through variables with median
variable_count <- 7
result <- list()
for (n in 2:variable_count) {
variable <- colnames(tbl_sperf_sex_diff)[n]
result[[variable]] <-
leveneTest(
y = tbl_sperf_sex_diff[, n],
group = as.factor(tbl_sperf_sex_diff$sex),
center = median
)
}
#Pr(>F) is your probability - in this case it is not statistically significant for any variable so we can assume homogeneity.
print.listof(result)
## mG1 :
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 0.0552 0.8144
## 337
##
## mG2 :
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 0.012 0.9128
## 337
##
## mG3 :
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 0.1175 0.732
## 337
##
## pG1 :
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 1.453 0.2289
## 337
##
## pG2 :
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 1.7574 0.1858
## 337
##
## pG3 :
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 3.086 0.07988 .
## 337
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Part 2: Iterate through variables with mean this time.
variable_count <- 7
result <- list()
for (n in 2:variable_count) {
variable <- colnames(tbl_sperf_sex_diff)[n]
result[[variable]] <-
leveneTest(
y = tbl_sperf_sex_diff[, n],
group = as.factor(tbl_sperf_sex_diff$sex),
center = mean
)
}
# Pr(>F) is your probability - in this case it is not statistically significant for any variable so we can assume homogeneity.
# No difference in outcomes from using mean versus median, which is a good sign.
print.listof(result)
## mG1 :
## Levene's Test for Homogeneity of Variance (center = mean)
## Df F value Pr(>F)
## group 1 0.0121 0.9126
## 337
##
## mG2 :
## Levene's Test for Homogeneity of Variance (center = mean)
## Df F value Pr(>F)
## group 1 0.0202 0.887
## 337
##
## mG3 :
## Levene's Test for Homogeneity of Variance (center = mean)
## Df F value Pr(>F)
## group 1 0.0631 0.8017
## 337
##
## pG1 :
## Levene's Test for Homogeneity of Variance (center = mean)
## Df F value Pr(>F)
## group 1 1.2826 0.2582
## 337
##
## pG2 :
## Levene's Test for Homogeneity of Variance (center = mean)
## Df F value Pr(>F)
## group 1 1.6905 0.1944
## 337
##
## pG3 :
## Levene's Test for Homogeneity of Variance (center = mean)
## Df F value Pr(>F)
## group 1 2.8945 0.08981 .
## 337
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
When we compare the t-statistic obtained during the test to the standard t distribution, we can see if the value is so unusual that it would be in the tail regions of the distribution, set at +/-0.025 (Significance level of 95%). If we find that this is the case, we can conclude that, in comparison to all other mean differences, the mean difference for each group is so unusual it is less likely to be due to random chance and more likely to be as a result of our alternate hypothesis. In this event we can reject the null hypothesis as the likely explanation for the difference between the two groups.
############
# PART: T-Test
############
# -------------- Conduct the t-test --------------- #
#Conduct the t-test from package stats
#In this case we can use the var.equal = TRUE option to specify equal variances and a pooled variance estimate
variable_count <- 7
tbl_test_result <- data.frame()
tbl_test_effectsize <- data.frame()
for (n in 2:variable_count) {
variable <- colnames(tbl_sperf_sex_diff)[n]
test_result <-
stats::t.test(tbl_sperf_sex_diff[, n]~as.factor(tbl_sperf_sex_diff$sex), var.equal = TRUE) %>%
broom::tidy() %>% as.data.frame()
# Build output table
row.names(test_result) <- variable
tbl_test_result <- rbind(tbl_test_result,test_result)
#---------- Calculate Cohens D Effect size ---------- #
effcd <- round(effectsize::t_to_d(t = test_result$statistic, test_result$parameter),2)
#---------- Calculate Eta Squared Effect size ---------- #
#Eta squared calculation
effes <- round((test_result$statistic * test_result$statistic) / ((test_result$statistic * test_result$statistic
) + (test_result$parameter)), 3)
# Build output table
tbl_merged_effectsize <- merge(effcd,effes)
row.names(tbl_merged_effectsize) <- variable
tbl_test_effectsize <- rbind(tbl_test_effectsize,tbl_merged_effectsize)
}
## tidy up column names
colnames(tbl_test_result)[1] <- 'mean.diff.est'
colnames(tbl_test_result)[2] <- paste('mean', levels(as.factor(tbl_sperf_sex_diff$sex))[1], sep = ".")
colnames(tbl_test_result)[3] <- paste('mean', levels(as.factor(tbl_sperf_sex_diff$sex))[2], sep = ".")
#P-value is your probability - in this case every resul was statiscally significant @ P < .05*
# -------------- Pretty Print Test statistics --------------- #
tbl_test_result %>%
kbl(caption = "Summary of T-Test Statistics for the Male and Female student Groups") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
mean.diff.est | mean.F | mean.M | statistic | p.value | parameter | conf.low | conf.high | method | alternative | |
---|---|---|---|---|---|---|---|---|---|---|
mG1 | -0.9266662 | 10.83237 | 11.75904 | -2.657017 | 0.0082588 | 337 | -1.6126906 | -0.2406418 | Two Sample t-test | two.sided |
mG2 | -0.9093948 | 10.99422 | 11.90361 | -2.650216 | 0.0084234 | 337 | -1.5843608 | -0.2344288 | Two Sample t-test | two.sided |
mG3 | -0.9345358 | 11.16185 | 12.09639 | -2.663740 | 0.0080990 | 337 | -1.6246401 | -0.2444316 | Two Sample t-test | two.sided |
pG1 | 1.0236785 | 12.86705 | 11.84337 | 3.987135 | 0.0000820 | 337 | 0.5186531 | 1.5287039 | Two Sample t-test | two.sided |
pG2 | 0.9542447 | 12.94220 | 11.98795 | 3.731071 | 0.0002237 | 337 | 0.4511650 | 1.4573245 | Two Sample t-test | two.sided |
pG3 | 1.0224946 | 13.32370 | 12.30120 | 3.713155 | 0.0002395 | 337 | 0.4808324 | 1.5641568 | Two Sample t-test | two.sided |
# -------------- Pretty Print Test Effect size --------------- #
tbl_test_effectsize %>%
kbl(caption = "Summary of T-Test Effectsize for the Male and Female student Groups") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
d | CI | CI_low | CI_high | y | |
---|---|---|---|---|---|
mG1 | -0.29 | 0.95 | -0.50 | -0.07 | 0.021 |
mG2 | -0.29 | 0.95 | -0.50 | -0.07 | 0.020 |
mG3 | -0.29 | 0.95 | -0.50 | -0.08 | 0.021 |
pG1 | 0.43 | 0.95 | 0.22 | 0.65 | 0.045 |
pG2 | 0.41 | 0.95 | 0.19 | 0.62 | 0.040 |
pG3 | 0.40 | 0.95 | 0.19 | 0.62 | 0.039 |
A series of independent-samples t-tests were conducted to compare performance scores in Portuguese and Maths for respondents who are Male and those who are Female.
For Maths initial grade a statistically significant difference in the scores was found (M=10.83, SD= 3.20 for respondents who are Female, M= 11.76, SD= 3.22 for respondents who are Male), (t(337)= 2.66 , p < 0.05). Cohen’s d also indicated a small effect size (0.29).
For Maths Intermediate grade a statistically significant difference in the scores was found (M=10.99, SD= 3.13 for respondents who are Female, M= 11.90, SD=3.19 for respondents who are Male), (t(337)= 2.65, p < 0.05). Cohen’s d also indicated a small effect size (0.29).
For Maths final grade a statistically significant difference in the scores was found (M=11.16, SD= 3.23 for respondents who are Female, M= 12.10, SD=3.23 for respondents who are Male), (t(337)=2.66, p < 0.05). Cohen’s d also indicated a small effect size (0.29).
For Portuguese initial grade a statistically significant difference in the scores was found (M=12.86705, SD= 2.295018 for respondents who are Female, M= 11.84, SD= 2.43 for respondents who are Male), (t(337)= 3.99 , p < 0.001). Cohen’s d also indicated a small effect size (0.43).
For Portuguese Intermediate grade a statistically significant difference in the scores was found (M=12.94, SD= 2.25 for respondents who are Female, M= 11.98795, SD=2.456872 for respondents who are Male), (t(337)= 3.73 , p < 0.001). Cohen’s d also indicated a small effect size (0.41).
For Portuguese final grade a statistically significant difference in the scores was found (M=13.32, SD= 2.31 for respondents who are Female, M= 12.30, SD=2.75 for respondents who are Male), (t(337)=3.71, p < 0.001). Cohen’s d also indicated a small effect size (0.40).
Given these results there is justification to reject the Null Hypothesis that differences between the two groups is random. There is justification to accept the Alternative Hypothesis that mean performance scores for students who are Male or Female will differ in the population.
From the preparation phase we had the Hypothesis:
Using our student performance dataset we are going to investigate if there is a significant difference in the ranked absentee scores for students who are Male or Female.Male and Female students will have different levels of absenteeism overall.
To do this difference test we have one independent/predictor/input variable that is considered a categorical variable. It is a nominal variable with two possible values, m or f. We have one continuous variable dependent/response/outcome variable this is the total number of absences, which meets the criteria of being at least ordinal data because it is skewed interval data. Our goal is to use the independent samples Mann-Whitney U-Test to tell us if there is a significant difference between these two groups. The null hypothesis is that there is no difference in performance between Male and Female students. The alternative hypothesis is that there is a difference between Male and Female students. Observations are independent because they came from different people.
The test works by looking to see if there are a similar number of high and low ranks in each group, so when absenteeism is ranked then we can say the null hypothesis holds and there is no difference between the groups. If there is a greater number in one group relative to the other then that would suggest there is a difference between the two groups.
To illustrate the concept of assessing normality for non-normal data, we selected the absences variables for Portuguese and Maths from the sample dataset. The Normal Quantile Plot (Q-Q Plot) shows that most observations are off the reference line with curves in the distribution of observations for both subjects. As such, the variables are not approximately normal as we have large clusters and gaps affecting the shape of our distribution.
If we still had any doubts about normality, the next step is to quantify how far away from normal the distribution is. To do this, we calculate statistics for skew and kurtosis and standardise them so we can compare them to heuristics. Standardised scores (value/std.error) for skewness between +/-2 (1.96 rounded) are considered acceptable in order to assume a normal distribution. Skewness for both variables exceeded our acceptable range with values of 32.26 absences from Maths and 17.41 for absences from Portuguese.
In terms of quantifying the proportion of the data that is not normal, we generated standardised z scores for the variable and calculated the percentage of the standardised scores that are outside an acceptable range. Neither absences from Maths nor absences from Portuguese were within our acceptable range for the 99.7% significance level. Based on this assessment, none of the variables can be treated as normally distributed. This supports selecting the Mann-Whitney test over the independent samples t-test as independent samples t-test requires a normal sampling distribution.
median | mean | SE.mean | CI.mean.0.95 | var | std.dev | coef.var | std_skew | std_kurt | gt_2sd | gt_3sd | |
---|---|---|---|---|---|---|---|---|---|---|---|
absences.m | 3 | 5.319372 | 0.3901418 | 0.7671006 | 58.14445 | 7.625251 | 1.433487 | 32.26225 | 106.969 | 3.403141 | 0.7853403 |
absences.p | 2 | 3.672775 | 0.2510110 | 0.4935404 | 24.06850 | 4.905965 | 1.335765 | 17.41551 | 25.23263 | 6.544503 | 1.570681 |
item | group1 | vars | n | median | mad | min | max | range | skew | kurtosis | IQR | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
absences.m1 | 3 | F | 2 | 198 | 4 | 5.9304 | 0 | 75 | 75 | 4.057366 | 22.676835 | 7.0 |
absences.m2 | 4 | M | 2 | 184 | 3 | 4.4478 | 0 | 30 | 30 | 1.531947 | 2.480317 | 8.0 |
absences.p1 | 5 | F | 3 | 198 | 2 | 2.9652 | 0 | 32 | 32 | 2.416823 | 7.900982 | 5.5 |
absences.p2 | 6 | M | 3 | 184 | 2 | 2.9652 | 0 | 22 | 22 | 1.788762 | 3.352829 | 6.0 |
Based on the analysis above, a non parametric independent variable u-test is selected to evaluate if two different groups exist. We compare the u-statistic obtained during the test to a standard distribution to see if the value we get is so unusual, it is the tail regions of the distribution set at +/-0.025 (Significance level of 95%). If we find this we can conclude that, in comparison to all other ranking differences the ranking difference for each group under test is so unusual it has less probability of being down to random change and more probability of being as a result of our alternate hypothesis. In this event, we reject the null hypothesis as the likely explanation for the difference between the two groups.
############
# PART: wilcox
############
# -------------- Conduct the U-test --------------- #
#Conduct the U-test from package stats
variable_count <- 3
tbl_test_result <- data.frame()
tbl_test_effectsize <- data.frame()
test_result_zscore <- list()
test_result_reff <- list()
for (n in 2:variable_count) {
variable <- colnames(tbl_sabsence_sex_diff)[n]
test_result <-
stats::wilcox.test(tbl_sabsence_sex_diff[, n] ~ sex, data = tbl_sabsence_sex_diff) %>%
broom::tidy() %>% as.data.frame()
#To calculate Z we can use the Wilcox test from the coin package
test_result_zscore[[variable]] <-
coin::wilcox_test(tbl_sabsence_sex_diff[, n] ~ as.factor(sex), data = tbl_sabsence_sex_diff)
# Build output table
row.names(test_result) <- variable
tbl_test_result <- rbind(tbl_test_result, test_result)
}
#---------- Calculate the R Effect size ---------- #
test_result_reff[['absences.m']] <-
rstatix::wilcox_effsize(absences.m ~ sex, data = tbl_sabsence_sex_diff)
test_result_reff[['absences.p']] <-
rstatix::wilcox_effsize(absences.p ~ sex, data = tbl_sabsence_sex_diff)
# -------------- Pretty Print Test statistics --------------- #
tbl_test_result %>%
kbl(caption = "Summary of U-Test Statistics for the Male and Female student Groups") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
statistic | p.value | method | alternative | |
---|---|---|---|---|
absences.m | 18341.0 | 0.9063626 | Wilcoxon rank sum test with continuity correction | two.sided |
absences.p | 18381.5 | 0.8738837 | Wilcoxon rank sum test with continuity correction | two.sided |
print.listof(test_result_zscore)
## absences.m :
##
## Asymptotic Wilcoxon-Mann-Whitney Test
##
## data: tbl_sabsence_sex_diff[, n] by as.factor(sex) (F, M)
## Z = 0.1181, p-value = 0.906
## alternative hypothesis: true mu is not equal to 0
##
##
## absences.p :
##
## Asymptotic Wilcoxon-Mann-Whitney Test
##
## data: tbl_sabsence_sex_diff[, n] by as.factor(sex) (F, M)
## Z = 0.15921, p-value = 0.8735
## alternative hypothesis: true mu is not equal to 0
print.listof(test_result_reff)
## absences.m :
## # A tibble: 1 x 7
## .y. group1 group2 effsize n1 n2 magnitude
## * <chr> <chr> <chr> <dbl> <int> <int> <ord>
## 1 absences.m F M 0.00604 198 184 small
##
## absences.p :
## # A tibble: 1 x 7
## .y. group1 group2 effsize n1 n2 magnitude
## * <chr> <chr> <chr> <dbl> <int> <int> <ord>
## 1 absences.p F M 0.00815 198 184 small
Given these results there is insufficient justification to accept the Hypothesis that absence levels for students who are Male or Female will differ in the population.
Following this theory, if we have a small number of outliers, where small is determined by our significance level, then we can tolerate a level of accuracy to our alpha value (.05 in this instance) and use parametric tests without significantly increasing the possibility of making an incorrect inference or interpretation of a hypothesis. This is desirable because we get additional accuracy from using the parametric tests as opposed to non-parametric tests and reduced risk of making a Type 2 error.
Missing data can be considered to be an outlier. This is where we have a variable but we are missing a value for that variable in some records. It is common, particularly when dealing with data related to human beings, that not all variables will have values in all cases. In the Student performance dataset we have at least one zero grade value for 39 (11.3%) of students, predominantly Math scores. The purpose of this section is to take a deeper look at these variables in terms of their treatment as missing data. The following section outlines the process and criteria for making a decision about these variables, and reports back the findings we have made as well as any impact our choices may have had on the hypotheses testing conducted. First, we must quantify the scale of missing data, and identify any patterns that may exist.
We have 6 variables measuring student performance in Maths and Portuguese subjects. We can see the most common pattern is where all variables have data, and therefore missing data is not an issue. The next most common is where mG3, the final Maths grade, alone is missing (24 cases). This is followed by the case where both mG3 and mG2 are missing (13 cases), then by pG3 alone (3 cases), pG3 and mG3 combined (2 cases), and lastly 1 case where pG1 was missing. There were no records where all records were missing. At first glance there is no real pattern to this missing data other than the pattern that a record being missing for mG2 meant it was missing for mG3 also.
##
## Variables sorted by number of missings:
## Variable Count
## mG3 39
## mG2 13
## pG3 5
## pG1 1
## mG1 0
## pG2 0
Combinations | Count | Percent | |
---|---|---|---|
1 | 0:0:0:0:0:0 | 339 | 88.7434555 |
4 | 0:0:1:0:0:0 | 24 | 6.2827225 |
6 | 0:1:1:0:0:0 | 13 | 3.4031414 |
2 | 0:0:0:0:0:1 | 3 | 0.7853403 |
5 | 0:0:1:0:0:1 | 2 | 0.5235602 |
3 | 0:0:0:1:0:0 | 1 | 0.2617801 |
Based on the analysis of missing data above our working assumption will be that removing the records with zero scores will not have a statistically significant impact on our outcomes, however when it comes to predictive modelling and inference we will run the model including and excluding the records with missing variables to see if it makes a difference.
Similar to the t-test, the ANOVA test is based on the normal distribution and utilises the mean for each group, but differs in that it examines the variance in the mean within each group and inspects how that statistic relates to variation between the groups. Our starting point is to look at the overall mean for the variable of interest in our first calculation of the test statistic and then we look to see how different the groups means are from that.
In ANOVA testing, the theory is that for the multiple groups, if they are from the same population then the overall mean of the variable of interest will be very close to the mean for each group and the variation around the mean for each group will be similar. If we find that the group means and variance are different with regards to our significance level, then we can assume the groups are from different populations rather than one overall for the variable of interest. As such this is the same idea as used for the independent samples t test.
The output of ANOVA is the F-statistic. which is a ratio value and a measure of effect. It is similar to t-score since it compares the amount of systematic variance in the data to the amount of unsystematic variance. In other words, it compares what we see to what we expect for a distribution with the same degrees of freedom. If the F-statistics ratio’s value is less than 1, it represents a non-significant event and we accept the null hypothesis that there is only one population. If the F-statistic is greater than 1, it indicates that there is some effect above and beyond the effect of individual differences in performance.
If we find an effect we now need to determine if it’s statically significant. To do this we compare our obtained F-statistic against the maximum value one would expect to get by chance alone in a standard F-distribution with the same degrees of freedom. The p-value associated with the F-statistic is the probability that differences between groups could occur by chance if the null-hypothesis is correct. As such we are aiming for a low p-value if we are looking for evidence to support the alternative hypothesis. If our result is within the rejection region, it means that our F-statistic is such that only 5% of samples from the same population for the variable of interest would produce the same F value, assuming two tailed significance levels of +/- 1.96. This is also known as a Type 1 error.
ANOVA tests for one overall effect only (this makes it an omnibus test), so it can tell us if a collection of groups had an effect but It doesn’t provide specific information about which specific groups were affected. To determine this we need to perform post-hoc testing!
From the preparation phase we had the Hypothesis:
Using our student performance dataset we are going to investigate if there is a significant difference in the mean performance score for students whose mother’s obtain different levels of education achievement.Maternal educational achievement has a differential effect on student performance overall.
To do this test, we require one independent/predictor/input variable that is considered a categorical variable. We will use the ordinal variable Medu representing mother’s education (0 – none, 1 – primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education). We have one continuous variable (student grade) which meets the criteria of being at least interval data. Our goal is to use the Independent samples ANOVA test to tell us if there is a significant difference between based on Mothers education, binned into 5 groups. The null hypothesis is that there is no difference in performance, and the alternative hypothesis is that there is a difference based on mothers’ education. This will be a one-way between groups ANOVA test with a significance level of .05. Observations are independent because they came from different people.Before selecting the difference test to perform, we address the issues of Normality and Homoscedasticity.
If the variable of interest is normal then we can use ANOVA. The variable of interest being used to illustrate this methodology is Portuguese final grade, pG3 which has previously been established to follow the normal distribution once outliers have been removed.
item | group1 | vars | n | mean | sd | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
pG31 | 6 | 0 | 2 | 3 | 12.33333 | 2.309401 | 12.33333 | 0.0000 | 11 | 15 | 4 | 0.3849002 | -2.3333333 | 1.3333333 |
pG32 | 7 | 1 | 2 | 49 | 12.02041 | 2.609995 | 11.80488 | 2.9652 | 7 | 18 | 11 | 0.6531037 | 0.0995218 | 0.3728565 |
pG33 | 8 | 2 | 2 | 97 | 12.13402 | 2.356855 | 12.12658 | 1.4826 | 6 | 18 | 12 | -0.0595605 | 0.2001724 | 0.2393024 |
pG34 | 9 | 3 | 2 | 95 | 12.52632 | 2.949367 | 12.59740 | 2.9652 | 1 | 19 | 18 | -0.5567336 | 1.3561680 | 0.3025987 |
pG35 | 10 | 4 | 2 | 133 | 13.44361 | 2.291002 | 13.42991 | 2.9652 | 8 | 19 | 11 | 0.0308614 | -0.4584421 | 0.1986551 |
The below we see box plots and Bartlett test results for each group. The output for Bartlett’s test for Portuguese final grade is shown below for each group. The result is non-significant for the exam scores (the value in the P-value column is more than .025). This indicates that the variances are not significantly different (i.e., they are similar and the homogeneity of variance assumption is tenable). We will use Tukey for our post hoc test.
############
# PART: Homoscedasticity
############
# -------------- Box Plot --------------- #
# Just a little eye ball test fo variance and mean to cross check with Leven's test
tbl_sperf_medu_diff %>%
gather(pG3, key = "var", value = "value") %>%
ggplot(aes(x = var, y = value, fill = value)) +
geom_boxplot() +
theme_bw() +
labs(
y = "Grades",
x = "Performance Variables",
title = "Box Plots to eye ball variance",
subtitle = "Difference testing: Mothers education"
) + facet_wrap(~Medu)
# -------------- Bartlett's test --------------- #
# Conduct Bartlett's test for homogeneity of variance in library car - the null hypothesis is that variances in groups are equal so to
# assume homogeneity we would expect probaility to not be statistically significant.
result <- list()
result[["pG1"]] <- stats::bartlett.test(pG3~ Medu, data=tbl_sperf_medu_diff)
Based on the analysis above it is safe to select a one way anova test to evaluate if different groups exist. For this post-hoc test we will assume the variance are and use Tukey to determine which group has the largest effect.
When we compare the F-statistic obtained during the test to the standard F distribution, we want to see if our value is in the tail regions of the standard distribution set at +/-0.025 (Significance level of 95%). If we find this, we can conclude that, in comparison to all other mean variance differences the mean variance difference for each group under test is so unusual it has less probability of being down to random change and more probability of being as a result of our alternate hypothesis. In this event we can reject the null hypothesis as the likely explanation for the difference between the groups.
############
# PART: ANOVA
############
# -------------- Conduct the ANOVA --------------- #
#Conduct ANOVA using the userfriendlyscience test oneway
#In this case we can use Tukey as the post-hoc test option since variances in the groups are equal
#If variances were not equal we would use Games-Howell
userfriendlyscience::oneway(x = tbl_sperf_medu_diff$Medu,y=tbl_sperf_medu_diff$pG3,posthoc='Tukey')
## ### Oneway Anova for y=pG3 and x=Medu (groups: 0, 1, 2, 3, 4)
##
## Omega squared: 95% CI = [.01; .09], point estimate = .04
## Eta Squared: 95% CI = [.01; .08], point estimate = .05
##
## SS Df MS F p
## Between groups (error + effect) 130.39 4 32.6 5.09 .001
## Within groups (error only) 2381.42 372 6.4
##
##
## ### Post hoc test: Tukey
##
## diff lwr upr p adj
## 1-0 -0.31 -4.44 3.81 1.000
## 2-0 -0.2 -4.27 3.87 1.000
## 3-0 0.19 -3.87 4.26 1.000
## 4-0 1.11 -2.94 5.16 .944
## 2-1 0.11 -1.1 1.33 .999
## 3-1 0.51 -0.71 1.73 .787
## 4-1 1.42 0.26 2.58 .007
## 3-2 0.39 -0.61 1.39 .820
## 4-2 1.31 0.38 2.24 .001
## 4-3 0.92 -0.01 1.85 .056
## P-value < .001 so this is statistically significant result between groups.
#use the aov function - same as one way but makes it easier to access values for reporting
test_result <- stats::aov(pG3~Medu, data = tbl_sperf_medu_diff)
#Get the F statistic into a variable to make reporting easier
test_result_fstat<-summary(test_result)[[1]][["F value"]][[1]]
#Get the p value into a variable to make reporting easier
test_result_aovpvalue<-summary(test_result)[[1]][["Pr(>F)"]][[1]]
#---------- Calculate Eta Squared Effect size ---------- #
#In the report we are using the res2 variable to retrieve the degrees of freedom
#and the eta_sq function from the sjstats package to calculate the effect
test_result_aoveta<-sjstats::eta_sq(test_result)[2]
There was a statistically significant difference at the p < .001 level in Portuguese scores for 3 groups: (F(4, 375)= 17.58, p< .001.Despite reaching statistical significance, the actual difference in mean scores between groups was quite small. The effect size, calculated using eta squared was (0.04). Post-hoc comparisons using the Tukey HSD test indicated that the mean score for Group 4 (M=13.44, SD=2.29) was significantly different to that for Group 1 (M=12.02, SD=2.61) and also Group 2 (M=12.13, SD=2.36).
Based on this analysis we can assume there that mother’s educational level has a differential effect on student performance. The higher the educational level obtained, the more statistically significant the result.
If our data is ordinal or is continuous scale data that does not conform to the normal distribution, then we cannot use the one way Anova test to establish if there is a difference between groups. The Kruskal–Wallis test (Kruskal & Wallis, 1952;) is the non-parametric counterpart of the one-way independent ANOVA. The theory supporting it is similar to the Mann Whitney test, and it uses ranking data. The values for the variable are ranked for each of the groups and then the sums of the ranks are used to calculate the difference. The test produces a statistic which is referred to as the Kruskal-Wallis chi squared or the H-statistic.
From the preparation phase we had the Hypothesis:
Using our student performance dataset we are going to investigate if there is a significant difference in the school attendance for students based on time travelled to school.Time spent travelling to school has a differential effect on student absence overall.
To do this difference test, we require one independent variable that is considered a categorical variable. We will use the ordinal variable traveltime representing home to school travel time (numeric: 1 – < 15 min., 2 – 15 to 30 min., 3 – 30 min. to 1 hour or 4 – > 1 hour). We have one continuous dependent variable (absences) which meets the criteria of being at least interval data. Absences have already been established as being non-normally distributed. Our goal is to use the Kruskal–Wallis test to tell us if there is a significant difference between these 4 groups. The null hypothesis is that there is no difference in absenteeism level. The alternative hypothesis is that there is a difference based on time travelled to school. This will be a one-way between-groups ANOVA test with a significance level of .05. Observations are independent because they came from different people.
item | group1 | vars | n | median | mad | min | max | range | skew | kurtosis | IQR | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
absences.p1 | 5 | 1 | 2 | 250 | 2 | 2.9652 | 0 | 32 | 32 | 2.095858 | 5.6191998 | 5.5 |
absences.p2 | 6 | 2 | 2 | 102 | 2 | 2.9652 | 0 | 30 | 30 | 2.324534 | 7.1534150 | 6.0 |
absences.p3 | 7 | 3 | 2 | 22 | 4 | 2.9652 | 0 | 16 | 16 | 1.430468 | 1.5942880 | 3.5 |
absences.p4 | 8 | 4 | 2 | 8 | 2 | 2.9652 | 0 | 8 | 8 | 1.010057 | -0.2763264 | 2.5 |
Based on the analysis above, it is reasonable to select a non-parametric independent variable u-test to evaluate if different groups exist within the data. The test provides a u statistic, which can be compared against a standard distribution. If our u statistic is in the tail regions of the distribution set at +/-0.025 (Significance level of 95%), we can conclude that, in comparison to all other ranking differences, the ranking difference for each group under test has less probability of being down to random chance and more probability of being as a result of our alternate hypothesis. In this event we can reject the null hypothesis as the likely explanation for the difference between the two groups.
############
# PART: Kruskal-Wallis
############
# -------------- Conduct the Kruskal-Wallis --------------- #
test_result <-
stats::kruskal.test(absences.p~traveltime.p,data=tbl_sabsence_traveltime_diff)
# -------------- Conduct Post Hoc test --------------- #
#Need library FSA to run the post-hoc tests
test_result_post_hoc <- FSA::dunnTest(x=tbl_sabsence_traveltime_diff$absences.p, g=as.factor(tbl_sabsence_traveltime_diff$traveltime.p), method="bonferroni")
print(test_result_post_hoc, dunn.test.results = TRUE)
## Kruskal-Wallis rank sum test
##
## data: x and g
## Kruskal-Wallis chi-squared = 3.238, df = 3, p-value = 0.36
##
##
## Comparison of x by g
## (Bonferroni)
## Col Mean-|
## Row Mean | 1 2 3
## ---------+---------------------------------
## 2 | -0.723587
## | 1.0000
## |
## 3 | -1.643534 -1.193175
## | 0.6016 1.0000
## |
## 4 | 0.428276 0.650503 1.257850
## | 1.0000 1.0000 1.0000
##
## alpha = 0.05
## Reject Ho if p <= alpha
#---------- calculate the effect size eta squared -------------------- #
test_result_effsize <- rstatix::kruskal_effsize(tbl_sabsence_traveltime_diff, absences.p~traveltime.p, ci = FALSE, conf.level = 0.95,
ci.type = "perc", nboot = 1000)#uses bootstrapping
print(test_result_effsize)
## # A tibble: 1 x 5
## .y. n effsize method magnitude
## * <chr> <int> <dbl> <chr> <ord>
## 1 absences.p 382 0.000630 eta2[H] small
There was no statistically significant difference at any significance level (P = 0.36). The actual difference in mean scores between groups was quite small. The effect size, calculated using eta squared was (6.2950584^{-4}). Post-hoc testing was conducted using Dunn and Bonferroni and this also confirmed there was no difference.
Based on this analysis, insufficient evidence has been found to accept the alternative hypothesis, as such we will accept the null hypothesis that any differences between the groups is due to random occurrence.
We test if there is a difference effect for nominal variables using the the Chi-squared test, and when we compare the Chi-squared statistic obtained during the test to the standard Chi-squared distributions we can see if the value we get is in the tail region of the distribution set at 0.05 (Significance level of 95%). If we find this to be the case, we can conclude that the difference for each group under test is so unusual it is unlikely to be due to random chance and more likely being as a result of our alternate hypothesis.
We have the Hypothesis:
HA: There are differences between extra-curricular activities engagement for respondents who are male or female
Using our student performance dataset we investigate if there is a significant difference in extra-curricular activities engagement for students who are male and those who are female. We have two binary categorical variables: student sex (sex) and after school activity participation (activities.p). A variable measuring after school activity appears twice in our dataset because it was collected twice, once in each class. For the purposes of illustrating nominal difference evaluation using Chi-Square we will use activities.p as our variable measuring after school activity, but we later investigate if there was significant difference between the repeated measures of this variable.
############
# PART: Chi
############
# -------------- Conduct the Chi-Square --------------- #
#Use the Crosstable function
#CrossTable(predictor, outcome, fisher = TRUE, chisq = TRUE, expected = TRUE)
gmodels::CrossTable(tbl_sactivity_sex_diff$sex, tbl_sactivity_sex_diff$activities.p, fisher = TRUE, chisq = TRUE, expected = TRUE, sresid = TRUE, format = "SPSS")
##
## Cell Contents
## |-------------------------|
## | Count |
## | Expected Values |
## | Chi-square contribution |
## | Row Percent |
## | Column Percent |
## | Total Percent |
## | Std Residual |
## |-------------------------|
##
## Total Observations in Table: 382
##
## | tbl_sactivity_sex_diff$activities.p
## tbl_sactivity_sex_diff$sex | no | yes | Row Total |
## ---------------------------|-----------|-----------|-----------|
## F | 105 | 93 | 198 |
## | 94.335 | 103.665 | |
## | 1.206 | 1.097 | |
## | 53.030% | 46.970% | 51.832% |
## | 57.692% | 46.500% | |
## | 27.487% | 24.346% | |
## | 1.098 | -1.047 | |
## ---------------------------|-----------|-----------|-----------|
## M | 77 | 107 | 184 |
## | 87.665 | 96.335 | |
## | 1.297 | 1.181 | |
## | 41.848% | 58.152% | 48.168% |
## | 42.308% | 53.500% | |
## | 20.157% | 28.010% | |
## | -1.139 | 1.087 | |
## ---------------------------|-----------|-----------|-----------|
## Column Total | 182 | 200 | 382 |
## | 47.644% | 52.356% | |
## ---------------------------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 4.781025 d.f. = 1 p = 0.02877499
##
## Pearson's Chi-squared test with Yates' continuity correction
## ------------------------------------------------------------
## Chi^2 = 4.343239 d.f. = 1 p = 0.03715617
##
##
## Fisher's Exact Test for Count Data
## ------------------------------------------------------------
## Sample estimate odds ratio: 1.567009
##
## Alternative hypothesis: true odds ratio is not equal to 1
## p = 0.03160227
## 95% confidence interval: 1.02609 2.39997
##
## Alternative hypothesis: true odds ratio is less than 1
## p = 0.9890319
## 95% confidence interval: 0 2.24789
##
## Alternative hypothesis: true odds ratio is greater than 1
## p = 0.01849912
## 95% confidence interval: 1.094444 Inf
##
##
##
## Minimum expected frequency: 87.66492
#more simplistic way of doing Chi-Square
#Create your contingency table
contingency_table <-xtabs(~activities.p+sex, data=tbl_sactivity_sex_diff)
ctest_test_result <-stats::chisq.test(contingency_table, correct=TRUE)#chi square test
#correct=TRUE to get Yates correction needed for 2x2 table
# -------------- Calculate the effect Size --------------- #
ctest_test_result$chi_effphi <- sjstats::phi(contingency_table)
ctest_test_result$chi_effcramer <- sjstats::cramer(contingency_table)
print.listof(ctest_test_result)
## statistic :
## X-squared
## 4.343239
##
## parameter :
## df
## 1
##
## p.value :
## [1] 0.03715617
##
## method :
## [1] "Pearson's Chi-squared test with Yates' continuity correction"
##
## data.name :
## [1] "contingency_table"
##
## observed :
## sex
## activities.p F M
## no 105 77
## yes 93 107
##
## expected :
## sex
## activities.p F M
## no 94.33508 87.66492
## yes 103.66492 96.33508
##
## residuals :
## sex
## activities.p F M
## no 1.098047 -1.139055
## yes -1.047470 1.086589
##
## stdres :
## sex
## activities.p F M
## no 2.186556 -2.186556
## yes -2.186556 2.186556
##
## chi_effphi :
## [1] 0.1118739
##
## chi_effcramer :
## [1] 0.1118739
A Chi-Square test for independence (with Yates’ Continuity Correction) indicated a significant association between gender and reported participation in after school activity, χ2(1,n=382) = 4.34 , p < .05, phi = .11). As such we reject the null hypothesis and accept the alternative hypotheses that there is a difference in after school activity engagement between male and female students. The odds of attending after school activities were 1.56 times higher for male students compared to female students.
So far we have only looked at tests for independent groups, but as mentioned there is another test type for related samples and that is a repeated measures test. This is where we have the same grouping but we take two measurements, the first at time T1 and the second at time T2 where T2 > T1. If our data measurement level is at interval data and normally distributed then we can use paired samples t-test. If the data is not normally distributed or ordinal we will use a Wilcoxon test if our data is nominal with two groups we use McNemar’s test, and lastly for multiple groups we use the Friedman Anova.
McNemar’s test is used to determine if there are differences on a binary dependent variable between two related groups. It can be considered to be similar to the paired-samples t-test, but for a binary nominal variable rather than a continuous scale variable. If we had more than two repeated measurements of the nominal variable, we could use Cochran’s Q test.
As mentioned previously, students were asked twice about their engagement in after school activities, with a time difference between each survey. Due to how the demographic survey was administered, there are two variables in the dataset capturing after school activity participation: one collected during Maths class and the other collected during the Portuguesse class at different times. As such, it is valid for the same student to have different responses to the question and their circumstances may have changed between surveys. An inspection of that data revealed that only 5 records contained different responses for the variable activities.p versus activities.m.
extra-curricular activities engagement for students between measurements. We have one binary categorical variable: after school activity participation, measured twice (activities.p, activities.m), which meets our requirements to have at least one dichotomous variable. We have two related groups (students) and we have before-after matched pair evaluations. The null hypothesis is that there is no difference before and after, while the alternative hypothesis is that there is. To our knowledge, no treatment happened between repeated measures, so our expectation is to find no difference between the two groups for this variable and any variation between them is due to systematic differences.HA: There are differences between extra-curricular activities engagement for respondents between measurements.
We will use the McNemar test to determine whether the proportion of participants who participated in after school activities (as opposed to those who did not) was different when comparing the first survey to the second survey result. This will provide supporting evidence to justify accepting the omitting of differences between survey responses for this variable in our predictive model. While we do not expect to find a significant result, we have included this test for the purposes of illustrating repeated measures nominal difference evaluation using Chi-Square and McNemar’s test.
############
# PART: mcnemar = TRUE
############
# -------------- Conduct the Chi-Square with mcnemar = TRUE --------------- #
#Use the Crosstable function
#CrossTable(predictor, outcome, fisher = TRUE, chisq = TRUE, expected = TRUE, mcnemar = TRUE)
gmodels::CrossTable(tbl_sactivity_diff$activities.m, tbl_sactivity_diff$activities.p, mcnemar = TRUE, expected = TRUE, sresid = TRUE, prop.chisq = FALSE, format = "SPSS")
##
## Cell Contents
## |-------------------------|
## | Count |
## | Expected Values |
## | Row Percent |
## | Column Percent |
## | Total Percent |
## | Std Residual |
## |-------------------------|
##
## Total Observations in Table: 382
##
## | tbl_sactivity_diff$activities.p
## tbl_sactivity_diff$activities.m | no | yes | Row Total |
## --------------------------------|-----------|-----------|-----------|
## no | 179 | 2 | 181 |
## | 86.236 | 94.764 | |
## | 98.895% | 1.105% | 47.382% |
## | 98.352% | 1.000% | |
## | 46.859% | 0.524% | |
## | 9.989 | -9.529 | |
## --------------------------------|-----------|-----------|-----------|
## yes | 3 | 198 | 201 |
## | 95.764 | 105.236 | |
## | 1.493% | 98.507% | 52.618% |
## | 1.648% | 99.000% | |
## | 0.785% | 51.832% | |
## | -9.479 | 9.043 | |
## --------------------------------|-----------|-----------|-----------|
## Column Total | 182 | 200 | 382 |
## | 47.644% | 52.356% | |
## --------------------------------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 362.2236 d.f. = 1 p = 9.23437e-81
##
## Pearson's Chi-squared test with Yates' continuity correction
## ------------------------------------------------------------
## Chi^2 = 358.3293 d.f. = 1 p = 6.506784e-80
##
##
## McNemar's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 0.2 d.f. = 1 p = 0.6547208
##
## McNemar's Chi-squared test with continuity correction
## ------------------------------------------------------------
## Chi^2 = 0 d.f. = 1 p = 1
##
##
## Minimum expected frequency: 86.2356
#more simplistic way of doing Chi-Square
#Create your contingency table
contingency_table <-xtabs(~activities.p+activities.m, data=tbl_sactivity_diff)
ctest_test_result <-stats::mcnemar.test(contingency_table, correct=TRUE) #mcnemar
#correct=TRUE to get Yates correction needed for 2x2 table
# -------------- Calculate the effect Size --------------- #
ctest_test_result$chi_effphi <- sjstats::phi(contingency_table)
ctest_test_result$chi_effcramer <- sjstats::cramer(contingency_table)
print.listof(ctest_test_result)
## statistic :
## McNemar's chi-squared
## 0
##
## parameter :
## df
## 1
##
## p.value :
## [1] 1
##
## method :
## [1] "McNemar's Chi-squared test with continuity correction"
##
## data.name :
## [1] "contingency_table"
##
## chi_effphi :
## [1] 0.9737707
##
## chi_effcramer :
## [1] 0.9737707
A McNemar’s chi-squared repeated measures test for difference (with Continuity Correction) indicated a no significant change in after school activity, χ2(1,n=382) = 0 , p = 1, phi = .97). As such we accept the null hypothesis and reject the alternative hypotheses that there is a difference in after school activity engagement between repeated measures. The odds of attending after school activities were the same between measures.
Cortez and A. Silva. “Using Data Mining to Predict Secondary School Student Performance.” In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7. (https://repositorium.sdum.uminho.pt/bitstream/1822/8024/1/student.pdf)
Cohen, J. (1988). Set Correlation and Contingency Tables. Applied Psychological Measurement, 12(4), 425–434. https://doi.org/10.1177/014662168801200410
George, Darren & Mallery, Paul. (2003). SPSS for Windows Step-by-Step: A Simple Guide and Reference, 14.0 update (7th Edition). http://lst-iiep.iiep-unesco.org/cgi-bin/wwwi32.exe/[in=epidoc1.in]/?t2000=026564/(100).
Tabachnick, B. G., Fidell, L. S., & Ullman, J. B. (2007). Using multivariate statistics (Vol. 5, pp. 481-498). Boston, MA: Pearson.