Second Language Acquisition Experiment II

In my previous post, I wrote about a Second Language Acquisition experiment I worked on last semester with the director of USC’s Spanish language program. In this follow-up, I’ll discuss how I analyzed the data. In other words, get ready for some pictures. 📊 🤓

First, a refresher: Participants took a pre-test on conditional sentences in Spanish. They were divided into 5 groups. Each group watched a different video explaining conditional sentences (except for the control) and took a post-test on the conditional. The director wanted to see whether these videos had significantly different effects on the change in students test scores.

In my previous post, I imported and cleaned the data. The DataFrame I’d made contained biographic data for each participant. It also had their treatment group and their answers to the pre- and post-tests. The column names for the tests contained information about the each question (whether was pre- or post-test, the type of question, etc.).

For reference, here are the column names.

print(inner_merge.columns)
Index(['timestamp_pre', 'email', 'first_name', 'last_name', 'eng_native',
       'treatment', 'pre_02_ds_0_1', 'pre_18_hp_0_0', 'pre_23_rl_0_0',
       'pre_25_hp_0_0',
       ...
       'pos_11_rl_1_1', 'pos_08_hp_0_1', 'pos_31_rl_1_1', 'pos_29_hp_0_1',
       'pos_11_rl_0_1', 'pos_27_hp_0_0', 'pos_02_ds_0_1', 'pos_22_hp_0_1',
       'pos_30_hp_0_1', 'pos_04_ds_0_1'],
      dtype='object', length=143)

Pre- and Post-test Scores

My first goal was to isolate the pre- and post-test scores. I filtered the DataFrame for columns whose name began with pre_, then pos_. I put that information into a new DataFrame named analysis.

import pandas as pd
import numpy as np

column_list = inner_merge.columns.get_values().tolist()
criteria = [x for x in column_list if 'pre_' in x]
pre_totals = inner_merge[criteria]

column_list = inner_merge.columns.get_values().tolist()
criteria = [x for x in column_list if 'pos_' in x]
pos_totals = inner_merge[criteria]

analysis = pd.DataFrame()
analysis['treatment'] = inner_merge['treatment']
analysis['pre_total'] = pre_totals.sum(axis=1)
analysis['pos_total'] = pos_totals.sum(axis=1)

print(analysis.head(), '\n')
print(analysis.info())
  treatment  pre_total  pos_total
0         2         39         49
1         3         34         46
2         1         63         81
3         2         36         26
4         1         78        100 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 430 entries, 0 to 429
Data columns (total 3 columns):
treatment    430 non-null object
pre_total    430 non-null int64
pos_total    430 non-null int64
dtypes: int64(2), object(1)
memory usage: 13.4+ KB
None

When I looked at the means for the pre- and post-tests, it seemed like students did better on the post test overall. However, the standard deviations were wide enough that those differences didn’t appear to be significant.

print('MEAN PRE AND POST SCORES \n', analysis[['pre_total', 'pos_total']].mean(), '\n')
print('STD PRE AND POST SCORES \n',analysis[['pre_total', 'pos_total']].std(), '\n')
print('MEAN PRE AND POST SCORES BY GROUP \n', analysis.groupby('treatment').mean(), '\n')
print('STD PRE AND POST SCORES BY GROUP \n', analysis.groupby('treatment').std())
MEAN PRE AND POST SCORES 
pre_total    48.123256
pos_total    60.251163
dtype: float64 

STD PRE AND POST SCORES 
pre_total    16.028854
pos_total    22.544898
dtype: float64 

MEAN PRE AND POST SCORES BY GROUP 
           pre_total  pos_total
treatment                      
1          49.256410  62.333333
2          47.988889  62.366667
3          49.556818  63.954545
4          45.725275  56.868132
5          48.313253  55.783133 

STD PRE AND POST SCORES BY GROUP 
           pre_total  pos_total
treatment                      
1          14.871253  24.955371
2          17.039607  23.884765
3          14.985708  20.671436
4          16.628667  21.857726
5          16.423457  20.512369

Visualization

OK, picture time. I started with a series of histograms with matplotlib that looked at pre- and post-test scores for each treatment. At first glance, it appeared that students improved in the post-test on average, but there was also a wider range of scores in the post-test. In other words, students did better overall, but a few students did worse individually. (That, or they were so tired at the end of the experiment that they rushed through the post-test.)

import matplotlib.pyplot as plt

for t in sorted(analysis['treatment'].unique()):
    
    df = analysis[analysis['treatment'] == t]
    
    fig = plt.figure(figsize=(10.00, 3.00))
    
    print('TREATMENT GROUP:', t)
    
    pre_ax = fig.add_subplot(1,2,1)
    pos_ax = fig.add_subplot(1,2,2)
    
    pre_ax.hist(df['pre_total'])
    pos_ax.hist(df['pos_total'])
    
    pre_ax.set_xlabel('pre-test score')
    pos_ax.set_xlabel('post-test score')
    
    pre_ax.set_xlim((0, 100))
    pos_ax.set_xlim((0, 100))
    pre_ax.set_ylim((0, 25))
    pos_ax.set_ylim((0, 25))
    
    fig.tight_layout()
    
    plt.show()

TREATMENT GROUP: 1

png

TREATMENT GROUP: 2

png

TREATMENT GROUP: 3

png

TREATMENT GROUP: 4

png

TREATMENT GROUP: 5

png

Looking at the same data in a boxplot seemed to support the story that pre- and post-test scores were essentially the same across treatments.

analysis.boxplot(['pre_total','pos_total'], by = 'treatment')
plt.show()

png

Next, I looked at the difference between students’ pre- and post-test scores. Here, I used a boxplot from matplotlib and a swarm plot from seaborn, which I’d learned about in the Introduction to Data Visualization with Python from DataCamp.

analysis["test_dif"] = analysis['pos_total'] - analysis["pre_total"]

analysis.boxplot(['test_dif'], by='treatment')
plt.xlabel('teatment')
plt.ylabel('change in test score')
plt.show()

png

import seaborn as sns

sns.swarmplot(x = 'treatment', y = 'test_dif', data = analysis)
plt.xlabel('teatment')
plt.ylabel('change in test score')
plt.show()

png

These plots didn’t suggest significant differences between the treatment groups, but I decided to look at the cumulative distribution function to make sure. I’d learned about this in the Statistical Thinking in Python course from DataCamp.

# Function for calculating ECDF

def ecdf(data):
    """Compute ECDF for a one-dimensional array of measurements."""

    # Number of data points: n
    n = len(data)

    # x-data for the ECDF: x
    x = np.sort(data)

    # y-data for the ECDF: y
    y = np.arange(1, n+1) / n

    return x, y

# Compute ECDFs

x_1, y_1 = ecdf(analysis.loc[analysis['treatment'] == '1', 'test_dif'])
x_2, y_2 = ecdf(analysis.loc[analysis['treatment'] == '2', 'test_dif'])
x_3, y_3 = ecdf(analysis.loc[analysis['treatment'] == '3', 'test_dif'])
x_4, y_4 = ecdf(analysis.loc[analysis['treatment'] == '4', 'test_dif'])
x_5, y_5 = ecdf(analysis.loc[analysis['treatment'] == '5', 'test_dif'])

# Plot ECDFs

_ = plt.plot(x_1, y_1, marker='.', linestyle = 'none')
_ = plt.plot(x_2, y_2, marker='.', linestyle = 'none')
_ = plt.plot(x_3, y_3, marker='.', linestyle = 'none')
_ = plt.plot(x_4, y_4, marker='.', linestyle = 'none')
_ = plt.plot(x_5, y_5, marker='.', linestyle = 'none')

# Margins

plt.margins(0.02)

# Annotate the plot

plt.legend(('treatment 1', 'treatment 2', 'treatment 3', 'treatment 4', 'treatment 5 (c)'), loc = 'lower right')
_ = plt.xlabel('change in test score')
_ = plt.ylabel('ECDF')

# Display

plt.show()

png

Looking at the CDF for the 5 treatments, it was hard to see much difference, but it did appear that treatment 5 (the control) diverged from the other groups. So I ran a one-way ANOVA test (which is what you’ve been waiting for this whole time, right?)

My null hypothesis (H0) was that there was no difference in the mean change in test score between the 5 groups. The alternative hypothesis (H1) was that there was a difference in the mean change in test scores.

from scipy import stats

F, p = stats.f_oneway(analysis.loc[analysis['treatment'] == '1', 'test_dif'],
                      analysis.loc[analysis['treatment'] == '2', 'test_dif'],
                      analysis.loc[analysis['treatment'] == '3', 'test_dif'],
                      analysis.loc[analysis['treatment'] == '4', 'test_dif'],
                      analysis.loc[analysis['treatment'] == '5', 'test_dif']
)

print ('F-value: {}'.format(F))
print ('p-value from the F-distribution: {}'.format(p))
F-value: 2.968634461487222
p-value from the F-distribution: 0.01940028372616215

A p-value of .019 is statistically significant, so I rejected the null hypothesis. But the CDF visualization suggested that the statistical significance was due to the control group exclusively. I re-ran the ANOVA test without the control.

F, p = stats.f_oneway(analysis.loc[analysis['treatment'] == '1', 'test_dif'],
                      analysis.loc[analysis['treatment'] == '2', 'test_dif'],
                      analysis.loc[analysis['treatment'] == '3', 'test_dif'],
                      analysis.loc[analysis['treatment'] == '4', 'test_dif']
)

print ('F-value: {}'.format(F))
print ('p-value from the F-distribution: {}'.format(p))
F-value: 0.8029042329491185
p-value from the F-distribution: 0.49293002632018534

Suddenly the p-value jumped to .49 — not significant at all.

Finally, I ran the ANOVA on the control versus all other groups.

l = ['1', '2', '3', '4']
F, p = stats.f_oneway(analysis.loc[analysis['treatment'].isin(l), 'test_dif'],
                      analysis.loc[analysis['treatment'] == '5', 'test_dif']
)

print ('F-value: {}'.format(F))
print ('p-value from the F-distribution: {}'.format(p))
F-value: 9.246376783494622
p-value from the F-distribution: 0.0025045686383411802

With a p-value at .002, it does seem that it was the control group’s performance versus the other treatments that explained the statistical significance.

Conclusions

Students who watched any of the 4 videos did better than those who watched no video at all, but no particular video was more effective than any other in improving test performance. Maybe the difference between the videos was too subtle to make an impact. Perhaps it was sufficient for the students to read a few sample sentences (which appeared in all the videos) in order to improve performance.

What does this mean for Spanish at USC? Rather than impacting the way the program is delivered to students, this experiment showed faculty that it was possible to evaluate learning in a critical and rigorous way. Now that the director has put systems in place for running experiments, she can run other experiments in future semesters. The results from those experiments can be used to improve the program and folded into self-assessment and accreditation exercises. In short, this was a critical first step in making a great program even better.

As with all things learning, creating a process for continual improvement is more important that any one product.

Written on September 4, 2018