# Second Language Acquisition Experiment II

In my previous post, I wrote about a Second Language Acquisition experiment I worked on last semester with the director of USC’s Spanish language program. In this follow-up, I’ll discuss how I analyzed the data. In other words, get ready for some pictures. 📊 🤓

First, a refresher: Participants took a pre-test on conditional sentences in Spanish. They were divided into 5 groups. Each group watched a different video explaining conditional sentences (except for the control) and took a post-test on the conditional. The director wanted to see whether these videos had significantly different effects on the change in students test scores.

In my previous post, I imported and cleaned the data. The DataFrame I’d made contained biographic data for each participant. It also had their treatment group and their answers to the pre- and post-tests. The column names for the tests contained information about the each question (whether was pre- or post-test, the type of question, etc.).

For reference, here are the column names.

```
print(inner_merge.columns)
```

```
Index(['timestamp_pre', 'email', 'first_name', 'last_name', 'eng_native',
'treatment', 'pre_02_ds_0_1', 'pre_18_hp_0_0', 'pre_23_rl_0_0',
'pre_25_hp_0_0',
...
'pos_11_rl_1_1', 'pos_08_hp_0_1', 'pos_31_rl_1_1', 'pos_29_hp_0_1',
'pos_11_rl_0_1', 'pos_27_hp_0_0', 'pos_02_ds_0_1', 'pos_22_hp_0_1',
'pos_30_hp_0_1', 'pos_04_ds_0_1'],
dtype='object', length=143)
```

## Pre- and Post-test Scores

My first goal was to isolate the pre- and post-test scores. I filtered the DataFrame for columns whose name began with `pre_`

, then `pos_`

. I put that information into a new DataFrame named `analysis`

.

```
import pandas as pd
import numpy as np
column_list = inner_merge.columns.get_values().tolist()
criteria = [x for x in column_list if 'pre_' in x]
pre_totals = inner_merge[criteria]
column_list = inner_merge.columns.get_values().tolist()
criteria = [x for x in column_list if 'pos_' in x]
pos_totals = inner_merge[criteria]
analysis = pd.DataFrame()
analysis['treatment'] = inner_merge['treatment']
analysis['pre_total'] = pre_totals.sum(axis=1)
analysis['pos_total'] = pos_totals.sum(axis=1)
print(analysis.head(), '\n')
print(analysis.info())
```

```
treatment pre_total pos_total
0 2 39 49
1 3 34 46
2 1 63 81
3 2 36 26
4 1 78 100
<class 'pandas.core.frame.DataFrame'>
Int64Index: 430 entries, 0 to 429
Data columns (total 3 columns):
treatment 430 non-null object
pre_total 430 non-null int64
pos_total 430 non-null int64
dtypes: int64(2), object(1)
memory usage: 13.4+ KB
None
```

When I looked at the means for the pre- and post-tests, it seemed like students did better on the post test overall. However, the standard deviations were wide enough that those differences didn’t appear to be significant.

```
print('MEAN PRE AND POST SCORES \n', analysis[['pre_total', 'pos_total']].mean(), '\n')
print('STD PRE AND POST SCORES \n',analysis[['pre_total', 'pos_total']].std(), '\n')
print('MEAN PRE AND POST SCORES BY GROUP \n', analysis.groupby('treatment').mean(), '\n')
print('STD PRE AND POST SCORES BY GROUP \n', analysis.groupby('treatment').std())
```

```
MEAN PRE AND POST SCORES
pre_total 48.123256
pos_total 60.251163
dtype: float64
STD PRE AND POST SCORES
pre_total 16.028854
pos_total 22.544898
dtype: float64
MEAN PRE AND POST SCORES BY GROUP
pre_total pos_total
treatment
1 49.256410 62.333333
2 47.988889 62.366667
3 49.556818 63.954545
4 45.725275 56.868132
5 48.313253 55.783133
STD PRE AND POST SCORES BY GROUP
pre_total pos_total
treatment
1 14.871253 24.955371
2 17.039607 23.884765
3 14.985708 20.671436
4 16.628667 21.857726
5 16.423457 20.512369
```

## Visualization

OK, picture time. I started with a series of histograms with `matplotlib`

that looked at pre- and post-test scores for each treatment. At first glance, it appeared that students improved in the post-test on average, but there was also a wider range of scores in the post-test. In other words, students did better overall, but a few students did worse individually. (That, or they were so tired at the end of the experiment that they rushed through the post-test.)

```
import matplotlib.pyplot as plt
for t in sorted(analysis['treatment'].unique()):
df = analysis[analysis['treatment'] == t]
fig = plt.figure(figsize=(10.00, 3.00))
print('TREATMENT GROUP:', t)
pre_ax = fig.add_subplot(1,2,1)
pos_ax = fig.add_subplot(1,2,2)
pre_ax.hist(df['pre_total'])
pos_ax.hist(df['pos_total'])
pre_ax.set_xlabel('pre-test score')
pos_ax.set_xlabel('post-test score')
pre_ax.set_xlim((0, 100))
pos_ax.set_xlim((0, 100))
pre_ax.set_ylim((0, 25))
pos_ax.set_ylim((0, 25))
fig.tight_layout()
plt.show()
```

TREATMENT GROUP: 1

TREATMENT GROUP: 2

TREATMENT GROUP: 3

TREATMENT GROUP: 4

TREATMENT GROUP: 5

Looking at the same data in a boxplot seemed to support the story that pre- and post-test scores were essentially the same across treatments.

```
analysis.boxplot(['pre_total','pos_total'], by = 'treatment')
plt.show()
```

Next, I looked at the *difference* between students’ pre- and post-test scores. Here, I used a boxplot from `matplotlib`

and a swarm plot from `seaborn`

, which I’d learned about in the *Introduction to Data Visualization with Python* from DataCamp.

```
analysis["test_dif"] = analysis['pos_total'] - analysis["pre_total"]
analysis.boxplot(['test_dif'], by='treatment')
plt.xlabel('teatment')
plt.ylabel('change in test score')
plt.show()
```

```
import seaborn as sns
sns.swarmplot(x = 'treatment', y = 'test_dif', data = analysis)
plt.xlabel('teatment')
plt.ylabel('change in test score')
plt.show()
```

These plots didn’t suggest significant differences between the treatment groups, but I decided to look at the cumulative distribution function to make sure. I’d learned about this in the *Statistical Thinking in Python* course from DataCamp.

```
# Function for calculating ECDF
def ecdf(data):
"""Compute ECDF for a one-dimensional array of measurements."""
# Number of data points: n
n = len(data)
# x-data for the ECDF: x
x = np.sort(data)
# y-data for the ECDF: y
y = np.arange(1, n+1) / n
return x, y
# Compute ECDFs
x_1, y_1 = ecdf(analysis.loc[analysis['treatment'] == '1', 'test_dif'])
x_2, y_2 = ecdf(analysis.loc[analysis['treatment'] == '2', 'test_dif'])
x_3, y_3 = ecdf(analysis.loc[analysis['treatment'] == '3', 'test_dif'])
x_4, y_4 = ecdf(analysis.loc[analysis['treatment'] == '4', 'test_dif'])
x_5, y_5 = ecdf(analysis.loc[analysis['treatment'] == '5', 'test_dif'])
# Plot ECDFs
_ = plt.plot(x_1, y_1, marker='.', linestyle = 'none')
_ = plt.plot(x_2, y_2, marker='.', linestyle = 'none')
_ = plt.plot(x_3, y_3, marker='.', linestyle = 'none')
_ = plt.plot(x_4, y_4, marker='.', linestyle = 'none')
_ = plt.plot(x_5, y_5, marker='.', linestyle = 'none')
# Margins
plt.margins(0.02)
# Annotate the plot
plt.legend(('treatment 1', 'treatment 2', 'treatment 3', 'treatment 4', 'treatment 5 (c)'), loc = 'lower right')
_ = plt.xlabel('change in test score')
_ = plt.ylabel('ECDF')
# Display
plt.show()
```

Looking at the CDF for the 5 treatments, it was hard to see much difference, but it did appear that treatment 5 (the control) diverged from the other groups. So I ran a one-way ANOVA test (which is what you’ve been waiting for this whole time, right?)

My null hypothesis (**H _{0}**) was that there was no difference in the mean change in test score between the 5 groups. The alternative hypothesis (

**H**) was that there was a difference in the mean change in test scores.

_{1}```
from scipy import stats
F, p = stats.f_oneway(analysis.loc[analysis['treatment'] == '1', 'test_dif'],
analysis.loc[analysis['treatment'] == '2', 'test_dif'],
analysis.loc[analysis['treatment'] == '3', 'test_dif'],
analysis.loc[analysis['treatment'] == '4', 'test_dif'],
analysis.loc[analysis['treatment'] == '5', 'test_dif']
)
print ('F-value: {}'.format(F))
print ('p-value from the F-distribution: {}'.format(p))
```

```
F-value: 2.968634461487222
p-value from the F-distribution: 0.01940028372616215
```

A p-value of .019 is statistically significant, so I rejected the null hypothesis. But the CDF visualization suggested that the statistical significance was due to the control group exclusively. I re-ran the ANOVA test without the control.

```
F, p = stats.f_oneway(analysis.loc[analysis['treatment'] == '1', 'test_dif'],
analysis.loc[analysis['treatment'] == '2', 'test_dif'],
analysis.loc[analysis['treatment'] == '3', 'test_dif'],
analysis.loc[analysis['treatment'] == '4', 'test_dif']
)
print ('F-value: {}'.format(F))
print ('p-value from the F-distribution: {}'.format(p))
```

```
F-value: 0.8029042329491185
p-value from the F-distribution: 0.49293002632018534
```

Suddenly the p-value jumped to .49 — not significant at all.

Finally, I ran the ANOVA on the control versus all other groups.

```
l = ['1', '2', '3', '4']
F, p = stats.f_oneway(analysis.loc[analysis['treatment'].isin(l), 'test_dif'],
analysis.loc[analysis['treatment'] == '5', 'test_dif']
)
print ('F-value: {}'.format(F))
print ('p-value from the F-distribution: {}'.format(p))
```

```
F-value: 9.246376783494622
p-value from the F-distribution: 0.0025045686383411802
```

With a p-value at .002, it does seem that it was the control group’s performance versus the other treatments that explained the statistical significance.

## Conclusions

Students who watched *any* of the 4 videos did better than those who watched no video at all, but no particular video was more effective than any other in improving test performance. Maybe the difference between the videos was too subtle to make an impact. Perhaps it was sufficient for the students to read a few sample sentences (which appeared in all the videos) in order to improve performance.

What does this mean for Spanish at USC? Rather than impacting the way the program is delivered to students, this experiment showed faculty that it was possible to evaluate learning in a critical and rigorous way. Now that the director has put systems in place for running experiments, she can run other experiments in future semesters. The results from those experiments can be used to improve the program and folded into self-assessment and accreditation exercises. In short, this was a critical first step in making a great program even better.

As with all things learning, creating a process for continual improvement is more important that any one product.