Second Language Acquisition Experiment I

I had the opportunity to work with the director of USC’s Spanish program recently on an experiment. She was interested in seeing if making tweaks to a teaching intervention resulted in better understanding of the grammar used in conditional sentences (“If I study, I will get an A on the exam”). She’s the expert in Second Language Acquisition (SLA), so she designed the experiment. I helped her record and edit the videos, wrangle the data, and analyze the results. In this first post, I’ll discuss how I imported and cleaned the data.

In the experiment, the Spanish Director took all the students in all the sections of Spanish II (about 450) and randomly assigned them into four treatment groups and one control group. All groups took a multiple choice pre-test where they had to evlauate the gramatical correctness of several sentences. Then each treatment group watched a different video on conditional sentences. The control group watched a non-language-related video. Then they all took a post-test.

Importing the Data

The Spanish Director gave me a list of questions from the pre- and post-tests. I then manually entered the following variables for each question:

  • question group (since there are several variations of each question)
  • question type
    • rl = real
    • hp = hipotético
    • ds = distractor
  • reverse (1 for sentences that started with si clause, 0 for all others)
  • correct (1 for gramatically correct, 0 for gramatically incorrect)

Then I imported and formatted the data in python.

import pandas as pd
import numpy as np

file = "2017-12-08 - goretti - question list.xlsx"

dtypes = {'question': object, 
          'group': object,
          'type': object,
          'reverse': np.int32,
          'correct': np.int32}

questions = pd.read_excel(file, dtype = dtypes)
questions['group'] = questions['group'].apply(lambda x: format(x, '02d'))
print(questions.head(7))
                                            question group type  reverse  \
0  Antes se construaban muchas torres de vigilancia.    01   ds        0   
1  Antes se construaban muchas torres de vigilancia.    01   ds        0   
2   Antes se construían muchas torres de vigilancia.    01   ds        0   
3   Antes se construían muchas torres de vigilancia.    01   ds        0   
4  Antes se construyeron muchas torres de vigilan...    01   ds        0   
5  Antes se construyeron muchas torres de vigilan...    01   ds        0   
6  Antes se escribaba con un tipo de letra elabor...    02   ds        0   

   correct  
0        0  
1        0  
2        1  
3        1  
4        1  
5        1  
6        0  

Next, I imported the pre- and post-test data, which had been collected using a Google form. (Below, I’m using iloc[] instead of head() in order to maintain students’ personal details private.)

file = '2017-12-08 - goretti - pretest.csv'
pretest = pd.read_csv(file, parse_dates = [0], infer_datetime_format = True)
print(pretest.iloc[0:6,4:8])
  Is English your native language? What group are you in?  \
0                               No                Group 2   
1                              Yes                Group 3   
2                              Yes                Group 1   
3                              Yes                Group 2   
4                              Yes                Group 1   
5                              Yes                Group 4   

  Antes se escribía con un tipo de letra elaborado.  \
0                                     I don't know.   
1                        The sentence is incorrect.   
2                                     I don't know.   
3                        The sentence is incorrect.   
4                        The sentence is incorrect.   
5                        The sentence is incorrect.   

  Si no sería vegetariano, comería carne.  
0              The sentence is incorrect.  
1                The sentence is correct.  
2                The sentence is correct.  
3                The sentence is correct.  
4              The sentence is incorrect.  
5                The sentence is correct.  
file = '2017-12-08 - goretti - posttest.csv'
postest = pd.read_csv(file, parse_dates = [0], infer_datetime_format = True)
print(postest.iloc[0:6,4:8])
  What group are you in?  \
0                Group 1   
1                Group 2   
2                Group 3   
3                Group 1   
4                Group 2   
5                Group 5   

  Por la noche se cerraban las puertas para proteger la ciudad.  \
0                           The sentence is correct.              
1                           The sentence is correct.              
2                           The sentence is correct.              
3                           The sentence is correct.              
4                                      I don't know.              
5                           The sentence is correct.              

  Recibiría una A en todas mis clases, si estudiara todas las noches.  \
0                         The sentence is incorrect.                    
1                         The sentence is incorrect.                    
2                         The sentence is incorrect.                    
3                           The sentence is correct.                    
4                           The sentence is correct.                    
5                           The sentence is correct.                    

  Si necesito energía, comiera chocolate.  
0              The sentence is incorrect.  
1                The sentence is correct.  
2                The sentence is correct.  
3              The sentence is incorrect.  
4              The sentence is incorrect.  
5                The sentence is correct. 

Cleaning the Data

In order to analyze the data, I needed to do two things:

  1. Create uniform and systematic column headers for the questions in the pretest and postest DataFrames.
  2. Transform student responses to correct and incorrect (i.e. 1 and 0).

This is where all the manual data entry I’d done for the questions DataFrame paid off.

First, I created a dictionary from questions so that I could easily look up their characteristics.

question_mod = questions.set_index('question')
question_dict = question_mod.to_dict(orient='index')

Second, I standardized the column names using the information in that dictionary.

old_names = pretest.columns.get_values().tolist()[6:]
# Skipping first 6 columns because they don't contain questions. 

new_names = []

for item in old_names:
    string = 'pre_'
    string = string + str(question_dict[item]['group']) + '_'
    string = string + str(question_dict[item]['type']) + '_'
    string = string + str(question_dict[item]['reverse']) + '_'
    string = string + str(question_dict[item]['correct'])
    new_names.append(string)

new_columns = ['timestamp_pre', 'email', 'first_name', 'last_name', 'eng_native', 'treatment']
new_columns = new_columns + new_names
pretest.columns = new_columns
old_names = postest.columns.get_values().tolist()[5:]

new_names = []

for item in old_names:
    string = 'pos_'
    string = string + str(question_dict[item]['group']) + '_'
    string = string + str(question_dict[item]['type']) + '_'
    string = string + str(question_dict[item]['reverse']) + '_'
    string = string + str(question_dict[item]['correct'])
    new_names.append(string)

new_columns = ['timestamp_pos', 'email', 'first_name', 'last_name', 'treatment']
new_columns = new_columns + new_names
postest.columns = new_columns

Third, I recoded the student answers as correct or incorrect.

def valid_answers(col_name):
    if col_name[0] == 'p':
        if col_name[-1] == "1":
            return ["The sentence is correct."]
        else:
            return ["The sentence is incorrect."]
        
coded = (
    pretest
    .iloc[:, 6:]
    .apply(lambda col: col.isin(valid_answers(col.name)))
    .astype(int)
    )

pretest_coded = pd.concat([pretest.iloc[:, :6], coded], axis=1)

I also needed to recode the eng_native and treatment data.

pretest_coded['eng_native'] = pretest_coded.eng_native.apply(lambda x: x == 'Yes').astype(int)
pretest_coded['treatment'] = pretest_coded.treatment.str.split(' ').str.get(-1)

Here’s what the clean DataFrame looks like.

print(pretest_coded.iloc[0:6,4:8])
   eng_native treatment  pre_02_ds_0_1  pre_18_hp_0_0
0           0         2              0              1
1           1         3              0              0
2           1         1              0              0
3           1         2              0              0
4           1         1              0              1
5           1         4              0              0

The treatment for the postest was basically the same.

def valid_answers(col_name):
    if col_name[0] == 'p':
        if col_name[-1] == "1":
            return ["The sentence is correct."]
        else:
            return ["The sentence is incorrect."]
        
coded = (
    postest
    .iloc[:, 5:]
    .apply(lambda col: col.isin(valid_answers(col.name)))
    .astype(int)
    )

postest_coded = pd.concat([postest.iloc[:, :5], coded], axis=1)
postest_coded['treatment'] = postest_coded.treatment.str.split(' ').str.get(-1)

print(postest_coded.iloc[0:6,4:8])
  treatment  pos_05_ds_0_1  pos_10_hp_1_1  pos_16_rl_0_0
0         1              1              0              1
1         2              1              0              0
2         3              1              0              0
3         1              1              1              1
4         2              0              1              1
5         5              1              1              0

The final step was merging pretest and postest. I opted for an inner merge since it looked like there were several students who eneded up not taking the pre- or post-test.

postest_coded.drop(['first_name', 'last_name', 'treatment'], axis=1, inplace=True)
inner_merge = pd.merge(pretest_coded, postest_coded, on="email", how = 'inner')

And for safe keeping, I exported a copy of the data to a CSV file.

inner_merge.to_csv("2018-01-25 - goretti - inner merge.csv")

Tune in Next Time

In the next post, I’ll review how I analyzed the data. Stay tuned!

Written on August 29, 2018