Aggregating Spanish Placement Exams

This week, I began exploring our backlog of language placement exams. I think the best way to talk about his is to walk you through the process of answering a sample question. For instance, how many students have taken our Spanish exam since we started collecting data?

To answer this question, I needed to combine all of the fixed-width text files into a single data set that I could analyze. I did this using glob. The below script combines all the results into a single text file.

file_pattern = "*.txt"
list_data = []

text_files = glob.glob(file_pattern)

with open("result.txt", "wb") as outfile:
    for file in text_files:
        with open(file, "rb") as infile:
            outfile.write(infile.read())

clean_results_list = []

with open("result.txt", "r") as outfile:
	data = outfile.readlines()
	for line in data:
		if len(line)==62:
			clean_results_list.append(line)

clean_results_file = open(pd.to_datetime("today").strftime("%Y-%m-%d") + " - all_results.txt", "w")

for line in clean_results_list:
	clean_results_file.write(line)

One drawback of this code is that it filters out all lines that aren’t 62 characters long. I figured that it’d be easier to remove those scores than clean them. This reduced the number of results from about 10,000 to 9,100. Maybe a better option would be to use some sort of REGEX expression to find and clean those 900 results. Suggestions welcome.

The next step was to take the remaining results and import them into a pandas DataFrame. Originally, I planned to use the script I discussed in an earlier post. However, I ran into a few problems. First, there were some missing data in the “birthdate” and “score” columns. That meant I couldn’t assign datatypes to these columns in the read_csv command. Second, there were several duplicate rows that needed to be eliminated. So here’s how I modified the code:

col_breaks = [(0,19),(20,30), (31,33), (40,42), (43,46), (47,55), (55,61)]
col_names = ["name", "id", "score", "language", "placement", "test date", "birthdate"]

# Abbreviations from ISO 639-2
langs = {"3": "SPA", "4": "FRA", "5": "GER", "6": "LAT", "7": "JPN", "8": "KOR", "9": "RUS", "10": "CHI", "11": "ARA", "12": "PER", "13": "ITA", "14": "POR", "15": "GRE"}

filepath = "2017-10-10 - all_results.txt"


# Import text file
df = pd.read_fwf(filepath, 
	colspecs = col_breaks, header = None, names = col_names, dtype = object, parse_dates = [5], infer_datetime_format = True) 

# Deal with missing data
df["language"] = df["language"].apply(lambda x: langs[x])
df["score"] = pd.to_numeric(df["score"], errors="coerce")
df["birthdate"] = pd.to_datetime(df["birthdate"], errors = "coerce")

# Remove duplicates
dupes = df.duplicated("name", keep = "last")
deduped = df.loc[dupes == False]

Removing duplicates brought the count from 9,100 to 8,700.

Finally, I had to reshape the data using pivot_table, then resample the Spanish column by year:

by_lang = deduped.pivot_table(aggfunc="count", index = "test date", values = "score", columns = "language")
by_lang_agg = by_lang.resample("A")["SPA"].sum()

…and here’s your answer in the form of a nifty graph. Enjoy!

Written on October 10, 2017