Importing Fixed-width Text Files

In my previous post, I discussed the process of converting the results from our language placement test into a fixed-width text file that’s compatible with our student information system. But what about going in the other direction? We have years worth of data in text files, and they’re ripe for analysis!

Fortunately, pandas comes with an easy solution: read_fwf. This allows me to wrangle fixed-width data into a DataFrame.

To give you an idea of what I’m working with, here’s a hypothetical result from a placement exam:

TROJAN TOMMY T      0123456789 93 0 0 0  32 II 10/02/17010199

What’s all this mean? The line starts with name and student number. Then there’s the test score, three placeholder 0s, a one- or two-digit code for the language, and the placement level (as a digit and Roman numeral). Finally, there’s the test date in mm/dd/yy format, followed directly by the student’s birthday in mmddyy format. If you’re confused, that’s because it’s confusing!

Next, here’s the script I wrote:

import pandas as pd
import numpy as np

col_breaks = [(0,19),(20,30), (31,33), (40,42), (43,46), (47,55), (55,61)]
col_names = ["name", "id", "score", "language", "placement", "test date", "birthdate"]
col_types = {"name": np.object, "id": np.object, "score": np.int64, "language": np.object, "placement": np.object, "test date": np.object, "birthdate": np.object}

# Three-letter abbreviations from ISO 639-2
langs = {"3": "SPA", "4": "FRA", "5": "GER", "6": "LAT", "7": "JPN", "8": "KOR", "9": "RUS", "10": "CHI", "11": "ARA", "12": "PER", "13": "ITA", "14": "POR", "15": "GRE"}

df = pd.read_fwf("sis_upload.txt", colspecs = col_breaks, header = None, names = col_names, dtype = col_types, 
     parse_dates = [5, 6], infer_datetime_format = True) 

df["language"] = df["language"].apply(lambda x: langs[x])

Finally, here’s the first row of the resulting DataFrame. Note how the parse_dates and infer_datetime_format arguments convert the test and birth dates into datetime objects.

                   name          id  score language placement  test date  \
0        TROJAN TOMMY T  0123456789     93      SPA        II 2017-10-02 

0  1999-01-01 

Et voilà! The next step would be to use glob, which I discussed in my previous post, is to concatenate our backlog of results into a single DataFrame. That leaves the question of storage. I’m working on a SQL solution for this and other data we need to easily access in the Language Center. Stay tuned!

Written on October 2, 2017