Bad Birthdays Are Killing Me!

Every time we do a round of placement tests at USC, we upload the results to our student information system (SIS) as a fixed-width text file. In the past, that file was created manually in Excel, so one of the first things I did when I arrived was create a script to automate that process. It’s definitely made things a lot quicker, but there’s still a problem: bad birthdays.

In about 3% of the exams I upload, a student has entered either heir birthdate or ID number incorrectly. (Hey, we all get a little nervous during tests, right?) When I upload a result with an incorrect birthdate or ID, I get an email notification from SIS that looks like this:

Test birthdate 04/04/00 does not match SIS_PROFILE birthdate for 03/03/2000 1234567890 TROJAN TOMMY

Invalid USCID 1234567891 for TROJAN TINA

I then need to find the offending line in the original results file, copy it to a new file, make the correction, and then upload the new file to SIS. The same thing happens when a student has entered a bad ID number. With all the exams we do during the year, this can become a time killer.

The Solution

Fortunately, Python can help automate this process. After looking over some suggestions for grabbing email, here’s what I came up with:

Get Error Report Email

Import packages.

import email
import imaplib
import pandas as pd
import re

Get email from inbox.

FROM_EMAIL  = 'nice_try@usc.edu'
FROM_PWD    = 'nice_try'
SMTP_SERVER = 'outlook.office365.com'

mail = imaplib.IMAP4_SSL(SMTP_SERVER)
mail.login(FROM_EMAIL,FROM_PWD)
mail.select('INBOX')

Get latest error report from SIS and turn it into a string. (This was the part that gave me the most trouble because I had no idea how work with the imaplib and email packages. I also didn’t know that an email comes as byte literal and not, like I dunno, some sort of string.)

result, data = mail.uid('search', None, 'FROM "nice_try@sis-cal1.usc.edu"')
latest_email_uid = data[0].split()[-1]
result, data = mail.uid('fetch', latest_email_uid, '(RFC822)')
raw_email = data[0][1]

Get body of email as a string.

raw_email_string = raw_email.decode('utf-8')
new_email_string = email.message_from_string(raw_email_string)
 
if new_email_string.is_multipart():
    for payload in new_email_string.get_payload():
        sis_message = payload.get_payload()
else:
    sis_message = new_email_string.get_payload()

Match Email String to Regular Expression

For incorrect birthdates, get a list of tuples that include (bad_bday, correct_bday, id). For incorrect id numbers, get a list of bad ids.

bad_bday_pattern = r'Test birthdate (\d{2}/\d{2}/\d{2}) does not match SIS_PROFILE birthdate for (\d{2}/\d{2}/\d{4}) (\d{10}) .*'
bad_bday_list = re.findall(bad_bday_pattern, sis_message)

bad_id_pattern = r'Invalid USCID (\d{10}) for .*'
bad_id_list = re.findall(bad_id_pattern, sis_message)

Correct Bad Exam Results

Import the fixed-width file into a DataFrame.

results = '/Users/nicholascifuentes-goodbody/Desktop/2018-04-99 - sis_upload.txt'
col_breaks = [(0,20), (20,30), (30,33), (40,42),    (43,46),     (47,55),     (55,61)    ]
col_names =  ["name", "id",    "score", "language", "placement", "test_date", "birthdate"]

df = pd.read_fwf(
	results, 
	colspecs = col_breaks, 
	header = None, 
	names = col_names, 
	dtype = object,
)

Extract results with bad birthdates, correct them, and put them in a new DataFrame.

corrected_results = pd.DataFrame()

for student in bad_bday_list:
    mask = df['id'] == student[2]
    result = df[mask]
    result['birthdate'] = student[1].replace('/', '')[0:4] + student[1].replace('/', '')[5:-1]
    corrected_results = corrected_results.append(result)
    

Extract results with bad ID numbers and append them to the same DataFrame. (I also added a !! marker so I can correct them later.)

for student in bad_id_list:
    mask = df['id'] == student
    result = df[mask]
    result['id'] = student + '!!'
    corrected_results = corrected_results.append(result)

Write New File

These two functions are for creating the corrected fixed-width text file according to the format that SIS requires.

def romans(x):
	if x == "I":
		return "1"
	elif x == "II":
		return "2"
	elif x == "III":
		return "3"
	else:
		return "MET"
def makestring(row):
    string = str(
        str('%-20s' % row['name']) +
        row['id']                  +
        str(row['score']).rjust(3) +
        " 0 0 0 "                  +
        row['language'].rjust(2)   +
        romans(row['placement'])   +
        row['placement'].rjust(3)  + 
        ' '                        +
        row['test_date']           +
        row['birthdate']
    )
    
    return string

Create the fixed-width file.

new_file = open('/Users/nicholascifuentes-goodbody/Desktop/DONE.txt', 'w')

for lab, row in corrected_results.iterrows():
    string = makestring(row)
    new_file.write(string + '\n')

new_file.close()

Next Steps

Now that I have a proof of concept, I’m planning to refine all this code. I’ve been working on my certification as an instructor for Software Carpentry, and I’d like to take what I’m learning to make this executable from the command line.

In an idea world, I’d like to have access to a SQL database with the information for all USC’s students. That way, I could correct all the bad data before I upload them to SIS in the first place. But that’s more about institutional politics than data science. Alas, Python lacks a diplomacy package. 😄

Written on May 8, 2018