Coming Out of My Shell

The reason I love all the data science and programming techniques I’ve been learning is that it feels like magic. There’s a task that I have to do over and over again. I write a few lines of code. I never have to do it again. And everything I learn ends up helping me somewhere, often in ways I can’t anticipate. It’s like learning a language when you’re living abroad. As soon as you know a new word, you hear it everywhere, and a little piece of the world is suddenly unlocked.

In my last blog post, I mentioned that I wanted to make my scripts executable from the command line. As part of the training for my Carpentry certification, I’ve been working through the Software Carpentry lessons for Unix and Python. All that studying served as motivation to dive into the the Terminal.

The first cool thing I learned is that there are a lot of Unix commands that are helpful for the work I do at the USC Language Center. For instance, the cat command lets me concatenate text files. Since all the test results I deal with are CSV files, I can use cat to combine the day’s test results into a single file. Then there’s awk, which allows me to eliminate duplicate rows, so I can get rid of the repeated header when I combine those CSV files. And I can combine both of these commands into a single line of code using the pipe (|). This works for passing the output from these Unix commands to a Python script, too (more on that below).

Even cooler, I can turn all these Terminal commands into a service on my Mac using Automator. For example, the following service takes a bunch of CSV files, combines them together with cat and awk, and then uses a Python script to convert everything into the properly formatted fixed-width file I need to upload to our student information system:

In order for all of this to work, there were several issues I had to figure out.

First, in order to run a python script from the command line, it needs to be executable. So I had to go into the Terminal, use cd to get to the folder where the script is, and then run the following command:

$ chmod +x filename.py

Second, the first line of the Python script needs to be the path to the python environment where I want to run the script. In this case:

#!/usr/local/bin/python3

Third, I had to use the sys library in my Python script in order to get the output from the Unix commands. And for this particular script, I had to use StringIO from io to get that input into a form where pandas could read it into a DataFrame. Here’s an abbreviated version of the script to give you an idea (the full script is in my repository):

import pandas as pd
import sys
from io import StringIO

def main():
	'''Takes data stream (string), writes results to FWF on Desktop'''
	assert sys.stdin, "No input"
	
	results_string = StringIO(sys.stdin.read())

	# Import into DataFrame

	columns = ["Student Name", "Month", "Day", "Year", "ID", "Special Codes", "Total Score", "Grade"]
	
	results = pd.read_csv(
	results_string,
	usecols = columns, 
	dtype = object, 
	na_values="      "
	)

if __name__ == '__main__':
	main()

BTW, notice the if clause at the end of this code. I learned from the Carpentry lesson that this allows you to run the function main() from the command line.

Finally, I had to learn a little about sis. Namely, sis.stdin is for passing data into a function, whereas sis.argv is for passing in filenames. In this case, the UNIX command awk is passing text to fixed-width-cmd.py, so I needed to use sis.stdin.read() and then use StringIO() to put it into a string that could be read by pandas.read_csv().

Once I had all of these issues figured out, I saved my service as a application. So now, I can drag CSV files onto the application icon and the fixed-width file appears on my desktop. Here’s a video:

See, it’s magic!

Written on June 20, 2018