Coming Out of My Shell
The reason I love all the data science and programming techniques I’ve been learning is that it feels like magic. There’s a task that I have to do over and over again. I write a few lines of code. I never have to do it again. And everything I learn ends up helping me somewhere, often in ways I can’t anticipate. It’s like learning a language when you’re living abroad. As soon as you know a new word, you hear it everywhere, and a little piece of the world is suddenly unlocked.
In my last blog post, I mentioned that I wanted to make my scripts executable from the command line. As part of the training for my Carpentry certification, I’ve been working through the Software Carpentry lessons for Unix and Python. All that studying served as motivation to dive into the the Terminal.
The first cool thing I learned is that there are a lot of Unix commands that are helpful for the work I do at the USC Language Center. For instance, the
cat command lets me concatenate text files. Since all the test results I deal with are CSV files, I can use
cat to combine the day’s test results into a single file. Then there’s
awk, which allows me to eliminate duplicate rows, so I can get rid of the repeated header when I combine those CSV files. And I can combine both of these commands into a single line of code using the pipe (
|). This works for passing the output from these Unix commands to a Python script, too (more on that below).
Even cooler, I can turn all these Terminal commands into a service on my Mac using Automator. For example, the following service takes a bunch of CSV files, combines them together with
awk, and then uses a Python script to convert everything into the properly formatted fixed-width file I need to upload to our student information system:
In order for all of this to work, there were several issues I had to figure out.
First, in order to run a python script from the command line, it needs to be executable. So I had to go into the Terminal, use
cd to get to the folder where the script is, and then run the following command:
$ chmod +x filename.py
Second, the first line of the Python script needs to be the path to the python environment where I want to run the script. In this case:
Third, I had to use the
sys library in my Python script in order to get the output from the Unix commands. And for this particular script, I had to use
io to get that input into a form where
pandas could read it into a DataFrame. Here’s an abbreviated version of the script to give you an idea (the full script is in my repository):
import pandas as pd import sys from io import StringIO def main(): '''Takes data stream (string), writes results to FWF on Desktop''' assert sys.stdin, "No input" results_string = StringIO(sys.stdin.read()) # Import into DataFrame columns = ["Student Name", "Month", "Day", "Year", "ID", "Special Codes", "Total Score", "Grade"] results = pd.read_csv( results_string, usecols = columns, dtype = object, na_values=" " ) if __name__ == '__main__': main()
BTW, notice the if clause at the end of this code. I learned from the Carpentry lesson that this allows you to run the function
main() from the command line.
Finally, I had to learn a little about
sis.stdin is for passing data into a function, whereas
sis.argv is for passing in filenames. In this case, the UNIX command
awk is passing text to
fixed-width-cmd.py, so I needed to use
sis.stdin.read() and then use
StringIO() to put it into a string that could be read by
Once I had all of these issues figured out, I saved my service as a application. So now, I can drag CSV files onto the application icon and the fixed-width file appears on my desktop. Here’s a video:
See, it’s magic!