Combining CSV Files with Glob

An important part of my job at the USC Language Center is administering placement tests and making the results available to students, advisors, and other administrators. Several times during the year, students take our tests using Scantron forms, and I end up with several CSV files — one for each of the languages we offer. I then need to make sure that all those results end up in a single, fixed-width text file that’s compatible with the university’s student information system. It’s one of those data management tasks that are perfect for automation with python.

Enter glob, a python package that helps you find multiple pathnames matching a certain pattern. For example, it allows me to quickly find all the CSV files that I need to combine into a single DataFrame that I then convert into a fixed-width text file.

import pandas as pd
import glob
import time

# Creating a date string for the exam, required by out student information system
date_str = str(time.strftime("%m/%d/%y"))

# Creating list of columns I want to import into my DataFrame
columns = ["Student Name", "Month", "Day", "Year", "ID", "Special Codes", "Total Score", "Grade"]

# Set my empty DataFrame
frame = pd.DataFrame()

# Use glob to collect all my CSV files into one DataFrame
for f in glob.glob("/filepath/placement_tests/*.csv") :
	df = pd.read_csv(f, usecols = columns, dtype = object)
	frame = frame.append(df, ignore_index = True)

# Add date string
frame["Test Date"] = date_str

I’ve been working through a DataCamp course on cleaning data taught by Daniel Chen. There he advised a slightly different strategy: using glob to create a list of DataFrames and then using the .concat() method to combine them all together. It would look something like this:

# Collect CSV files
pattern = '/filepath/placement_tests/*.csv'
csv_files = glob.glob(pattern)

# Create an empty list
frames = []

#  Iterate over csv_files
for csv in csv_files:
    df = pd.read_csv(csv, usecols = columns, dtype = object)

# Concatenate frames into a single DataFrame
test_results = pd.concat(frames)

I don’t know if one of these strategies is better than the other for this use case. Still, it’s interesting to see that there is more than one way to get things done!

Do you find yourself dealing with these sorts of repetitive data tasks in your work? How do you deal with them?

Written on September 5, 2017