Combining CSV Files with Glob
An important part of my job at the USC Language Center is administering placement tests and making the results available to students, advisors, and other administrators. Several times during the year, students take our tests using Scantron forms, and I end up with several CSV files — one for each of the languages we offer. I then need to make sure that all those results end up in a single, fixed-width text file that’s compatible with the university’s student information system. It’s one of those data management tasks that are perfect for automation with python.
glob, a python package that helps you find multiple pathnames matching a certain pattern. For example, it allows me to quickly find all the CSV files that I need to combine into a single DataFrame that I then convert into a fixed-width text file.
import pandas as pd import glob import time # Creating a date string for the exam, required by out student information system date_str = str(time.strftime("%m/%d/%y")) # Creating list of columns I want to import into my DataFrame columns = ["Student Name", "Month", "Day", "Year", "ID", "Special Codes", "Total Score", "Grade"] # Set my empty DataFrame frame = pd.DataFrame() # Use glob to collect all my CSV files into one DataFrame for f in glob.glob("/filepath/placement_tests/*.csv") : df = pd.read_csv(f, usecols = columns, dtype = object) frame = frame.append(df, ignore_index = True) # Add date string frame["Test Date"] = date_str
I’ve been working through a DataCamp course on cleaning data taught by Daniel Chen. There he advised a slightly different strategy: using
glob to create a list of DataFrames and then using the
.concat() method to combine them all together. It would look something like this:
# Collect CSV files pattern = '/filepath/placement_tests/*.csv' csv_files = glob.glob(pattern) # Create an empty list frames =  # Iterate over csv_files for csv in csv_files: df = pd.read_csv(csv, usecols = columns, dtype = object) frames.append(df) # Concatenate frames into a single DataFrame test_results = pd.concat(frames)
I don’t know if one of these strategies is better than the other for this use case. Still, it’s interesting to see that there is more than one way to get things done!
Do you find yourself dealing with these sorts of repetitive data tasks in your work? How do you deal with them?