Webscraping DataCamp for a Personal Archive

Most of what I’ve learned about data sciences has come from the courses I’ve taken at DataCamp. Until now, every time I take a course, I copy and paste the course outline into a Markdown file. This gives me an archive I can refer to when I face similar problems down the road. But it takes time, time I could be using to take more DataCamp courses!

It recently occurred to me that I could use what I’ve been learning to automate the archiving process. So after spending some quality time with the Beautiful Soup documentation, staring at the HTML for this DataCamp course, and asking for advice on Stack Overflow, I’ve come up with a solution.

First, this is what the website for a DataCamp course looks like:

Screenshot of DataCamp Course

And here’s the code I put together:

from urllib.request import urlopen
from bs4 import BeautifulSoup
from collections import OrderedDict

url = 'https://www.datacamp.com/courses/experimental-design-in-r'
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')

lesson_outline = soup.find_all(['h4', 'li'])

chapters = OrderedDict()   # {chapter: [(lesson_name, lesson_link), ...], ...}

for item in lesson_outline:
    attributes = item.attrs
    try:
        class_type = attributes['class'][0]
        if class_type == 'chapter__title':
            chapter = item.text.strip()
            chapters[chapter] = []
        if class_type == 'chapter__exercise':
            lesson_name = item.find('h5').text
            lesson_link = item.find('a').attrs['href']
            chapters[chapter].append((lesson_name, lesson_link))
    except KeyError:
        pass

The two big things I learned here are how Beautiful Soup organizes the data it scrapes from a website and how you can put lists and tuples inside of dictionaries. This last one was a real sticking point as I was deciding how to structure the data.

When it’s all said and done, here’s the data that I’ll write into a Markdown file:

for chapter, lessons in chapters.items():
    print('\n# ', chapter, '\n')
    for lesson_name, lesson_link in lessons:
        print("   *", lesson_name)
#  Introduction to Experimental Design 

   * Intro to Experimental Design
   * A basic experiment
   * Randomization
   * Replication
   * Blocking
   * Hypothesis Testing
   * One sided vs. Two sided tests
   * pwr package Help Docs exploration
   * Power & Sample Size Calculations

#  Basic Experiments 

   * Single & Multiple Factor Experiments
   * Exploratory Data Analysis (EDA) Lending Club
   * How does loan purpose affect amount funded?
   * Which loan purpose mean is different?
   * Multiple Factor Experiments
   * Model Validation
   * Pre-modeling EDA
   * Post-modeling validation plots + variance
   * Kruskal-Wallis rank sum test
   * A/B Testing
   * Which post-A/B test test?
   * Sample size for A/B test
   * Basic A/B test
   * A/B tests vs. multivariable experiments

#  Randomized Complete (& Balanced Incomplete) Block Designs 

   * Intro to NHANES dataset & Sampling
   * NHANES dataset construction
   * NHANES EDA
   * NHANES Data Cleaning
   * Resampling NHANES data
   * Randomized Complete Block Designs (RCBD)
   * Which is NOT a good blocking factor?
   * Drawing RCBDs with Agricolae
   * NHANES RCBD
   * RCBD Model Validation
   * Balanced Incomplete Block Designs (BIBD)
   * Is a BIBD even possible?
   * Drawing BIBDs with agricolae
   * BIBD - cat's kidney function
   * NHANES BIBD

#  Latin Squares, Graeco-Latin Squares, & Factorial experiments 

   * Latin Squares
   * NYC SAT Scores EDA
   * Dealing with Missing Test Scores
   * Drawing Latin Squares with agricolae
   * Latin Square with NYC SAT Scores
   * Graeco-Latin Squares
   * NYC SAT Scores Data Viz
   * Drawing Graeco-Latin Squares with agricolae
   * Graeco-Latin Square with NYC SAT Scores
   * Factorial Experiments
   * NYC SAT Scores Factorial EDA
   * Factorial Experiment with NYC SAT Scores
   * Evaluating the NYC SAT Scores Factorial Model
   * What's next in Experimental Design

The next step will be to tweak the output to better work with my note-taking system (Zettelkasten all the way, baby!). That should be easy. Then I’d like to iterate through each lesson and turn those into notes, too. That will be harder.

For now, the good news is that I’m using the things I’ve learned to accelerate the pace at which I can learn more things. It’s like compounding interest for your brain. 📈🧠

Written on August 2, 2018