Webscraping DataCamp Courses II
I was so pysched about scraping the DataCamp website that I dove right into the second part of the project. In my previous post, I showed how I was able to iterate through all the chapters and lessons from the course landing page. In this post, I’ll focus on how I extracted all the lesson info.
I’m not gonna lie to you: This is a long post. But the payoff is definitely worth it!
Structure of DataCamp Lessons
Before we get into the code, I want to look at how DataCamp courses are structured and what is the information I want to store in my notes.
Courses are divided in to chapters. Each chapter is comprised of several lessons.
For each lesson, there’s (1) a description of the exercises, (2) instructions, (3) starter code, and (4) a success message that appears once you’ve submitted (5) the correct code.
Extracting Lesson Info
First, I imported the relevant packages.
from urllib.request import urlopen #For fetching the webpage from bs4 import BeautifulSoup #For parsing the HTML import json #For dealing with website data that comes in JSON import subprocess #For accessing pandoc from subprocess import Popen, PIPE, STDOUT #For accessing pandoc import re #For regex searching import pprint #For reading through JSONs
url = 'https://campus.datacamp.com/courses/experimental-design-in-r/introduction-to-experimental-design?ex=2' html = urlopen(url) soup = BeautifulSoup(html, 'lxml')
After noodling around in the HTML for a while, I figured out that the data I needed to access was in a JSON. That JSON was nested in a text string that itself was nested in a
<script> tag. So I needed to extract the string from the
<script>, and the JSON from the string.
string = soup.find_all('script').string #Don't confuse `string` the variable with `.string` from `BeautifulSoup`. json_text = string.strip('window.PRELOADED_STATE=')[:-1] lesson_json = json.loads(json_text)
With the JSON in hand, I started looking through the keys. I won’t show it here, but
pprint was extremely helpful in wading through the JSON code in order to figure out which keys were important. In the end, I drilled down to
'all'. It turned out that it was called
all because all the information for all the lessons in the chapter was here in a list of dictionaries. Sweet!
dict_keys(['systemStatus', 'backendSession', 'settings', 'autocomplete', 'user', 'fileBrowser', 'chapter', 'ltiChecker', 'location', 'course', 'exercises'])
dict_keys(['sample_code', 'sct', 'instructions', 'question', 'hint', 'possible_answers', 'number', 'user', 'randomNumber', 'assignment', 'feedbacks', 'attachments', 'title', 'xp', 'language', 'pre_exercise_code', 'solution', 'type', 'id'])
In DataCamp courses, there are different types of lessons (free coding, multiple-choice, etc), and that information is stored in
for item in lesson_json['exercises']['all']: print(item['type'], ':', item['title'])
VideoExercise : Intro to Experimental Design NormalExercise : A basic experiment NormalExercise : Randomization NormalExercise : Replication NormalExercise : Blocking VideoExercise : Hypothesis Testing BulletExercise : One sided vs. Two sided tests MultipleChoiceExercise : pwr package Help Docs exploration BulletExercise : Power & Sample Size Calculations
I could skip video exercises because there was no text to capture. However, normal, bullet and multiple-choice exercises each required slightly different treatments. So I made functions for each of them.
def NormalExercise_print(json): print('#', json['title'], '\n') print('## Exercise\n') print(convert_2_md(json['assignment'])) print('## Instructions\n') print(convert_2_md(json['instructions'])) print('## Code\n') print('```\n' + convert_2_md(json['sample_code']).replace('\\', ''),'\n```') print('```\n' + convert_2_md(json['solution']).replace('\\', ''),'\n```') print(get_success_msg(json['sct']) + '\n')
def BulletExercise_print(json): print('# ' + json['title'], '\n') print('## Exercise\n') print(convert_2_md(json['assignment'])) print('## Instructions & Code \n') for item in json['subexercises']: print(convert_2_md(item['instructions'])) print('```\n' + item['sample_code'] + '\n```') print('```\n' + item['solution'] + '\n```') print(get_success_msg(item['sct']) + '\n')
def MultipleChoiceExercise_print(json): print('# ' + json['title'], '\n') print('## Exercise\n') print(convert_2_md(json['assignment'])) print("## Choices\n") for choice in json['instructions']: print('* ' + choice) print('\n**Correct answer: ' + get_correct_mc(json['sct']) + '**\n') print(get_success_msg(json['sct']) + '\n')
In order to make these functions easier to work with (and to follow some of the coding practices I learned as part of my Carpentry certification, I created a few sub-functions. The most interesting of these was
convert_2_md(), which converts HTML syntax to Markdown using
pandoc. In order to do that, I had to learn a little bit about the
subprocess package. This blog post by Eddie Smith was a big help. Here’s what I ended up with:
def convert_2_md(string): p = Popen(['pandoc', '-f', 'html', '-t', 'markdown', '--wrap=preserve'], stdout=PIPE, stdin=PIPE, stderr=STDOUT) text = p.communicate(input=string.encode('utf-8')) text = text.decode('utf-8') return text
I also needed a function to pull out the message you get from a DataCamp lesson when you submit the correct answer. This is where regex expressions came in with
def get_success_msg(string): match = re.search(r'success_msg\("(.*?)"\)', string) if match != None: message = match.group(1) return message else: return ''
And while we’re at it, why not another regex for finding the correct answer in multiple-choice lessons?
def get_correct_mc(string): match = re.search(r'test_mc\(correct = (\d),', string) if match != None: message = match.group(1) return message else: return ''
At this point, I would like to thank you for getting this far into the post (most likely because you’re a member of my family or a super nerd). Either way, you’ve made it to the good stuff. Here’s the function that brings everything together. It looks at the type of lesson, parses it accordingly, and prints everything out in Markdown.
for item in lesson_json['exercises']['all']: if item['type'] == 'VideoExercise': pass elif item['type'] == 'NormalExercise': NormalExercise_print(item) elif item['type'] == 'BulletExercise': BulletExercise_print(item) elif item['type'] == 'MultipleChoiceExercise': MultipleChoiceExercise_print(item)
# A basic experiment ## Exercise `ToothGrowth` is a built-in dataset in R from a study that examined the effect of three different doses of Vitamin C on the length of the odontoplasts, the cells responsible for teeth growth in 60 guinea pigs, with length as the measured outcome variable. We'll call it "tooth length" throughout this chapter for ease. Built-in data can be loaded with the `data()` function. It will load the dataset as a data frame with the same name called in the function. You can load the famous `iris` dataset, for example, using `data("iris")`. Suppose you know that the average length of a guinea pigs odontoplasts is 18 micrometers. Conduct a t-test on the ToothGrowth dataset. Test to check that the mean of `len` is not equal to 18. ## Instructions - Load the `ToothGrowth` dataset with the `data()` function, like shown above. Run this line and then type `ToothGrowth` into the console to be sure it loaded. - Use `t.test()` to test if the `len` variable is not equal to 18 micrometers. ## Code ``` # Load the ToothGrowth dataset data(___) # Perform a two-sided t-test t.test(x = ___, alternative = ___, mu = ___) ``` ``` # Load the ToothGrowth dataset data("ToothGrowth") # Perform a two-sided t-test t.test(x = ToothGrowth$len, alternative = "two.sided", mu = 18) ``` Excellent job! (Aren't you glad the guinea pigs' teeth aren't 18 *inches* long?)
I didn’t include the full output here, but you get the idea. All the info I need is in Markdown format for my notes. All that time I would’ve wasted on copying and pasting can now be spent on moving quickly through the lesson.
The next step will be to put this and the previous post together, iterate through the whole course, and build several Markdown documents at once: A table of contents and a file for each chapter. Can’t wait to put it all together!