Webscraping Datacamp Courses III

In my last two posts (here and here), I discussed my efforts to scrape the DataCamp website so that I could have a record of the courses I’ve taken. In this final post, I’m putting all the pieces together, discussing how the script works and showing what the final product looks like.

There’s a lot of code here, so I’ve frontloaded it into the first half of the post. You can choose your own adventure. If you’re interested in how the script works, go to the next section. If you’re interested in seeing what the script does, skip to the section “In Action.”

Before I start, I want to say how excited I am about this project. A year ago, I could barely print "Hello world", but thanks to sites like DataCamp, communities like The Carpentries and a heavy dose of Stack Overflow, I’m doing things like this. So if you’re interested in coding or data science but think its something you could never do, think again. Keep a growth mindset, and dive in.

How It Works

Here’s how the script works. The user passes the link of the course they want to get_whole_course(). This function then uses get_course_outline() to get a list of the chapters and lessons in each course, make_chapter_notes() to create a text file for each chapter, and download_chapter_slides() to download a PDF of the chapter slides. It then creates a text file with a table of contents for each course. This allows for easy navegation of the course once all the text files are in my notes system (I use nvALT and The Archive). Everything is organized using unqiue, 14-digit “z numbers,” which are assigned to each course and chapter.

Here are the libraries I used, with a brief explanation of the role each one plays:

from bs4 import BeautifulSoup					#For parsing the HTML
from collections import OrderedDict				#For storing chapter and lesson info
from subprocess import Popen, PIPE, STDOUT		#For accessing pandoc
from urllib.request import urlopen, urlretrieve	#For fetching the webpage, downloading PDF of slides
import json										#For dealing with website data that comes in JSON
import pandas as pd								#For creating Zettelkasten-style filenames
import re										#For regex parsing `sct`
import subprocess								#For accessing pandoc
import pprint									#For looking through webpages, dictionaries, etc

And here’s all the code you need to capture an entire course:

def get_whole_course(link, return_z=False):
	'''Receives course URL from user. Gets ordered dict of chapters/lessons. 
	Creates list of unique z_numbers, one for each chapter. 
	Creates txt file for each chapter, fills each file with lesson content.
	Gets course name. Creates table of contents.
	Feeds to: get_course_outline(), make_z_number(), make_chapter_notes(), 
	get_course_title(), create_table_of_contents(), download_chapter_slides()'''
	course_dictionary = get_course_outline(link)
	z_list = make_z_number(len(course_dictionary.items())+1)
	chapter_names_Z_list = []
	for chapter, lesson in course_dictionary.items():
		z_index = list(course_dictionary.keys()).index(chapter)
		filename = z_list[z_index] + ' ' + chapter + '.txt'
		chapter_names_Z_list.append(filename)
		chapter_link = [lesson][0][1][1]
		make_chapter_notes(filename, chapter_link)
	course_name = get_course_title(link)
	create_table_of_contents(course_dictionary, course_name, z_list)
	first_key = list(course_dictionary.keys())[0]
	one_link = course_dictionary[first_key][0][1]
	download_chapter_slides(chapter_names_Z_list, one_link)
	if return_z == True:
		return z_list[-1]
    
def get_course_outline(link):
	'''Receives link to course landing page from get_whole_course(). 
	Returns ordered dict of chapters with lessons.'''
	html = urlopen(link)
	soup = BeautifulSoup(html, 'lxml')
	lesson_outline = soup.find_all(['h4', 'li'])
	chapters = OrderedDict()   # {chapter: [(lesson_name, lesson_link), ...], ...}
	for item in lesson_outline:
		attributes = item.attrs
		try:
			class_type = attributes['class'][0]
			if class_type == 'chapter__title':
				chapter = item.text.strip()
				chapters[chapter] = []
			if class_type == 'chapter__exercise':
				lesson_name = item.find('h5').text
				lesson_link = item.find('a').attrs['href']
				chapters[chapter].append((lesson_name, lesson_link))
		except KeyError:
			pass
	return(chapters)

def make_z_number(num):
	'''Takes int and returns a list of unique, 14-digit numbers. Only goes to 99.'''
	assert num < 100, 'Enter an int that is less than 100. Max list size is 99.'
	string_list = []
	z_index = 0
	for x in range(num):
		z_string = pd.to_datetime('now').strftime('%Y%m%d%H%S') #Did seconds not minutes for repeats
		z_string = z_string + '{0:0>2}'.format(z_index)
		string_list.append(z_string)
		z_index += 1
	return string_list

def make_chapter_notes(filename, link):
	'''Receives filename and lesson link from get_whole_course(). (Note that a link from any lesson 
	in a chapter will work. That is, any lesson link has all the information for the chapter.)
	Cycles through all lessons in chapter, converting each lesson and sub-exercise from HTML to
	Markdown. Prints all chapter content into text file.
	Feeds to: get_lesson_json(), NormalExercise_print(), BulletExercise_print(), 
	MultipleChoiceExercise_print(), download_chapter_slides()'''
	lesson_json = get_lesson_json(link)
	for item in lesson_json['exercises']['all']:
		if item['type'] == 'VideoExercise':
			pass
		elif item['type'] == 'NormalExercise':
			NormalExercise_print(item, filename)
		elif item['type'] == 'BulletExercise':
			BulletExercise_print(item, filename)
		elif item['type'] == 'MultipleChoiceExercise':
			MultipleChoiceExercise_print(item, filename)

def get_course_title(link):
	'''Receives link from get_whole_course(). Gets website. Returns title.'''
	html = urlopen(link)
	soup = BeautifulSoup(html, 'lxml')
	return soup.title.text

def create_table_of_contents(dictionary, course_name, z_list):
	'''Receives course dictionary, course name, and list of unique z_numbers from
	get_whole_course(). Creates text file with contents of course, formatted in Markdown, with
	wiki-style links to each chapter.'''    
	filename = z_list[-1] + ' ' + course_name + '.txt'
	with open(filename, 'a') as f:
		for chapter, lessons in dictionary.items():
			z_index = list(dictionary.keys()).index(chapter)
			print('\n# ', '[[' + z_list[z_index] + ']]', chapter, '\n', file=f)
			for lesson_name, lesson_link in lessons:
				print("   *", lesson_name, file=f)

def get_lesson_json(link):
	'''Receives lesson link from make_chapter_notes() and returns 
	the dictionary that holds all the information for the lesson's parent chapter.'''
	html = urlopen(link)
	soup = BeautifulSoup(html, 'lxml')
	string = soup.find_all('script')[3].string
	json_text = string.strip('window.PRELOADED_STATE=')[:-1]
	lesson_json = json.loads(json_text)
	return lesson_json
                
def NormalExercise_print(json, f):
	'''Works with make_chapter_notes. Parses NormalExercise type lessons and prints them in 
	markdown to file.
	Feeds to: convert_2_md(), get_success_msg().'''
	with open(f, 'a') as f:
		print('#', json['title'], '\n', file=f)
		print('## Exercise\n', file=f)
		print(convert_2_md(json['assignment']), file=f)
		print('## Instructions\n', file=f)
		print(convert_2_md(json['instructions'][:-2]), file=f)
		print('## Code\n', file=f)
		print('```\n' + convert_2_md(json['sample_code']).replace('\\', ''),'\n```\n', file=f)
		print('```\n' + convert_2_md(json['solution']).replace('\\', ''),'\n```\n', file=f)
		print(get_success_msg(json['sct']) + '\n', file=f)

def BulletExercise_print(json, f):
	'''Works with make_chapter_notes. Parses BulletExercises type lessons and prints them in 
	markdown to file.
	Feeds to: convert_2_md(), get_success_msg().'''
	with open(f, 'a') as f:
		print('# ' + json['title'], '\n', file=f)
		print('## Exercise\n', file=f)
		print(convert_2_md(json['assignment']), file=f)
		print('## Instructions & Code \n', file=f)  
		for item in json['subexercises']:
			print(convert_2_md(item['instructions']), file=f)
			print('```\n' + item['sample_code'] + '\n```\n', file=f)
			print('```\n' + item['solution'] + '\n```', file=f)
			print(get_success_msg(item['sct']) + '\n', file=f)

def MultipleChoiceExercise_print(json, f):
	'''Works with make_chapter_notes. Parses MultipleChoice type lessons and prints them in 
	markdown to file.
	Feeds to: convert_2_md(), get_correct_mc(), get_success_msg().'''
	with open(f, 'a') as f:
		print('# ' + json['title'], '\n', file=f)
		print('## Exercise\n', file=f)
		print(convert_2_md(json['assignment']), file=f)
		print("## Choices\n", file=f)
		for choice in json['instructions']:
			print('* ' + choice, file=f)
		print('\n**Correct answer: ' + get_correct_mc(json['sct']) + '**\n', file=f)
		print(get_success_msg(json['sct']) + '\n', file=f)

def convert_2_md(string):
	'''Receives a string of HTML and use Pandoc to return string in Markdown.
	Source: http://www.practicallyefficient.com/2016/12/04/pandoc-and-python.html'''
	p = Popen(['pandoc', '-f', 'html', '-t', 'markdown', '--wrap=preserve'], stdout=PIPE, stdin=PIPE, stderr=STDOUT)
	text = p.communicate(input=string.encode('utf-8'))[0]
	text = unescape(text.decode('utf-8'))
	return text

def get_success_msg(string):
	'''Parses text from DataCamp `sct` JSON and returns the success message as a string.'''
	match = re.search(r'success_msg\("(.*?)"\)', string)
	if match != None:
		message = match.group(1)
		return message
	else:
		return ''

def get_correct_mc(string):
	'''Parses text from DataCamp `sct` JSON and correct answer for MultipleChoice type lessons. 
    Works with MultipleChoiceExercise_print()'''
	match = re.search(r'test_mc\(correct = (\d),', string)
	if match != None:
		message = match.group(1)
		return message
	else:
		return ''

def download_chapter_slides(chapter_names, one_link):
	'''Receives the list of chaper_names and a link to the first lesson of the first chapter
	from get_whole_course(). Gets links for the PDF slides for each chapter, downloads each PDF,
	and saves it with the same name as the chapter txt file for easy indexing.'''
	course_json = get_lesson_json(one_link)
	pdf_links = []
	for item in course_json['course']['chapters']:
		pdf_links.append(item['slides_link'])
	pdf_tuples = list(zip(chapter_names, pdf_links))
	for t in pdf_tuples:
		filename = t[0].strip('.txt') + '.pdf'
		if t[1]:
			urlretrieve(t[1], filename)

def unescape(s): 
	'''Receives string from convert_2_md() and unescapes non-ascii characters.
	Source: https://wiki.python.org/moin/EscapingHtml'''
	s = s.replace("&lt;", "<")
	s = s.replace("&gt;", ">")
	s = s.replace("&amp;", "&")
	return s

Once the above code is loaded, all you have to do is grab the link for the course and run get_whole_course().

link = 'https://www.datacamp.com/courses/introduction-to-the-tidyverse'
get_whole_course(link)
print('Done!')

And, as a bonus, you can use the below function to scrape a whole track. In the next section, I demo what it looks like to scrape the track that I’m currently working on.

def get_whole_track(link):
    html = urlopen(link)
    soup = BeautifulSoup(html, 'lxml')
    track = soup.find_all('a', attrs={'class':'course-block__link ds-snowplow-link-course-block'})
    track_title = soup.title.text
    courses = []

    for x in track:
        title = x.find('h4').text
        tail = x.attrs['href']
        url = 'https://www.datacamp.com' + tail
        courses.append((title, url))
    
    track_file = '20180815170711 ' + track_title + '.txt'  #Create your own z_number first
    
    with open(track_file, 'a') as f:
        print('#', track_title + '\n', file=f)
        for course in courses:
            z_num = get_whole_course(course[1], return_z=True)
            course_name = '[[' + z_num + ']] ' + course[0]
            print('*', course_name, file=f)

In Action

So what happens when you scrape a whole track? Here’s the “Data Scientist with R” track, 23 courses that’ll take you from R-zero to R-hero:

Each course looks something like this:

When I run the get_whole_track() function, here’s the table of contents I end up with:

Here’s the table of contents for the course:

And here’s the first chapter of the course:

And that’s it!

What’s Next?

Studying! I’ve been so focused on getting the notes for the “Data Scientist with R” track that I haven’t been actually completing the courses. (Typical, right?) So the next step for me is doubling down on completing the track and adding more R to my skill set.

If you see me noodling around on this blog, please remind me that I should be studying. 😉

Written on August 15, 2018