Scraping the HAD website

A project log for Hackaday statistics

Adventures in web scraping and data analysis

pjkim00pjkim00 06/09/2016 at 06:481 Comment

I started off knowing nothing about web scraping. I found a good link which shows how to scrape using python:

Found a few websites that explain the xtree syntax and I was off to the races. So a few baby steps first.

from lxml import html
import requests
page = requests.get('')
tree = html.fromstring(page.content)
# get post titles
# get post IDs
# get Date of publication

Eventually wrote a script to scrape the entire HAD archives. On Wednesday June 8th at 11PM Pacific time, it had 3223 pages. Decided to include article ID, date of publication, title, author, #comments, "posted ins", and tags. Here is a quick and dirty python script to output all data to a tab delimited file:

from lxml import html
import requests

fh = open("Hackaday.txt", 'w')
for pageNum in xrange(1,3224,1):
    page = requests.get(''%pageNum)
    tree = html.fromstring(page.content)

    titles = tree.xpath('//article/header/h1/a/text()')
    postIDs = tree.xpath('//article/@id')
    dates = tree.xpath('//article/header/div/span[@class="entry-date"]/a/text()')
    authors = tree.xpath('//article/header/div/a[@rel="author"]/text()')
    commentCounts = tree.xpath('//article/header/div/a[@class="comments-counts comments-counts-top"]/text()')
    commentCounts  =[i.strip() for i in commentCounts]
    posts = []
    tags = []
    for i in xrange(len(titles)):
        posts.append(tree.xpath('//article[%d]/footer/span/a[@rel="category tag"]/text()'%(i+1)))
    for i in xrange(len(titles)):
        #print postIDs[i] + '\t' + dates[i] +'\t' +titles[i] +'\t' + authors[i]+'\t'+commentCounts[i]+ '\t' + ",".join(posts[i]) + '\t' + ",".join(tags[i])
        fh.write(postIDs[i] + '\t' + dates[i] +'\t' +titles[i] +'\t' + authors[i]+'\t'+commentCounts[i]+ '\t' + ",".join(posts[i]) + '\t' + ",".join(tags[i]) + '\n')

I felt a bit guilty about scraping the entire website but Brian said it was OK. The html file for each page is ~60KB times 3223 pages is about 193 MB of data. This was distilled down to 3.5 MB of data and took about 25 minutes.

The latested post is #207753 and the earliest is post # 7. The numbers are not sequential and there are total of 22556 articles. The file looks like this

post-207753	June 8, 2016	Hackaday Prize Entry: The Green Machine	Anool Mahidharia	1 Comment	The Hackaday Prize	2016 Hackaday Prize,arduino,Coating machine,grbl,Hackaday Prize,linear motion,motor,raspberry pi,Spraying machine,stepper driver,the hackaday prize
post-208524	June 8, 2016	Rainbow Cats Announce Engagement	Kristina Panos	1 Comment	ATtiny Hacks	attiny,because cats,blinkenlights,RGB LED,smd soldering,wedding announcements
post-208544	June 8, 2016	Talking Star Trek	Al Williams	8 Comments	linux hacks,software hacks	computer speech,natural language,speech recognition,star trek,text to speech,voice command,voice recognition
post-11	September 9, 2004	hack the dakota disposable camera	Phillip Torrone	1 Comment	digital cameras hacks
post-10	September 8, 2004	mod the cuecat, and scan barcodes…	Phillip Torrone	1 Comment	misc hacks
post-9	September 7, 2004	make a nintendo controller in to a usb joystick	Phillip Torrone	22 Comments	computer hacks,macs hacks
post-8	September 6, 2004	change the voice of an aibo ers-7	Phillip Torrone	10 Comments	robots hacks
post-7	September 5, 2004	radioshack phone dialer – red box	Phillip Torrone	38 Comments	misc hacks
I'll upload a zipped version. Hopefully this will save HAD from being scraped over and over again.I'll start slicing and dicing the data soon.

Addendum: for whatever reason, two articles were missing the posts/tags fields. I fixed them manually and uploaded the corrected file.


aimeeyoung wrote 02/06/2019 at 14:59 point

Interesting article, but I did not really understand anything hahah. I want to learn Python, I wonder, a person with a creative mind, will be able to master such knowledge? The last thing I did was high school homework help (btw using this site) my friends. But it starts to bother me. I want to know for myself the world of numbers and calculations. Today I will start my training :)

  Are you sure? yes | no