I started off knowing nothing about web scraping. I found a good link which shows how to scrape using python:
Found a few websites that explain the xtree syntax and I was off to the races. So a few baby steps first.
from lxml import html import requests page = requests.get('http://hackaday.com/blog/page/3000/') tree = html.fromstring(page.content) # get post titles tree.xpath('//article/header/h1/a/text()') # get post IDs tree.xpath('//article/@id') # get Date of publication tree.xpath('//article/header/div/span[@class="entry-date"]/a/text()')
Eventually wrote a script to scrape the entire HAD archives. On Wednesday June 8th at 11PM Pacific time, it had 3223 pages. Decided to include article ID, date of publication, title, author, #comments, "posted ins", and tags. Here is a quick and dirty python script to output all data to a tab delimited file:
from lxml import html import requests fh = open("Hackaday.txt", 'w') for pageNum in xrange(1,3224,1): page = requests.get('http://hackaday.com/blog/page/%d/'%pageNum) tree = html.fromstring(page.content) titles = tree.xpath('//article/header/h1/a/text()') postIDs = tree.xpath('//article/@id') dates = tree.xpath('//article/header/div/span[@class="entry-date"]/a/text()') authors = tree.xpath('//article/header/div/a[@rel="author"]/text()') commentCounts = tree.xpath('//article/header/div/a[@class="comments-counts comments-counts-top"]/text()') commentCounts =[i.strip() for i in commentCounts] posts =  tags =  for i in xrange(len(titles)): posts.append(tree.xpath('//article[%d]/footer/span/a[@rel="category tag"]/text()'%(i+1))) tags.append(tree.xpath('//article[%d]/footer/span/a[@rel="tag"]/text()'%(i+1))) for i in xrange(len(titles)): #print postIDs[i] + '\t' + dates[i] +'\t' +titles[i] +'\t' + authors[i]+'\t'+commentCounts[i]+ '\t' + ",".join(posts[i]) + '\t' + ",".join(tags[i]) fh.write(postIDs[i] + '\t' + dates[i] +'\t' +titles[i] +'\t' + authors[i]+'\t'+commentCounts[i]+ '\t' + ",".join(posts[i]) + '\t' + ",".join(tags[i]) + '\n') fh.close()
I felt a bit guilty about scraping the entire website but Brian said it was OK. The html file for each page is ~60KB times 3223 pages is about 193 MB of data. This was distilled down to 3.5 MB of data and took about 25 minutes.
The latested post is #207753 and the earliest is post # 7. The numbers are not sequential and there are total of 22556 articles. The file looks like this
post-207753 June 8, 2016 Hackaday Prize Entry: The Green Machine Anool Mahidharia 1 Comment The Hackaday Prize 2016 Hackaday Prize,arduino,Coating machine,grbl,Hackaday Prize,linear motion,motor,raspberry pi,Spraying machine,stepper driver,the hackaday prize post-208524 June 8, 2016 Rainbow Cats Announce Engagement Kristina Panos 1 Comment ATtiny Hacks attiny,because cats,blinkenlights,RGB LED,smd soldering,wedding announcements post-208544 June 8, 2016 Talking Star Trek Al Williams 8 Comments linux hacks,software hacks computer speech,natural language,speech recognition,star trek,text to speech,voice command,voice recognition ..... post-11 September 9, 2004 hack the dakota disposable camera Phillip Torrone 1 Comment digital cameras hacks post-10 September 8, 2004 mod the cuecat, and scan barcodes… Phillip Torrone 1 Comment misc hacks post-9 September 7, 2004 make a nintendo controller in to a usb joystick Phillip Torrone 22 Comments computer hacks,macs hacks post-8 September 6, 2004 change the voice of an aibo ers-7 Phillip Torrone 10 Comments robots hacks post-7 September 5, 2004 radioshack phone dialer – red box Phillip Torrone 38 Comments misc hacksI'll upload a zipped version. Hopefully this will save HAD from being scraped over and over again.I'll start slicing and dicing the data soon.
Addendum: for whatever reason, two articles were missing the posts/tags fields. I fixed them manually and uploaded the corrected file.