• Prolific authors/submitters and other tidbits

    pjkim0006/09/2016 at 23:46 0 comments

    I'll continue to put up interesting things as I think of them. Here are a few interesting tidbits.

    Most often used post tags:

    misc hacks 2327
    Arduino Hacks 1792
    news 1492
    classic hacks 1291
    robots hacks 1248
    tool hacks 1200
    home hacks 1024
    led hacks 1024
    Microcontrollers 893
    Hackaday Columns 813
    peripherals hacks 778
    Featured 750
    transportation hacks 742
    slider 711
    3d Printer hacks 696
    hardware 661
    security hacks 657
    Raspberry Pi 634
    digital cameras hacks 589
    home entertainment hacks 587

    Perhaps unsurprisingly, arduino hacks are near the top of the list.

    If you look at the most prolific authors you get:

    Mike Szczys 5716
    Brian Benchoff 3834
    Caleb Kraft 1567
    Eliot 1332
    James Hobson 1063
    Mike Nathan 1039
    Will O'Brien 805
    Adam Fabio 530
    Elliot Williams 405
    Al Williams 401
    Kristina Panos 322
    Rich Bremer 290
    Jakob Griffith 269
    Eric Evenchick 265
    Rick Osgood 225
    Gerrit Coetzee 215
    Marsh 213
    Jeremy Cook 199
    Dan Maloney 198
    Bryan Cockfield 187
    Kevin Dady 187
    Mathieu Stephan 180
    Anool Mahidharia 160
    Juan Aguilar 160
    Vine Veneziani 137

    Plotting the number of articles per week, segregated by the top ten authors, over time gives the following picture:

    You can clearly see where submitters became active and when when they stopped. Brian had a early submission somewhere in 2006 before he joined HAD. Mike Szczys was active early and then starting tailing off around 2013-- other behind the scenes activities I imagine.

  • Featured articles over time

    pjkim0006/09/2016 at 18:08 0 comments

    Here is the data requested: featured per week and %featured.

    The above was for articles with the "Featured" post marker. If you include "Featured","Retrotechtacular","Hackaday Columns", "The Hackaday Prize", "Ask Hackaday", "Hackaday Store", "Interviews", that roughly triples the number of articles, but the overall shape looks the same.

  • Fris Plot

    pjkim0006/09/2016 at 08:19 0 comments

    OK, first plot of the data before I go to bed. I munged the data and plotted posts per day as a function of time. Not surprisingly, the number of posts per day have been going up since the early days. Somewhat surprisingly the maximum posts per day was way back in Feb 28, 2011 when there were no less than 16 posts! Here you go:

    Staying true to its name, most days early on had one article per day. Now the mode appears to be 8 per day.

  • Scraping the HAD website

    pjkim0006/09/2016 at 06:48 1 comment

    I started off knowing nothing about web scraping. I found a good link which shows how to scrape using python:

    http://docs.python-guide.org/en/latest/scenarios/scrape/

    Found a few websites that explain the xtree syntax and I was off to the races. So a few baby steps first.

    from lxml import html
    import requests
    page = requests.get('http://hackaday.com/blog/page/3000/')
    tree = html.fromstring(page.content)
    # get post titles
    tree.xpath('//article/header/h1/a/text()')
    # get post IDs
    tree.xpath('//article/@id')
    # get Date of publication
    tree.xpath('//article/header/div/span[@class="entry-date"]/a/text()')
    

    Eventually wrote a script to scrape the entire HAD archives. On Wednesday June 8th at 11PM Pacific time, it had 3223 pages. Decided to include article ID, date of publication, title, author, #comments, "posted ins", and tags. Here is a quick and dirty python script to output all data to a tab delimited file:

    from lxml import html
    import requests
    
    fh = open("Hackaday.txt", 'w')
    for pageNum in xrange(1,3224,1):
        page = requests.get('http://hackaday.com/blog/page/%d/'%pageNum)
        tree = html.fromstring(page.content)
    
        titles = tree.xpath('//article/header/h1/a/text()')
        postIDs = tree.xpath('//article/@id')
        dates = tree.xpath('//article/header/div/span[@class="entry-date"]/a/text()')
        authors = tree.xpath('//article/header/div/a[@rel="author"]/text()')
        commentCounts = tree.xpath('//article/header/div/a[@class="comments-counts comments-counts-top"]/text()')
        commentCounts  =[i.strip() for i in commentCounts]
        posts = []
        tags = []
        for i in xrange(len(titles)):
            posts.append(tree.xpath('//article[%d]/footer/span/a[@rel="category tag"]/text()'%(i+1)))
            tags.append(tree.xpath('//article[%d]/footer/span/a[@rel="tag"]/text()'%(i+1)))
        for i in xrange(len(titles)):
            #print postIDs[i] + '\t' + dates[i] +'\t' +titles[i] +'\t' + authors[i]+'\t'+commentCounts[i]+ '\t' + ",".join(posts[i]) + '\t' + ",".join(tags[i])
            fh.write(postIDs[i] + '\t' + dates[i] +'\t' +titles[i] +'\t' + authors[i]+'\t'+commentCounts[i]+ '\t' + ",".join(posts[i]) + '\t' + ",".join(tags[i]) + '\n')
    fh.close()

    I felt a bit guilty about scraping the entire website but Brian said it was OK. The html file for each page is ~60KB times 3223 pages is about 193 MB of data. This was distilled down to 3.5 MB of data and took about 25 minutes.

    The latested post is #207753 and the earliest is post # 7. The numbers are not sequential and there are total of 22556 articles. The file looks like this

    post-207753	June 8, 2016	Hackaday Prize Entry: The Green Machine	Anool Mahidharia	1 Comment	The Hackaday Prize	2016 Hackaday Prize,arduino,Coating machine,grbl,Hackaday Prize,linear motion,motor,raspberry pi,Spraying machine,stepper driver,the hackaday prize
    post-208524	June 8, 2016	Rainbow Cats Announce Engagement	Kristina Panos	1 Comment	ATtiny Hacks	attiny,because cats,blinkenlights,RGB LED,smd soldering,wedding announcements
    post-208544	June 8, 2016	Talking Star Trek	Al Williams	8 Comments	linux hacks,software hacks	computer speech,natural language,speech recognition,star trek,text to speech,voice command,voice recognition
    .....
    post-11	September 9, 2004	hack the dakota disposable camera	Phillip Torrone	1 Comment	digital cameras hacks
    post-10	September 8, 2004	mod the cuecat, and scan barcodes…	Phillip Torrone	1 Comment	misc hacks
    post-9	September 7, 2004	make a nintendo controller in to a usb joystick	Phillip Torrone	22 Comments	computer hacks,macs hacks
    post-8	September 6, 2004	change the voice of an aibo ers-7	Phillip Torrone	10 Comments	robots hacks
    post-7	September 5, 2004	radioshack phone dialer – red box	Phillip Torrone	38 Comments	misc hacks
    I'll upload a zipped version. Hopefully this will save HAD from being scraped over and over again.I'll start slicing and dicing the data soon.

    Addendum: for whatever reason, two articles were missing the posts/tags fields. I fixed them manually and uploaded the corrected file.