Close

Scoring by vocabulary

A project log for Hackaday.io Spambot Hunter Project

Re-claiming hackaday.io from the spam bots

Stuart LonglandStuart Longland 02/03/2018 at 23:030 Comments

So, I implemented the scoring by a fairly naïve summing of individual word scores, computed with the earlier algorithm.

This… due to the fact that the spambot-to-real user ratio isn't 50:50, meant that plenty of normal words and phrases boosted the spambots' scores to normal levels.

I needed to re-think how I used that data.  In the end, I decided to dump the words' scores into an array, sort it in ascending order, then sum the worst 10.

That proved to be quite effective.  It means they can't cheat the system by putting their usual text up, then slapping in a handful of hacker lingo words to boost their score.  Users that only use one or two bad words, will usually be scored up highly enough to avoid getting flagged.

The database now has nearly 20000 words and 65800 word pairs, scored according to typical usage of users that have arrived at hackaday.io since early June last year.

With that information, the script can auto-classify some users with better accuracy:

2018-02-03 22:47:02,402       INFO HADSHApp.crawler 25397/MainThread: New user: USA Tile & Marble [#xxxxxx]
2018-02-03 22:47:02,404      DEBUG HADSHApp.crawler 25397/MainThread: User USA Tile & Marble [#xxxxxx] is in groups set() (classifie
d False)
2018-02-03 22:47:03,433      DEBUG HADSHApp.crawler 25397/MainThread: Inspecting user USA Tile & Marble [#xxxxxx]
2018-02-03 22:47:03,440    WARNING polyglot.detect.base 25397/MainThread: Detector is not able to detect the language reliably.
2018-02-03 22:47:03,443      DEBUG     HADSHApp.api 25397/MainThread: Query arguments: {'per_page': 50, 'page': 1, 'api_key': 'xxxxx
xxxxxxxxxxx'}
2018-02-03 22:47:03,446      DEBUG     HADSHApp.api 25397/MainThread: GET 'https://api.hackaday.io/v1/users/xxxxxx/links?per_page=50
&page=1&api_key=xxxxxxxxxxxxxxxx'
2018-02-03 22:47:04,683       INFO HADSHApp.crawler 25397/MainThread: User USA Tile & Marble [#xxxxxx] has link to VIEW OUR SHOWROOM
 <[REDACTED]>
2018-02-03 22:47:04,754      DEBUG HADSHApp.crawler 25397/MainThread: New word: mosaic
2018-02-03 22:47:04,789      DEBUG HADSHApp.crawler 25397/MainThread: New word: ceramic
2018-02-03 22:47:04,808      DEBUG HADSHApp.crawler 25397/MainThread: New word: tiles
2018-02-03 22:47:04,818      DEBUG HADSHApp.crawler 25397/MainThread: New word: porcelain
2018-02-03 22:47:04,862      DEBUG HADSHApp.crawler 25397/MainThread: New word: marble
2018-02-03 22:47:04,891      DEBUG HADSHApp.crawler 25397/MainThread: New word: showroom
2018-02-03 22:47:04,901      DEBUG HADSHApp.crawler 25397/MainThread: New word: travertine
2018-02-03 22:47:04,911      DEBUG HADSHApp.crawler 25397/MainThread: New word: collection
2018-02-03 22:47:04,945      DEBUG HADSHApp.crawler 25397/MainThread: New word: flooring
2018-02-03 22:47:04,963      DEBUG HADSHApp.crawler 25397/MainThread: New word: pompano
2018-02-03 22:47:04,973      DEBUG HADSHApp.crawler 25397/MainThread: New word: tile
2018-02-03 22:47:06,090      DEBUG HADSHApp.crawler 25397/MainThread: User USA Tile & Marble [#xxxxxx] has score -3.675362
2018-02-03 22:47:06,098      DEBUG HADSHApp.crawler 25397/MainThread: Auto-classifying USA Tile & Marble [#xxxxxx] as suspect

That's out of the logs.  The script "learned" some new words there.  In the database, we can see how those words are scored:

hadsh=> select score, count, score::float/count::float rel_score from word where word='flooring';
 score | count | rel_score 
-------+-------+-----------
    -2 |     2 |        -1

As I say, machine learning at it's most primitive.  I've considered whether to integrate uribl or surbl DNS blacklists, but so far, this has neither been necessary, nor have I seen any links yet (of the few that I tried) that actually show up in those blacklists.

For now, the site is once again, blocked.  So it's back to the manual methods again.  Things going to plan, we should be able to expand the data set to cover arrivals in late 2016 once the user retrieval resumes.

Discussions