• Scoring by vocabulary

    Stuart Longland02/03/2018 at 23:03 0 comments

    So, I implemented the scoring by a fairly naïve summing of individual word scores, computed with the earlier algorithm.

    This… due to the fact that the spambot-to-real user ratio isn't 50:50, meant that plenty of normal words and phrases boosted the spambots' scores to normal levels.

    I needed to re-think how I used that data.  In the end, I decided to dump the words' scores into an array, sort it in ascending order, then sum the worst 10.

    That proved to be quite effective.  It means they can't cheat the system by putting their usual text up, then slapping in a handful of hacker lingo words to boost their score.  Users that only use one or two bad words, will usually be scored up highly enough to avoid getting flagged.

    The database now has nearly 20000 words and 65800 word pairs, scored according to typical usage of users that have arrived at hackaday.io since early June last year.

    With that information, the script can auto-classify some users with better accuracy:

    2018-02-03 22:47:02,402       INFO HADSHApp.crawler 25397/MainThread: New user: USA Tile & Marble [#xxxxxx]
    2018-02-03 22:47:02,404      DEBUG HADSHApp.crawler 25397/MainThread: User USA Tile & Marble [#xxxxxx] is in groups set() (classifie
    d False)
    2018-02-03 22:47:03,433      DEBUG HADSHApp.crawler 25397/MainThread: Inspecting user USA Tile & Marble [#xxxxxx]
    2018-02-03 22:47:03,440    WARNING polyglot.detect.base 25397/MainThread: Detector is not able to detect the language reliably.
    2018-02-03 22:47:03,443      DEBUG     HADSHApp.api 25397/MainThread: Query arguments: {'per_page': 50, 'page': 1, 'api_key': 'xxxxx
    xxxxxxxxxxx'}
    2018-02-03 22:47:03,446      DEBUG     HADSHApp.api 25397/MainThread: GET 'https://api.hackaday.io/v1/users/xxxxxx/links?per_page=50
    &page=1&api_key=xxxxxxxxxxxxxxxx'
    2018-02-03 22:47:04,683       INFO HADSHApp.crawler 25397/MainThread: User USA Tile & Marble [#xxxxxx] has link to VIEW OUR SHOWROOM
     <[REDACTED]>
    2018-02-03 22:47:04,754      DEBUG HADSHApp.crawler 25397/MainThread: New word: mosaic
    2018-02-03 22:47:04,789      DEBUG HADSHApp.crawler 25397/MainThread: New word: ceramic
    2018-02-03 22:47:04,808      DEBUG HADSHApp.crawler 25397/MainThread: New word: tiles
    2018-02-03 22:47:04,818      DEBUG HADSHApp.crawler 25397/MainThread: New word: porcelain
    2018-02-03 22:47:04,862      DEBUG HADSHApp.crawler 25397/MainThread: New word: marble
    2018-02-03 22:47:04,891      DEBUG HADSHApp.crawler 25397/MainThread: New word: showroom
    2018-02-03 22:47:04,901      DEBUG HADSHApp.crawler 25397/MainThread: New word: travertine
    2018-02-03 22:47:04,911      DEBUG HADSHApp.crawler 25397/MainThread: New word: collection
    2018-02-03 22:47:04,945      DEBUG HADSHApp.crawler 25397/MainThread: New word: flooring
    2018-02-03 22:47:04,963      DEBUG HADSHApp.crawler 25397/MainThread: New word: pompano
    2018-02-03 22:47:04,973      DEBUG HADSHApp.crawler 25397/MainThread: New word: tile
    2018-02-03 22:47:06,090      DEBUG HADSHApp.crawler 25397/MainThread: User USA Tile & Marble [#xxxxxx] has score -3.675362
    2018-02-03 22:47:06,098      DEBUG HADSHApp.crawler 25397/MainThread: Auto-classifying USA Tile & Marble [#xxxxxx] as suspect

    That's out of the logs.  The script "learned" some new words there.  In the database, we can see how those words are scored:

    hadsh=> select score, count, score::float/count::float rel_score from word where word='flooring';
     score | count | rel_score 
    -------+-------+-----------
        -2 |     2 |        -1

    As I say, machine learning at it's most primitive.  I've considered whether to integrate uribl or surbl DNS blacklists, but so far, this has neither been necessary, nor have I seen any links yet (of the few that I tried) that actually show up in those blacklists.

    For now, the site is once again, blocked.  So it's back to the manual methods again.  Things going to plan, we should be able to expand the data set to cover arrivals in late 2016 once the user retrieval resumes.

  • Scoring on words

    Stuart Longland02/02/2018 at 23:33 0 comments

    So, I implemented polyglot in the code, and while I'm not yet doing the user classification, I am at least collecting the information, and already I'm seeing some trends.

    I have a table of words with four columns:

    • BIGINT primary key (indexing by numeric values is easier for a database)
    • TEXT word (with an index for fast look-up)
    • INTEGER score
    • INTEGER count

    The score and count are how we'll keep track of the "spammyness" of a word.  These are update when a user is classified (by a human).  If the user is classed as legitimate, both get incremented, otherwise if they're a spambot user, count will be incremented while score is decremented.

    The end result is that, when you compute score/count, this normalised score is closer to +1.0 for words that are typical of legitimate users, and closer to -1.0 for spambot users.

    I'm not sure how good the tokenisation is in polyglot for non-English scripts, but so far, the spambots I've seen that post Korean/Chinese, tend to have other traits that scream spambot: like creating lots of "projects" with much the same text.

    When a 5-minute-old user account has 20 projects, one scratches their head and wonders why a legitimate user would do that.

    Already, this approach is showing some insights:

    hadsh=> select word, score, count, score::float/count::float rel_score from word where count>1 order by rel_score, count desc;
           word        | score | count |     rel_score      
    -------------------+-------+-------+--------------------
     estate            |    -7 |     7 |                 -1
     agent             |    -6 |     6 |                 -1
     top               |    -4 |     4 |                 -1
     realtors          |    -4 |     4 |                 -1
     bowie             |    -3 |     3 |                 -1
     rated             |    -2 |     2 |                 -1
     polska            |    -2 |     2 |                 -1
     transhelsa        |    -2 |     2 |                 -1
     erekcja           |    -2 |     2 |                 -1
     real              |    -4 |    10 |               -0.4
     md                |    -1 |     3 | -0.333333333333333
     en                |     0 |     8 |                  0
     happy             |     0 |     2 |                  0
     local             |     0 |     2 |                  0
     barcelona         |     1 |     3 |  0.333333333333333
     mi                |     1 |     3 |  0.333333333333333
     la                |     3 |     7 |  0.428571428571429
     best              |     5 |    11 |  0.454545454545455
     really            |     4 |     6 |  0.666666666666667
     de                |    13 |    19 |  0.684210526315789
     hi                |     6 |     8 |               0.75
     am                |   133 |   139 |  0.956834532374101
     in                |   171 |   175 |  0.977142857142857
     with              |    86 |    88 |  0.977272727272727
     my                |    92 |    94 |  0.978723404255319
     i                 |   406 |   412 |  0.985436893203884
     and               |   363 |   367 |  0.989100817438692
     a                 |   290 |   292 |  0.993150684931507
     ,                 |   727 |   731 |   0.99452804377565
     .                 |   806 |   810 |  0.995061728395062
     github            |   363 |   363 |                  1
     to                |   345 |   345 |                  1
     twitter           |   172 |   172 |                  1
    …
     california        |    27 |    27 |                  1
     interesting       |    26 |    26 |                  1
     know              |    26 |    26 |                  1
     germany           |    26 |    26 |                  1
     work              |    25 |    25 |                  1
     electrical        |    25 |    25 |                  1
     enthusiast        |    25 |    25 |                  1
     arduino           |    25 |    25 |                  1
     working           |    24 |    24 |                  1
     3d                |    24 |    24 |                  1
     as                |    24 |    24 |                  1
     science           |    24 |    24 |                  1
     world             |    24 |    24 |                  1
     &                 |    23 |    23 |                  1
     make              |    23 |    23 |                  1
     hack              |    23 |    23 |                  1
     hobbyist          |    22 |    22 |                  1
     indonesia         |    22 |    22 |                  1
     iot               |    22 |    22 |                  1
     what              |    22 |    22 |                  1
     years             |    22 |    22 |                  1
     have              |    22 |    22 |                  1
     you               |    22 |    22 |                  1
     ingin             |    22 |    22 |                  1
     hardware          |    21 |    21 |                  1
     all               |    21 |    21 |                  1
     diy               |    21 |    21 |                  1
     retired           |    20 |    20 |                  1
     because           |    20 |    20 |                  1
     guy               |    19 |    19 |                  1
     here              |    19 |    19 |                  1
     ideas             |    19 |    19 |                  1
     cool              |    18 |    18 |                  1
     old               |    18 |    18 |                  1
    …

    So some are definitely specific to spammers… and I'll apologise now to the people of Bowie, MD (you can blame one of your local business owners for the bad reputation, it'd only take 3 legitimate users mentioning "bowie" to offset this).

    Already, we know the moment they mention "realtors" or "estate" to be suspicious.  Word adjacency is also tracked:

    hadsh=> select (select word from word where word_id=proceeding_id) as proceeding, (select word from word where word_id=following_id) as following, score, count, score::float/count::float rel_score from word_adjacent where count>1 order by rel_score, count desc;
       proceeding   |   following    | score | count |     rel_score     
    ----------------+----------------+-------+-------+-------------------
     real           | estate         |    -7 |     7 |                -1
     estate         | agent          |    -6 |     6 |                -1
     best           | real           |    -2 |     2 |                -1
     top            | rated          |    -2 |     2 |                -1
     rated          | real           |    -2 |     2 |                -1
     , |...
    Read more »

  • Playing with polyglot

    Stuart Longland02/02/2018 at 11:07 0 comments

    So, last post I discussed tokenising the language and counting up word frequency.  I did some SQL queries that crudely stripped the HTML and chopped up the text into words.

    It worked, kinda.

    Firstly, it was an extremely naïve algorithm, it would tokenise the word "you're" as "you" and "re".  I could try to embed exceptions, but that'd only work for English.  It would somewhat work for German, French and Spanish, since English borrows a lot of words from those languages, but there I think it'd have less success since those languages have their own special rules.

    It'd fall flat on its face where it came to Arabic, Chinese, Japanese, Korean… etc.

    Now, I'd have to think back to the mid-90s when I was studying Japanese to think how the sentence structures there worked.  We had to as part of primary school studies, and I was never any good at it then.  To this day I recall something about someone having an itchy knee, a guy named Roko, then it got a little rude!

    So I'd be the last person that should be writing a natural language parser for Japanese, let alone the others!

    Enter polyglot.

    Polyglot is a toolkit for natural language processing in Python.  Among its features is tokenisation.  It is able to detect the language used, then apply rules to tokenise the text into words, which we can then count and use.  I tried it with some recent users here, copying and pasting their profile text into the ipython shell, and lo and behold, it was able to identify the language and tokenise the words out.

    It may not be perfect, but it's better than anything I can write.

    The catch is, it's GPLv3, whereas up to now, my code was under the BSD license.  Since I'm sole developer so far, we'll just switch to GPLv3 as well.  Not my cup of tea for a software license (in particular, being able to sue for infringement counts for nought if you don't have the time/money to defend your copyright), but it's not really a big deal in this case.

    I'll look around for something to strip the HTML out, and we should be in business.

  • Classifying users through posted content

    Stuart Longland01/22/2018 at 10:32 0 comments

    Many years ago… when I used to muck around with things like IRC, I had a bot on the Freenode network.

    no_body was based on PerlBot.  Aside from annoying some members of #gentoo-mips… it did had a few party tricks, such as doing Google searches, scraping the weather data off the BoM website and a few other skills.

    One which would have an application here, is tallying up the vocabulary of a channel.  It'd tokenise sentences, throw out punctuation, and tally up the number of times each word was used, and store that in a database.  A web frontend could retrieve that information.

    There are certain keywords that the spammers love to use.  Not many legitimate users mention that they are into "marketing", or that they are "SEO" experts, or perhaps they really like those little pills that … well… you get the idea.

    I don't have the code for that old bot, or at least I don't think I do… and in any case, it probably wasn't as well constructed as I'd like.

    It would seem that a simple scoring system could work here.  When a user is marked as "legit", it could tally up the keywords used by that person in their profile, and up-vote them.  When marked as "suspect", it could down vote those same keywords.

    Ultimately, the words that the spammers mainly use would attain very high negative numbers, the words us regulars use would attain high positive numbers.

    If a user profile mentions a significant number of these "negative" words, that's a further clue that the profile might belong to someone that's up to no good.  To achieve this, the HTML will need to be decoded to plain text.

    It might be helpful for hyperlinks to be decoded to the link text plus the domain of the site, since I note one particular spammer has been keen to promote the same half-dozen sites over the past week.  This behaviour would backfire for them rather spectacularly.

    The other metric would be the number of projects the user has posted, compared to the account's age… if a user has been there 5 minutes, and somehow has 100 projects to their name, that's a clue.

    The good news though is that the latter variety have not shown up on the projects page the way they have done in the past.  Likely, that's the hard work of the hackaday.io development / operations team, a job well done keeping on top of that. :-)

    For now, my server is still blocked, I suspect that should change in about a fortnight.  I have some data that I can analyse, and that might be a worthwhile sub-project… analyse what I have and get things ready so that processing can resume once the big switch is thrown back again.


    Just tried an experiment… via the SQL console (since my cookie has expired, I am unable to log in: the site can't verify me because the API reports "hourly limit exceeded")…
    select distinct
        word
    from
        (
            select regexp_split_to_table(word, E'\\s+') word from (
                select case
                    when word like 'LINK:%' then split_part(word, ':', 2)
                    else regexp_replace(word, E'[^a-z0-9]+', E' ', 'g')
                end as word
            from
                (
                    select distinct
                        regexp_split_to_table(
                            regexp_replace(
                                regexp_replace(
                                    lower(about_me || ' ' ||
                                        who_am_i || location ||
                                        what_i_would_like_to_do),
                                E'<a.*href="https?://([^/]+).*">(.*)</a>', E' LINK:\\1 \\2'),
                            E'<[^>]+>', E''), E'\\s+'
                        ) as word
                    from user_detail
                ) as words
            ) as filtered_words
        ) as all_words
    where
        word like 'c%'
    order by word
    limit 20;

    That gives me this table result:

        word    
    ------------
     c
     cafe
     cairo
     california
     campo
     can
     canable
     canada
     cannot
     capelle
     capture
     car
     carolina
     cars
     casino
     celebrity
     cena
     central
     cfa
     challenge
    (20 rows)

    Some of those are place names, nothing suspicious there.  We've had a few casinos spam-advertised in the past, and so the word appears there.  Now, can I count frequency?

    select distinct
        word, count(*) as freq
    from
        (
            select regexp_split_to_table(word, E'\\s+') word from (
                select case
                    when word like 'LINK:%' then split_part(word, ':', 2)
                    else regexp_replace(word,...
    Read more »

  • Back to the old fashioned way…

    Stuart Longland01/09/2018 at 10:26 0 comments

    Since my last update, I managed to make a small background client that would continually pull in new users, both polling for the newest accounts, and pulling in the historical back-log.

    This required a hack to actually get the user IDs, because any request for users sorting by creation date presently yields an error 500 response from api.hackaday.io.  My work-around is to try it the way it's supposed to work, and if that returns error 500, fetch the All Hackers page, scrape that for the IDs, then bulk retrieve those IDs.

    Soon as the retrieval bug is fixed though, my code should revert back to doing it the normal way, and I can look at removing that work-around code.

    The downside though is that this does burn up the requests… and evidently, I've now hit the monthly limit.  No idea what that limit was without scraping the logs as it isn't documented what the limit is, but I can say now that I definitely hit it in the small hours of this morning.

    A pain, but it'll be back next month.  The limit is there for a reason.  This kind of data mining is not what SupplyFrame had in mind when they produced the API.

    Another quirk of the API: it reports users that have been deleted.  The only way you can know is to try and fetch the profile page… I use a HEAD query for this, if it comes back 404, the account is a dud.

    Prior to the stream getting cut off, I had managed to go as far back as late September last year.  Many historical spambot accounts were dealt with there.  I hope I didn't cause the moderators too much trouble in doing this.

    Having some machine-assistance for classifying users, as simple as the rules were, and the presentation of all the information on one page, made life a lot easier in deciding whether something needed reporting or not.

    I could see at a glance whether someone was legitimate.  I recorded only one false-negative (spambot account that got missed by the code), there were a few false-positives, but seeing the link information there meant it was trivial to check many accounts all at once.

    So the proof-of-concept worked.  The JavaScript code might've been utter spaghetti, and the UI was nothing pretty, but it worked.

    Next steps I think would be to clean up the front-end side of things, add some reporting on the users that have been classified.

  • Current status

    Stuart Longland01/07/2018 at 01:24 0 comments

    So, I've done some further work… and while I've ran into some issues, particularly when I hit one of the rate limits of the API, things are moving along.

    Right now, it doesn't do any background scanning… the retrieval of users and classification is done as someone is scrolling through the user list.  As each user is encountered, it looks through their profile and links for particular text patterns.

    If those patterns are found, then those parts of the profile are stored.  When displayed, if there's any suspicion about the account, you get a full view of that profile, whereas non-suspicious profiles are shown in brief form.

    In this screenshot, you can see the top and bottom users are flagged due to the external links, but clearly are legitimate users (and welcome, by the way).  The second and third users haven't: the algorithms didn't spot anything particularly worrysome, and so its made a note of the profile name and little more.

    The one I have highlighted (and blurred the links on… don't want to give them free advertising) is one of the Polish spambots we've been having problems with for a good few months now.  I've since reported that user account.

    Previously, one had to actually visit the profile page to see whether the user was legitimate or not.  Thus it's possible to inspect a large number of profiles.

    As you reach the bottom of the page, the page loads the next batch.

    By doing this, the algorithm doesn't need to be particularly smart … it's relying on human intelligence to act on the final output.

    The next stage would be to look at if users are being followed/skulled (by someone other than the hackaday.io site itself), or whether they are contributing to projects (could be tricky; there's no API call for that to my knowledge) or whether they own projects that are being skulled/followed by others.

    The user avatars are there… so there's scope also to scan through those and look for duplicates, or for various patterns in the avatar image itself (such as the presence of a QR code).

    The page requires that you log in via the hackaday.io site and search engines are forbidden from indexing the site, so there's no risk of this being abused to boost page rankings.

    Also what would be worth doing, is providing a means to manually mark a user as spammer or not-spammer.  Given any hackaday.io user can log in, including spam bot users, I think this ability should be restricted to a trusted group; perhaps contributors to this project (and if you want to help, by all means, join me).

    Being able to flag users should mean we can reduce double-ups when reporting, which should reduce the workload of the site moderators here who otherwise have to investigate each report.  Long term, it may be possible to provide an API that SupplyFrame themselves can use to monitor bot activity too.

    Long term, we want to make this site a tough nut for spam, make it too expensive for "SEO"s to use this site to increase their page rank so they'll go elsewhere (or better yet, re-evaluate their methodologies).

  • Beginnings of a working site

    Stuart Longland01/06/2018 at 15:04 2 comments

    Okay, so after working around the previous bug, I now have a page that will allow me to scroll through all the users and display ones that have hyperlinks off to some commercial site.

    I have to work on the formatting a bit … and I have to make the back-end code a good bit smarter so that it doesn't keep re-loading the same profiles over and over.  Right now I'm getting 403 errors, and I think that is now because I've exhausted the token bucket.

    Never mind… it already allowed me to spot some more spambot accounts much more easily than doing it the manual way.

    I also need to work on avatar scaling, as right now, the image sizes are all over the place.  I'll probably do that server-side, save myself some bandwidth.

  • Workaround found… screen-scraping to the rescue

    Stuart Longland01/06/2018 at 13:33 0 comments

    Okay, so I really didn't want to screen scrape HTML to do this, but until the sortby=newest bug is fixed, I haven't much choice.

    So I've resorted to an unholy hack.  I first try doing it properly, but if that fails, I grab https://hackaday.io/hackers?sort=newest and look for user IDs… then do a batch read of those IDs.

    The end result is I'm able to obtain the same data with a few extra HTTP calls.

  • API server bugs

    Stuart Longland01/06/2018 at 11:20 0 comments

    So, it appears I can successfully forward a user to the Hackaday.io log-in page, they authorise the application, and it sends them back… I find out who they are via OAuth2 and can decide what to let them do on that basis.

    Great… now to start pulling in data.  The logical point to start would be to start reading the newest arrivals feed…

    RC=0 stuartl@rikishi ~/projects/hadsh $ curl 'https://api.hackaday.io/v1/users?api_key=MY_KEY&sortby=newest' | python -m json.tool
      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    100    53  100    53    0     0     53      0  0:00:01 --:--:--  0:00:01    53
    {
        "error": {
            "code": 500,
            "type": "Internal Server Error"
        }
    }
    

    Seems there's a number of documented URIs that… if attempted… yield a 500 internal server error.  Take out the sortby and it works, except it's telling me the most influential, which is not what I'm after.  In fact, all other sortby options work, just not newest.

    https://hadsh.vk4msl.id.au/data/newcomers.json was supposed to return the newest users, with the view of displaying that data and having it pull in more users as you keep scrolling down.

  • Pulling the bits and pieces together

    Stuart Longland01/06/2018 at 06:15 0 comments

    So… I have my instance.  Not much to look at right now, just nginx reporting that it can't connect to the back-end, because I haven't written it yet.  PostgreSQL Server 10.0 is installed, along with Tornado 4.4.2 and Python 3.5.4.

    I have however, set it up to host static files, such as this.

    The code for now is up on Github.  The API wrapper is pretty crude for now, but this should at least get the ball rolling.