• Cleaning up the top-level domain scores

    Stuart Longland09/09/2018 at 10:21 0 comments

    So, earlier I eluded to an issue with the hostname parsing in that it was dumb: it would strip of the very top-level domain (so in 'example.com' it would ignore the '.com'), but didn't consider country-codes as being any different, e.g. 'digikey.com.au' would generates scores for 'digikey.com.au' (correct) and 'com.au' (not correct).

    To add more fun, some countries skip that second-level domain.  youtu.be anyone?

    Thankfully, there's a list I can consult.  So, I wrote some code to consume this, and slurped it into a temporary table to perform the clean-up.  Below is what the scores were for all those top-level domains.

    hadsh=> select * from hostname where hostname in (select name from tld);
     hostname_id |     hostname      | score | count
    -------------+-------------------+-------+-------
            2191 | co.nz             |     2 |     2
            2640 | com.ve            |     0 |     0
            2641 | net.ve            |     0 |     0
            2226 | github.io         |     3 |     3
            2456 | blogspot.co.nz    |     1 |     1
            2671 | blogspot.de       |     0 |     0
            2306 | co.jp             |     1 |     1
            2353 | com.br            |    -7 |     7
            1847 | co.uk             |   -51 |    63
            1967 | org.br            |    -1 |     1
            1978 | com.pl            |    -3 |     3
            2374 | net.in            |    -1 |     1
            2038 | org.uk            |    -4 |     4
            2103 | org.pl            |    -1 |     1
            1865 | blogspot.com      |     4 |     6
            2503 | com.au            |    -5 |     5
            1901 | co.il             |    -3 |     3
            1898 | com.tr            |    -1 |     1
            1899 | blogspot.com.tr   |    -1 |     1
            1935 | yolasite.com      |    -2 |     2
            2111 | edu.pl            |   -10 |    10
            2134 | azurewebsites.net |     1 |     1
    (22 rows)
    

    I've just cleaned those up.  Hopefully the new domain parsing code will take care of the rest.

  • Re-working the frontend using modern JavaScript

    Stuart Longland09/09/2018 at 03:41 0 comments

      So, the API is blocked yet again, and so I've spent some time re-working the front-end UI.

      One mistake I made early on was to have the system immediately begin classification of a user the moment you clicked the "legit" or "suspect" buttons.  That's fine if it was intentional but if you did it by mistake, tough, it would take you at your word and immediately move a user to that group.

      I later added a delay to give me a moment to be able to re-load the page or defer the user to have a way to cancel such mistakes before they took effect.

      Most of the time though, there's nothing to flag a user as spam and so you find yourself clicking the "legit" button a lot.  I rectified this by having a button that would automatically mark the "auto_legit" users as "legit", thus giving me a way to just focus on the ones that are actually spammers and letting the others fall through.

      Thinking about this, bulk operations work well, and it's better to be able to tell the system how to mark a user, then later, commit that result once you are satisfied that someone is legitimate or suspect.  Sometimes a user is not picked up as a spammer, but you want to do some more careful checking, so it helps to be able to just mark a user as "neutral", and come back to them later.

      The way the UI manipulated the DOM was awkward too.  I didn't want to drag in something big like JQuery, AngularJS or Dojo, what I needed to do wasn't that hard, just it needed a little abstraction.

      That abstraction is a class called DOMElement.  That just removes a little bit of boilerplate when manipulating the DOM and provides some conveniences.

      The UI was also very mouse-heavy.  Keyboard actually works quite well for lots of operations.  It'd be nice if I can just navigate between the users by pressing keys, and flag them with different keys.  Happily, L and S are opposite ends of a US keyboard.  If I use the term "hide" rather than "defer", H is in the middle.  I was thinking of N for next, and B to go back, so that suggests use U for neutral.  Pressing C should commit what's pending.

      It turns out in JavaScript, there's a few tricks you need to do to make this happen on a DIV:

      1. Set the tab index on the DIV so can accept keyboard focus
      2. Add an event listener on the DIV.  That event listener should disable the default action when it handles a key press (so you don't get the find-in-page pop-up making a nuisance) also watch for the ALT and CTRL keys being pressed so it can ignore keys like CTRL-L (go to location bar) or CTRL-C (copy).

      A bit more code, and I had a way to do all of the above.

      The code now uses classes a lot more, it uses Promises around XMLHttpRequest, and I've tried to split apart the business logic from the display logic a lot more.

      I want to make it a bit more modular, maybe I should look at a loader like require.js or similar… it annoys me that while it's possible for HTML to embed a script either in-line or from a separate file, that script is not able to "import" other resources itself without further help.

      Right now, it's 51kB of hand-written JavaScript, and gzip packs that down to 12kB.  I won't win the 5k website contest with that (never mind the back-end would disqualify it anyway), but it's not going to take an age to load over my ADSL2 uplink either.

      This is what it looks like today.  The original version had a loading spinner animation that was reminiscent of early terminal-based software, so I've decided to keep that aesthetic.

      Down the bottom, it gives me a count of how many users are flagged.  I've found that big batches don't work so well, so it now gives a warning when you reach the century:

      As it's committing these, it also counts them down so I know how long it'll be before its done.  In the event of a failure, this also tries to carry on with as many of the other operations as possible.  Those that failed stay flagged, and can be retried.

      Up the top, I can now jump around the user...

    Read more »

  • Progress Report

    Stuart Longland08/04/2018 at 03:13 0 comments

    Well, this project has been slowly grinding away in the background.  My manual reporting of spam accounts has been my primary focus, and it is through our collective reporting of spam (the "report as inappropriate" link … seriously, I urge you all to hunt down this crap and report it so the mods can dispose of it), and the hard work of the Hackaday.io dev-ops team that has kept this place largely free of spam.

    That said, the scanning via the API has still been useful, it has flagged at least half a dozen users I've missed in the last month which really shouldn't be here.

    One change to the algorithm I made last month resulted in a drastic increase in how quickly the script moves through the "All Hackers" pages.  Rather than fetching IDs and then immediately querying those IDs, I stash them for later.  This allows me to then request profile data 50 users at a time, reducing the number of hits to the API.  As of writing, the API has scanned each page back to page number 10368.

    How many pages are there?  Well, the first users on hackaday.io are waay back on page 14395.  That'll probably slip back a bit, but soon it'll be a case of working through that list of IDs and focussing on the newcomers only.  I don't expect to see much spam as I work back towards the last page of the newest user list.

    I certainly am considering how to generalise the code so that it can be more readily used for spam detection on other sites.  One blind-spot we have with regards to URI blacklists is stuff posted to web communities such as this.  Email has proven a hard target, so the spammers have naturally migrated to these web-based communities.

    By using a system such as this to real-time scan what people are submitting, we can really put a dint in the business model of would-be spammers.  The HADSH code is useless for this as-is, as it's dependent on the Hackaday.io API.  I'll probably still stick with Python for now, and I think ditching sqlalchemy for an asynchronous PostgreSQL client will be on the plans, along with a re-jig of the schema to make this project more general.


    Update 2018-08-10: It got there:
    hadsh=> select * from newest_user_page_refresh order by page_num desc limit 10;
     page_num |         refresh_date          
    ----------+-------------------------------
        14442 | 2018-08-10 07:27:17.559179+10
        14441 | 2018-08-10 06:46:27.558809+10
        14440 | 2018-08-10 06:15:44.022249+10
        14439 | 2018-08-10 05:43:58.536599+10
        14438 | 2018-08-10 05:42:52.539986+10
        14437 | 2018-08-10 05:41:48.405866+10
        14436 | 2018-08-10 05:40:44.790953+10
        14435 | 2018-08-10 05:39:41.411618+10
        14434 | 2018-08-10 05:38:37.689943+10
        14433 | 2018-08-10 05:37:33.53155+10
    

  • Progress and future plans

    Stuart Longland05/06/2018 at 04:38 1 comment

    So, it's been about 5 months now since I started this project.  I still battle things like request quotas but the project has matured a bit and I'm starting to get some ideas on how to push it forward.

    Since I started this, so far:

    • 78276 profiles have been analysed.
    • 76681 profiles have been cleared as legitimate user accounts
    • 918 spam profiles have been identified and reported (many of these have now been removed)
    • 46370 words and 186568 word pairs have been identified across many languages

    This is not counting the users found by manually patrolling the All Hackers pages.  Lots have been found and reported that way too.

    Some users have been real big contributors to the vocabulary.  @Morning.Star in particular had the classification script grinding away for 10 minutes.  That's more a reflection on the inefficiencies in my code than anything else.  That said, do have a look at his work, seems Nintendo are.

    Getting a feel for what words are typical of our community, and what words are typical of the spammers, really does help in identifying future new users.

    Thanks to some now deleted users, the "word" 출장만남 is one of the most strongly negatively scored "words".  (And yeah, I'm sure they offer quite a "meeting".  Never mind that there's an increasing number of women in business either — seems that part of Korea has not gotten the equal opportunities memo yet.)

    Soon as one of these spammers pops up, with some tart who's more interested in looking at her phone instead of the camera, this system sees words like that and immediately knows there's trouble.

    Someone mentions "pi", or "stuff", and the opposite occurs.

    Strangely, the most frequent neutral word (no score) seen thus far is "ich"… I can only think German.

    The system as developed, so far is working.  It's far from ideal, and is suffering some growing pains, but it generally is working well enough that we can say the concept has been proven.  I haven't yet looked at avatar analysis, the mob that were QR-code mad seem to have gone underground.

    One goal I've considered here is whether something could be made that would plug into the infrastructure used to host hackaday.ioapi.hackaday.io advertises Jetty which is a Java-based web framework.  OpenNLP is an option for doing this on that platform.

    Java is far from a favourite language for me, but I do know how to write code in it, even if my knowledge is rusty.  It's getting a polish at the moment thanks to my workplace and nHaystack, and thankfully if I'm working on it here, I don't have Tridium's idiosyncrasies to worry about.

    If I keep it as Python, probably the next steps will be to consider re-writing the database logic to either better interface with sqlalchemy, or ditch sqlalchemy in favour of a dedicated PostgreSQL library, ideally one that's asynchronous.  I also need to re-vamp the front-end, as that's rather hacky at present.

    PostgreSQL has worked well though, I see no reason to change that.  Right now the sluggishness is more to inefficient use of the database libraries than the database itself.

  • Analysing user projects and pages for tokens

    Stuart Longland03/03/2018 at 23:12 1 comment

    Just before the month ended, I tweaked my code so that it'd scan user-created projects and pages for tokens to include when computing the score.  This is because there are some spambots that put all their "content" in pages or projects and not on the profile page.

    Yesterday, saw me refine the system more; tweaking how this data is displayed and collected; notably if a user is registered and when analysed has a very low absolute score; we schedule a check later to see if they do anything.

    There's a spambot… spruiking some massage service in Korea (thank-you Google Translate).  The sea of red indicates these are words that only spambots use.  There's a couple that were used by legitimate users, but most of it is unique to spambots.  It's a wonder it didn't get to a score of -10, given the sheer number of negatively scored words.

    This almost could be good enough to flag the user itself; but I still believe in human oversight.  No legitimate user scores this low!  To quote one of the members of the Ipswich radio club: "Man must be master!"

    I'm not sure what would be needed to bash this into a form that could run at SupplyFrame… but I believe this at least proves the concept.

  • Scoring by vocabulary

    Stuart Longland02/03/2018 at 23:03 0 comments

    So, I implemented the scoring by a fairly naïve summing of individual word scores, computed with the earlier algorithm.

    This… due to the fact that the spambot-to-real user ratio isn't 50:50, meant that plenty of normal words and phrases boosted the spambots' scores to normal levels.

    I needed to re-think how I used that data.  In the end, I decided to dump the words' scores into an array, sort it in ascending order, then sum the worst 10.

    That proved to be quite effective.  It means they can't cheat the system by putting their usual text up, then slapping in a handful of hacker lingo words to boost their score.  Users that only use one or two bad words, will usually be scored up highly enough to avoid getting flagged.

    The database now has nearly 20000 words and 65800 word pairs, scored according to typical usage of users that have arrived at hackaday.io since early June last year.

    With that information, the script can auto-classify some users with better accuracy:

    2018-02-03 22:47:02,402       INFO HADSHApp.crawler 25397/MainThread: New user: USA Tile & Marble [#xxxxxx]
    2018-02-03 22:47:02,404      DEBUG HADSHApp.crawler 25397/MainThread: User USA Tile & Marble [#xxxxxx] is in groups set() (classifie
    d False)
    2018-02-03 22:47:03,433      DEBUG HADSHApp.crawler 25397/MainThread: Inspecting user USA Tile & Marble [#xxxxxx]
    2018-02-03 22:47:03,440    WARNING polyglot.detect.base 25397/MainThread: Detector is not able to detect the language reliably.
    2018-02-03 22:47:03,443      DEBUG     HADSHApp.api 25397/MainThread: Query arguments: {'per_page': 50, 'page': 1, 'api_key': 'xxxxx
    xxxxxxxxxxx'}
    2018-02-03 22:47:03,446      DEBUG     HADSHApp.api 25397/MainThread: GET 'https://api.hackaday.io/v1/users/xxxxxx/links?per_page=50
    &page=1&api_key=xxxxxxxxxxxxxxxx'
    2018-02-03 22:47:04,683       INFO HADSHApp.crawler 25397/MainThread: User USA Tile & Marble [#xxxxxx] has link to VIEW OUR SHOWROOM
     <[REDACTED]>
    2018-02-03 22:47:04,754      DEBUG HADSHApp.crawler 25397/MainThread: New word: mosaic
    2018-02-03 22:47:04,789      DEBUG HADSHApp.crawler 25397/MainThread: New word: ceramic
    2018-02-03 22:47:04,808      DEBUG HADSHApp.crawler 25397/MainThread: New word: tiles
    2018-02-03 22:47:04,818      DEBUG HADSHApp.crawler 25397/MainThread: New word: porcelain
    2018-02-03 22:47:04,862      DEBUG HADSHApp.crawler 25397/MainThread: New word: marble
    2018-02-03 22:47:04,891      DEBUG HADSHApp.crawler 25397/MainThread: New word: showroom
    2018-02-03 22:47:04,901      DEBUG HADSHApp.crawler 25397/MainThread: New word: travertine
    2018-02-03 22:47:04,911      DEBUG HADSHApp.crawler 25397/MainThread: New word: collection
    2018-02-03 22:47:04,945      DEBUG HADSHApp.crawler 25397/MainThread: New word: flooring
    2018-02-03 22:47:04,963      DEBUG HADSHApp.crawler 25397/MainThread: New word: pompano
    2018-02-03 22:47:04,973      DEBUG HADSHApp.crawler 25397/MainThread: New word: tile
    2018-02-03 22:47:06,090      DEBUG HADSHApp.crawler 25397/MainThread: User USA Tile & Marble [#xxxxxx] has score -3.675362
    2018-02-03 22:47:06,098      DEBUG HADSHApp.crawler 25397/MainThread: Auto-classifying USA Tile & Marble [#xxxxxx] as suspect

    That's out of the logs.  The script "learned" some new words there.  In the database, we can see how those words are scored:

    hadsh=> select score, count, score::float/count::float rel_score from word where word='flooring';
     score | count | rel_score 
    -------+-------+-----------
        -2 |     2 |        -1

    As I say, machine learning at it's most primitive.  I've considered whether to integrate uribl or surbl DNS blacklists, but so far, this has neither been necessary, nor have I seen any links yet (of the few that I tried) that actually show up in those blacklists.

    For now, the site is once again, blocked.  So it's back to the manual methods again.  Things going to plan, we should be able to expand the data set to cover arrivals in late 2016 once the user retrieval resumes.

  • Scoring on words

    Stuart Longland02/02/2018 at 23:33 0 comments

    So, I implemented polyglot in the code, and while I'm not yet doing the user classification, I am at least collecting the information, and already I'm seeing some trends.

    I have a table of words with four columns:

    • BIGINT primary key (indexing by numeric values is easier for a database)
    • TEXT word (with an index for fast look-up)
    • INTEGER score
    • INTEGER count

    The score and count are how we'll keep track of the "spammyness" of a word.  These are update when a user is classified (by a human).  If the user is classed as legitimate, both get incremented, otherwise if they're a spambot user, count will be incremented while score is decremented.

    The end result is that, when you compute score/count, this normalised score is closer to +1.0 for words that are typical of legitimate users, and closer to -1.0 for spambot users.

    I'm not sure how good the tokenisation is in polyglot for non-English scripts, but so far, the spambots I've seen that post Korean/Chinese, tend to have other traits that scream spambot: like creating lots of "projects" with much the same text.

    When a 5-minute-old user account has 20 projects, one scratches their head and wonders why a legitimate user would do that.

    Already, this approach is showing some insights:

    hadsh=> select word, score, count, score::float/count::float rel_score from word where count>1 order by rel_score, count desc;
           word        | score | count |     rel_score      
    -------------------+-------+-------+--------------------
     estate            |    -7 |     7 |                 -1
     agent             |    -6 |     6 |                 -1
     top               |    -4 |     4 |                 -1
     realtors          |    -4 |     4 |                 -1
     bowie             |    -3 |     3 |                 -1
     rated             |    -2 |     2 |                 -1
     polska            |    -2 |     2 |                 -1
     transhelsa        |    -2 |     2 |                 -1
     erekcja           |    -2 |     2 |                 -1
     real              |    -4 |    10 |               -0.4
     md                |    -1 |     3 | -0.333333333333333
     en                |     0 |     8 |                  0
     happy             |     0 |     2 |                  0
     local             |     0 |     2 |                  0
     barcelona         |     1 |     3 |  0.333333333333333
     mi                |     1 |     3 |  0.333333333333333
     la                |     3 |     7 |  0.428571428571429
     best              |     5 |    11 |  0.454545454545455
     really            |     4 |     6 |  0.666666666666667
     de                |    13 |    19 |  0.684210526315789
     hi                |     6 |     8 |               0.75
     am                |   133 |   139 |  0.956834532374101
     in                |   171 |   175 |  0.977142857142857
     with              |    86 |    88 |  0.977272727272727
     my                |    92 |    94 |  0.978723404255319
     i                 |   406 |   412 |  0.985436893203884
     and               |   363 |   367 |  0.989100817438692
     a                 |   290 |   292 |  0.993150684931507
     ,                 |   727 |   731 |   0.99452804377565
     .                 |   806 |   810 |  0.995061728395062
     github            |   363 |   363 |                  1
     to                |   345 |   345 |                  1
     twitter           |   172 |   172 |                  1
    …
     california        |    27 |    27 |                  1
     interesting       |    26 |    26 |                  1
     know              |    26 |    26 |                  1
     germany           |    26 |    26 |                  1
     work              |    25 |    25 |                  1
     electrical        |    25 |    25 |                  1
     enthusiast        |    25 |    25 |                  1
     arduino           |    25 |    25 |                  1
     working           |    24 |    24 |                  1
     3d                |    24 |    24 |                  1
     as                |    24 |    24 |                  1
     science           |    24 |    24 |                  1
     world             |    24 |    24 |                  1
     &                 |    23 |    23 |                  1
     make              |    23 |    23 |                  1
     hack              |    23 |    23 |                  1
     hobbyist          |    22 |    22 |                  1
     indonesia         |    22 |    22 |                  1
     iot               |    22 |    22 |                  1
     what              |    22 |    22 |                  1
     years             |    22 |    22 |                  1
     have              |    22 |    22 |                  1
     you               |    22 |    22 |                  1
     ingin             |    22 |    22 |                  1
     hardware          |    21 |    21 |                  1
     all               |    21 |    21 |                  1
     diy               |    21 |    21 |                  1
     retired           |    20 |    20 |                  1
     because           |    20 |    20 |                  1
     guy               |    19 |    19 |                  1
     here              |    19 |    19 |                  1
     ideas             |    19 |    19 |                  1
     cool              |    18 |    18 |                  1
     old               |    18 |    18 |                  1
    …

    So some are definitely specific to spammers… and I'll apologise now to the people of Bowie, MD (you can blame one of your local business owners for the bad reputation, it'd only take 3 legitimate users mentioning "bowie" to offset this).

    Already, we know the moment they mention "realtors" or "estate" to be suspicious.  Word adjacency is also tracked:

    hadsh=> select (select word from word where word_id=proceeding_id) as proceeding, (select word from word where word_id=following_id) as following, score, count, score::float/count::float rel_score from word_adjacent where count>1 order by rel_score, count desc;
       proceeding   |   following    | score | count |     rel_score     
    ----------------+----------------+-------+-------+-------------------
     real           | estate         |    -7 |     7 |                -1
     estate         | agent          |    -6 |     6 |                -1
     best           | real           |    -2 |     2 |                -1
     top            | rated          |    -2 |     2 |                -1
     rated          | real           |    -2 |     2 |                -1
     , |...
    Read more »

  • Playing with polyglot

    Stuart Longland02/02/2018 at 11:07 0 comments

    So, last post I discussed tokenising the language and counting up word frequency.  I did some SQL queries that crudely stripped the HTML and chopped up the text into words.

    It worked, kinda.

    Firstly, it was an extremely naïve algorithm, it would tokenise the word "you're" as "you" and "re".  I could try to embed exceptions, but that'd only work for English.  It would somewhat work for German, French and Spanish, since English borrows a lot of words from those languages, but there I think it'd have less success since those languages have their own special rules.

    It'd fall flat on its face where it came to Arabic, Chinese, Japanese, Korean… etc.

    Now, I'd have to think back to the mid-90s when I was studying Japanese to think how the sentence structures there worked.  We had to as part of primary school studies, and I was never any good at it then.  To this day I recall something about someone having an itchy knee, a guy named Roko, then it got a little rude!

    So I'd be the last person that should be writing a natural language parser for Japanese, let alone the others!

    Enter polyglot.

    Polyglot is a toolkit for natural language processing in Python.  Among its features is tokenisation.  It is able to detect the language used, then apply rules to tokenise the text into words, which we can then count and use.  I tried it with some recent users here, copying and pasting their profile text into the ipython shell, and lo and behold, it was able to identify the language and tokenise the words out.

    It may not be perfect, but it's better than anything I can write.

    The catch is, it's GPLv3, whereas up to now, my code was under the BSD license.  Since I'm sole developer so far, we'll just switch to GPLv3 as well.  Not my cup of tea for a software license (in particular, being able to sue for infringement counts for nought if you don't have the time/money to defend your copyright), but it's not really a big deal in this case.

    I'll look around for something to strip the HTML out, and we should be in business.

  • Classifying users through posted content

    Stuart Longland01/22/2018 at 10:32 0 comments

    Many years ago… when I used to muck around with things like IRC, I had a bot on the Freenode network.

    no_body was based on PerlBot.  Aside from annoying some members of #gentoo-mips… it did had a few party tricks, such as doing Google searches, scraping the weather data off the BoM website and a few other skills.

    One which would have an application here, is tallying up the vocabulary of a channel.  It'd tokenise sentences, throw out punctuation, and tally up the number of times each word was used, and store that in a database.  A web frontend could retrieve that information.

    There are certain keywords that the spammers love to use.  Not many legitimate users mention that they are into "marketing", or that they are "SEO" experts, or perhaps they really like those little pills that … well… you get the idea.

    I don't have the code for that old bot, or at least I don't think I do… and in any case, it probably wasn't as well constructed as I'd like.

    It would seem that a simple scoring system could work here.  When a user is marked as "legit", it could tally up the keywords used by that person in their profile, and up-vote them.  When marked as "suspect", it could down vote those same keywords.

    Ultimately, the words that the spammers mainly use would attain very high negative numbers, the words us regulars use would attain high positive numbers.

    If a user profile mentions a significant number of these "negative" words, that's a further clue that the profile might belong to someone that's up to no good.  To achieve this, the HTML will need to be decoded to plain text.

    It might be helpful for hyperlinks to be decoded to the link text plus the domain of the site, since I note one particular spammer has been keen to promote the same half-dozen sites over the past week.  This behaviour would backfire for them rather spectacularly.

    The other metric would be the number of projects the user has posted, compared to the account's age… if a user has been there 5 minutes, and somehow has 100 projects to their name, that's a clue.

    The good news though is that the latter variety have not shown up on the projects page the way they have done in the past.  Likely, that's the hard work of the hackaday.io development / operations team, a job well done keeping on top of that. :-)

    For now, my server is still blocked, I suspect that should change in about a fortnight.  I have some data that I can analyse, and that might be a worthwhile sub-project… analyse what I have and get things ready so that processing can resume once the big switch is thrown back again.


    Just tried an experiment… via the SQL console (since my cookie has expired, I am unable to log in: the site can't verify me because the API reports "hourly limit exceeded")…
    select distinct
        word
    from
        (
            select regexp_split_to_table(word, E'\\s+') word from (
                select case
                    when word like 'LINK:%' then split_part(word, ':', 2)
                    else regexp_replace(word, E'[^a-z0-9]+', E' ', 'g')
                end as word
            from
                (
                    select distinct
                        regexp_split_to_table(
                            regexp_replace(
                                regexp_replace(
                                    lower(about_me || ' ' ||
                                        who_am_i || location ||
                                        what_i_would_like_to_do),
                                E'<a.*href="https?://([^/]+).*">(.*)</a>', E' LINK:\\1 \\2'),
                            E'<[^>]+>', E''), E'\\s+'
                        ) as word
                    from user_detail
                ) as words
            ) as filtered_words
        ) as all_words
    where
        word like 'c%'
    order by word
    limit 20;

    That gives me this table result:

        word    
    ------------
     c
     cafe
     cairo
     california
     campo
     can
     canable
     canada
     cannot
     capelle
     capture
     car
     carolina
     cars
     casino
     celebrity
     cena
     central
     cfa
     challenge
    (20 rows)

    Some of those are place names, nothing suspicious there.  We've had a few casinos spam-advertised in the past, and so the word appears there.  Now, can I count frequency?

    select distinct
        word, count(*) as freq
    from
        (
            select regexp_split_to_table(word, E'\\s+') word from (
                select case
                    when word like 'LINK:%' then split_part(word, ':', 2)
                    else regexp_replace(word,...
    Read more »

  • Back to the old fashioned way…

    Stuart Longland01/09/2018 at 10:26 0 comments

    Since my last update, I managed to make a small background client that would continually pull in new users, both polling for the newest accounts, and pulling in the historical back-log.

    This required a hack to actually get the user IDs, because any request for users sorting by creation date presently yields an error 500 response from api.hackaday.io.  My work-around is to try it the way it's supposed to work, and if that returns error 500, fetch the All Hackers page, scrape that for the IDs, then bulk retrieve those IDs.

    Soon as the retrieval bug is fixed though, my code should revert back to doing it the normal way, and I can look at removing that work-around code.

    The downside though is that this does burn up the requests… and evidently, I've now hit the monthly limit.  No idea what that limit was without scraping the logs as it isn't documented what the limit is, but I can say now that I definitely hit it in the small hours of this morning.

    A pain, but it'll be back next month.  The limit is there for a reason.  This kind of data mining is not what SupplyFrame had in mind when they produced the API.

    Another quirk of the API: it reports users that have been deleted.  The only way you can know is to try and fetch the profile page… I use a HEAD query for this, if it comes back 404, the account is a dud.

    Prior to the stream getting cut off, I had managed to go as far back as late September last year.  Many historical spambot accounts were dealt with there.  I hope I didn't cause the moderators too much trouble in doing this.

    Having some machine-assistance for classifying users, as simple as the rules were, and the presentation of all the information on one page, made life a lot easier in deciding whether something needed reporting or not.

    I could see at a glance whether someone was legitimate.  I recorded only one false-negative (spambot account that got missed by the code), there were a few false-positives, but seeing the link information there meant it was trivial to check many accounts all at once.

    So the proof-of-concept worked.  The JavaScript code might've been utter spaghetti, and the UI was nothing pretty, but it worked.

    Next steps I think would be to clean up the front-end side of things, add some reporting on the users that have been classified.