• Closure of the project

    Stuart Longland02/06/2019 at 19:41 0 comments

    Sadly, there comes a time when one realises a project is no longer viable, and for me I think this project, I think that time has come.

    1. Issues with the API that have to date gone unfixed after being reported more than a year ago
    2. Changes to the website that have broken work-arounds to those API issues
    3. A lack of time on my part

    I actually haven't reported any spam for a good month now.  Not that there hasn't been any, I'm sure there is.  There's at least 3 or 4 regular spambot operators that harass this website.  One based in Poland, one in the Czech Republic, and the others yet unidentified.

    The Polish one is the most regular, and there was an example of this person's work in my last post.  Traits include:

    • The "about me" text is identical to the title of one of the links; often written in Polish
    • Often, the avatar chosen is of a character in a Nintendo game, although there are a few others (e.g. several Apple-themed avatars, one featuring a ring of coloured pencils, etc)
    • Sometimes, the "display name" matches a particular regular expression:
      • ^[a-zA-Z][0-9][a-zA-Z][a-zA-Z][a-zA-Z]$ -- which also matches some radio callsigns (particularly UK, Italy and US calls)
      • [0-9][a-zA-Z][0-9][0-9][0-9][a-zA-Z]$

    The Czech operator is a little more subtle, using English text and a "normal" name (usually male), thus is harder to spot just looking at the profile list.

    PCBGogo and ALLPCB at one point were making particular pests of themselves.  I don't mind PCB manufacturers having a presence here, but this isn't an advertising billboard, it's a community.  If you want the former, there's Facebook for that.

    Cleaning the mess was easily a 2 hour effort, each night, every night, with a dozen or more hours spent each week-end.  My other projects were suffering from the lack of time being allocated to them.

    I developed keyboard shortcuts to be able to locate the "Report" link (Firefox will "jump" to the link if you hit ' then type "report"), and having some canned reasons to copy and paste into the field that pops up.  With some work with the DOM (e.g. using JQuery like my previous post), you can set up keyboard shortcuts to streamline this, which did save a lot of time.  Sadly, not enough.

    Use of a web service which helped collect the information and perform some automatic analysis did help identify spammers that were missed by manual methods.  I think there's a lot to be learned from the approach used -- I'll probably archive the Github project rather than deleting it for this reason.  The server development proved a useful test ground for refining such spam-fighting algorithms.

    In order to actually retrieve the information, I used the Hackaday.io API.  This wasn't exactly efficient, something GraphQL-based could be a lot better, but it worked well enough.  A big thorn in my side though was that getting a listing of users sorted by either user ID or registration date was neigh on impossible.  The former was never supported and the latter returned error 500s.

    The latter problem has been reported several times already, and represents the single highest barrier to this project, the second being the monthly request quota.

    My work-around, which was documented here (I won't go looking through the logs to find references to it) was to do a regex-based screen-scrape of the "new users" listing on hackaday.io.  That worked well, until a few months back when a design change replaced the user ID-based links to users, to a "vanity" URL built on the user name.

    The work-around to this would be to load the linked-to page, then scrape it for the user ID (it appears in several places).  When you get this far, one might as well just scrape the other information from that page as well.  I haven't gone as far as that server-side, but did implement a GreaseMonkey/TamperMonkey script that did exactly this in my previous log.  The downside to this approach is that it generates a...

    Read more »

  • Browser-based spambot hunting

    Stuart Longland01/09/2019 at 09:50 0 comments

    So, I've been slowly doing a major re-work of the back-end server.  Some of this is because the Hackaday site has changed in ways that has broken the old scripts (particularly the code that retrieves new users), and some of it because it was woefully inefficient, unstable, or both.

    One bugbear of mine is that when the API gets blocked, I'm back to doing things manually, which actually makes the spambots a lot harder to pick out because while they do have very distinct traits, these are not apparent from the standard user profile list.

    The initial spin of the HADSH server used the API directly to retrieve users as your browsed and basically replicated the same view, but with more profile information provided in the user list so you could more easily spot the spammers.  That was the original concept: to better highlight the information that was being published so you could make a better decision.

    In thinking about this, I thought: "how can this be done without a back-end server?"  Turns out, this wasn't so difficult.  Firefox and Chromium have extensions that allow you to run arbitrary scripts on a web page: Greasemokey and Tampermonkey.  I could write a script that retrieves the additional information and inserts that directly into the DOM of the profile list page.

    The problem was, how to retrieve this information.  Yes, there's the hackaday.io API, if you know the user ID.  One of the changes that was made is the profile list page no longer displays the user ID.  (Indeed, the server I wrote was using that very hack to enumerate new users.)

    You can get the ID from the user profile page, but I can also get everything else from that page too, and for now I just want the user's links.  It's in the HTML, but how do you extract it?  As it happens, the Hackaday.io website is built on JQuery, and so one is able to use that to perform a XMLHTTPRequest, then with some post-processing, parse the HTML that comes back and extract the user list.  This can then be injected back into the user list DOM, so that all the information I need is on one page.

    As a bonus, I don't need to collect user profile information persistently on a back-end server, which is something I was never comfortable doing, but needed to do for technical reasons previously.

    In order to prevent hammering the hackaday.io server too hard (and to prevent getting blacklisted), these XHRs are rate-limited with a randomised interval, and requested one-at-a-time.  Yes, it takes a few seconds for all the users' links to show up, but it beats having to manually open each one in a new tab.  It reduces the chances of false positives.

    The result looks like this:

    Previously, I'd have to visit each and every one of those pages, but this provides the links right there so that I can investigate further.  Here, we can see the traits of a certain Polish spammer: they always create users whose "about me" text matches the name of the hyperlink.  Others, almost always have a link to their business.

    The only ones that won't show up here are the ones that post pages or projects linking to their website.  I'd have to look at the DOM of a profile page more carefully to see if that could be extracted and placed here, maybe just the headings so it's not too cluttered.

    There's potential for this to do XHRs to a back-end system which may look up those links on URIBL or see if they've been been spamvertised here before.  The back-end system can be less concerned with "users" and more about the use of individual words, what websites are linked to, etc.  There are a few libraries that do natural language tokenisation, but none as good as Python's polyglot, so being able to take a blob of text, throw that at the server, and have it tell me what the score is of each word used would be a bonus.

    The avatar can also be checked: I actually started looking into this, using 5 hashing algorithms.  SHA512, which is used to check for...

    Read more »

  • Cleaning up the top-level domain scores

    Stuart Longland09/09/2018 at 10:21 0 comments

    So, earlier I eluded to an issue with the hostname parsing in that it was dumb: it would strip of the very top-level domain (so in 'example.com' it would ignore the '.com'), but didn't consider country-codes as being any different, e.g. 'digikey.com.au' would generates scores for 'digikey.com.au' (correct) and 'com.au' (not correct).

    To add more fun, some countries skip that second-level domain.  youtu.be anyone?

    Thankfully, there's a list I can consult.  So, I wrote some code to consume this, and slurped it into a temporary table to perform the clean-up.  Below is what the scores were for all those top-level domains.

    hadsh=> select * from hostname where hostname in (select name from tld);
     hostname_id |     hostname      | score | count
    -------------+-------------------+-------+-------
            2191 | co.nz             |     2 |     2
            2640 | com.ve            |     0 |     0
            2641 | net.ve            |     0 |     0
            2226 | github.io         |     3 |     3
            2456 | blogspot.co.nz    |     1 |     1
            2671 | blogspot.de       |     0 |     0
            2306 | co.jp             |     1 |     1
            2353 | com.br            |    -7 |     7
            1847 | co.uk             |   -51 |    63
            1967 | org.br            |    -1 |     1
            1978 | com.pl            |    -3 |     3
            2374 | net.in            |    -1 |     1
            2038 | org.uk            |    -4 |     4
            2103 | org.pl            |    -1 |     1
            1865 | blogspot.com      |     4 |     6
            2503 | com.au            |    -5 |     5
            1901 | co.il             |    -3 |     3
            1898 | com.tr            |    -1 |     1
            1899 | blogspot.com.tr   |    -1 |     1
            1935 | yolasite.com      |    -2 |     2
            2111 | edu.pl            |   -10 |    10
            2134 | azurewebsites.net |     1 |     1
    (22 rows)
    

    I've just cleaned those up.  Hopefully the new domain parsing code will take care of the rest.

  • Re-working the frontend using modern JavaScript

    Stuart Longland09/09/2018 at 03:41 0 comments

    So, the API is blocked yet again, and so I've spent some time re-working the front-end UI.

    One mistake I made early on was to have the system immediately begin classification of a user the moment you clicked the "legit" or "suspect" buttons.  That's fine if it was intentional but if you did it by mistake, tough, it would take you at your word and immediately move a user to that group.

    I later added a delay to give me a moment to be able to re-load the page or defer the user to have a way to cancel such mistakes before they took effect.

    Most of the time though, there's nothing to flag a user as spam and so you find yourself clicking the "legit" button a lot.  I rectified this by having a button that would automatically mark the "auto_legit" users as "legit", thus giving me a way to just focus on the ones that are actually spammers and letting the others fall through.

    Thinking about this, bulk operations work well, and it's better to be able to tell the system how to mark a user, then later, commit that result once you are satisfied that someone is legitimate or suspect.  Sometimes a user is not picked up as a spammer, but you want to do some more careful checking, so it helps to be able to just mark a user as "neutral", and come back to them later.

    The way the UI manipulated the DOM was awkward too.  I didn't want to drag in something big like JQuery, AngularJS or Dojo, what I needed to do wasn't that hard, just it needed a little abstraction.

    That abstraction is a class called DOMElement.  That just removes a little bit of boilerplate when manipulating the DOM and provides some conveniences.

    The UI was also very mouse-heavy.  Keyboard actually works quite well for lots of operations.  It'd be nice if I can just navigate between the users by pressing keys, and flag them with different keys.  Happily, L and S are opposite ends of a US keyboard.  If I use the term "hide" rather than "defer", H is in the middle.  I was thinking of N for next, and B to go back, so that suggests use U for neutral.  Pressing C should commit what's pending.

    It turns out in JavaScript, there's a few tricks you need to do to make this happen on a DIV:

    1. Set the tab index on the DIV so can accept keyboard focus
    2. Add an event listener on the DIV.  That event listener should disable the default action when it handles a key press (so you don't get the find-in-page pop-up making a nuisance) also watch for the ALT and CTRL keys being pressed so it can ignore keys like CTRL-L (go to location bar) or CTRL-C (copy).

    A bit more code, and I had a way to do all of the above.

    The code now uses classes a lot more, it uses Promises around XMLHttpRequest, and I've tried to split apart the business logic from the display logic a lot more.

    I want to make it a bit more modular, maybe I should look at a loader like require.js or similar… it annoys me that while it's possible for HTML to embed a script either in-line or from a separate file, that script is not able to "import" other resources itself without further help.

    Right now, it's 51kB of hand-written JavaScript, and gzip packs that down to 12kB.  I won't win the 5k website contest with that (never mind the back-end would disqualify it anyway), but it's not going to take an age to load over my ADSL2 uplink either.

    This is what it looks like today.  The original version had a loading spinner animation that was reminiscent of early terminal-based software, so I've decided to keep that aesthetic.

    Down the bottom, it gives me a count of how many users are flagged.  I've found that big batches don't work so well, so it now gives a warning when you reach the century:

    As it's committing these, it also counts them down so I know how long it'll be before its done.  In the event of a failure, this also tries to carry on with as many of the other operations as possible.  Those that failed stay flagged, and can be retried.

    Up the top, I can now jump around the user...

    Read more »

  • Progress Report

    Stuart Longland08/04/2018 at 03:13 0 comments

    Well, this project has been slowly grinding away in the background.  My manual reporting of spam accounts has been my primary focus, and it is through our collective reporting of spam (the "report as inappropriate" link … seriously, I urge you all to hunt down this crap and report it so the mods can dispose of it), and the hard work of the Hackaday.io dev-ops team that has kept this place largely free of spam.

    That said, the scanning via the API has still been useful, it has flagged at least half a dozen users I've missed in the last month which really shouldn't be here.

    One change to the algorithm I made last month resulted in a drastic increase in how quickly the script moves through the "All Hackers" pages.  Rather than fetching IDs and then immediately querying those IDs, I stash them for later.  This allows me to then request profile data 50 users at a time, reducing the number of hits to the API.  As of writing, the API has scanned each page back to page number 10368.

    How many pages are there?  Well, the first users on hackaday.io are waay back on page 14395.  That'll probably slip back a bit, but soon it'll be a case of working through that list of IDs and focussing on the newcomers only.  I don't expect to see much spam as I work back towards the last page of the newest user list.

    I certainly am considering how to generalise the code so that it can be more readily used for spam detection on other sites.  One blind-spot we have with regards to URI blacklists is stuff posted to web communities such as this.  Email has proven a hard target, so the spammers have naturally migrated to these web-based communities.

    By using a system such as this to real-time scan what people are submitting, we can really put a dint in the business model of would-be spammers.  The HADSH code is useless for this as-is, as it's dependent on the Hackaday.io API.  I'll probably still stick with Python for now, and I think ditching sqlalchemy for an asynchronous PostgreSQL client will be on the plans, along with a re-jig of the schema to make this project more general.


    Update 2018-08-10: It got there:
    hadsh=> select * from newest_user_page_refresh order by page_num desc limit 10;
     page_num |         refresh_date          
    ----------+-------------------------------
        14442 | 2018-08-10 07:27:17.559179+10
        14441 | 2018-08-10 06:46:27.558809+10
        14440 | 2018-08-10 06:15:44.022249+10
        14439 | 2018-08-10 05:43:58.536599+10
        14438 | 2018-08-10 05:42:52.539986+10
        14437 | 2018-08-10 05:41:48.405866+10
        14436 | 2018-08-10 05:40:44.790953+10
        14435 | 2018-08-10 05:39:41.411618+10
        14434 | 2018-08-10 05:38:37.689943+10
        14433 | 2018-08-10 05:37:33.53155+10
    

  • Progress and future plans

    Stuart Longland05/06/2018 at 04:38 1 comment

    So, it's been about 5 months now since I started this project.  I still battle things like request quotas but the project has matured a bit and I'm starting to get some ideas on how to push it forward.

    Since I started this, so far:

    • 78276 profiles have been analysed.
    • 76681 profiles have been cleared as legitimate user accounts
    • 918 spam profiles have been identified and reported (many of these have now been removed)
    • 46370 words and 186568 word pairs have been identified across many languages

    This is not counting the users found by manually patrolling the All Hackers pages.  Lots have been found and reported that way too.

    Some users have been real big contributors to the vocabulary.  @Morning.Star in particular had the classification script grinding away for 10 minutes.  That's more a reflection on the inefficiencies in my code than anything else.  That said, do have a look at his work, seems Nintendo are.

    Getting a feel for what words are typical of our community, and what words are typical of the spammers, really does help in identifying future new users.

    Thanks to some now deleted users, the "word" 출장만남 is one of the most strongly negatively scored "words".  (And yeah, I'm sure they offer quite a "meeting".  Never mind that there's an increasing number of women in business either — seems that part of Korea has not gotten the equal opportunities memo yet.)

    Soon as one of these spammers pops up, with some tart who's more interested in looking at her phone instead of the camera, this system sees words like that and immediately knows there's trouble.

    Someone mentions "pi", or "stuff", and the opposite occurs.

    Strangely, the most frequent neutral word (no score) seen thus far is "ich"… I can only think German.

    The system as developed, so far is working.  It's far from ideal, and is suffering some growing pains, but it generally is working well enough that we can say the concept has been proven.  I haven't yet looked at avatar analysis, the mob that were QR-code mad seem to have gone underground.

    One goal I've considered here is whether something could be made that would plug into the infrastructure used to host hackaday.ioapi.hackaday.io advertises Jetty which is a Java-based web framework.  OpenNLP is an option for doing this on that platform.

    Java is far from a favourite language for me, but I do know how to write code in it, even if my knowledge is rusty.  It's getting a polish at the moment thanks to my workplace and nHaystack, and thankfully if I'm working on it here, I don't have Tridium's idiosyncrasies to worry about.

    If I keep it as Python, probably the next steps will be to consider re-writing the database logic to either better interface with sqlalchemy, or ditch sqlalchemy in favour of a dedicated PostgreSQL library, ideally one that's asynchronous.  I also need to re-vamp the front-end, as that's rather hacky at present.

    PostgreSQL has worked well though, I see no reason to change that.  Right now the sluggishness is more to inefficient use of the database libraries than the database itself.

  • Analysing user projects and pages for tokens

    Stuart Longland03/03/2018 at 23:12 1 comment

    Just before the month ended, I tweaked my code so that it'd scan user-created projects and pages for tokens to include when computing the score.  This is because there are some spambots that put all their "content" in pages or projects and not on the profile page.

    Yesterday, saw me refine the system more; tweaking how this data is displayed and collected; notably if a user is registered and when analysed has a very low absolute score; we schedule a check later to see if they do anything.

    There's a spambot… spruiking some massage service in Korea (thank-you Google Translate).  The sea of red indicates these are words that only spambots use.  There's a couple that were used by legitimate users, but most of it is unique to spambots.  It's a wonder it didn't get to a score of -10, given the sheer number of negatively scored words.

    This almost could be good enough to flag the user itself; but I still believe in human oversight.  No legitimate user scores this low!  To quote one of the members of the Ipswich radio club: "Man must be master!"

    I'm not sure what would be needed to bash this into a form that could run at SupplyFrame… but I believe this at least proves the concept.

  • Scoring by vocabulary

    Stuart Longland02/03/2018 at 23:03 0 comments

    So, I implemented the scoring by a fairly naïve summing of individual word scores, computed with the earlier algorithm.

    This… due to the fact that the spambot-to-real user ratio isn't 50:50, meant that plenty of normal words and phrases boosted the spambots' scores to normal levels.

    I needed to re-think how I used that data.  In the end, I decided to dump the words' scores into an array, sort it in ascending order, then sum the worst 10.

    That proved to be quite effective.  It means they can't cheat the system by putting their usual text up, then slapping in a handful of hacker lingo words to boost their score.  Users that only use one or two bad words, will usually be scored up highly enough to avoid getting flagged.

    The database now has nearly 20000 words and 65800 word pairs, scored according to typical usage of users that have arrived at hackaday.io since early June last year.

    With that information, the script can auto-classify some users with better accuracy:

    2018-02-03 22:47:02,402       INFO HADSHApp.crawler 25397/MainThread: New user: USA Tile & Marble [#xxxxxx]
    2018-02-03 22:47:02,404      DEBUG HADSHApp.crawler 25397/MainThread: User USA Tile & Marble [#xxxxxx] is in groups set() (classifie
    d False)
    2018-02-03 22:47:03,433      DEBUG HADSHApp.crawler 25397/MainThread: Inspecting user USA Tile & Marble [#xxxxxx]
    2018-02-03 22:47:03,440    WARNING polyglot.detect.base 25397/MainThread: Detector is not able to detect the language reliably.
    2018-02-03 22:47:03,443      DEBUG     HADSHApp.api 25397/MainThread: Query arguments: {'per_page': 50, 'page': 1, 'api_key': 'xxxxx
    xxxxxxxxxxx'}
    2018-02-03 22:47:03,446      DEBUG     HADSHApp.api 25397/MainThread: GET 'https://api.hackaday.io/v1/users/xxxxxx/links?per_page=50
    &page=1&api_key=xxxxxxxxxxxxxxxx'
    2018-02-03 22:47:04,683       INFO HADSHApp.crawler 25397/MainThread: User USA Tile & Marble [#xxxxxx] has link to VIEW OUR SHOWROOM
     <[REDACTED]>
    2018-02-03 22:47:04,754      DEBUG HADSHApp.crawler 25397/MainThread: New word: mosaic
    2018-02-03 22:47:04,789      DEBUG HADSHApp.crawler 25397/MainThread: New word: ceramic
    2018-02-03 22:47:04,808      DEBUG HADSHApp.crawler 25397/MainThread: New word: tiles
    2018-02-03 22:47:04,818      DEBUG HADSHApp.crawler 25397/MainThread: New word: porcelain
    2018-02-03 22:47:04,862      DEBUG HADSHApp.crawler 25397/MainThread: New word: marble
    2018-02-03 22:47:04,891      DEBUG HADSHApp.crawler 25397/MainThread: New word: showroom
    2018-02-03 22:47:04,901      DEBUG HADSHApp.crawler 25397/MainThread: New word: travertine
    2018-02-03 22:47:04,911      DEBUG HADSHApp.crawler 25397/MainThread: New word: collection
    2018-02-03 22:47:04,945      DEBUG HADSHApp.crawler 25397/MainThread: New word: flooring
    2018-02-03 22:47:04,963      DEBUG HADSHApp.crawler 25397/MainThread: New word: pompano
    2018-02-03 22:47:04,973      DEBUG HADSHApp.crawler 25397/MainThread: New word: tile
    2018-02-03 22:47:06,090      DEBUG HADSHApp.crawler 25397/MainThread: User USA Tile & Marble [#xxxxxx] has score -3.675362
    2018-02-03 22:47:06,098      DEBUG HADSHApp.crawler 25397/MainThread: Auto-classifying USA Tile & Marble [#xxxxxx] as suspect

    That's out of the logs.  The script "learned" some new words there.  In the database, we can see how those words are scored:

    hadsh=> select score, count, score::float/count::float rel_score from word where word='flooring';
     score | count | rel_score 
    -------+-------+-----------
        -2 |     2 |        -1

    As I say, machine learning at it's most primitive.  I've considered whether to integrate uribl or surbl DNS blacklists, but so far, this has neither been necessary, nor have I seen any links yet (of the few that I tried) that actually show up in those blacklists.

    For now, the site is once again, blocked.  So it's back to the manual methods again.  Things going to plan, we should be able to expand the data set to cover arrivals in late 2016 once the user retrieval resumes.

  • Scoring on words

    Stuart Longland02/02/2018 at 23:33 0 comments

    So, I implemented polyglot in the code, and while I'm not yet doing the user classification, I am at least collecting the information, and already I'm seeing some trends.

    I have a table of words with four columns:

    • BIGINT primary key (indexing by numeric values is easier for a database)
    • TEXT word (with an index for fast look-up)
    • INTEGER score
    • INTEGER count

    The score and count are how we'll keep track of the "spammyness" of a word.  These are update when a user is classified (by a human).  If the user is classed as legitimate, both get incremented, otherwise if they're a spambot user, count will be incremented while score is decremented.

    The end result is that, when you compute score/count, this normalised score is closer to +1.0 for words that are typical of legitimate users, and closer to -1.0 for spambot users.

    I'm not sure how good the tokenisation is in polyglot for non-English scripts, but so far, the spambots I've seen that post Korean/Chinese, tend to have other traits that scream spambot: like creating lots of "projects" with much the same text.

    When a 5-minute-old user account has 20 projects, one scratches their head and wonders why a legitimate user would do that.

    Already, this approach is showing some insights:

    hadsh=> select word, score, count, score::float/count::float rel_score from word where count>1 order by rel_score, count desc;
           word        | score | count |     rel_score      
    -------------------+-------+-------+--------------------
     estate            |    -7 |     7 |                 -1
     agent             |    -6 |     6 |                 -1
     top               |    -4 |     4 |                 -1
     realtors          |    -4 |     4 |                 -1
     bowie             |    -3 |     3 |                 -1
     rated             |    -2 |     2 |                 -1
     polska            |    -2 |     2 |                 -1
     transhelsa        |    -2 |     2 |                 -1
     erekcja           |    -2 |     2 |                 -1
     real              |    -4 |    10 |               -0.4
     md                |    -1 |     3 | -0.333333333333333
     en                |     0 |     8 |                  0
     happy             |     0 |     2 |                  0
     local             |     0 |     2 |                  0
     barcelona         |     1 |     3 |  0.333333333333333
     mi                |     1 |     3 |  0.333333333333333
     la                |     3 |     7 |  0.428571428571429
     best              |     5 |    11 |  0.454545454545455
     really            |     4 |     6 |  0.666666666666667
     de                |    13 |    19 |  0.684210526315789
     hi                |     6 |     8 |               0.75
     am                |   133 |   139 |  0.956834532374101
     in                |   171 |   175 |  0.977142857142857
     with              |    86 |    88 |  0.977272727272727
     my                |    92 |    94 |  0.978723404255319
     i                 |   406 |   412 |  0.985436893203884
     and               |   363 |   367 |  0.989100817438692
     a                 |   290 |   292 |  0.993150684931507
     ,                 |   727 |   731 |   0.99452804377565
     .                 |   806 |   810 |  0.995061728395062
     github            |   363 |   363 |                  1
     to                |   345 |   345 |                  1
     twitter           |   172 |   172 |                  1
    …
     california        |    27 |    27 |                  1
     interesting       |    26 |    26 |                  1
     know              |    26 |    26 |                  1
     germany           |    26 |    26 |                  1
     work              |    25 |    25 |                  1
     electrical        |    25 |    25 |                  1
     enthusiast        |    25 |    25 |                  1
     arduino           |    25 |    25 |                  1
     working           |    24 |    24 |                  1
     3d                |    24 |    24 |                  1
     as                |    24 |    24 |                  1
     science           |    24 |    24 |                  1
     world             |    24 |    24 |                  1
     &                 |    23 |    23 |                  1
     make              |    23 |    23 |                  1
     hack              |    23 |    23 |                  1
     hobbyist          |    22 |    22 |                  1
     indonesia         |    22 |    22 |                  1
     iot               |    22 |    22 |                  1
     what              |    22 |    22 |                  1
     years             |    22 |    22 |                  1
     have              |    22 |    22 |                  1
     you               |    22 |    22 |                  1
     ingin             |    22 |    22 |                  1
     hardware          |    21 |    21 |                  1
     all               |    21 |    21 |                  1
     diy               |    21 |    21 |                  1
     retired           |    20 |    20 |                  1
     because           |    20 |    20 |                  1
     guy               |    19 |    19 |                  1
     here              |    19 |    19 |                  1
     ideas             |    19 |    19 |                  1
     cool              |    18 |    18 |                  1
     old               |    18 |    18 |                  1
    …

    So some are definitely specific to spammers… and I'll apologise now to the people of Bowie, MD (you can blame one of your local business owners for the bad reputation, it'd only take 3 legitimate users mentioning "bowie" to offset this).

    Already, we know the moment they mention "realtors" or "estate" to be suspicious.  Word adjacency is also tracked:

    hadsh=> select (select word from word where word_id=proceeding_id) as proceeding, (select word from word where word_id=following_id) as following, score, count, score::float/count::float rel_score from word_adjacent where count>1 order by rel_score, count desc;
       proceeding   |   following    | score | count |     rel_score     
    ----------------+----------------+-------+-------+-------------------
     real           | estate         |    -7 |     7 |                -1
     estate         | agent          |    -6 |     6 |                -1
     best           | real           |    -2 |     2 |                -1
     top            | rated          |    -2 |     2 |                -1
     rated          | real           |    -2 |     2 |                -1
     , |...
    Read more »

  • Playing with polyglot

    Stuart Longland02/02/2018 at 11:07 0 comments

    So, last post I discussed tokenising the language and counting up word frequency.  I did some SQL queries that crudely stripped the HTML and chopped up the text into words.

    It worked, kinda.

    Firstly, it was an extremely naïve algorithm, it would tokenise the word "you're" as "you" and "re".  I could try to embed exceptions, but that'd only work for English.  It would somewhat work for German, French and Spanish, since English borrows a lot of words from those languages, but there I think it'd have less success since those languages have their own special rules.

    It'd fall flat on its face where it came to Arabic, Chinese, Japanese, Korean… etc.

    Now, I'd have to think back to the mid-90s when I was studying Japanese to think how the sentence structures there worked.  We had to as part of primary school studies, and I was never any good at it then.  To this day I recall something about someone having an itchy knee, a guy named Roko, then it got a little rude!

    So I'd be the last person that should be writing a natural language parser for Japanese, let alone the others!

    Enter polyglot.

    Polyglot is a toolkit for natural language processing in Python.  Among its features is tokenisation.  It is able to detect the language used, then apply rules to tokenise the text into words, which we can then count and use.  I tried it with some recent users here, copying and pasting their profile text into the ipython shell, and lo and behold, it was able to identify the language and tokenise the words out.

    It may not be perfect, but it's better than anything I can write.

    The catch is, it's GPLv3, whereas up to now, my code was under the BSD license.  Since I'm sole developer so far, we'll just switch to GPLv3 as well.  Not my cup of tea for a software license (in particular, being able to sue for infringement counts for nought if you don't have the time/money to defend your copyright), but it's not really a big deal in this case.

    I'll look around for something to strip the HTML out, and we should be in business.