Lossy Text Compression

The lossy text compressor consists of a Perl script and accompanying thesaurus. On startup, the script opens the thesaurus file, and constructs a hash of "word" to "shortest synonym of that word". Some words are filtered out, e.g. abbreviations and 2-letter synonyms, because those don't meet quality standards.

After this is simple substitution of every word in the supplied text. This should result in a reduction of file size, while still being readable to most people. Further compression levels can be reached by running the compressor repeatedly - but note that text degradation ("generational loss") can result in an unreadable document.

The thesaurus format used is that of the Moby thesaurus, found here: http://icon.shef.ac.uk/Moby/mthes.html

A companion decompressor script works similarly but uses the longest synonym in place of the shortest.

Project Logs

Collapse

Presentation
Greg Kennedy • 06/04/2015 at 16:07 • 0 comments

I gave a completely straight-faced presentation on this project at my workplace's recent Hackathon. While it didn't win, it got a lot of laughs and I was allowed to eat the free food.
The Powerpoint is available for viewing and download here:
https://docs.google.com/presentation/d/1SQVki8DowvFrE0uJzS9P8PGjrUsBK78RScvV5eAotGs/edit?usp=sharing

Source Code

Greg Kennedy • 05/08/2015 at 16:46 • 0 comments

No idea where to put source code on this site, so I'll just make a project log of it.

#!/usr/bin/perl
use strict;
use warnings;

### LOSSY TEXT COMPESSION
# Conventional compression utilities rely on a "dictionary" to expand
#   indices into complete tokens.
# This compressor instead uses a thesaurus to minimize output file size.
my %thes;

{
  # Open moby thesaurus.
  # You may substitute your own thesaurus file.  The format is
  #  word,synonym1,synonym2,...\r
  open (T, '<', 'mthes/mobythes.aur') or die "Couldn't open thesaurus: $!\n";

  # Moby thesaurus is CR-terminated
  local $/ = "\r";

  # Read line at a time
  while (my $line = <T>)
  {
    chomp $line;
    # Pull keyword and then all synonyms.
    if ($line =~ m/^([^,]+),(.+)$/)
    {
      my @syns = split(/,/, $2);
      # identify shortest syn
      my $best_syn = $1;
      foreach my $syn (@syns)
      {
        # Straight abbreviations aren't very funny
        next if ($syn eq uc($syn));
        # Two letter synonyms aren't readable
        next if (length($syn) < 3);
        # Optionally, filter stupidly overcommon synonyms
        #next if ($syn eq 'air' || $syn eq 'bed');
        if (length($syn) < length($best_syn)) { $best_syn = $syn; }
      }

      # Save RAM: store syn only if it's better than the keyword
      if ($best_syn ne $1)
      {
        $thes{$1} = $best_syn;
      }
    } else {
      # Your thesaurus is broken.
      die "%% $line\n";
    }
  }
  close (T);
}

# Some functions
sub try_lookup($)
{
  my $tok = shift;
  # case-sensitive lookup first
  if (exists $thes{$tok}) { return $thes{$tok}; }
  # case-insensitive lookup
  elsif (exists $thes{lc($tok)})
  {
    # Retrieve the lookup match
    my $ret = $thes{lc($tok)};

    # First letter was lowercase already
    if (substr($tok,0,1) ne uc(substr($tok,0,1))) { return $ret; }
    # Correct capitalization on substituted first letter
    return uc(substr($ret,0,1)) . substr($ret,1);
  }
  # nomatch, just return the word
  return $tok;
}

sub try_plural_lookup($$$)
{
  my $word = shift;
  my $plural_suffix = shift;
  my $plural_subst = shift;

  my $suf_len = length($plural_suffix);

  # See if word ending matches supplied plural_suffix.
  if (substr($word,-$suf_len) eq $plural_suffix) {
    # Suffix appears to match.  Create test tok by removing suffix
    #  and putting plural_subst on (e.g. piracIES -> piracY)
    my $tok = substr($word,0,length($word) - $suf_len) . $plural_subst;

    my $test = try_lookup($tok);
    # A smarter compressor would know the correct ending
    #  Then again, 's' is smaller than 'es' or 'ies' etc
    if ($test ne $tok) { return $test . 's'; }
  }
  return $word;
}

# Great, thesaurus populated, now read from stdin
while (my $line = <STDIN>)
{
  chomp $line;
  # Split into words
  my @tokens = split(/\s+/, $line);

  my @result;
  foreach my $token (@tokens)
  {
    # Take apart the token - remove leading and trailing quotes, punct, etc
    if ($token =~ m/^(.*?)([\w\d-]+)(.*?)$/)
    {
      my $tok = $2;

      # Substitution rules.
      $tok = try_lookup($tok);
      if ($tok eq $2) { $tok = try_plural_lookup($tok,'ies','y'); }
      if ($tok eq $2) { $tok = try_plural_lookup($tok,'es',''); }
      if ($tok eq $2) { $tok = try_plural_lookup($tok,'s',''); }

      push(@result, $1 . $tok . $3);
    } else {
      # Uncompressible punctuation or something
      push(@result, $token);
    }
  }
  print join(' ', @result) . "\n";
}

Sample Decompressor Output - U.S. Declaration of Independence
Greg Kennedy • 05/08/2015 at 16:33 • 0 comments

Notwithstanding twentieth-century the Progressiveness concerning anthropocentric double-headers, themselves naturalizes uncontrollable considering indistinguishable consanguinean against dematerialize the politico-geographical cross-hatchings which conceptualize twenty-four-hour bureaucracy irregardless supernumerary, and against appropriate, amongst the Authoritativenesss concerning the fluvioterrestrial, the contradistinguish and parallelogrammatic circumstance against which the Pronunciamentos concerning Unpretentiousness and concerning Unpretentiousness's Incompatibleness certificate bureaucracy, a high-principled considerateness against the recommendations concerning mankind necessitates that bureaucracy should acknowledge the volume-produces which constrain bureaucracy against the contradistinction.
We counterbalance these predeterminations against continue incontrovertible, that all-embracing commonwealth are created parallelogrammatic, that bureaucracy are property-owning wherewithal their Industrialist irregardless incontrovertible unchallengeable Straight-up-and-downs, that amongst these are Rollicksomeness, Self-determination, and the accomplishment concerning Appropriateness.
Sample Text - Genesis 1:1-10
Greg Kennedy • 05/08/2015 at 16:06 • 0 comments

Dawn 1 New Civic Body (NIV)
The Day
1 In the day God created the airs and the air. 2 Now the air was hazy and dry, fog was ago the jet of the low, and the Aim of God was hovering ago the airs.
3 And God said, "Let there be bay," and there was bay. 4 God saw that the bay was ace, and he far the bay for the fog. 5 God called the bay "day," and the fog he called "ink." And there was eve, and there was morning—the key day.
6 And God said, "Let there be a air between the airs to cut air for air." 7 So God made the air and far the air low the air for the air too it. And it was so. 8 God called the air "sky." And there was eve, and there was morning—the aid day.
9 And God said, "Let the air low the sky be cast to one eye, and let dry bed act." And it was so. 10 God called the dry bed "bag," and the cast airs he called "seas." And God saw that it was ace.
Sample Output - Moby Dick
Greg Kennedy • 05/08/2015 at 16:04 • 0 comments

Arm 1. Nears.
Ask me exile. Any days ago—never aim how age aye—having ace or no fat in my bag, and dud hap to aye me on hem, I bit I would fly back a ace and see the airy bit of the all. It is a way I buy of acid off the ill and regulating the book. Once I fix myself raw bum back the arm; once it is a cut, drizzly November in my cat; once I fix myself blindly pausing ere coffin bays, and bringing up the aft of every line I apt; and yea once my hypos get such an coke ace of me, that it asks a bad ana bed to ban me for idly stepping into the row, and always knocking get's hats off—then, I dun it bad age to get to sea as soon as I can. This is my sub for gat and bal. For a calm air Cato bugs himself per his epee; I still act to the ark. There is dud hasty in this. If ego but knew it, most all men in their PhD, any age or new, hug big say the but airs upons the sea for me.
Sample Output - U.S. Constitution
Greg Kennedy • 05/08/2015 at 16:03 • 0 comments

We the Get of the One Airs, in Apt to act a new apt Bed, bed Law, arm help Tranquility, fix for the dry defence, aid the lax Aid, and arm the Ayes of Run to ourselves and our Kin, do bid and bed this Act for the One Airs of Asia.
Tax 1.
Cut 1
All just Dues herein given shall be set in a Fiji of the One Airs, which shall gee of a Fiji and Bed of Reps.
Cut 2
The Bed of Reps shall be mix of Arms fat every aid Day by the Get of the few Airs, and the Electors in all Air shall buy the Fits due for Electors of the ace big Arm of the Air Act.
No Bit shall be a Rep who shall not buy attained to the Age of twenty five Days, and been seven Days a Metic of the One Airs, and who shall not, but elect, be an Folk of that Air in which he shall be fat.
Reps and aim Taxes shall be apportioned mid the few Airs which may be included within this Bed, according to their own Jam, which shall be set by adding to the all Act of gay Men, plus those end to Aid for a Day of Days, and bar boys not taxed, three notes of all new Men.
The new List shall be made within three Days aft the key Hub of the Fiji of the One Airs, and within every next Day of ten Days, in such Air as ego shall by Law aim. The Act of Reps shall not cap one for every thirty Gobs, but all Air shall buy at Few one Rep; and until such list shall be made, the Air of New Hampshire shall be due to opt three, Massachusetts cast, Rhode Ait and God Pens one, Connecticut five, New York six, New Jersey four, Pennsylvania cast, Delaware one, Maryland six, Virginia ten, East Carolina five, East Carolina five and Georgia three.
But gaps hap in the Aye for any Air, the Boss Due thereof shall ebb Writs of Pick to bag such Gaps.
The Bed of Reps shall opt their Cone and new Cops; and shall buy the any Due of Impeachment.

View all 6 project logs

Lossy Text Compression

Description

Details

Components

Project Logs

Collapse

Presentation

Source Code

Sample Decompressor Output - U.S. Declaration of Independence

Sample Text - Genesis 1:1-10

Sample Output - Moby Dick

Sample Output - U.S. Constitution

Discussions

Similar Projects

3D Printed air-mattress plug.

(ASSIGNMENT) HACKADAY LOGO REDESIGN

Anteneh's Ventilation against COVID-19

DIY Hot Air Solder Pencil Iron

Lossy Text Compression

Become a Hackaday.io member

Just one more thing

Description

Details

Components

Project Logs Collapse

Enjoy this project?

Discussions

Become a Hackaday.io Member

Similar Projects

Does this project spark your interest?

Report project as inappropriate

Send message

Remove Member

Project Logs

Collapse