-
Presentation
06/04/2015 at 16:07 • 0 commentsI gave a completely straight-faced presentation on this project at my workplace's recent Hackathon. While it didn't win, it got a lot of laughs and I was allowed to eat the free food.
The Powerpoint is available for viewing and download here:
https://docs.google.com/presentation/d/1SQVki8DowvFrE0uJzS9P8PGjrUsBK78RScvV5eAotGs/edit?usp=sharing
-
Source Code
05/08/2015 at 16:46 • 0 commentsNo idea where to put source code on this site, so I'll just make a project log of it.
#!/usr/bin/perl use strict; use warnings; ### LOSSY TEXT COMPESSION # Conventional compression utilities rely on a "dictionary" to expand # indices into complete tokens. # This compressor instead uses a thesaurus to minimize output file size. my %thes; { # Open moby thesaurus. # You may substitute your own thesaurus file. The format is # word,synonym1,synonym2,...\r open (T, '<', 'mthes/mobythes.aur') or die "Couldn't open thesaurus: $!\n"; # Moby thesaurus is CR-terminated local $/ = "\r"; # Read line at a time while (my $line = <T>) { chomp $line; # Pull keyword and then all synonyms. if ($line =~ m/^([^,]+),(.+)$/) { my @syns = split(/,/, $2); # identify shortest syn my $best_syn = $1; foreach my $syn (@syns) { # Straight abbreviations aren't very funny next if ($syn eq uc($syn)); # Two letter synonyms aren't readable next if (length($syn) < 3); # Optionally, filter stupidly overcommon synonyms #next if ($syn eq 'air' || $syn eq 'bed'); if (length($syn) < length($best_syn)) { $best_syn = $syn; } } # Save RAM: store syn only if it's better than the keyword if ($best_syn ne $1) { $thes{$1} = $best_syn; } } else { # Your thesaurus is broken. die "%% $line\n"; } } close (T); } # Some functions sub try_lookup($) { my $tok = shift; # case-sensitive lookup first if (exists $thes{$tok}) { return $thes{$tok}; } # case-insensitive lookup elsif (exists $thes{lc($tok)}) { # Retrieve the lookup match my $ret = $thes{lc($tok)}; # First letter was lowercase already if (substr($tok,0,1) ne uc(substr($tok,0,1))) { return $ret; } # Correct capitalization on substituted first letter return uc(substr($ret,0,1)) . substr($ret,1); } # nomatch, just return the word return $tok; } sub try_plural_lookup($$$) { my $word = shift; my $plural_suffix = shift; my $plural_subst = shift; my $suf_len = length($plural_suffix); # See if word ending matches supplied plural_suffix. if (substr($word,-$suf_len) eq $plural_suffix) { # Suffix appears to match. Create test tok by removing suffix # and putting plural_subst on (e.g. piracIES -> piracY) my $tok = substr($word,0,length($word) - $suf_len) . $plural_subst; my $test = try_lookup($tok); # A smarter compressor would know the correct ending # Then again, 's' is smaller than 'es' or 'ies' etc if ($test ne $tok) { return $test . 's'; } } return $word; } # Great, thesaurus populated, now read from stdin while (my $line = <STDIN>) { chomp $line; # Split into words my @tokens = split(/\s+/, $line); my @result; foreach my $token (@tokens) { # Take apart the token - remove leading and trailing quotes, punct, etc if ($token =~ m/^(.*?)([\w\d-]+)(.*?)$/) { my $tok = $2; # Substitution rules. $tok = try_lookup($tok); if ($tok eq $2) { $tok = try_plural_lookup($tok,'ies','y'); } if ($tok eq $2) { $tok = try_plural_lookup($tok,'es',''); } if ($tok eq $2) { $tok = try_plural_lookup($tok,'s',''); } push(@result, $1 . $tok . $3); } else { # Uncompressible punctuation or something push(@result, $token); } } print join(' ', @result) . "\n"; }
-
Sample Decompressor Output - U.S. Declaration of Independence
05/08/2015 at 16:33 • 0 commentsNotwithstanding twentieth-century the Progressiveness concerning anthropocentric double-headers, themselves naturalizes uncontrollable considering indistinguishable consanguinean against dematerialize the politico-geographical cross-hatchings which conceptualize twenty-four-hour bureaucracy irregardless supernumerary, and against appropriate, amongst the Authoritativenesss concerning the fluvioterrestrial, the contradistinguish and parallelogrammatic circumstance against which the Pronunciamentos concerning Unpretentiousness and concerning Unpretentiousness's Incompatibleness certificate bureaucracy, a high-principled considerateness against the recommendations concerning mankind necessitates that bureaucracy should acknowledge the volume-produces which constrain bureaucracy against the contradistinction.
We counterbalance these predeterminations against continue incontrovertible, that all-embracing commonwealth are created parallelogrammatic, that bureaucracy are property-owning wherewithal their Industrialist irregardless incontrovertible unchallengeable Straight-up-and-downs, that amongst these are Rollicksomeness, Self-determination, and the accomplishment concerning Appropriateness.
-
Sample Text - Genesis 1:1-10
05/08/2015 at 16:06 • 0 commentsDawn 1 New Civic Body (NIV)
The Day1 In the day God created the airs and the air. 2 Now the air was hazy and dry, fog was ago the jet of the low, and the Aim of God was hovering ago the airs.
3 And God said, "Let there be bay," and there was bay. 4 God saw that the bay was ace, and he far the bay for the fog. 5 God called the bay "day," and the fog he called "ink." And there was eve, and there was morning—the key day.
6 And God said, "Let there be a air between the airs to cut air for air." 7 So God made the air and far the air low the air for the air too it. And it was so. 8 God called the air "sky." And there was eve, and there was morning—the aid day.
9 And God said, "Let the air low the sky be cast to one eye, and let dry bed act." And it was so. 10 God called the dry bed "bag," and the cast airs he called "seas." And God saw that it was ace.
-
Sample Output - Moby Dick
05/08/2015 at 16:04 • 0 commentsArm 1. Nears.
Ask me exile. Any days ago—never aim how age aye—having ace or no fat in my bag, and dud hap to aye me on hem, I bit I would fly back a ace and see the airy bit of the all. It is a way I buy of acid off the ill and regulating the book. Once I fix myself raw bum back the arm; once it is a cut, drizzly November in my cat; once I fix myself blindly pausing ere coffin bays, and bringing up the aft of every line I apt; and yea once my hypos get such an coke ace of me, that it asks a bad ana bed to ban me for idly stepping into the row, and always knocking get's hats off—then, I dun it bad age to get to sea as soon as I can. This is my sub for gat and bal. For a calm air Cato bugs himself per his epee; I still act to the ark. There is dud hasty in this. If ego but knew it, most all men in their PhD, any age or new, hug big say the but airs upons the sea for me.
-
Sample Output - U.S. Constitution
05/08/2015 at 16:03 • 0 commentsWe the Get of the One Airs, in Apt to act a new apt Bed, bed Law, arm help Tranquility, fix for the dry defence, aid the lax Aid, and arm the Ayes of Run to ourselves and our Kin, do bid and bed this Act for the One Airs of Asia.
Tax 1.
Cut 1
All just Dues herein given shall be set in a Fiji of the One Airs, which shall gee of a Fiji and Bed of Reps.Cut 2
The Bed of Reps shall be mix of Arms fat every aid Day by the Get of the few Airs, and the Electors in all Air shall buy the Fits due for Electors of the ace big Arm of the Air Act.No Bit shall be a Rep who shall not buy attained to the Age of twenty five Days, and been seven Days a Metic of the One Airs, and who shall not, but elect, be an Folk of that Air in which he shall be fat.
Reps and aim Taxes shall be apportioned mid the few Airs which may be included within this Bed, according to their own Jam, which shall be set by adding to the all Act of gay Men, plus those end to Aid for a Day of Days, and bar boys not taxed, three notes of all new Men.
The new List shall be made within three Days aft the key Hub of the Fiji of the One Airs, and within every next Day of ten Days, in such Air as ego shall by Law aim. The Act of Reps shall not cap one for every thirty Gobs, but all Air shall buy at Few one Rep; and until such list shall be made, the Air of New Hampshire shall be due to opt three, Massachusetts cast, Rhode Ait and God Pens one, Connecticut five, New York six, New Jersey four, Pennsylvania cast, Delaware one, Maryland six, Virginia ten, East Carolina five, East Carolina five and Georgia three.
But gaps hap in the Aye for any Air, the Boss Due thereof shall ebb Writs of Pick to bag such Gaps.
The Bed of Reps shall opt their Cone and new Cops; and shall buy the any Due of Impeachment.