The Daily WTF: Curious Perversions in Information Technology
Welcome to TDWTF Forums Sign in | Join | Help
in Search

Speed Challenge 37 - ITT We Consider Word Scrambles

Last post 02-15-2008 7:04 PM by zzo38. 35 replies.
Page 1 of 1 (36 items)
Sort Posts: Previous Next
  • 09-24-2007 8:13 PM

    • kirchhoff
    • Top 150 Contributor
    • Joined on 02-27-2007
    • ECE 280 (Circuit Analysis)
    • Posts 216

    Speed Challenge 37 - ITT We Consider Word Scrambles

    A programmer friend of yours is making a whiz-bang AJAX-y Web2.0 game that mimics the 'Word Scramble' mode of Brain Age. (Take any word, and scramble the letters in it -- this game ups the ante by moving the letters around while you try to figure out the anagram.)

    Naturally he plans to let end-users submit words to appear in the puzzles. He has reviewed his server-side data quality checking with you ("Yes I remembered to normalize case!") but you are still not convinced this is a good idea.

    "What if a user submits a list of internet memes or curse words?"

    "Oh, that'll be hilarious! It's the power of peer to peer."

    It takes all your willpower to avoid rolling your eyes. Still, you humor him because this project has made him more enthusiastic than he has been in some months.

    In a moment of clarity he realizes that beyond checking for illegal characters and phrases, there is the distinct possibility that one or more words in the user-provided list could anagram to each other, thus some of the puzzles could have multiple correct answers -- he'd like to avoid this as he wanted to animate the letters moving into the correct order when you got the word wrong, but it'd be too hard for his meager DHTML skills to support showing the multiple solutions in a way he found satisfactory.

    "Should I even be worried about this? What are the chances of this happening in any random set of english words?"

    Help our perplexed puzzle programmer!!! Propose a procedure for procuring the percentage of words in a plain text phile which do not pose a problem for this anagram project.

    Translation: Given a text file on standard input, ignoring case and punctionation and such, determine the percentage of words which have possible collisions when anagrammed. Ignore words shorter than 4 letters, as they wouldn't make fun puzzles. Since we're trying to get a good estimate out of this program, the word lists can be very long (> 10,000 words) so your method needs to scale a little bit.
     

  • 09-25-2007 1:23 AM In reply to

    • flop
    • Top 500 Contributor
    • Joined on 04-27-2007
    • Posts 51

    Re: Speed Challenge 37 - ITT We Consider Word Scrambles

    #!/usr/bin/perl
    # one word per line, at least 4 characters in a word.

    $dupl=$count=0;
    %ana=();
    while (<>)
    {
            @chars=sort m#(\l[a-z])#gi;
            next unless @chars>=4;

            $dupl++ if (++$ana{join("",@chars)}) >1;
            $count++;
    }

    printf "Of %d words %d duplicate anagrams (%5.2f%%).\n",
            $count, $dupl, $dupl*100.0/$count;
  • 09-25-2007 2:13 AM In reply to

    • flop
    • Top 500 Contributor
    • Joined on 04-27-2007
    • Posts 51

    Re: Speed Challenge 37 - ITT We Consider Word Scrambles

    Sorry - just saw that \l doesn't seem to work. Please replace the line to:
    @chars=sort map { lc } m#([a-z])#gi;
  • 09-25-2007 10:43 AM In reply to

    • kirchhoff
    • Top 150 Contributor
    • Joined on 02-27-2007
    • ECE 280 (Circuit Analysis)
    • Posts 216

    Re: Speed Challenge 37 - ITT We Consider Word Scrambles

    flop, your frugal foray into this frightful folly is fitting, and is my favorite (albeit first) finalist.

    Can anyone in attendance allocate an attenuated amount of ASCII in your algorithm?

  • 09-25-2007 10:49 AM In reply to

    • kirchhoff
    • Top 150 Contributor
    • Joined on 02-27-2007
    • ECE 280 (Circuit Analysis)
    • Posts 216

    Re: Speed Challenge 37 - ITT We Consider Word Scrambles

    Incidentally -- I'm indicating 16% as the neighborhood number that you should be seeing as you substitute your samples.

    That's an amazing appearence of anagrams in our lay lexicon! 

  • 09-25-2007 4:32 PM In reply to

    Re: Speed Challenge 37 - ITT We Consider Word Scrambles

    flop:
    #!/usr/bin/perl
    # one word per line, at least 4 characters in a word.

    $dupl=$count=0;
    %ana=();
    while (<>)
    {
            @chars=sort m#(\l[a-z])#gi;
            next unless @chars>=4;

            $dupl++ if (++$ana{join("",@chars)}) >1;
            $count++;
    }

    printf "Of %d words %d duplicate anagrams (%5.2f%%).\n",
            $count, $dupl, $dupl*100.0/$count;

    Ok...which one of those lines accesses a dictionary? I'm assuming it's the (later corrected) line regarding the \l? (Something to do with #m or #gi?)

    The first three pages of searching Google for "perl dictionary" and various variants revealed no valuable results.

    (Sorry, I couldn't really alliterate that with 'v' very easily...)

     

  • 09-25-2007 5:16 PM In reply to

    • kirchhoff
    • Top 150 Contributor
    • Joined on 02-27-2007
    • ECE 280 (Circuit Analysis)
    • Posts 216

    Re: Speed Challenge 37 - ITT We Consider Word Scrambles

    too_many_usernames:

    Ok...which one of those lines accesses a dictionary?

     

    #!/usr/bin/perl
    # one word per line, at least 4 characters in a word.

    $dupl=$count=0;
    %ana=();  ## "ana" is declared as a dictionary here.
    while (<>)
    {
            @chars=sort m#(\l[a-z])#gi;
            next unless @chars>=4;

            $dupl++ if (
    ++$ana{join("",@chars)}) >1;    ## "ana" is indexed by the result of join on the
            $count++;                                   ## chars array, and the value is post-incremented.
    }

    printf "Of %d words %d duplicate anagrams (%5.2f%%).\n",
            $count, $dupl, $dupl*100.0/$count;

    various variants revealed no valuable results... (Sorry, I couldn't really alliterate that with 'v' very easily...)

    Don't fret, fair forum-goer, you couched your conundrum cleverly, even as the alliteration ailed you.

  • 09-25-2007 10:13 PM In reply to

    Re: Speed Challenge 37 - ITT We Consider Word Scrambles

    kirchhoff:

    too_many_usernames:

    Ok...which one of those lines accesses a dictionary?

     

    #!/usr/bin/perl
    # one word per line, at least 4 characters in a word.

    $dupl=$count=0;
    %ana=();  ## "ana" is declared as a dictionary here.

    Don't fret, fair forum-goer, you couched your conundrum cleverly, even as the alliteration ailed you.


    I'm terribly late to the Perl game in general (that is, this is the first time I've really looked at it in any detail at all), but it appears that the line quoted above just declares 'ana' be a hash set to a null list - so where do the words get populated into the dictionary from this?  That is, it looks like the program just checks for anagrams in string characters, not anagrams of actual words in, say, the English language.  (That is, it would say that 'peanut' and 'eapunt' are anagrams, but the latter of those is not a valid English word...)

    Or am I still missing something completely here?

    (Weariness now weakens one's waking workable train of thought...) 
  • 09-26-2007 1:07 AM In reply to

    Re: Speed Challenge 37 - ITT We Consider Word Scrambles

    too_many_usernames:


    flop:
    #!/usr/bin/perl
    # one word per line, at least 4 characters in a word.

    $dupl=$count=0;
    %ana=();
    while (<>)
    {
        @chars=sort m#(\l[a-z])#gi;
        next unless @chars>=4;

        $dupl++ if (++$ana{join("",@chars)}) >1;
        $count++;
    }

    printf "Of %d words %d duplicate anagrams (%5.2f%%).\n",
    $count, $dupl, $dupl*100.0/$count;


    Ok...which one of those lines accesses a dictionary? I'm assuming it's the (later corrected) line regarding the \l? (Something to do with #m or #gi?)


    #!/usr/bin/perl
    # one word per line, at least 4 characters in a word.

    #A somewhat heavily commented and transformed version.


    $dupl=$count=0;
    %ana=(); #declares an empty dictionary
    while (<>) #Magic angle brackets. Equivalent to while($_ = <STDIN>)
    {         #This reads each line of input into $_

        @chars=sort map { lc } m#([a-z])#gi;
        #@chars = sort(map( { lc } $_ =~ m#([a-z])#gi)) -- parentheses added
        # $_ =~ m#([a-z])#gi returns a list composed of all the characters in the
        # range [a-z] (case insensitve). So "redraw!" => ("r", "e", "d", "r", "a", "w")
        # map { lc } converts a list to lowercase. So ("T", "H", "E") => ("t", "h", "e")
        # sort() sorts a list using the default sort function. (r,e,d,r,a,w) => (a,d,e,r,r,w)
        next unless @chars>=4;

        $dupl++ if (++$ana{join("",@chars)}) >1;
        #expanded:
        # $tmp = join("", @chars)
        # $ana{$tmp}++
        # if($ana{$tmp} > 1){ $dupl++; }

        # join("separater", @list) returns a string conataining all of the elements of @list,
        # with "separater" between each array index. So join("-", (a,d,e,r,r,w)) => "a-d-e-r-r-w"

        # $ana{$tmp} accesses the hashmap / dictionary. ++, though, is somewhat magical. undef++ => 1.
        # So if $ana{$tmp} doesn't exist, it is created and set to the value of 1.

        # if($ana{$tmp} > 1) checks for duplicates. Consider the input file:
        # Hello
        # drawer
        # redraw
        # reward
        # World

        # Start: dupl = 0, count = 0, %ana = ()
        # Hello: dupl = 0, count = 1, %ana = (ehllo => 1)
        # drawer: dupl = 0, count = 2, %ana = (ehllo => 1, aderrw => 1)
        # redraw: dupl = 1, count = 3, %ana = (ehllo => 1, aderrw => 2)
        # reward: dupl = 2, count = 4, %ana = (ehllo => 1, aderrw => 3)
        # World: dupl = 2, count = 5, %ana = (ehllo => 1, aderrw => 3, dlorw => 1)
        $count++;
    }

    printf "Of %d words %d duplicate anagrams (%5.2f%%).\n",
    $count, $dupl, $dupl*100.0/$count;

    ----
    And my own version, sporting superior statistics (although much, much longer then flop's solution):

    #!/usr/bin/perl -w
    use strict;
    use Getopt::Long;

    my $dump_pairs = 0;
    my $verbose = 0;
    my $skip = 1;
    GetOptions("dump!" => \$dump_pairs, "skip!" => \$skip,
        "verbose" => sub { $verbose++ } );

    my %dictionary = ();
    my $words_loaded = 0;

    while(my $line = <>){
        my $orig = $line;
        chomp($orig);
        my $simple = lc($orig);
        $simple =~ s/[^a-z0-9]//g;
        next if length($simple) < 5 && $skip;
        my $key = join("", sort(split(//, $simple)));
        if( ! defined($dictionary{$key})){
            $dictionary{$key} = [];
        }
        push(@{$dictionary{$key}}, $orig);
        $words_loaded++;
        if($words_loaded % 1000 == 0 ){
            printf(STDERR "Loaded $words_loaded words.\n");
        }
    }

    my %dupelists = ();

    foreach my $key (sort(keys(%dictionary))){
        my @arr = @{$dictionary{$key}};
        my $num_dupes = $#arr + 1;
        if($num_dupes > 1){
            if(! defined($dupelists{$num_dupes})){
                $dupelists{$num_dupes} = [];
            }
            push(@{$dupelists{$num_dupes}}, $dictionary{$key});
        }
    }
    #Now print out some statistics.
    my $total_dupes = 0;
    # num_groups x size_of_group
    $total_dupes += ($#{$dupelists{$_}} + 1) * $_ for keys(%dupelists);

    sub percent { my ($n, $t) = @_; my $p = ($n * 100) / $t; sprintf("%6.3f %%", $p) }
    print("Out of $words_loaded total words, $total_dupes are duplicates. (" .
    percent($total_dupes, $words_loaded) . ")\n");

    print("Detailed statistics.\n");
    foreach my $k (sort {$b <=> $a} keys(%dupelists) ){
        my @arr = @{$dupelists{$k}};
        my $num = $#arr + 1;
        my $total = $num * $k;
        print("  $total of $words_loaded (" . percent($total, $words_loaded) .
            ") have $k anagrams each.\n");

    }

    if($dump_pairs){
        local $" = ", ";
        foreach my $k (sort {$b <=> $a} keys(%dupelists) ){
            my @arr = @{$dupelists{$k}};
            my $num = $#arr + 1;
            my $total = $num * $k;
            print("==== The $total words that each have $k anagrams ====\n");
            foreach(@arr) {
                print("  (@$_)\n");
            }
            print("\n\n");
        }
    }



    The forum software ate the indentation; sorry for the double post.


  • 09-26-2007 2:18 PM In reply to

    Re: Speed Challenge 37 - ITT We Consider Word Scrambles

    ohcamacj:

    (lots of code removed :-) )

    You've confirmed that the internal dictionary starts empty, and has to have its words entered from somewhere.

    And wow, what the heck is the utility of simplifying "$_ = <STDIN>" to simply "<>"??  Is it just sadism or masochism?

    I really do understand why people get frustrated trying to read Pearl for the first time.  Most other languages I've seen you can tell what's going on without the loss of information due to removing context of "verbose, English-like token strings."  I suppose Perl is great for efficiency (the famous "do anything in one line of code" philosophy), and once you know what all the little tricks are that's cool, but I think I'm beginning to agree that Perl is some of the most difficult to maintain code out there (either that, or the learning curve to be able to read Perl "native" rather than having to think about what the code is doing is extremely steep...)

    I'm going to stick with C and assembly, thanks very much! ;-) 

  • 09-26-2007 2:40 PM In reply to

    Re: Speed Challenge 37 - ITT We Consider Word Scrambles

    too_many_usernames:
    I'm going to stick with [some languages], thanks very much! ;-)
    Statements that match this is meaning are terrifying. It's like a carpenter saying, "I'm going to stick with my hammer, thanks very much! ;-)" While a hammer is a rather robust tool and can probably be used as the only tool to build an entire house, would you want to live in a house built entirely with just a hammer? (The answer to this rhetorical question is no, in case you were wondering.)
  • 09-26-2007 3:38 PM In reply to

    • kirchhoff
    • Top 150 Contributor
    • Joined on 02-27-2007
    • ECE 280 (Circuit Analysis)
    • Posts 216

    Re: Speed Challenge 37 - ITT We Consider Word Scrambles

    Just to clarify something, which ohcamacj did not (apparently) for you.

    Yes, the dictionary starts out empty.

    Now hashes (dictionaries) in perl have a special property. When you attempt to access a member of the dictionary by a key, that expression is both an lvalue and an rvalue. Meaning the value of the expression $dictionary{"key"} is the value pointed to by "key", but you can also assign to it or increment it to store a new value. Unlike C, you do not need getter and setter functions for dictionaries.

    Thus doing ++$dictionary{"key"} is an expression that 1) fetches the value for key (which is a number, or the undefined value if the key doesn't actually exist yet) 2) Increments that value (keep in mind that undefined + 1 = 1 in perl) 3) Re-assigns this new value into the dictionary at key, creating the entry if it doesn't yet exist.

    Very handy stuff.
     

     

  • 09-27-2007 7:50 AM In reply to

    Re: Speed Challenge 37 - ITT We Consider Word Scrambles

    too_many_usernames:

    And wow, what the heck is the utility of simplifying "$_ = <STDIN>" to simply "<>"??  Is it just sadism or masochism?

    That isn't an accurate expansion.

    $_ can be omitted in most places in perl; it is invariably easier to read code which doesn't have those extra characters scattered all over the place. $_ is just the default target of most operations. Compare the following:

    while (<>) {print if /^ /}

    while ($_ = <>) {print $_ if $_ =~ /^ /}

    They mean the same thing, but the first is a lot easier to read. 

    I really do understand why people get frustrated trying to read Pearl for the first time.

    Do you also understand why English speakers get frustrated trying to read Chinese for the first time?

    If you just substitute "Chinese" for "Perl" in that paragraph, the meaning won't significantly change, and the flaws should be obvious. Yes, it probably would be less efficient if an English company were to write all of its documentation and memos in Chinese. That is not the language's fault.

  • 09-27-2007 10:45 AM In reply to

    Re: Speed Challenge 37 - ITT We Consider Word Scrambles

    Welbog:
    too_many_usernames:
    I'm going to stick with [some languages], thanks very much! ;-)
    Statements that match this is meaning are terrifying. It's like a carpenter saying, "I'm going to stick with my hammer, thanks very much! ;-)" While a hammer is a rather robust tool and can probably be used as the only tool to build an entire house, would you want to live in a house built entirely with just a hammer? (The answer to this rhetorical question is no, in case you were wondering.)

    I suppose my ";-)" was not enough indication of sarcasm....

     

    Also, regarding

    asuffield:

    Do you also understand why English speakers get frustrated trying to read Chinese for the first time?

    If you just substitute "Chinese" for "Perl" in that paragraph, the meaning won't significantly change, and the flaws should be obvious. Yes, it probably would be less efficient if an English company were to write all of its documentation and memos in Chinese. That is not the language's fault.

     

    I totally agree. In both cases, you can't just look at the language without some reference material to provide context. The context missing in the Perl bit was regarding the fact that "without any explicit target of an operation, the default is to use $_". There is no way that, just looking at a bit of code (and not its results, etc.) that you could infer that fact - outside information is required.  When thinking about Chinese/English, the reference material is the physical universe - the thing represented by the English word "potato" is the common reference for whatever Chinese character/spoken word references the same object.  That said, just like real languages, some of the mechanisms of referring to those reference objects are more easy to pick up than others (nouns for physical objects are easier to translate than abstract concepts like, say, the words for "philosophical argument").  Also common to both languages is that without proper context, it's all meaningless.  Some computer languages (the ones that are designed to mimic a spoken language) have more built-in context than languages like Perl - so to learn those languages, less reference material may be required than for languages like, say, Brainfuck that don't have any built-in context at all.
  • 09-27-2007 11:40 AM In reply to

    Re: Speed Challenge 37 - ITT We Consider Word Scrambles

    Damnable sarcasm! We really need an Internet-wide convention to use orange text to indicate sarcasm. I interpreted the ;-) as something along the lines of "I'm cool 'cause I use C and assembly!" and that's why I failed.
  • 09-27-2007 1:00 PM In reply to

    • kirchhoff
    • Top 150 Contributor
    • Joined on 02-27-2007
    • ECE 280 (Circuit Analysis)
    • Posts 216

    Re: Speed Challenge 37 - ITT We Consider Word Scrambles

    Okay so I guess flop won... any objections?
  • 10-01-2007 4:53 PM In reply to

    Re: Speed Challenge 37 - ITT We Consider Word Scrambles

    Bonus points for creative use of the in operator, right?

    // calculate percentage of anagrammed words in a collection,
    // which we'll assume is an array.
    // each anagram counts to the total, so this script doesn't count the
    // *unique* anagram percentage.
    function anaPercentage(words) {
       var wordmap = {};
       var found = 0;

       for (var i=0; i < words.length; i++) {
          if (words[i].length < 4) continue;

          words[i] = (words[i].toLowerCase().match(/[a-z]/g) || []);

          chk = checksum(words[i]);
          if (chk in wordmap) ++wordmap[chk];
          else wordmap[chk] = 1;
       }

       for (var w in wordmap)
          if (wordmap[w] > 1)
             found += wordmap[w];

       return Math.round((found / words.length * 100) * 100) / 100;
    }

    //string with unique chars in a string and frequencies.
    //this sum identifies the anagram. Anagrams will have the same checksum.
    function checksum(word) {
       var chars = word.sort();
       var sum = '';
       var charmap = {};

       for (var i=0; i < chars.length; i++)
          if (chars[i] in charmap) ++charmap[chars[i]];
          else charmap[chars[i]] = 1;

       for (var key in charmap) sum += key + charmap[key];

       return sum;
    }
    — Flurp.
  • 10-01-2007 4:59 PM In reply to

    Re: Speed Challenge 37 - ITT We Consider Word Scrambles

    Instead of hastily editing within the time limit, I'm going to expose myself and suddenly notice that the bit with the checksum horror is complete bullshit. So just replace chk = checksum(words[i]); with chk = words[i].sort();

    But since flop's entry is in Perl, he wins by default.

    — Flurp.
  • 10-01-2007 5:33 PM In reply to

    • kirchhoff
    • Top 150 Contributor
    • Joined on 02-27-2007
    • ECE 280 (Circuit Analysis)
    • Posts 216

    Re: Speed Challenge 37 - ITT We Consider Word Scrambles

    I think you wanted chk = words[i].split("").sort().join("");

    In any case, your entry in Javascript is gutsy. You win, dhromed, provided you put up the next problem before flop does. HA

    (Incidentally I did my test using a god-awful mix of ruby one-liners and shell commands because I'm an asshole)

  • 10-02-2007 9:19 AM In reply to

    • flop
    • Top 500 Contributor
    • Joined on 04-27-2007
    • Posts 51

    Re: Speed Challenge 37 - ITT We Consider Word Scrambles

    kirchhoff:
    You win, dhromed, provided you put up the next problem before flop does. HA

    Just when I got such a nice problem for all of you ... Well, let me know whether you're interested.

  • 10-02-2007 10:45 AM In reply to

    • kirchhoff
    • Top 150 Contributor
    • Joined on 02-27-2007
    • ECE 280 (Circuit Analysis)
    • Posts 216

    Re: Speed Challenge 37 - ITT We Consider Word Scrambles

    flop:
    kirchhoff:
    You win, dhromed, provided you put up the next problem before flop does. HA

    Just when I got such a nice problem for all of you ... Well, let me know whether you're interested.

    Go for it, you beat dhromed to the punch. I mean, if people don't post problems, then there's nothing to solve! 

  • 10-02-2007 11:59 AM In reply to

    • flop
    • Top 500 Contributor
    • Joined on 04-27-2007
    • Posts 51

    Re: Speed Challenge 37 - ITT We Consider Word Scrambles

    Here you are ... Challenge 38.
  • 10-02-2007 8:31 PM In reply to

    Re: Sp