|
Speed Challenge 37 - ITT We Consider Word Scrambles
Last post 02-15-2008 7:04 PM by zzo38. 35 replies.
-
09-24-2007 8:13 PM
|
|
-
kirchhoff


- Joined on 02-27-2007
- ECE 280 (Circuit Analysis)
- Posts 216
|
Speed Challenge 37 - ITT We Consider Word Scrambles
A programmer friend of yours is making a whiz-bang AJAX-y Web2.0 game that mimics the 'Word Scramble' mode of Brain Age. (Take any word, and scramble the letters in it -- this game ups the ante by moving the letters around while you try to figure out the anagram.)
Naturally he plans to let end-users submit words to appear in the puzzles. He has reviewed his server-side data quality checking with you ("Yes I remembered to normalize case!") but you are still not convinced this is a good idea. "What if a user submits a list of internet memes or curse words?" "Oh, that'll be hilarious! It's the power of peer to peer." It takes all your willpower to avoid rolling your eyes. Still, you humor him because this project has made him more enthusiastic than he has been in some months. In a moment of clarity he realizes that beyond checking for illegal characters and phrases, there is the distinct possibility that one or more words in the user-provided list could anagram to each other, thus some of the puzzles could have multiple correct answers -- he'd like to avoid this as he wanted to animate the letters moving into the correct order when you got the word wrong, but it'd be too hard for his meager DHTML skills to support showing the multiple solutions in a way he found satisfactory.
"Should I even be worried about this? What are the chances of this happening in any random set of english words?" Help our perplexed puzzle programmer!!! Propose a procedure for procuring the percentage of words in a plain text phile which do not pose a problem for this anagram project. Translation: Given a text file on standard input, ignoring case and punctionation and such, determine the percentage of words which have possible collisions when anagrammed. Ignore words shorter than 4 letters, as they wouldn't make fun puzzles. Since we're trying to get a good estimate out of this program, the word lists can be very long (> 10,000 words) so your method needs to scale a little bit.
|
|
-
-
flop


- Joined on 04-27-2007
- Posts 51
|
Re: Speed Challenge 37 - ITT We Consider Word Scrambles
#!/usr/bin/perl
# one word per line, at least 4 characters in a word.
$dupl=$count=0;
%ana=();
while (<>)
{
@chars=sort m#(\l[a-z])#gi;
next unless @chars>=4;
$dupl++ if (++$ana{join("",@chars)}) >1;
$count++;
}
printf "Of %d words %d duplicate anagrams (%5.2f%%).\n",
$count, $dupl, $dupl*100.0/$count;
|
|
-
-
flop


- Joined on 04-27-2007
- Posts 51
|
Re: Speed Challenge 37 - ITT We Consider Word Scrambles
Sorry - just saw that \l doesn't seem to work. Please replace the line to:
@chars=sort map { lc } m#([a-z])#gi;
|
|
-
-
kirchhoff


- Joined on 02-27-2007
- ECE 280 (Circuit Analysis)
- Posts 216
|
Re: Speed Challenge 37 - ITT We Consider Word Scrambles
flop, your frugal foray into this frightful folly is fitting, and is my favorite (albeit first) finalist. Can anyone in attendance allocate an attenuated amount of ASCII in your algorithm?
|
|
-
-
kirchhoff


- Joined on 02-27-2007
- ECE 280 (Circuit Analysis)
- Posts 216
|
Re: Speed Challenge 37 - ITT We Consider Word Scrambles
Incidentally -- I'm indicating 16% as the neighborhood number that you should be seeing as you substitute your samples. That's an amazing appearence of anagrams in our lay lexicon!
|
|
-
-
too_many_usernames


- Joined on 12-09-2005
- (0,0,0) in local frame
- Posts 165
|
Re: Speed Challenge 37 - ITT We Consider Word Scrambles
flop:
#!/usr/bin/perl
# one word per line, at least 4 characters in a word.
$dupl=$count=0;
%ana=();
while (<>)
{
@chars=sort m#(\l[a-z])#gi;
next unless @chars>=4;
$dupl++ if (++$ana{join("",@chars)}) >1;
$count++;
}
printf "Of %d words %d duplicate anagrams (%5.2f%%).\n",
$count, $dupl, $dupl*100.0/$count;
Ok...which one of those lines accesses a dictionary? I'm assuming it's the (later corrected) line regarding the \l? (Something to do with #m or #gi?) The first three pages of searching Google for "perl dictionary" and various variants revealed no valuable results. (Sorry, I couldn't really alliterate that with 'v' very easily...)
|
|
-
-
kirchhoff


- Joined on 02-27-2007
- ECE 280 (Circuit Analysis)
- Posts 216
|
Re: Speed Challenge 37 - ITT We Consider Word Scrambles
too_many_usernames:Ok...which one of those lines accesses a dictionary?
#!/usr/bin/perl
# one word per line, at least 4 characters in a word.
$dupl=$count=0;
%ana=(); ## "ana" is declared as a dictionary here.
while (<>)
{
@chars=sort m#(\l[a-z])#gi;
next unless @chars>=4;
$dupl++ if (++$ana{join("",@chars)}) >1; ## "ana"
is indexed by the result of join on
the
$count++;
## chars array, and the value is post-incremented.
}
printf "Of %d words %d duplicate anagrams (%5.2f%%).\n",
$count, $dupl, $dupl*100.0/$count;
various variants revealed no valuable results... (Sorry, I
couldn't really alliterate that with 'v' very easily...)
Don't fret, fair forum-goer, you couched your conundrum cleverly, even as the alliteration ailed you.
|
|
-
-
too_many_usernames


- Joined on 12-09-2005
- (0,0,0) in local frame
- Posts 165
|
Re: Speed Challenge 37 - ITT We Consider Word Scrambles
kirchhoff: too_many_usernames:Ok...which one of those lines accesses a dictionary?
#!/usr/bin/perl
# one word per line, at least 4 characters in a word.
$dupl=$count=0;
%ana=(); ## "ana" is declared as a dictionary here.
Don't fret, fair forum-goer, you couched your conundrum cleverly, even as the alliteration ailed you.
I'm terribly late to the Perl game in general (that is, this is the first time I've really looked at it in any detail at all), but it appears that the line quoted above just declares 'ana' be a hash set to a null list - so where do the words get populated into the dictionary from this? That is, it looks like the program just checks for anagrams in string characters, not anagrams of actual words in, say, the English language. (That is, it would say that 'peanut' and 'eapunt' are anagrams, but the latter of those is not a valid English word...)
Or am I still missing something completely here?
(Weariness now weakens one's waking workable train of thought...)
|
|
-
-
ohcamacj


- Joined on 05-27-2007
- Posts 16
|
Re: Speed Challenge 37 - ITT We Consider Word Scrambles
too_many_usernames: flop: #!/usr/bin/perl # one word per line, at least 4 characters in a word.
$dupl=$count=0; %ana=(); while (<>) { @chars=sort m#(\l[a-z])#gi; next unless @chars>=4;
$dupl++ if (++$ana{join("",@chars)}) >1; $count++; }
printf "Of %d words %d duplicate anagrams (%5.2f%%).\n", $count, $dupl, $dupl*100.0/$count;
Ok...which one of those lines accesses a dictionary? I'm assuming it's the (later corrected) line regarding the \l? (Something to do with #m or #gi?)
#!/usr/bin/perl # one word per line, at least 4 characters in a word. #A somewhat heavily commented and transformed version. $dupl=$count=0; %ana=(); #declares an empty dictionary while (<>) #Magic angle brackets. Equivalent to while($_ = <STDIN>) { #This reads each line of input into $_ @chars=sort map { lc } m#([a-z])#gi; #@chars = sort(map( { lc } $_ =~ m#([a-z])#gi)) -- parentheses added # $_ =~ m#([a-z])#gi returns a list composed of all the characters in the # range [a-z] (case insensitve). So "redraw!" => ("r", "e", "d", "r", "a", "w") # map { lc } converts a list to lowercase. So ("T", "H", "E") => ("t", "h", "e") # sort() sorts a list using the default sort function. (r,e,d,r,a,w) => (a,d,e,r,r,w) next unless @chars>=4; $dupl++ if (++$ana{join("",@chars)}) >1; #expanded: # $tmp = join("", @chars) # $ana{$tmp}++ # if($ana{$tmp} > 1){ $dupl++; } # join("separater", @list) returns a string conataining all of the elements of @list, # with "separater" between each array index. So join("-", (a,d,e,r,r,w)) => "a-d-e-r-r-w" # $ana{$tmp} accesses the hashmap / dictionary. ++, though, is somewhat magical. undef++ => 1. # So if $ana{$tmp} doesn't exist, it is created and set to the value of 1. # if($ana{$tmp} > 1) checks for duplicates. Consider the input file: # Hello # drawer # redraw # reward # World # Start: dupl = 0, count = 0, %ana = () # Hello: dupl = 0, count = 1, %ana = (ehllo => 1) # drawer: dupl = 0, count = 2, %ana = (ehllo => 1, aderrw => 1) # redraw: dupl = 1, count = 3, %ana = (ehllo => 1, aderrw => 2) # reward: dupl = 2, count = 4, %ana = (ehllo => 1, aderrw => 3) # World: dupl = 2, count = 5, %ana = (ehllo => 1, aderrw => 3, dlorw => 1) $count++; } printf "Of %d words %d duplicate anagrams (%5.2f%%).\n", $count, $dupl, $dupl*100.0/$count; ---- And my own version, sporting superior statistics (although much, much longer then flop's solution): #!/usr/bin/perl -w use strict; use Getopt::Long;
my $dump_pairs = 0; my $verbose = 0; my $skip = 1; GetOptions("dump!" => \$dump_pairs, "skip!" => \$skip, "verbose" => sub { $verbose++ } );
my %dictionary = (); my $words_loaded = 0;
while(my $line = <>){ my $orig = $line; chomp($orig); my $simple = lc($orig); $simple =~ s/[^a-z0-9]//g; next if length($simple) < 5 && $skip; my $key = join("", sort(split(//, $simple))); if( ! defined($dictionary{$key})){ $dictionary{$key} = []; } push(@{$dictionary{$key}}, $orig); $words_loaded++; if($words_loaded % 1000 == 0 ){ printf(STDERR "Loaded $words_loaded words.\n"); } }
my %dupelists = ();
foreach my $key (sort(keys(%dictionary))){ my @arr = @{$dictionary{$key}}; my $num_dupes = $#arr + 1; if($num_dupes > 1){ if(! defined($dupelists{$num_dupes})){ $dupelists{$num_dupes} = []; } push(@{$dupelists{$num_dupes}}, $dictionary{$key}); } } #Now print out some statistics. my $total_dupes = 0; # num_groups x size_of_group $total_dupes += ($#{$dupelists{$_}} + 1) * $_ for keys(%dupelists);
sub percent { my ($n, $t) = @_; my $p = ($n * 100) / $t; sprintf("%6.3f %%", $p) } print("Out of $words_loaded total words, $total_dupes are duplicates. (" . percent($total_dupes, $words_loaded) . ")\n");
print("Detailed statistics.\n"); foreach my $k (sort {$b <=> $a} keys(%dupelists) ){ my @arr = @{$dupelists{$k}}; my $num = $#arr + 1; my $total = $num * $k; print(" $total of $words_loaded (" . percent($total, $words_loaded) . ") have $k anagrams each.\n");
}
if($dump_pairs){ local $" = ", "; foreach my $k (sort {$b <=> $a} keys(%dupelists) ){ my @arr = @{$dupelists{$k}}; my $num = $#arr + 1; my $total = $num * $k; print("==== The $total words that each have $k anagrams ====\n"); foreach(@arr) { print(" (@$_)\n"); } print("\n\n"); } }
The forum software ate the indentation; sorry for the double post.
|
|
-
-
too_many_usernames


- Joined on 12-09-2005
- (0,0,0) in local frame
- Posts 165
|
Re: Speed Challenge 37 - ITT We Consider Word Scrambles
ohcamacj:(lots of code removed :-) )
You've confirmed that the internal dictionary starts empty, and has to have its words entered from somewhere. And wow, what the heck is the utility of simplifying "$_ = <STDIN>" to simply "<>"?? Is it just sadism or masochism? I really do understand why people get frustrated trying to read Pearl for the first time. Most other languages I've seen you can tell what's going on without the loss of information due to removing context of "verbose, English-like token strings." I suppose Perl is great for efficiency (the famous "do anything in one line of code" philosophy), and once you know what all the little tricks are that's cool, but I think I'm beginning to agree that Perl is some of the most difficult to maintain code out there (either that, or the learning curve to be able to read Perl "native" rather than having to think about what the code is doing is extremely steep...)
I'm going to stick with C and assembly, thanks very much! ;-)
|
|
-
-
Welbog


- Joined on 02-08-2007
- Posts 450
|
Re: Speed Challenge 37 - ITT We Consider Word Scrambles
too_many_usernames:I'm going to stick with [some languages], thanks very much! ;-)
Statements that match this is meaning are terrifying. It's like a carpenter saying, "I'm going to stick with my hammer, thanks very much! ;-)" While a hammer is a rather robust tool and can probably be used as the only tool to build an entire house, would you want to live in a house built entirely with just a hammer? (The answer to this rhetorical question is no, in case you were wondering.)
|
|
-
-
kirchhoff


- Joined on 02-27-2007
- ECE 280 (Circuit Analysis)
- Posts 216
|
Re: Speed Challenge 37 - ITT We Consider Word Scrambles
Just to clarify something, which ohcamacj did not (apparently) for you. Yes, the dictionary starts out empty. Now hashes (dictionaries) in perl have a special property. When you attempt to access a member of the dictionary by a key, that expression is both an lvalue and an rvalue. Meaning the value of the expression $dictionary{"key"} is the value pointed to by "key", but you can also assign to it or increment it to store a new value. Unlike C, you do not need getter and setter functions for dictionaries. Thus doing ++$dictionary{"key"} is an expression that 1) fetches the value for key (which is a number, or the undefined value if the key doesn't actually exist yet) 2) Increments that value (keep in mind that undefined + 1 = 1 in perl) 3) Re-assigns this new value into the dictionary at key, creating the entry if it doesn't yet exist. Very handy stuff.
|
|
-
-
asuffield


- Joined on 05-31-2006
- Posts 2,137
|
Re: Speed Challenge 37 - ITT We Consider Word Scrambles
too_many_usernames:And wow, what the heck is the utility of simplifying "$_ = <STDIN>" to simply "<>"?? Is it just sadism or masochism?
That isn't an accurate expansion. $_ can be omitted in most places in perl; it is invariably easier to read code which doesn't have those extra characters scattered all over the place. $_ is just the default target of most operations. Compare the following: while (<>) {print if /^ /} while ($_ = <>) {print $_ if $_ =~ /^ /} They mean the same thing, but the first is a lot easier to read. I really do understand why people get frustrated trying to read Pearl for the first time.
Do you also understand why English speakers get frustrated trying to read Chinese for the first time? If you just substitute "Chinese" for "Perl" in that paragraph, the meaning won't significantly change, and the flaws should be obvious. Yes, it probably would be less efficient if an English company were to write all of its documentation and memos in Chinese. That is not the language's fault.
|
|
-
-
too_many_usernames


- Joined on 12-09-2005
- (0,0,0) in local frame
- Posts 165
|
Re: Speed Challenge 37 - ITT We Consider Word Scrambles
Welbog: too_many_usernames:I'm going to stick with [some languages], thanks very much! ;-)
Statements that match this is meaning are terrifying. It's like a carpenter saying, "I'm going to stick with my hammer, thanks very much! ;-)" While a hammer is a rather robust tool and can probably be used as the only tool to build an entire house, would you want to live in a house built entirely with just a hammer? (The answer to this rhetorical question is no, in case you were wondering.)
I suppose my ";-)" was not enough indication of sarcasm.... Also, regarding asuffield:Do you also understand why English speakers get frustrated trying to read Chinese for the first time? If
you just substitute "Chinese" for "Perl" in that paragraph, the meaning
won't significantly change, and the flaws should be obvious. Yes, it
probably would be less efficient if an English company were to write
all of its documentation and memos in Chinese. That is not the
language's fault.
I totally agree. In both cases, you can't just look at the language without some reference material to provide context. The context missing in the Perl bit was regarding the fact that "without any explicit target of an operation, the default is to use $_". There is no way that, just looking at a bit of code (and not its results, etc.) that you could infer that fact - outside information is required. When thinking about Chinese/English, the reference material is the physical universe - the thing represented by the English word "potato" is the common reference for whatever Chinese character/spoken word references the same object. That said, just like real languages, some of the mechanisms of referring to those reference objects are more easy to pick up than others (nouns for physical objects are easier to translate than abstract concepts like, say, the words for "philosophical argument"). Also common to both languages is that without proper context, it's all meaningless. Some computer languages (the ones that are designed to mimic a spoken language) have more built-in context than languages like Perl - so to learn those languages, less reference material may be required than for languages like, say, Brainfuck that don't have any built-in context at all.
|
|
-
-
Welbog


- Joined on 02-08-2007
- Posts 450
|
Re: Speed Challenge 37 - ITT We Consider Word Scrambles
Damnable sarcasm! We really need an Internet-wide convention to use orange text to indicate sarcasm. I interpreted the ;-) as something along the lines of "I'm cool 'cause I use C and assembly!" and that's why I failed.
|
|
-
-
kirchhoff


- Joined on 02-27-2007
- ECE 280 (Circuit Analysis)
- Posts 216
|
Re: Speed Challenge 37 - ITT We Consider Word Scrambles
Okay so I guess flop won... any objections?
|
|
-
-
dhromed


- Joined on 04-13-2005
- Dutchland
- Posts 2,691
|
Re: Speed Challenge 37 - ITT We Consider Word Scrambles
Bonus points for creative use of the in operator, right?
// calculate percentage of anagrammed words in a collection, // which we'll assume is an array. // each anagram counts to the total, so this script doesn't count the // *unique* anagram percentage. function anaPercentage(words) { var wordmap = {}; var found = 0;
for (var i=0; i < words.length; i++) { if (words[i].length < 4) continue;
words[i] = (words[i].toLowerCase().match(/[a-z]/g) || []);
chk = checksum(words[i]); if (chk in wordmap) ++wordmap[chk]; else wordmap[chk] = 1; }
for (var w in wordmap) if (wordmap[w] > 1) found += wordmap[w];
return Math.round((found / words.length * 100) * 100) / 100; }
//string with unique chars in a string and frequencies. //this sum identifies the anagram. Anagrams will have the same checksum. function checksum(word) { var chars = word.sort(); var sum = ''; var charmap = {};
for (var i=0; i < chars.length; i++) if (chars[i] in charmap) ++charmap[chars[i]]; else charmap[chars[i]] = 1;
for (var key in charmap) sum += key + charmap[key];
return sum; }
— Flurp.
|
|
-
-
dhromed


- Joined on 04-13-2005
- Dutchland
- Posts 2,691
|
Re: Speed Challenge 37 - ITT We Consider Word Scrambles
Instead of hastily editing within the time limit, I'm going to expose myself and suddenly notice that the bit with the checksum horror is complete bullshit. So just replace chk = checksum(words[i]); with chk = words[i].sort();
But since flop's entry is in Perl, he wins by default.
— Flurp.
|
|
-
-
kirchhoff


- Joined on 02-27-2007
- ECE 280 (Circuit Analysis)
- Posts 216
|
Re: Speed Challenge 37 - ITT We Consider Word Scrambles
I think you wanted chk = words[i].split("").sort().join(""); In
any case, your entry in Javascript is gutsy. You win, dhromed,
provided you put up the next problem before flop does. HA
(Incidentally I did my test using a god-awful mix of ruby one-liners and shell commands because I'm an asshole)
|
|
-
-
flop


- Joined on 04-27-2007
- Posts 51
|
Re: Speed Challenge 37 - ITT We Consider Word Scrambles
kirchhoff:You win, dhromed,
provided you put up the next problem before flop does. HA
Just when I got such a nice problem for all of you ... Well, let me know whether you're interested.
|
|
-
-
kirchhoff


- Joined on 02-27-2007
- ECE 280 (Circuit Analysis)
- Posts 216
|
Re: Speed Challenge 37 - ITT We Consider Word Scrambles
flop: kirchhoff:You win, dhromed,
provided you put up the next problem before flop does. HA
Just when I got such a nice problem for all of you ... Well, let me know whether you're interested.
Go for it, you beat dhromed to the punch. I mean, if people don't post problems, then there's nothing to solve!
|
|
-
-
|
|