Scrabble Distributions

« previous entry | next entry »
Jul. 12th, 2006 | 02:49 am
music: Red Label - Bill Conti

I was wondering about the Scrabble letter distributions and scoring. After talking to a very good friend, I now have some data. Specifically:
Scrabble letter distributions
SOWPODS dictionary (British + North American combo, 216,555 words)
TWL dictionary (US, 168,092 words)
I believe the US dictionary is OCTWL (aka TWL98), specifically, but I haven't been able to confirm that. It's definitely not (OC)TWL2.

One of the things that interested me was all the extra (relative to American) 'u's in British spelling. The same set of letters is included in both editions of Scrabble. So, it would suck if you were were required to play colour and so on, but still only have four 'u's to go around. In the case of color vs colour, I found the following exclusive words:
color:
  • SOWPODS: 1 (concolor)
  • TWL: 2 (colorpoint)
colour:
  • SOWPODS: 35
  • TWL: 0

For comparison:

SOWPODS, TWL, Scrabble


SOWPODS, TWL


To get a feel for the distributions as a whole:

SOWPODS, TWL, Scrabble


SOWPODS, TWL


As for scoring, the point values are in order with Scrabble's letter/tile distribution. They're out of order compared to either dictionary. I'm not even sure what a good method for point calculation would be. Raw lexical frequency doesn't seem terribly good because it doesn't consider how likely you are to know them. Weighting the lexical letter frequencies by corpus word frequency seems like a good way to adjust for this. Still, that doesn't take into account the strategic value of the letter's position and of the word as a whole (you're unlikely to have the opportunity to play 15-letter words).

I tried to find a corpus to use, but nothing decent seemed to be easy to acquire, never-mind parse. I suspect I can get my hands on one if I ask around, but I'm not sure I care enough to bother. Trying to assign values to words and letters based on position and such seems more entertaining anyway.

Link | Leave a comment | Add to Memories | Tell a Friend

Comments {8}

Scarlet

(no subject)

from: [info]scarls17
date: Jul. 12th, 2006 02:05 pm (UTC)
Link

I am impressed with your research and findings!

Reply | Thread

Atrus

(no subject)

from: [info]nikolasco
date: Jul. 13th, 2006 06:35 am (UTC)
Link

Out of curiousity, how did you come across this entry?

Reply | Parent | Thread

Scarlet

(no subject)

from: [info]scarls17
date: Jul. 13th, 2006 11:01 am (UTC)
Link

I can't remember, actually. I think I was bored at work and searched for something on blog search or something? Or, are you on DC Blogs?

Reply | Parent | Thread

Atrus

(no subject)

from: [info]nikolasco
date: Jul. 13th, 2006 07:38 pm (UTC)
Link

I'm on DC Blogs Live, so that's probably it.

Reply | Parent | Thread

Scarlet

(no subject)

from: [info]scarls17
date: Jul. 13th, 2006 07:39 pm (UTC)
Link

Ahh yes, that's it!

Reply | Parent | Thread

there are too many i's in scrabble

from: anonymous
date: May. 26th, 2007 02:54 pm (UTC)
Link

every single game I've ever played - I'm left looking at a handful of i's at some point.. am I alone..?!

Reply | Thread

Atrus

Re: there are too many i's in scrabble

from: [info]nikolasco
date: May. 27th, 2007 06:11 pm (UTC)
Link

It's probably a sign that you need to take care to use up the 'i's. Rack maitenence is one of the most important aspects of the game, based on the research done on computer Scrabble players.

It is worth nothing that my analysis uses the entire dictionary, which includes far more words than you know. An analysis that weighted words based on how common they are (how likely you are to use them) would probably yield different results. Sadly I haven't had found a convnient corpus to work with.

Reply | Parent | Thread

Re: there are too many i's in scrabble

from: anonymous
date: Nov. 7th, 2008 11:30 pm (UTC)
Link

I guess the best method for tile frequency and score would take into account a hybrid of lexicon and corpus based frequencies. Perhaps I'd put slightly more emphasis on the lexicon since knowledge of vocabulary is quite changeable. Another problem for me in Scrabble is what becomes acceptable in for the SOWPODS and TWL dictionaries.

Reply | Parent | Thread