Scrabble Distributions
« previous entry | next entry »
Jul. 12th, 2006 | 02:49 am
music: Red Label - Bill Conti
I was wondering about the Scrabble letter distributions and scoring. After talking to a very good friend, I now have some data. Specifically:
Scrabble letter distributions
SOWPODS dictionary (British + North American combo, 216,555 words)
TWL dictionary (US, 168,092 words)
I believe the US dictionary is OCTWL (aka TWL98), specifically, but I haven't been able to confirm that. It's definitely not (OC)TWL2.
One of the things that interested me was all the extra (relative to American) 'u's in British spelling. The same set of letters is included in both editions of Scrabble. So, it would suck if you were were required to play colour and so on, but still only have four 'u's to go around. In the case of color vs colour, I found the following exclusive words:
color:
For comparison:

SOWPODS, TWL, Scrabble

SOWPODS, TWL
To get a feel for the distributions as a whole:

SOWPODS, TWL, Scrabble

SOWPODS, TWL
As for scoring, the point values are in order with Scrabble's letter/tile distribution. They're out of order compared to either dictionary. I'm not even sure what a good method for point calculation would be. Raw lexical frequency doesn't seem terribly good because it doesn't consider how likely you are to know them. Weighting the lexical letter frequencies by corpus word frequency seems like a good way to adjust for this. Still, that doesn't take into account the strategic value of the letter's position and of the word as a whole (you're unlikely to have the opportunity to play 15-letter words).
I tried to find a corpus to use, but nothing decent seemed to be easy to acquire, never-mind parse. I suspect I can get my hands on one if I ask around, but I'm not sure I care enough to bother. Trying to assign values to words and letters based on position and such seems more entertaining anyway.
Scrabble letter distributions
SOWPODS dictionary (British + North American combo, 216,555 words)
TWL dictionary (US, 168,092 words)
I believe the US dictionary is OCTWL (aka TWL98), specifically, but I haven't been able to confirm that. It's definitely not (OC)TWL2.
One of the things that interested me was all the extra (relative to American) 'u's in British spelling. The same set of letters is included in both editions of Scrabble. So, it would suck if you were were required to play colour and so on, but still only have four 'u's to go around. In the case of color vs colour, I found the following exclusive words:
color:
- SOWPODS: 1 (concolor)
- TWL: 2 (colorpoint)
- SOWPODS: 35
- TWL: 0
For comparison:

SOWPODS, TWL, Scrabble

SOWPODS, TWL
To get a feel for the distributions as a whole:

SOWPODS, TWL, Scrabble

SOWPODS, TWL
As for scoring, the point values are in order with Scrabble's letter/tile distribution. They're out of order compared to either dictionary. I'm not even sure what a good method for point calculation would be. Raw lexical frequency doesn't seem terribly good because it doesn't consider how likely you are to know them. Weighting the lexical letter frequencies by corpus word frequency seems like a good way to adjust for this. Still, that doesn't take into account the strategic value of the letter's position and of the word as a whole (you're unlikely to have the opportunity to play 15-letter words).
I tried to find a corpus to use, but nothing decent seemed to be easy to acquire, never-mind parse. I suspect I can get my hands on one if I ask around, but I'm not sure I care enough to bother. Trying to assign values to words and letters based on position and such seems more entertaining anyway.
(no subject)
from:
scarls17
date: Jul. 12th, 2006 02:05 pm (UTC)
Link
Reply | Thread
(no subject)
from:
nikolasco
date: Jul. 13th, 2006 06:35 am (UTC)
Link
Reply | Parent | Thread
(no subject)
from:
scarls17
date: Jul. 13th, 2006 11:01 am (UTC)
Link
Reply | Parent | Thread
(no subject)
from:
nikolasco
date: Jul. 13th, 2006 07:38 pm (UTC)
Link
Reply | Parent | Thread
(no subject)
from:
scarls17
date: Jul. 13th, 2006 07:39 pm (UTC)
Link
Reply | Parent | Thread
there are too many i's in scrabble
from: anonymous
date: May. 26th, 2007 02:54 pm (UTC)
Link
Reply | Thread
Re: there are too many i's in scrabble
from:
nikolasco
date: May. 27th, 2007 06:11 pm (UTC)
Link
It is worth nothing that my analysis uses the entire dictionary, which includes far more words than you know. An analysis that weighted words based on how common they are (how likely you are to use them) would probably yield different results. Sadly I haven't had found a convnient corpus to work with.
Reply | Parent | Thread
Re: there are too many i's in scrabble
from: anonymous
date: Nov. 7th, 2008 11:30 pm (UTC)
Link
Reply | Parent | Thread