word frequencies: truth, philology, and chrestomathy
I wrote a small program to compute word frequencies and have used it
to analyze some texts. For instance, do you know what the twenty
most common words are in the King James Version of the Bible?
So, you have a Strong's Concordance. Good for you! But we can
put this program to use anyway.
For instance, Project Gutenberg has an
etext version of the KJV Bible available for download. Running it
through my tool and then sorting the raw data actually gives different
frequency numbers than Strong's does.
What does this mean? either:
- Strong's numbers are off. (this seems pretty unlikely!)
- Project Gutenberg scanned an inaccurate printer's copy of the KJV
- The scanning/proofreading process used by Gutenberg has caused
some of the words to be incorrect (a contrived example would be
an instance "bouk" instead of "book").
- some other unaccounted for reason...
A Way of Verifying the Truth as the Truth
So, why do we care? Well, the Bible happens to be a very good candidate
for this sort of analysis because of the extensive work done by Strong
and the years of scholarship spent verifying Strong's numbers as correct.
We have learned that, most likely, the Project Gutenberg etext is
incorrect, without having to proofread it. This is very valuable!
For the record, here are the top twenty word in the Project Gutenberg
etext along with their official Strong's numbers:
Figure 1. Table of top twenty words with Project
Gutenberg frequencies and Strong frequencies
|the|| 63924|| 64040|
|and|| 51696|| 51714|
|of|| 34617|| 34755|
|to|| 13562|| 13643|
|that|| 12913|| 12916|
|in|| 12667|| 12674|
|he|| 10420|| 10431|
|shall|| 9838|| 9837|
|unto|| 8997|| 9003|
|for|| 8971|| 8985|
|I|| 8854|| 8853|
|his|| 8473|| 8478|
|a|| 8177|| 8284|
|lord|| 7830|| 7836|
|they|| 7376|| 7377|
|be|| 7013|| 7012|
|is|| 6989|| 6992|
|him|| 6659|| 6667|
|not|| 6596|| 6597|
|them|| 6430|| 6429|
So, this has use, but how many other works have had this much pre-digital
era analysis? Not many. This makes this sort of comparison akin to
a chicken-and-egg problem... How can we verify one etext against another
and baseline our statistics against a "known good" copy when none have
the "known good" quality?
This leads me to a possible alternate use for code like this...
Marking the Correct without Knowing the Correct
When we want to spell check a document, it seems likely that the
words that have very low frequency and do not pass a spelling
check are probably wrong. Similarly, words that have high
frequency but do not pass a spelling check might be right...
Does your dictionary contain the word "saith"? Well, it
occurs 1,262 times in the KJV (both Project Gutenberg's KJV etext
and Strong's agree on this one!), so despite not being in your
dictionary, and being archaic, perhaps an adaptive spell checker
should leave it alone!
However, we would have to have some sort of "leave it alone" threshold
built into the spell checker and still have a "verify all words"
capability as well. It is conceivable that someone would misspell
"fahrenhiet" (sic) hundreds of times in a single document either
because of not knowing how to spell it, learning how to spell it
incorrectly, or because of a tacto-mechanical error in their typing process.
Chrestomathy: Reading in Tongues
We can use this program other ways as well - for example, let us
suppose that I wanted to learn English from the King James Version
of the Bible - in fact, I want to learn English solely to be
able to read the KJV!
It is in my best interest, then, to compile a statistically
useful list of words to learn in order of priority. I know that I want
to start by reading the four Gospels (my chrestomathy), so I will
compile a list exclusively of words in Matthew, Mark, Luke, and John to
use as my dictionary and also as a list of the words I should learn first
to speed my learning process to the highest level - basically, what
words will give me the best results for the time spent.
The reason this is useful is that after a short time, I will know the
basic words rather well, and I will often be able to deduce the meaning
of an unknown word from its surrounding context.
While this would work well for English, in languages where there are a
large number of verb inflection and noun and adjective agreement, such
as Greek or Latin, this programmatic method would have to include the
ability to look up roots of words accurately from their grammatic position
and be able to build a list of these roots that we could then go and
learn to congugate, agglutinate, and such. This is much less trivial, and
I hope to write such a program someday.
Perhaps you will write one first?
For those who feel inclined to see my code:
This code is basically a word validator and counter - any validated
word is added into a binary tree, or if it is already in the tree, the
word's counter is incremented. When done processing the file,
statistics are printed to the output. Not a very complicated program.
Strong's Concordance (You will want to scroll down to the "Strong's Number Search" tool.)
Project Gutenberg Edition of the
King James Bible (2nd version, 10th ed.)
Copyright © 2002 by John Holder