This repository provides cleaned lists of the most frequent words and n-grams (sequences of n words), including some English translations, of the Google Books Ngram Corpus (v3/20200217, all languages), plus customizable Python code which reproduces these lists.
Lists with the most frequent n-grams are provided separately by language and n. Available languages are Chinese (simplified), English, English Fiction, French, German, Hebrew, Italian, Russian, and Spanish. n ranges from 1 to 5. In the provided lists the language subcorpora are restricted to books published in the years 2010-2019, but in the Python code both this and the number of most frequent n-grams included can be adjusted.
The lists are found in the ngrams directory. For all languages except Hebrew cleaned lists are provided for the
For Hebrew, due to the small corpus size, only the 200 most frequent 4-grams and 80 most frequent 5-grams are provided.
All cleaned lists also contain the number of times each n-gram occurs in the corpus (its frequency, column freq
). For 1-grams (words) there are two additional columns:
cumshare
which for each word contains the cumulative share of all words in the corpus made up by that word and all more frequent words.en
which contains the English translation of the word obtained using the Google Cloud Translate API (only for non-English languages).Here are the first 10 rows of 1grams_french.csv:
ngram | freq | cumshare | en |
---|---|---|---|
de | 1380202965 | 0.048 | of |
la | 823756863 | 0.077 | the |
et | 651571349 | 0.100 | and |
le | 614855518 | 0.121 | the |
à | 577644624 | 0.142 | at |
l' | 527188618 | 0.160 | the |
les | 503689143 | 0.178 | them |
en | 390657918 | 0.191 | in |
des | 384774428 | 0.205 | of the |
The lists found directly in the ngrams directory have been cleaned and are intended for use when developing language-learning materials. The sub-directory ngrams/more contains uncleaned and less cleaned versions which might be of use for e.g. linguists:
0_raw
),1a_no_pos
),1b_with_pos
),2_removed
).To provide some motivation for why leaning the most frequent words first may be a good idea when learning a language, the following graph is provided.
For each language, it plots the frequency rank of each 1-gram (i.e. word) on the x-axis and the cumshare
on the y-axis. So, for example, after learning the 1000 most frequent French words, one can understand more than 70% of all words, counted with duplicates, occurring in a typical book published between 2010 and 2019 in version 20200217 of the French Google Books Ngram Corpus.
For n-grams other than 1-grams the returns to learning the most frequent ones are not as steep, as there are so many possible combinations of words. Still, people tend to learn better when learning things in context, so one use of them could be to find common example phrases for each 1-gram. Another approach is the following: Say one wants to learn the 1000 most common words in some language. Then one could, for example, create a minimal list of the most common 4-grams which include these 1000 words and learn it.
Although the n-gram lists have been cleaned with language learning in mind and contain some English translations, they are not intended to be used directly for learning, but rather as intermediate resources for developing language learning materials. The provided translations only give the English word closest to the most common meaning of the word. Moreover, depending on ones language learning goals, the Google Books Ngram Corpus might not be the best corpus to base learning materials on – see the next section.
This repository is based on the Google Books Ngram Corpus Version 3 (with version identifier 20200217), made available by Google as n-gram lists here. This is also the data that underlies the Google Books Ngram Viewer. The corpus is a subset, selected by Google based on the quality of optical character recognition and metadata, of the books digitized by Google and contains around 6% of all books ever published (1, 2, 3).
When assessing the quality of a corpus, both its size and its representativeness of the kind of material one is interested in are important.
The Google Books Ngram Corpus Version 3 is huge, as this table of the count of words in it by language and subcorpus illustrates:
Language | # words, all years | # words, 2010-2019 |
---|---|---|
Chinese | 10,778,094,737 | 257,989,144 |
English | 1,997,515,570,677 | 283,795,232,871 |
American English | 1,167,153,993,435 | 103,514,367,264 |
British English | 336,950,312,247 | 45,271,592,771 |
English Fiction | 158,981,617,587 | 73,746,188,539 |
French | 328,796,168,553 | 35,216,041,238 |
German | 286,463,423,242 | 57,039,530,618 |
Hebrew | 7,263,771,123 | 76,953,586 |
Italian | 120,410,089,963 | 15,473,063,630 |
Russian | 89,415,200,246 | 19,648,780,340 |
Spanish | 158,874,356,254 | 17,573,531,785 |
Note that these numbers are for the total number of words, not the number of unique words. They also include many words which aren't valid words, but due to the size of the corpus, the cleaning steps are performed on only the most common words, so the number of words that would remain in the whole corpus after cleaning is unavailable. Moreover, Google uploads only words that appear over 40 times, while these counts also include words appearing less often than that.
We see that even after restricting the language subcorpora to books published in the last 10 available years, the number of words is still larger than 15 billion for each subcorpus, except Chinese and Hebrew. This is substantially larger than for all other corpora in common use, which never seem to contain more than a few billion words and often are much smaller than that (4).
When it comes to representativeness, it depends on the intended use. The Google Books Ngram Corpus contains only books (no periodicals, spoken language, websites, etc.). Each book edition is included at most once. Most of these books come from a small number of large university libraries, "over 40" in Version 1 of the corpus, while a smaller share is obtained directly from publishers (1). So, for example, if one intends to use this corpus for learning a language in which one mainly will read books of the kind large university libraries are interested in, then the words in this corpus are likely to be quite representative of the population of words one might encounter in the future.
The code producing everything is in the python directory. Each .py-file is a script that can be run from the command-line using python python/filename
, where the working directory has to be the directory containing the python directory. Every .py-file has a settings section at the top and additional cleaning settings can be specified using the files in python/extra_settings. The default settings have been chosen to make the code run reasonably fast and keep the resulting repository size reasonably small.
Optionally, start by running create_source_data_lists.py from the repository root directory to recreate the source-data folder with lists of links to the Google source data files.
Run download_and_extract_most_freq.py from the repository root directory to download each file listed in source-data (a ".gz-file") and extract the most frequent n-grams in it into a list saved in ngrams/more/{lang}/most_freq_ngrams_per_gz_file
. To save computer resources each .gz-file is immediately deleted after this. Since the lists of most frequent n-grams per .gz-file still take up around 36GB with the default settings, only one example list is uploaded to GitHub: ngrams_1-00006-of-00024.gz.csv. No cleaning has been performed at this stage, so this is how the raw data looks.
Run gather_and_clean.py to gather all the n-grams into lists of the overall most frequent ones and clean these lists (see the next section for details).
Run google_cloud_translate.py to add English translations to all non-English 1-grams using the Google Cloud Translate API (this requires an API key, see the file header). By default only 1-grams are translated and only to English, but by changing the settings any n-gram can be translated to any language supported by Google. Google randomly capitalizes translations so an attempt is made to correct for this. Moreover, a limited number of manual corrections are applied using manual_translations_1grams.csv.
Finally, graph_1grams_cumshare_rank.py produces graph_1grams_cumshare_rank_light.svg and its dark version.
All the cleaning steps are performed in gather_and_clean.py, except the cleaning of the Google translations.
The following cleaning steps are performed programmatically:
"'"
"'"
, " "
, and ","
are removed
"-"
for Russian and """
for HebrewMoreover, the following cleaning steps have been performed manually, using the English translations to help inform decisions:
extra_{n}grams_to_exclude.csv
.When manually deciding which words to include and exclude the following rules were applied. Exclude: person names (some exceptions: Jesus, God), city names (some exceptions: if differ a lot from English and are common enough), company names, abbreviations (some exceptions, e.g. ma, pa), word parts, words in the wrong language (except if in common use). Do not exclude: country names, names for ethnic/national groups of people, geographical names (e.g. rivers, oceans), colloquial terms, interjections.
For a carefully created list of all words in the English part of the Google Books Ngram Corpus (version 20200217), sorted by frequency, see the hackerb9/gwordlist repository.
An extensive list of frequency lists in many languages is available on Wiktionary.
What remains to be done is completing the final manual cleaning for Hebrew and the cleaning of the lists of n-grams for n > 1. An issue that might be addressed in the future is that some of the latter contain "," as a word. It might also be desirable to include more common abbreviations. No separate n-gram lists are provided for the American English and British English corpora. Of course, some errors remain.
The content of this repository is licensed under the Creative Commons Attribution 3.0 Unported License. Issue reports and pull requests are most welcome!