This repository provides cleaned lists of the most frequent words and n-grams (sequences of n words), including some English translations, of the Google Books Ngram Corpus (v3/20200217, all languages), plus customizable Python code which reproduces these lists.
Lists with n-grams
Lists with the most frequent n-grams are provided separately by language and n. Available languages are Chinese (simplified), English, English Fiction, French, German, Hebrew, Italian, Russian, and Spanish. n ranges from 1 to 5. In the provided lists the language subcorpora are restricted to books published in the years 2010-2019, but in the Python code both this and the number of most frequent n-grams included can be adjusted.
The lists are found in the ngrams directory. For all languages except Hebrew cleaned lists are provided for the
10.000 most frequent 1-grams,
5.000 most frequent 2-grams,
3.000 most frequent 3-grams,
1.000 most frequent 4-grams,
1.000 most frequent 5-grams.
For Hebrew, due to the small corpus size, only the 200 most frequent 4-grams and 80 most frequent 5-grams are provided.
All cleaned lists also contain the number of times each n-gram occurs in the corpus (its frequency, column freq). For 1-grams (words) there are two additional columns:
cumshare which for each word contains the cumulative share of all words in the corpus made up by that word and all more frequent words.
en which contains the English translation of the word obtained using the Google Cloud Translate API (only for non-English languages).
Here are the first 10 rows of 1grams_french.csv:
ngram
freq
cumshare
en
de
1380202965
0.048
of
la
823756863
0.077
the
et
651571349
0.100
and
le
614855518
0.121
the
à
577644624
0.142
at
l'
527188618
0.160
the
les
503689143
0.178
them
en
390657918
0.191
in
des
384774428
0.205
of the
The lists found directly in the ngrams directory have been cleaned and are intended for use when developing language-learning materials. The sub-directory ngrams/more contains uncleaned and less cleaned versions which might be of use for e.g. linguists:
the most frequent raw n-grams as Google stores them (suffixed 0_raw),
only keeping entries without part-of-speech (POS) tags (suffixed 1a_no_pos),
only keeping entries with POS tags (only for 1-grams, suffixed 1b_with_pos),
entries excluded from the final cleaned lists (suffixed 2_removed).
Learning languages with this
To provide some motivation for why leaning the most frequent words first may be a good idea when learning a language, the following graph is provided.