predict4all is an accurate, fast, lightweight, multilingual, free and open-source next word prediction library.
It aims to be integrated in applications to display possible next words and help user input : virtual keyboards, text editors, AAC systems...
predict4all originality stands in its correction model : it works thanks to a set of correction rules (general or specific : accents, grammar, missing space, etc). This correction model allows the correction to happen earlier in prediction compared to string distance techniques. This also allows the correction to be similar to existing corrector (e.g. GBoard) but to be enhanced with custom rule based on user errors (dysorthography, dyslexia, etc)
predict4all was co-designed with speech therapists and occupational therapists from CMRRF Kerpape and Hopital Raymond Poincaré to ensure that it match needs and requirements for user with speech and text writing troubles. A particular attention was given to determine common and particular mistakes made by people with dysorthography and dyslexia.
Currently, predict4all supports french language (provided rules and pre-trained language model).
This library was developed in the collaborative project predict4all involving CMRRF Kerpape, Hopital Raymond Poincaré and BdTln Team, LIFAT, Université de Tours
predict4all is supported by Fondation Paul Bennetot, Fondation du Groupe Matmut under Fondation de l'Avenir, Paris, France (project AP-FPB 16-001)
This project has been integrated in the following AAC software : LifeCompanion, Sibylle, CiviKey
The project is still developed in AAC4ALL project.
Train (see "Training your own language model") or download a language model. Pre-computed french language model is available :
The french language model have been trained on more than +20 millions words from Wikipedia and subtitles corpus. The vocabulary contains ~112 000 unique words.
Get the library through your favorite dependency manager :
Maven
<dependency>
<groupId>io.github.mthebaud</groupId>
<artifactId>predict4all</artifactId>
<version>1.2.0</version>
</dependency>
Gradle
implementation 'io.github.mthebaud:predict4all:1.2.0'
In the following examples, we assume that you initialized language model, predictor, etc.
final File FILE_NGRAMS = new File("fr_ngrams.bin");
final File FILE_WORDS = new File("fr_words.bin");
LanguageModel languageModel = new FrenchLanguageModel();
PredictionParameter predictionParameter = new PredictionParameter(languageModel);
WordDictionary dictionary = WordDictionary.loadDictionary(languageModel, FILE_WORDS);
try (StaticNGramTrieDictionary ngramDictionary = StaticNGramTrieDictionary.open(FILE_NGRAMS)) {
WordPredictor wordPredictor = new WordPredictor(predictionParameter, dictionary, ngramDictionary);
// EXAMPLE CODE SHOULD RUN HERE
}
You can find complete working code for these examples and more complex ones in predict4all-example
Please read the Javadoc (public classes are well documented)
WordPredictionResult predictionResult = wordPredictor.predict("j'aime manger des ");
for (WordPrediction prediction : predictionResult.getPredictions()) {
System.out.println(prediction);
}
Result (french language model)
trucs = 0.16105959785338766 (insert = trucs, remove = 0, space = true)
fruits = 0.16093509126844632 (insert = fruits, remove = 0, space = true)
bonbons = 0.11072838908013616 (insert = bonbons, remove = 0, space = true)
gâteaux = 0.1107102433239866 (insert = gâteaux, remove = 0, space = true)
frites = 0.1107077522148962 (insert = frites, remove = 0, space = true)
WordPredictionResult predictionResult = wordPredictor.predict("je te r");
for (WordPrediction prediction : predictionResult.getPredictions()) {
System.out.println(prediction);
}
Result (french language model)
rappelle = 0.25714609184509885 (insert = appelle, remove = 0, space = true)
remercie = 0.12539880967030353 (insert = emercie, remove = 0, space = true)
ramène = 0.09357117922321868 (insert = amène, remove = 0, space = true)
retrouve = 0.07317575867400958 (insert = etrouve, remove = 0, space = true)
rejoins = 0.06404375655722373 (insert = ejoins, remove = 0, space = true)
To tune the WordPredictor, you can explore PredictionParameter javadoc
CorrectionRuleNode root = new CorrectionRuleNode(CorrectionRuleNodeType.NODE);
root.addChild(FrenchDefaultCorrectionRuleGenerator.CorrectionRuleType.ACCENTS.generateNodeFor(predictionParameter));
predictionParameter.setCorrectionRulesRoot(root);
predictionParameter.setEnableWordCorrection(true);
WordPredictor wordPredictor = new WordPredictor(predictionParameter, dictionary, ngramDictionary);
WordPredictionResult predictionResult = wordPredictor.predict("il eta");
for (WordPrediction prediction : predictionResult.getPredictions()) {
System.out.println(prediction);
}
Result (french language model)
était = 0.9485814446960688 (insert = était, remove = 3, space = true)
établit = 0.05138460933797299 (insert = établit, remove = 3, space = true)
étale = 7.544080911878824E-6 (insert = étale, remove = 3, space = true)
établissait = 4.03283914323952E-6 (insert = établissait, remove = 3, space = true)
étaye = 4.025324786425216E-6 (insert = étaye, remove = 3, space = true)
In this example, remove become positive as the first letter in the word is incorrect : previous typed text should be removed before insert.
DynamicNGramDictionary dynamicNGramDictionary = new DynamicNGramDictionary(4);
predictionParameter.setDynamicModelEnabled(true);
WordPredictor wordPredictor = new WordPredictor(predictionParameter, dictionary, ngramDictionary, dynamicNGramDictionary);
WordPredictionResult predictionResult = wordPredictor.predict("je vais à la ");
for (WordPrediction prediction : predictionResult.getPredictions()) {
System.out.println(prediction);
}
wordPredictor.trainDynamicModel("je vais à la gare");
predictionResult = wordPredictor.predict("je vais à la ");
for (WordPrediction prediction : predictionResult.getPredictions()) {
System.out.println(prediction);
}
Result (french language model)
fête = 0.3670450710570904 (insert = fête, remove = 0, space = true)
bibliothèque = 0.22412342109445696 (insert = bibliothèque, remove = 0, space = true)
salle = 0.22398910838330122 (insert = salle, remove = 0, space = true)
fin = 0.014600071765987328 (insert = fin, remove = 0, space = true)
suite = 0.014315510457449597 (insert = suite, remove = 0, space = true)
- After training
fête = 0.35000112941797795 (insert = fête, remove = 0, space = true)
bibliothèque = 0.2137161256141207 (insert = bibliothèque, remove = 0, space = true)
salle = 0.213588049788271 (insert = salle, remove = 0, space = true)
gare = 0.045754860284824 (insert = gare, remove = 0, space = true)
fin = 0.013922109328323544 (insert = fin, remove = 0, space = true)
In this example, the word "gare" appears after training the model with "je vais à la gare".
Be careful, training a model with wrong sentence will corrupt your data.
When using a dynamic model, you should take care of saving/loading two different files : user ngrams and user word dictionary.
The original files won't be modified, to be shared across different users : the good implementation pattern.
Once your model is trained, you may want to save it :
dynamicNGramDictionary.saveDictionary(new File("fr_user_ngrams.bin"));
and later load it again (and pass it to WordPredictor constructor)
DynamicNGramDictionary dynamicNGramDictionary = DynamicNGramDictionary.load(new File("fr_user_ngrams.bin"));
You can also save the word dictionary if new words have been added :
dictionary.saveUserDictionary(new File("fr_user_words.bin"));
and later load it again (on an existing WordDictionary instance)
dictionary.loadUserDictionary(new File("fr_user_words.bin"));
It is sometimes useful to modify the available vocabulary to better adapt predictions to user.
This can be done working with WordDictionary, for example, you can disable a word :
Word maisonWord = dictionary.getWord("maison");
maisonWord.setForceInvalid(true, true);
Or you can show to user every custom words added to the dictionary :
dictionary.getAllWords().stream()
.filter(w -> w.isValidToBePredicted(predictionParameter)) // don't want to display to the user the word that would never appears in prediction
.filter(Word::isUserWord) // get only the user added words
.forEach(w -> System.out.println(w.getWord()));
When you modify words (original or added by users), don't forget to save the user dictionary : it will save user words but also original words modifications.
You can find further information looking at Word javadoc
When using predict4all, you should take note that :
To train your own language model, you will first need to prepare :
A good CPU is also a key point : predict4all strongly use multi threaded algorithms, so the more core you get, the faster the training will be
Then, you can run the executable jar (precompiled version available) with a command line :
java -Xmx16G -jar predict4all-model-trainer-cmd-1.1.0-all.jar -config fr_training_configuration.json -language fr -ngram-dictionary fr_ngrams.bin -word-dictionary fr_words.bin path/to/corpus
This command will launch a training, allowing the JVM to get 16GB memory, and giving an input and output configuration.
Generated data files will be fr_ngrams.bin and fr_words.bin
Alternatively, you can check LanguageDataModelTrainer in predict4all-model-trainer-cmd to launch your training programmatically.
Please let us know if you use predict4all !
Feel free to fill an issue if you need assistance or if you find a bug.
This software is distributed under the Apache License 2.0 (see file LICENCE)
This project was developed various NLP techniques (mainly ngram based)
predict4all is built using Github actions to publish new versions on Maven Central
To create a version
Reference document on publishing