Descargar LLaMa2lang - Descarga del código fuente LLaMa2lang

Ahora con soporte LLaMa3

LLaMa2lang v0.6

Este repositorio contiene scripts prácticos para ajustar LLaMa3-8B (o cualquier otro modelo básico) para chatear en cualquier idioma (que no sea inglés). La razón detrás de esto es que LLaMa3 está entrenado principalmente con datos en inglés y, si bien funciona hasta cierto punto para otros idiomas, su rendimiento es deficiente en comparación con el inglés.

Combine el poder del ajuste fino con el poder de RAG: consulte nuestro repositorio RAG Me Up en RAG, que puede usarse sobre sus modelos sintonizados con LLaMa2Lang.

TL;DR

 pip install -r requirements.txt

# Translate OASST1 to target language
python translate.py m2m target_lang checkpoint_location

# Combine the checkpoint files into a dataset
python combine_checkpoints.py input_folder output_location

# Finetune
python finetune.py tuned_model dataset_name instruction_prompt

# Optionally finetune with DPO (RLHF)
python finetune_dpo.py tuned_model dataset_name instruction_prompt

# Run inference
python run_inference.py model_name instruction_prompt input

que hace

El proceso que seguimos para ajustar un modelo básico como LLaMa3 para un lenguaje específico es el siguiente:

Cargue un conjunto de datos que contenga pares de preguntas y respuestas/instrucciones.
Traduzca todo el conjunto de datos a un idioma de destino determinado.
Cargue el conjunto de datos traducido y extraiga los hilos seleccionando recursivamente mensajes con sus respectivas respuestas con el rango más alto únicamente, hasta mensajes posteriores, etc.
Convierta los hilos en mensajes siguiendo una plantilla determinada (personalizable).
Utilice QLoRA y PEFT para ajustar las instrucciones de un modelo de base base en este conjunto de datos.
- Utilice QLoRA y PEFT para realizar ajustes con DPO para ampliar aún más las capacidades del modelo y enseñarle las respuestas preferidas sobre las rechazadas. Tenga en cuenta que su conjunto de datos base debe tener esta información.
- Alternativamente a DPO, puedes lograr lo mismo con ORPO
Ejecute la inferencia utilizando el modelo recién entrenado.

Paradigmas soportados

Traducción

OPUS
M2M
MADLAD
mBART
NLLB
Sin costuras (solo grande)
Tower Instruct (puede corregir errores ortográficos)

Conjuntos de datos básicos

Se han probado los siguientes, pero potencialmente funcionarán más

OASST1
OASST2

Modelos de cimentación soportados

LLaMa3
LLaMa2
Mistral
(No oficial) Mixtral 8x7B

Hoja de ruta

[L2L-6] Investigar la interoperabilidad con otras bibliotecas (Axolotl, llamacpp, unsloth)
[L2L-7] Permitir diferentes cuantificaciones junto a QLoRA (GGUF, GPTQ, AWQ)
[L2L-10] Soporte para ampliar el tokenizador y el vocabulario

Costo y tiempo de ejecución

El proceso anterior se puede ejecutar completamente en una GPU Google Colab T4 gratuita. Sin embargo, el último paso sólo se puede ejecutar con éxito con ventanas de contexto lo suficientemente cortas y un lote de como máximo 2. Además, la traducción en el paso 2 tarda aproximadamente 36 horas en total para cualquier idioma determinado, por lo que debe ejecutarse en varios pasos si Quiero quedarme con una GPU Google Colab gratuita.

Nuestros modelos ajustados para el paso 5 se realizaron utilizando un A40 en vast.ai y nos costaron menos de un dólar por cada modelo, y se completaron en aproximadamente 1,5 horas.

Uso

Asegúrese de que pytorch esté instalado y funcionando para su entorno (preferible el uso de CUDA): https://pytorch.org/get-started/locally/
Clona el repositorio e instala los requisitos.

pip install -r requirements.txt

Traduzca su conjunto de datos base al idioma de destino designado.

 usage: translate.py [-h] [--quant8] [--quant4] [--base_dataset BASE_DATASET] [--base_dataset_text_field BASE_DATASET_TEXT_FIELD] [--base_dataset_lang_field BASE_DATASET_LANG_FIELD]
                    [--checkpoint_n CHECKPOINT_N] [--batch_size BATCH_SIZE] [--max_length MAX_LENGTH] [--cpu] [--source_lang SOURCE_LANG]
                    {opus,mbart,madlad,m2m,nllb,seamless_m4t_v2,towerinstruct} ... target_lang checkpoint_location

Translate an instruct/RLHF dataset to a given target language using a variety of translation models

positional arguments:
  {opus,mbart,madlad,m2m,nllb,seamless_m4t_v2,towerinstruct}
                        The model/architecture used for translation.
    opus                Translate the dataset using HelsinkiNLP OPUS models.
    mbart               Translate the dataset using mBART.
    madlad              Translate the dataset using Google's MADLAD models.
    m2m                 Translate the dataset using Facebook's M2M models.
    nllb                Translate the dataset using Facebook's NLLB models.
    seamless_m4t_v2     Translate the dataset using Facebook's SeamlessM4T-v2 multimodal models.
    towerinstruct       Translate the dataset using Unbabel's Tower Instruct. Make sure your target language is in the 10 languages supported by the model.
  target_lang           The target language. Make sure you use language codes defined by the translation model you are using.
  checkpoint_location   The folder the script will write (JSONized) checkpoint files to. Folder will be created if it doesn't exist.

options:
  -h, --help            show this help message and exit
  --quant8              Optional flag to load the translation model in 8 bits. Decreases memory usage, increases running time
  --quant4              Optional flag to load the translation model in 4 bits. Decreases memory usage, increases running time
  --base_dataset BASE_DATASET
                        The base dataset to translate, defaults to OpenAssistant/oasst1
  --base_dataset_text_field BASE_DATASET_TEXT_FIELD
                        The base dataset's column name containing the actual text to translate. Defaults to text
  --base_dataset_lang_field BASE_DATASET_LANG_FIELD
                        The base dataset's column name containing the language the source text was written in. Defaults to lang
  --checkpoint_n CHECKPOINT_N
                        An integer representing how often a checkpoint file will be written out. To start off, 400 is a reasonable number.
  --batch_size BATCH_SIZE
                        The batch size for a single translation model. Adjust based on your GPU capacity. Default is 10.
  --max_length MAX_LENGTH
                        How much tokens to generate at most. More tokens might be more accurate for lengthy input but creates a risk of running out of memory. Default is unlimited.
  --cpu                 Forces usage of CPU. By default GPU is taken if available.
  --source_lang SOURCE_LANG
                        Source language to select from OASST based on lang property of dataset

Si desea más parámetros para los diferentes modelos de traducción, ejecute:

 python translate.py [MODEL] -h

Asegúrese de especificar primero los parámetros específicos del modelo antes de especificar los parámetros comunes de la lista anterior. Llamadas de ejemplo:

 # Using M2M with 4bit quantization and differen batch sizes to translate Dutch
python translate.py m2m nl ./output_nl --quant4 --batch_size 20

# Using madlad 7B with 8bit quantization for German with different max_length
python translate.py madlad --model_size 7b de ./output_de --quant8 --batch_size 5 --max_length 512

# Be sure to use target language codes that the model you use understands
python translate.py mbart xh_ZA ./output_xhosa
python translate.py nllb nld_Latn ./output_nl

Combine las matrices JSON de los archivos de los puntos de control en un conjunto de datos de Huggingface y luego escríbalo en el disco o publíquelo en Huggingface. El script intentará escribir en el disco de forma predeterminada y volverá a publicar en Huggingface si la carpeta no existe en el disco. Para publicar en Huggingface, asegúrese de tener su variable de entorno HF_TOKEN configurada según la documentación.

 usage: combine_checkpoints.py [-h] input_folder output_location

Combine checkpoint files from translation.

positional arguments:
  input_folder     The checkpoint folder used in translation, with the target language appended.
                   Example: "./output_nl".
  output_location  Where to write the Huggingface Dataset. Can be a disk location or a Huggingface
                   Dataset repository.

options:
  -h, --help       show this help message and exit

Convierta los mensajes traducidos en hilos de chat/instrucciones/avisos y ajuste las instrucciones de un modelo básico utilizando LoRA y PEFT.

 usage: finetune.py [-h] [--base_model BASE_MODEL] [--base_dataset_text_field BASE_DATASET_TEXT_FIELD] [--base_dataset_rank_field BASE_DATASET_RANK_FIELD] [--base_dataset_id_field BASE_DATASET_ID_FIELD] [--base_dataset_parent_field BASE_DATASET_PARENT_FIELD]
                   [--base_dataset_role_field BASE_DATASET_ROLE_FIELD] [--quant8] [--noquant] [--max_seq_length MAX_SEQ_LENGTH] [--num_train_epochs NUM_TRAIN_EPOCHS] [--batch_size BATCH_SIZE] [--threads_output_name THREADS_OUTPUT_NAME] [--thread_template THREAD_TEMPLATE]
                   [--padding PADDING]
                   tuned_model dataset_name instruction_prompt

Finetune a base instruct/chat model using (Q)LoRA and PEFT

positional arguments:
  tuned_model           The name of the resulting tuned model.
  dataset_name          The name of the dataset to use for fine-tuning. This should be the output of the combine_checkpoints script.
  instruction_prompt    An instruction message added to every prompt given to the chatbot to force it to answer in the target language. Example: "You are a generic chatbot that always answers in English."

options:
  -h, --help            show this help message and exit
  --base_model BASE_MODEL
                        The base foundation model. Default is "NousResearch/Meta-Llama-3-8B-Instruct".
  --base_dataset_text_field BASE_DATASET_TEXT_FIELD
                        The dataset's column name containing the actual text to translate. Defaults to text
  --base_dataset_rank_field BASE_DATASET_RANK_FIELD
                        The dataset's column name containing the rank of an answer given to a prompt. Defaults to rank
  --base_dataset_id_field BASE_DATASET_ID_FIELD
                        The dataset's column name containing the id of a text. Defaults to message_id
  --base_dataset_parent_field BASE_DATASET_PARENT_FIELD
                        The dataset's column name containing the parent id of a text. Defaults to parent_id
  --base_dataset_role_field BASE_DATASET_ROLE_FIELD
                        The dataset's column name containing the role of the author of the text (eg. prompter, assistant). Defaults to role
  --quant8              Finetunes the model in 8 bits. Requires more memory than the default 4 bit.
  --noquant             Do not quantize the finetuning. Requires more memory than the default 4 bit and optional 8 bit.
  --max_seq_length MAX_SEQ_LENGTH
                        The maximum sequence length to use in finetuning. Should most likely line up with your base model's default max_seq_length. Default is 512.
  --num_train_epochs NUM_TRAIN_EPOCHS
                        Number of epochs to use. 2 is default and has been shown to work well.
  --batch_size BATCH_SIZE
                        The batch size to use in finetuning. Adjust to fit in your GPU vRAM. Default is 4
  --threads_output_name THREADS_OUTPUT_NAME
                        If specified, the threads created in this script for finetuning will also be saved to disk or HuggingFace Hub.
  --thread_template THREAD_TEMPLATE
                        A file containing the thread template to use. Default is threads/template_fefault.txt
  --padding PADDING     What padding to use, can be either left or right.

6.1 [OPCIONAL] Ajuste fino usando DPO (similar a RLHF)

 usage: finetune_dpo.py [-h] [--base_model BASE_MODEL] [--base_dataset_text_field BASE_DATASET_TEXT_FIELD] [--base_dataset_rank_field BASE_DATASET_RANK_FIELD] [--base_dataset_id_field BASE_DATASET_ID_FIELD] [--base_dataset_parent_field BASE_DATASET_PARENT_FIELD] [--quant8]
                       [--noquant] [--max_seq_length MAX_SEQ_LENGTH] [--max_prompt_length MAX_PROMPT_LENGTH] [--num_train_epochs NUM_TRAIN_EPOCHS] [--batch_size BATCH_SIZE] [--threads_output_name THREADS_OUTPUT_NAME] [--thread_template THREAD_TEMPLATE] [--max_steps MAX_STEPS]
                       [--padding PADDING]
                       tuned_model dataset_name instruction_prompt

Finetune a base instruct/chat model using (Q)LoRA and PEFT using DPO (RLHF)

positional arguments:
  tuned_model           The name of the resulting tuned model.
  dataset_name          The name of the dataset to use for fine-tuning. This should be the output of the combine_checkpoints script.
  instruction_prompt    An instruction message added to every prompt given to the chatbot to force it to answer in the target language. Example: "You are a generic chatbot that always answers in English."

options:
  -h, --help            show this help message and exit
  --base_model BASE_MODEL
                        The base foundation model. Default is "NousResearch/Meta-Llama-3-8B-Instruct".
  --base_dataset_text_field BASE_DATASET_TEXT_FIELD
                        The dataset's column name containing the actual text to translate. Defaults to text
  --base_dataset_rank_field BASE_DATASET_RANK_FIELD
                        The dataset's column name containing the rank of an answer given to a prompt. Defaults to rank
  --base_dataset_id_field BASE_DATASET_ID_FIELD
                        The dataset's column name containing the id of a text. Defaults to message_id
  --base_dataset_parent_field BASE_DATASET_PARENT_FIELD
                        The dataset's column name containing the parent id of a text. Defaults to parent_id
  --quant8              Finetunes the model in 8 bits. Requires more memory than the default 4 bit.
  --noquant             Do not quantize the finetuning. Requires more memory than the default 4 bit and optional 8 bit.
  --max_seq_length MAX_SEQ_LENGTH
                        The maximum sequence length to use in finetuning. Should most likely line up with your base model's default max_seq_length. Default is 512.
  --max_prompt_length MAX_PROMPT_LENGTH
                        The maximum length of the prompts to use. Default is 512.
  --num_train_epochs NUM_TRAIN_EPOCHS
                        Number of epochs to use. 2 is default and has been shown to work well.
  --batch_size BATCH_SIZE
                        The batch size to use in finetuning. Adjust to fit in your GPU vRAM. Default is 4
  --threads_output_name THREADS_OUTPUT_NAME
                        If specified, the threads created in this script for finetuning will also be saved to disk or HuggingFace Hub.
  --thread_template THREAD_TEMPLATE
                        A file containing the thread template to use. Default is threads/template_fefault.txt
  --max_steps MAX_STEPS
                        The maximum number of steps to run DPO for. Default is -1 which will run the data through fully for the number of epochs but this will be very time-consuming.
  --padding PADDING     What padding to use, can be either left or right.

6.1 [OPCIONAL] Ajuste fino usando ORPO (similar a RLHF)

 usage: finetune_orpo.py [-h] [--base_model BASE_MODEL] [--base_dataset_text_field BASE_DATASET_TEXT_FIELD] [--base_dataset_rank_field BASE_DATASET_RANK_FIELD] [--base_dataset_id_field BASE_DATASET_ID_FIELD] [--base_dataset_parent_field BASE_DATASET_PARENT_FIELD] [--quant8]
                        [--noquant] [--max_seq_length MAX_SEQ_LENGTH] [--max_prompt_length MAX_PROMPT_LENGTH] [--num_train_epochs NUM_TRAIN_EPOCHS] [--batch_size BATCH_SIZE] [--threads_output_name THREADS_OUTPUT_NAME] [--thread_template THREAD_TEMPLATE] [--max_steps MAX_STEPS]
                        [--padding PADDING]
                        tuned_model dataset_name instruction_prompt

Finetune a base instruct/chat model using (Q)LoRA and PEFT using ORPO (RLHF)

positional arguments:
  tuned_model           The name of the resulting tuned model.
  dataset_name          The name of the dataset to use for fine-tuning. This should be the output of the combine_checkpoints script.
  instruction_prompt    An instruction message added to every prompt given to the chatbot to force it to answer in the target language. Example: "You are a generic chatbot that always answers in English."

options:
  -h, --help            show this help message and exit
  --base_model BASE_MODEL
                        The base foundation model. Default is "NousResearch/Meta-Llama-3-8B-Instruct".
  --base_dataset_text_field BASE_DATASET_TEXT_FIELD
                        The dataset's column name containing the actual text to translate. Defaults to text
  --base_dataset_rank_field BASE_DATASET_RANK_FIELD
                        The dataset's column name containing the rank of an answer given to a prompt. Defaults to rank
  --base_dataset_id_field BASE_DATASET_ID_FIELD
                        The dataset's column name containing the id of a text. Defaults to message_id
  --base_dataset_parent_field BASE_DATASET_PARENT_FIELD
                        The dataset's column name containing the parent id of a text. Defaults to parent_id
  --quant8              Finetunes the model in 8 bits. Requires more memory than the default 4 bit.
  --noquant             Do not quantize the finetuning. Requires more memory than the default 4 bit and optional 8 bit.
  --max_seq_length MAX_SEQ_LENGTH
                        The maximum sequence length to use in finetuning. Should most likely line up with your base model's default max_seq_length. Default is 512.
  --max_prompt_length MAX_PROMPT_LENGTH
                        The maximum length of the prompts to use. Default is 512.
  --num_train_epochs NUM_TRAIN_EPOCHS
                        Number of epochs to use. 2 is default and has been shown to work well.
  --batch_size BATCH_SIZE
                        The batch size to use in finetuning. Adjust to fit in your GPU vRAM. Default is 4
  --threads_output_name THREADS_OUTPUT_NAME
                        If specified, the threads created in this script for finetuning will also be saved to disk or HuggingFace Hub.
  --thread_template THREAD_TEMPLATE
                        A file containing the thread template to use. Default is threads/template_fefault.txt
  --max_steps MAX_STEPS
                        The maximum number of steps to run ORPO for. Default is -1 which will run the data through fully for the number of epochs but this will be very time-consuming.
  --padding PADDING     What padding to use, can be either left or right.

Ejecute la inferencia utilizando el modelo QLoRA recién creado.

 usage: run_inference.py [-h] model_name instruction_prompt input

Script to run inference on a tuned model.

positional arguments:
  model_name          The name of the tuned model that you pushed to Huggingface in the previous
                      step.
  instruction_prompt  An instruction message added to every prompt given to the chatbot to force
                      it to answer in the target language.
  input               The actual chat input prompt. The script is only meant for testing purposes
                      and exits after answering.

options:
  -h, --help          show this help message and exit

Elegir el modelo de traducción adecuado

¿Cómo sé qué modelo de traducción elegir para mi idioma de destino?

Lo cubrimos con nuestro script benchmark.py que ayuda a hacer una buena suposición (el conjunto de datos que utilizamos es el mismo en el que se entrenan los modelos OPUS, por lo que los resultados siempre son favorables para OPUS). Para su uso, consulte la ayuda de este script a continuación. Los modelos se cargan en cuantificación de 4 bits y se ejecutan en una pequeña muestra del subconjunto de libros OPUS.

Asegúrese de utilizar los idiomas que aparecen con mayor frecuencia en su conjunto de datos base como idioma_origen y su idioma de traducción de destino como idioma_destino. Para OASST1, por ejemplo, asegúrese de ejecutar al menos en y es como idiomas de origen.

 usage: benchmark.py [-h] [--cpu] [--start START] [--n N] [--max_length MAX_LENGTH] source_language target_language included_models

Benchmark all the different translation models for a specific source and target language to find out which performs best. This uses 4bit quantization to limit GPU usage. Note:
the outcomes are indicative - you cannot assume corretness of the BLEU and CHRF scores but you can compare models against each other relatively.

positional arguments:
  source_language       The source language you want to test for. Check your dataset to see which occur most prevalent or use English as a good start.
  target_language       The source language you want to test for. This should be the language you want to apply the translate script on. Note: in benchmark, we use 2-character
                        language codes, in constrast to translate.py where you need to specify whatever your model expects.
  included_models       Comma-separated list of models to include. Allowed values are: opus, m2m_418m, m2m_1.2b, madlad_3b, madlad_7b, madlad_10b, madlad_7bbt, mbart,
                        nllb_distilled600m, nllb_1.3b, nllb_distilled1.3b, nllb_3.3b, seamless

options:
  -h, --help            show this help message and exit
  --cpu                 Forces usage of CPU. By default GPU is taken if available.
  --start START         The starting offset to include sentences from the OPUS books dataset from. Defaults to 0.
  --n N                 The number of sentences to benchmark on. Defaults to 100.
  --max_length MAX_LENGTH
                        How much tokens to generate at most. More tokens might be more accurate for lengthy input but creates a risk of running out of memory. Default is 512.

Conjuntos de datos y modelos

Ya hemos creado y seguiremos creando numerosos conjuntos de datos y modelos. ¿Quiere ayudar a democratizar los LLM? Clone el repositorio y cree conjuntos de datos y modelos para otros idiomas, luego cree un PR.

Conjuntos de datos oasst1 traducidos


Comprensión del holandés/oasst1_nl	Comprensión del español/oasst1_es	Comprensión del francés/oasst1_fr	Comprensión del alemán/oasst1_de
catalán xaviviro/oasst1_ca	Portugués UnderstandLing/oasst1_pt	Árabe HeshamHaroon/oasst-árabe	Comprensión italiana/oasst1_it
Comprensión rusa/oasst1_ru	Hindi UnderstandLing/oasst1_hi	Comprensión china/oasst1_zh	Cristianos polacos/oasst1_pl
Comprensión del japonés/oasst1_jap	Vasco xezpeleta/oasst1_eu	Entendimiento bengalí/oasst1_bn	Entendimiento turco/oasst1_tr

Adaptadores de modelo de chat ❗LLaMa3-8B❗ específicos del idioma

Asegúrese de tener acceso al modelo LLaMa3-8B de Meta y configure su HF_TOKEN antes de usar estos modelos.


UnderstandLing/Llama-3-8B-Instruct-nl Holandés	UnderstandLing/Llama-3-8B-Instruct-es Español	UnderstandLing/Llama-3-8B-Instruct-fr Francés	UnderstandLing/Llama-3-8B-Instruct-de Alemán
UnderstandLing/Llama-3-8B-Instruct-pt Portugués	UnderstandLing/Llama-3-8B-Instruct-it Italiano	UnderstandLing/Llama-3-8B-Instruct-hi Hindi	UnderstandLing/Llama-3-8B-Instruct-ru Ruso

Conjuntos de datos de mensajes de chat de hilo LLaMa2 traducidos


Comprensión del holandés/oasst1_nl_threads	Comprensión del español/oasst1_es_threads	Comprensión del francés/oasst1_fr_threads	Comprensión del alemán/oasst1_de_threads
catalán xaviviro/oasst1_ca_threads	Comprensión del portugués/oasst1_pt_threads	Árabe HeshamHaroon/oasst-arabic_threads	Comprensión italiana/oasst1_it_threads
Comprensión rusa/oasst1_ru_threads	Comprensión del hindi/oasst1_hi_threads	Comprensión china/oasst1_zh_threads	Cristianos polacos/oasst1_pl_threads
Comprensión del japonés/oasst1_jap_threads	Vasco xezpeleta/oasst1_eu_threads	Comprensión bengalí/oasst1_bn_threads	Comprensión turca/oasst1_tr_threads

Adaptadores de modelo de chat LLaMa2-7B específicos del idioma


UnderstandLing/llama-2-7b-chat-nl holandés	UnderstandLing/llama-2-7b-chat-es español	UnderstandLing/llama-2-7b-chat-fr francés	UnderstandLing/llama-2-7b-chat-de alemán
xaviviro/llama-2-7b-chat-ca catalán	UnderstandLing/llama-2-7b-chat-pt Portugués	HeshamHaroon/llama-2-7b-chat-ar Árabe	UnderstandLing/llama-2-7b-chat-it Italiano
UnderstandLing/llama-2-7b-chat-ru ruso	UnderstandLing/llama-2-7b-chat-hi Hindi	ComprenderLing/llama-2-7b-chat-zh chino	chrystians/llama-2-7b-chat-pl-polaco-polski polaco
xezpeleta/llama-2-7b-chat-eu vasco	UnderstandLing/llama-2-7b-chat-bn bengalí	UnderstandLing/llama-2-7b-chat-tr turco

Adaptadores de modelo de chat Mistral específicos del idioma


UnderstandLing/Mistral-7B-Instruct-v0.2-nl Holandés	UnderstandLing/Mistral-7B-Instruct-v0.2-es Español	UnderstandLing/Mistral-7B-Instruct-v0.2-de Alemán

Adaptadores de modelo de chat LLaMa2-13B específicos del idioma


UnderstandLing/llama-2-13b-chat-nl holandés	UnderstandLing/llama-2-13b-chat-es español	UnderstandLing/llama-2-13b-chat-fr francés

Adaptadores de modelo de chat Mixtral-8x7B específicos del idioma


UnderstandLing/Mixtral-8x7B-Instruct-nl Holandés

Rendimiento empírico

Holandés

<s>[INST] <<SYS>> Je bent een generieke chatbot die altijd in het Nederlands antwoord geeft. <</SYS>> Wat is de hoofdstad van Nederland? [/INST] Amsterdam</s>

<s>[INST] <<SYS>> Je bent een generieke chatbot die altijd in het Nederlands antwoord geeft. <</SYS>> Wat is de hoofdstad van Nederland? [/INST] Amsterdam</s><s>[INST] Hoeveel inwoners heeft die stad? [/INST] 850 duizend inwoners (2023)</s>

<s>[INST] <<SYS>> Je bent een generieke chatbot die altijd in het Nederlands antwoord geeft. <</SYS>> Wie is de minister-president van Nederland? [/INST] Mark Rutte is sinds 2010 minister-president van Nederland. Hij is meerdere keren herkozen.</s>

Preguntas frecuentes

P: ¿Por qué traduce primero el conjunto de datos OASST1/2 completo? ¿No sería más rápido traducir sólo los hilos mejor clasificados?
R: Si bien puedes ganar mucho en términos de tiempo de procesamiento si primero creas los hilos y luego los traduces, proporcionamos traducciones OASST1/2 completas a la comunidad porque creemos que pueden ser útiles por sí solas.
P: ¿Qué tan bien funcionan los ajustes finos en comparación con el LLaMa3 básico?
R: Si bien no tenemos puntos de referencia formales, lograr que LLaMa3 hable consistentemente otro idioma además del inglés es un desafío, si no imposible. El idioma distinto del inglés que produce a menudo está gramaticalmente roto. Nuestros ajustes finos no muestran este comportamiento.
P: ¿Puedo utilizar otros marcos para realizar ajustes?
R: Sí, puedes, usamos Axolotl para entrenar en configuraciones de múltiples GPU.
P: ¿Puedo mezclar diferentes modelos de traducción?
R: Por supuesto, creemos que incluso podría aumentar el rendimiento si la traducción se realiza mediante varios modelos. Puede lograr esto deteniendo anticipadamente una traducción y continuando desde los puntos de control volviendo a ejecutar el script de traducción con un modelo de traducción diferente.