Download LLaMa2lang - Download do código-fonte LLaMa2lang

Agora com suporte LLaMa3

LLaMa2lang v0.6

Este repositório contém scripts convenientes para ajustar o LLaMa3-8B (ou qualquer outro modelo básico) para bate-papo em qualquer idioma (que não seja o inglês). A justificativa por trás disso é que o LLaMa3 é treinado principalmente com dados em inglês e, embora funcione até certo ponto para outros idiomas, seu desempenho é fraco em comparação com o inglês.

Combine o poder do ajuste fino com o poder do RAG - confira nosso repositório RAG Me Up no RAG, que pode ser usado em cima de seus modelos ajustados com LLaMa2Lang.

DR

 pip install -r requirements.txt

# Translate OASST1 to target language
python translate.py m2m target_lang checkpoint_location

# Combine the checkpoint files into a dataset
python combine_checkpoints.py input_folder output_location

# Finetune
python finetune.py tuned_model dataset_name instruction_prompt

# Optionally finetune with DPO (RLHF)
python finetune_dpo.py tuned_model dataset_name instruction_prompt

# Run inference
python run_inference.py model_name instruction_prompt input

O que isso faz

O processo que seguimos para ajustar um modelo básico como LLaMa3 para uma linguagem específica é o seguinte:

Carregue um conjunto de dados que contenha pares de perguntas e respostas/instruções.
Traduza todo o conjunto de dados para um determinado idioma de destino.
Carregue o conjunto de dados traduzido e extraia threads selecionando recursivamente os prompts com suas respectivas respostas apenas com a classificação mais alta, até os prompts subsequentes, etc.
Transforme os tópicos em prompts seguindo um determinado modelo (personalizável).
Use QLoRA e PEFT para ajustar o ajuste fino de instruções de um modelo básico básico neste conjunto de dados.
- Use QLoRA e PEFT para ajustar o DPO para ampliar ainda mais as capacidades do modelo e ensinar-lhe respostas preferidas em vez das rejeitadas. Observe que seu conjunto de dados base deve conter essas informações.
- Alternativamente ao DPO, você pode conseguir o mesmo com ORPO
Execute a inferência usando o modelo recém-treinado.

Paradigmas suportados

Tradução

OPUS
M2M
MADLAD
mBART
NLLB
Sem costura (somente grande)
Tower Instruct (pode corrigir erros ortográficos)

Conjuntos de dados básicos

Os itens a seguir foram testados, mas potencialmente mais funcionarão

OASST1
OASST2

Modelos de fundação suportados

LLaMa3
LLaMa2
Mistral
(Não oficial) Mixtral 8x7B

Roteiro

[L2L-6] Investigue a interoperabilidade com outras bibliotecas (Axolotl, llamacpp, unsloth)
[L2L-7] Permite diferentes quantizações próximas ao QLoRA (GGUF, GPTQ, AWQ)
[L2L-10] Suporte à extensão do tokenizer e do vocabulário

Custo e tempo de execução

O processo acima pode ser totalmente executado em uma GPU Google Colab T4 gratuita. A última etapa, entretanto, só pode ser executada com êxito com janelas de contexto curtas o suficiente e um lote de no máximo 2. Além disso, a tradução na etapa 2 leva cerca de 36 horas no total para qualquer idioma, portanto deve ser executada em várias etapas se você quero ficar com uma GPU Google Colab gratuita.

Nossos modelos ajustados para a etapa 5 foram executados usando um A40 no vasto.ai e custaram menos de um dólar para cada modelo, sendo concluídos em cerca de 1,5 horas.

Uso

Certifique-se de que o pytorch esteja instalado e funcionando para o seu ambiente (preferencialmente o uso de CUDA): https://pytorch.org/get-started/locally/
Clone o repositório e instale os requisitos.

pip install -r requirements.txt

Traduza seu conjunto de dados base para o idioma de destino designado.

 usage: translate.py [-h] [--quant8] [--quant4] [--base_dataset BASE_DATASET] [--base_dataset_text_field BASE_DATASET_TEXT_FIELD] [--base_dataset_lang_field BASE_DATASET_LANG_FIELD]
                    [--checkpoint_n CHECKPOINT_N] [--batch_size BATCH_SIZE] [--max_length MAX_LENGTH] [--cpu] [--source_lang SOURCE_LANG]
                    {opus,mbart,madlad,m2m,nllb,seamless_m4t_v2,towerinstruct} ... target_lang checkpoint_location

Translate an instruct/RLHF dataset to a given target language using a variety of translation models

positional arguments:
  {opus,mbart,madlad,m2m,nllb,seamless_m4t_v2,towerinstruct}
                        The model/architecture used for translation.
    opus                Translate the dataset using HelsinkiNLP OPUS models.
    mbart               Translate the dataset using mBART.
    madlad              Translate the dataset using Google's MADLAD models.
    m2m                 Translate the dataset using Facebook's M2M models.
    nllb                Translate the dataset using Facebook's NLLB models.
    seamless_m4t_v2     Translate the dataset using Facebook's SeamlessM4T-v2 multimodal models.
    towerinstruct       Translate the dataset using Unbabel's Tower Instruct. Make sure your target language is in the 10 languages supported by the model.
  target_lang           The target language. Make sure you use language codes defined by the translation model you are using.
  checkpoint_location   The folder the script will write (JSONized) checkpoint files to. Folder will be created if it doesn't exist.

options:
  -h, --help            show this help message and exit
  --quant8              Optional flag to load the translation model in 8 bits. Decreases memory usage, increases running time
  --quant4              Optional flag to load the translation model in 4 bits. Decreases memory usage, increases running time
  --base_dataset BASE_DATASET
                        The base dataset to translate, defaults to OpenAssistant/oasst1
  --base_dataset_text_field BASE_DATASET_TEXT_FIELD
                        The base dataset's column name containing the actual text to translate. Defaults to text
  --base_dataset_lang_field BASE_DATASET_LANG_FIELD
                        The base dataset's column name containing the language the source text was written in. Defaults to lang
  --checkpoint_n CHECKPOINT_N
                        An integer representing how often a checkpoint file will be written out. To start off, 400 is a reasonable number.
  --batch_size BATCH_SIZE
                        The batch size for a single translation model. Adjust based on your GPU capacity. Default is 10.
  --max_length MAX_LENGTH
                        How much tokens to generate at most. More tokens might be more accurate for lengthy input but creates a risk of running out of memory. Default is unlimited.
  --cpu                 Forces usage of CPU. By default GPU is taken if available.
  --source_lang SOURCE_LANG
                        Source language to select from OASST based on lang property of dataset

Se você quiser mais parâmetros para os diferentes modelos de tradução, execute:

 python translate.py [MODEL] -h

Certifique-se de especificar os parâmetros específicos do modelo antes de especificar os parâmetros comuns da lista acima. Exemplos de chamadas:

 # Using M2M with 4bit quantization and differen batch sizes to translate Dutch
python translate.py m2m nl ./output_nl --quant4 --batch_size 20

# Using madlad 7B with 8bit quantization for German with different max_length
python translate.py madlad --model_size 7b de ./output_de --quant8 --batch_size 5 --max_length 512

# Be sure to use target language codes that the model you use understands
python translate.py mbart xh_ZA ./output_xhosa
python translate.py nllb nld_Latn ./output_nl

Combine as matrizes JSON dos arquivos dos pontos de verificação em um conjunto de dados Huggingface e, em seguida, grave-o em disco ou publique-o no Huggingface. O script tentará gravar no disco por padrão e voltará a publicar no Huggingface se a pasta não existir no disco. Para publicar no Huggingface, certifique-se de ter sua variável de ambiente HF_TOKEN configurada conforme a documentação.

 usage: combine_checkpoints.py [-h] input_folder output_location

Combine checkpoint files from translation.

positional arguments:
  input_folder     The checkpoint folder used in translation, with the target language appended.
                   Example: "./output_nl".
  output_location  Where to write the Huggingface Dataset. Can be a disk location or a Huggingface
                   Dataset repository.

options:
  -h, --help       show this help message and exit

Transforme as mensagens traduzidas em threads de bate-papo/instrução/prompt e ajuste a instrução de um modelo básico usando LoRA e PEFT.

 usage: finetune.py [-h] [--base_model BASE_MODEL] [--base_dataset_text_field BASE_DATASET_TEXT_FIELD] [--base_dataset_rank_field BASE_DATASET_RANK_FIELD] [--base_dataset_id_field BASE_DATASET_ID_FIELD] [--base_dataset_parent_field BASE_DATASET_PARENT_FIELD]
                   [--base_dataset_role_field BASE_DATASET_ROLE_FIELD] [--quant8] [--noquant] [--max_seq_length MAX_SEQ_LENGTH] [--num_train_epochs NUM_TRAIN_EPOCHS] [--batch_size BATCH_SIZE] [--threads_output_name THREADS_OUTPUT_NAME] [--thread_template THREAD_TEMPLATE]
                   [--padding PADDING]
                   tuned_model dataset_name instruction_prompt

Finetune a base instruct/chat model using (Q)LoRA and PEFT

positional arguments:
  tuned_model           The name of the resulting tuned model.
  dataset_name          The name of the dataset to use for fine-tuning. This should be the output of the combine_checkpoints script.
  instruction_prompt    An instruction message added to every prompt given to the chatbot to force it to answer in the target language. Example: "You are a generic chatbot that always answers in English."

options:
  -h, --help            show this help message and exit
  --base_model BASE_MODEL
                        The base foundation model. Default is "NousResearch/Meta-Llama-3-8B-Instruct".
  --base_dataset_text_field BASE_DATASET_TEXT_FIELD
                        The dataset's column name containing the actual text to translate. Defaults to text
  --base_dataset_rank_field BASE_DATASET_RANK_FIELD
                        The dataset's column name containing the rank of an answer given to a prompt. Defaults to rank
  --base_dataset_id_field BASE_DATASET_ID_FIELD
                        The dataset's column name containing the id of a text. Defaults to message_id
  --base_dataset_parent_field BASE_DATASET_PARENT_FIELD
                        The dataset's column name containing the parent id of a text. Defaults to parent_id
  --base_dataset_role_field BASE_DATASET_ROLE_FIELD
                        The dataset's column name containing the role of the author of the text (eg. prompter, assistant). Defaults to role
  --quant8              Finetunes the model in 8 bits. Requires more memory than the default 4 bit.
  --noquant             Do not quantize the finetuning. Requires more memory than the default 4 bit and optional 8 bit.
  --max_seq_length MAX_SEQ_LENGTH
                        The maximum sequence length to use in finetuning. Should most likely line up with your base model's default max_seq_length. Default is 512.
  --num_train_epochs NUM_TRAIN_EPOCHS
                        Number of epochs to use. 2 is default and has been shown to work well.
  --batch_size BATCH_SIZE
                        The batch size to use in finetuning. Adjust to fit in your GPU vRAM. Default is 4
  --threads_output_name THREADS_OUTPUT_NAME
                        If specified, the threads created in this script for finetuning will also be saved to disk or HuggingFace Hub.
  --thread_template THREAD_TEMPLATE
                        A file containing the thread template to use. Default is threads/template_fefault.txt
  --padding PADDING     What padding to use, can be either left or right.

6.1 [OPCIONAL] Ajuste fino usando DPO (semelhante ao RLHF)

 usage: finetune_dpo.py [-h] [--base_model BASE_MODEL] [--base_dataset_text_field BASE_DATASET_TEXT_FIELD] [--base_dataset_rank_field BASE_DATASET_RANK_FIELD] [--base_dataset_id_field BASE_DATASET_ID_FIELD] [--base_dataset_parent_field BASE_DATASET_PARENT_FIELD] [--quant8]
                       [--noquant] [--max_seq_length MAX_SEQ_LENGTH] [--max_prompt_length MAX_PROMPT_LENGTH] [--num_train_epochs NUM_TRAIN_EPOCHS] [--batch_size BATCH_SIZE] [--threads_output_name THREADS_OUTPUT_NAME] [--thread_template THREAD_TEMPLATE] [--max_steps MAX_STEPS]
                       [--padding PADDING]
                       tuned_model dataset_name instruction_prompt

Finetune a base instruct/chat model using (Q)LoRA and PEFT using DPO (RLHF)

positional arguments:
  tuned_model           The name of the resulting tuned model.
  dataset_name          The name of the dataset to use for fine-tuning. This should be the output of the combine_checkpoints script.
  instruction_prompt    An instruction message added to every prompt given to the chatbot to force it to answer in the target language. Example: "You are a generic chatbot that always answers in English."

options:
  -h, --help            show this help message and exit
  --base_model BASE_MODEL
                        The base foundation model. Default is "NousResearch/Meta-Llama-3-8B-Instruct".
  --base_dataset_text_field BASE_DATASET_TEXT_FIELD
                        The dataset's column name containing the actual text to translate. Defaults to text
  --base_dataset_rank_field BASE_DATASET_RANK_FIELD
                        The dataset's column name containing the rank of an answer given to a prompt. Defaults to rank
  --base_dataset_id_field BASE_DATASET_ID_FIELD
                        The dataset's column name containing the id of a text. Defaults to message_id
  --base_dataset_parent_field BASE_DATASET_PARENT_FIELD
                        The dataset's column name containing the parent id of a text. Defaults to parent_id
  --quant8              Finetunes the model in 8 bits. Requires more memory than the default 4 bit.
  --noquant             Do not quantize the finetuning. Requires more memory than the default 4 bit and optional 8 bit.
  --max_seq_length MAX_SEQ_LENGTH
                        The maximum sequence length to use in finetuning. Should most likely line up with your base model's default max_seq_length. Default is 512.
  --max_prompt_length MAX_PROMPT_LENGTH
                        The maximum length of the prompts to use. Default is 512.
  --num_train_epochs NUM_TRAIN_EPOCHS
                        Number of epochs to use. 2 is default and has been shown to work well.
  --batch_size BATCH_SIZE
                        The batch size to use in finetuning. Adjust to fit in your GPU vRAM. Default is 4
  --threads_output_name THREADS_OUTPUT_NAME
                        If specified, the threads created in this script for finetuning will also be saved to disk or HuggingFace Hub.
  --thread_template THREAD_TEMPLATE
                        A file containing the thread template to use. Default is threads/template_fefault.txt
  --max_steps MAX_STEPS
                        The maximum number of steps to run DPO for. Default is -1 which will run the data through fully for the number of epochs but this will be very time-consuming.
  --padding PADDING     What padding to use, can be either left or right.

6.1 [OPCIONAL] Ajuste fino usando ORPO (semelhante ao RLHF)

 usage: finetune_orpo.py [-h] [--base_model BASE_MODEL] [--base_dataset_text_field BASE_DATASET_TEXT_FIELD] [--base_dataset_rank_field BASE_DATASET_RANK_FIELD] [--base_dataset_id_field BASE_DATASET_ID_FIELD] [--base_dataset_parent_field BASE_DATASET_PARENT_FIELD] [--quant8]
                        [--noquant] [--max_seq_length MAX_SEQ_LENGTH] [--max_prompt_length MAX_PROMPT_LENGTH] [--num_train_epochs NUM_TRAIN_EPOCHS] [--batch_size BATCH_SIZE] [--threads_output_name THREADS_OUTPUT_NAME] [--thread_template THREAD_TEMPLATE] [--max_steps MAX_STEPS]
                        [--padding PADDING]
                        tuned_model dataset_name instruction_prompt

Finetune a base instruct/chat model using (Q)LoRA and PEFT using ORPO (RLHF)

positional arguments:
  tuned_model           The name of the resulting tuned model.
  dataset_name          The name of the dataset to use for fine-tuning. This should be the output of the combine_checkpoints script.
  instruction_prompt    An instruction message added to every prompt given to the chatbot to force it to answer in the target language. Example: "You are a generic chatbot that always answers in English."

options:
  -h, --help            show this help message and exit
  --base_model BASE_MODEL
                        The base foundation model. Default is "NousResearch/Meta-Llama-3-8B-Instruct".
  --base_dataset_text_field BASE_DATASET_TEXT_FIELD
                        The dataset's column name containing the actual text to translate. Defaults to text
  --base_dataset_rank_field BASE_DATASET_RANK_FIELD
                        The dataset's column name containing the rank of an answer given to a prompt. Defaults to rank
  --base_dataset_id_field BASE_DATASET_ID_FIELD
                        The dataset's column name containing the id of a text. Defaults to message_id
  --base_dataset_parent_field BASE_DATASET_PARENT_FIELD
                        The dataset's column name containing the parent id of a text. Defaults to parent_id
  --quant8              Finetunes the model in 8 bits. Requires more memory than the default 4 bit.
  --noquant             Do not quantize the finetuning. Requires more memory than the default 4 bit and optional 8 bit.
  --max_seq_length MAX_SEQ_LENGTH
                        The maximum sequence length to use in finetuning. Should most likely line up with your base model's default max_seq_length. Default is 512.
  --max_prompt_length MAX_PROMPT_LENGTH
                        The maximum length of the prompts to use. Default is 512.
  --num_train_epochs NUM_TRAIN_EPOCHS
                        Number of epochs to use. 2 is default and has been shown to work well.
  --batch_size BATCH_SIZE
                        The batch size to use in finetuning. Adjust to fit in your GPU vRAM. Default is 4
  --threads_output_name THREADS_OUTPUT_NAME
                        If specified, the threads created in this script for finetuning will also be saved to disk or HuggingFace Hub.
  --thread_template THREAD_TEMPLATE
                        A file containing the thread template to use. Default is threads/template_fefault.txt
  --max_steps MAX_STEPS
                        The maximum number of steps to run ORPO for. Default is -1 which will run the data through fully for the number of epochs but this will be very time-consuming.
  --padding PADDING     What padding to use, can be either left or right.

Execute a inferência usando o modelo QLoRA recém-criado.

 usage: run_inference.py [-h] model_name instruction_prompt input

Script to run inference on a tuned model.

positional arguments:
  model_name          The name of the tuned model that you pushed to Huggingface in the previous
                      step.
  instruction_prompt  An instruction message added to every prompt given to the chatbot to force
                      it to answer in the target language.
  input               The actual chat input prompt. The script is only meant for testing purposes
                      and exits after answering.

options:
  -h, --help          show this help message and exit

Escolhendo o modelo de tradução certo

Como posso saber qual modelo de tradução escolher para meu idioma de destino?

Nós ajudamos você com nosso script benchmark.py que ajuda a fazer uma boa estimativa (o conjunto de dados que usamos é o mesmo no qual os modelos OPUS são treinados, então os resultados são sempre favoráveis ao OPUS). Para uso, consulte a ajuda deste script abaixo. Os modelos são carregados em quantização de 4 bits e executados em uma pequena amostra do subconjunto de livros OPUS.

Certifique-se de usar os idiomas que ocorrem mais comumente em seu conjunto de dados base como idioma_fonte e seu idioma de tradução de destino como idioma_alvo. Para OASST1, por exemplo, certifique-se de executar pelo menos en e es como idiomas de origem.

 usage: benchmark.py [-h] [--cpu] [--start START] [--n N] [--max_length MAX_LENGTH] source_language target_language included_models

Benchmark all the different translation models for a specific source and target language to find out which performs best. This uses 4bit quantization to limit GPU usage. Note:
the outcomes are indicative - you cannot assume corretness of the BLEU and CHRF scores but you can compare models against each other relatively.

positional arguments:
  source_language       The source language you want to test for. Check your dataset to see which occur most prevalent or use English as a good start.
  target_language       The source language you want to test for. This should be the language you want to apply the translate script on. Note: in benchmark, we use 2-character
                        language codes, in constrast to translate.py where you need to specify whatever your model expects.
  included_models       Comma-separated list of models to include. Allowed values are: opus, m2m_418m, m2m_1.2b, madlad_3b, madlad_7b, madlad_10b, madlad_7bbt, mbart,
                        nllb_distilled600m, nllb_1.3b, nllb_distilled1.3b, nllb_3.3b, seamless

options:
  -h, --help            show this help message and exit
  --cpu                 Forces usage of CPU. By default GPU is taken if available.
  --start START         The starting offset to include sentences from the OPUS books dataset from. Defaults to 0.
  --n N                 The number of sentences to benchmark on. Defaults to 100.
  --max_length MAX_LENGTH
                        How much tokens to generate at most. More tokens might be more accurate for lengthy input but creates a risk of running out of memory. Default is 512.

Conjuntos de dados e modelos

Já criamos e continuaremos a criar vários conjuntos de dados e modelos. Quer ajudar a democratizar os LLMs? Clone o repositório e crie conjuntos de dados e modelos para outras linguagens e, em seguida, crie um PR.

Conjuntos de dados oasst1 traduzidos


Holandês UnderstandingLing/oasst1_nl	Espanhol UnderstandingLing/oast1_es	Francês CompreenderLing/oasst1_fr	Alemão UnderstandingLing/oasst1_de
Xaviviro catalão/oasst1_ca	Português UnderstandingLing/oast1_pt	Árabe HeshamHaroon/oast-árabe	Italiano UnderstandingLing/oasst1_it
Russo UnderstandingLing/oasst1_ru	Hindi EntendaLing/oasst1_hi	Chinês CompreenderLing/oasst1_zh	Cristãos poloneses/oasst1_pl
Japonês UnderstandingLing/oasst1_jap	Basco xezpeleta/oasst1_eu	Entenda BengaliLing/oasst1_bn	Turco UnderstandingLing/oasst1_tr

Adaptadores de modelo de bate-papo ❗LLaMa3-8B❗ específicos do idioma

Certifique-se de ter acesso ao modelo LLaMa3-8B do Meta e defina seu HF_TOKEN antes de usar esses modelos.


UnderstandingLing/Llama-3-8B-Instruct-nl Holandês	UnderstandingLing/Llama-3-8B-Instruct-es Espanhol	UnderstandingLing/Llama-3-8B-Instruct-fr Francês	UnderstandingLing/Llama-3-8B-Instruct-de Alemão
UnderstandingLing/Llama-3-8B-Instruct-pt Português	UnderstandingLing/Llama-3-8B-Instruct-it Italiano	UnderstandingLing/Llama-3-8B-Instruct-hi Hindi	UnderstandingLing/Llama-3-8B-Instruct-ru Russo

Conjuntos de dados de prompt de bate-papo do tópico LLaMa2 traduzidos


Holandês UnderstandingLing/oasst1_nl_threads	Espanhol UnderstandingLing/oasst1_es_threads	Francês UnderstandingLing/oasst1_fr_threads	Alemão UnderstandingLing/oasst1_de_threads
Xaviviro catalão/oasst1_ca_threads	Português UnderstandingLing/oast1_pt_threads	Árabe HeshamHaroon/oasst-arabic_threads	Italiano UnderstandingLing/oasst1_it_threads
Russo UnderstandingLing/oasst1_ru_threads	Hindi UnderstandingLing/oasst1_hi_threads	Chinês UnderstandingLing/oasst1_zh_threads	Cristãos poloneses/oasst1_pl_threads
Japonês UnderstandingLing/oasst1_jap_threads	Basco xezpeleta/oasst1_eu_threads	Bengali UnderstandingLing/oasst1_bn_threads	Turco UnderstandingLing/oasst1_tr_threads

Adaptadores de modelo de bate-papo LLaMa2-7B específicos de idioma


UnderstandingLing/llama-2-7b-chat-nl Holandês	UnderstandingLing/llama-2-7b-chat-es Espanhol	UnderstandingLing/llama-2-7b-chat-fr Francês	UnderstandingLing/llama-2-7b-chat-de Alemão
xaviviro/llama-2-7b-chat-ca Catalão	UnderstandingLing/llama-2-7b-chat-pt Português	HeshamHaroon/llama-2-7b-chat-ar Árabe	UnderstandingLing/llama-2-7b-chat-it Italiano
UnderstandingLing/llama-2-7b-chat-ru Russo	EntendaLing/llama-2-7b-chat-hi Hindi	EntendaLing/llama-2-7b-chat-zh Chinês	cristãos/llama-2-7b-chat-pl-polish-polski polonês
xezpeleta/llama-2-7b-chat-eu Basco	EntendaLing/llama-2-7b-chat-bn bengali	UnderstandingLing/llama-2-7b-chat-tr Turco

Adaptadores de modelo de bate-papo Mistral específicos do idioma


UnderstandingLing/Mistral-7B-Instruct-v0.2-nl Holandês	UnderstandingLing/Mistral-7B-Instruct-v0.2-es Espanhol	UnderstandingLing/Mistral-7B-Instruct-v0.2-de Alemão

Adaptadores de modelo de bate-papo LLaMa2-13B específicos de idioma


UnderstandingLing/llama-2-13b-chat-nl Holandês	UnderstandingLing/llama-2-13b-chat-es Espanhol	UnderstandingLing/llama-2-13b-chat-fr Francês

Adaptadores de modelo de bate-papo Mixtral-8x7B específicos do idioma


UnderstandingLing/Mixtral-8x7B-Instruct-nl Holandês

Desempenho empírico

Holandês

<s>[INST] <<SYS>> Je bent een generieke chatbot die altijd in het Nederlands antwoord geeft. <</SYS>> Wat is de hoofdstad van Nederland? [/INST] Amsterdam</s>

<s>[INST] <<SYS>> Je bent een generieke chatbot die altijd in het Nederlands antwoord geeft. <</SYS>> Wat is de hoofdstad van Nederland? [/INST] Amsterdam</s><s>[INST] Hoeveel inwoners heeft die stad? [/INST] 850 duizend inwoners (2023)</s>

<s>[INST] <<SYS>> Je bent een generieke chatbot die altijd in het Nederlands antwoord geeft. <</SYS>> Wie is de minister-president van Nederland? [/INST] Mark Rutte is sinds 2010 minister-president van Nederland. Hij is meerdere keren herkozen.</s>

Perguntas frequentes

P: Por que você traduz primeiro o conjunto de dados OASST1/2 completo? Não seria mais rápido traduzir apenas os tópicos com classificação mais alta?
R: Embora você possa ganhar bastante em termos de tempo de transferência primeiro criando os threads e depois traduzindo-os, fornecemos traduções completas do OASST1/2 para a comunidade, pois acreditamos que elas podem ser úteis por si mesmas.
P: Qual é o desempenho dos ajustes finos em comparação com o vanilla LLaMa3?
R: Embora não tenhamos referências formais, fazer com que o LLaMa3 fale consistentemente outro idioma que não o inglês é um desafio, se não impossível. A língua não inglesa que produz é muitas vezes quebrada gramaticalmente. Nossos ajustes finos não mostram esse comportamento.
P: Posso usar outras estruturas para ajuste fino?
R: Sim, você pode, usamos Axolotl para treinamento em configurações multi-GPU.
P: Posso misturar diferentes modelos de tradução?
R: Com certeza, achamos que pode até aumentar o desempenho se a tradução for feita por vários modelos. Você pode conseguir isso interrompendo antecipadamente uma tradução e continuando a partir dos pontos de verificação, executando novamente o script de tradução com um modelo de tradução diferente.

Financiamento

Estamos ativamente à procura de financiamento para democratizar a IA e fazer avançar as suas aplicações. Contate-nos em [email protected] se quiser investir.

Expandir