instruction datasets herunterladen - instruction datasets Quellcode herunterladen

instruction datasets

Anderer Quellcode

1.0.0

Herunterladen

Befehlsoptimierungsdatensätze

Alle verfügbaren Datensätze für die Befehlsoptimierung großer Sprachmodelle

Goldstandard-Datensätze

P3: https://github.com/bigscience-workshop/promptsource, https://huggingface.co/datasets/bigscience/P3
- Sammlung angeforderter englischer Datensätze, die eine Vielzahl von NLP-Aufgaben abdecken
- 2000 Eingabeaufforderungstypen über 270 Datensätze
xP3: https://huggingface.co/datasets/bigscience/xP3mt
- Mischung aus 13 Trainingsaufgaben in 46 Sprachen mit Eingabeaufforderungen in 20 Sprachen (maschinell aus dem Englischen übersetzt)
Natürliche Anweisungen v2: https://github.com/allenai/natural-instructions
- Ein Benchmark von 1.616 verschiedenen NLP-Aufgaben und ihren von Experten verfassten Anweisungen, die 76 verschiedene Aufgabentypen und 55 verschiedene Sprachen abdecken.
Die Flan-Sammlung: https://github.com/google-research/FLAN/tree/main/flan/v2
- Obermenge einiger der Datensätze hier
- 1836 Aufgaben, 15 Mio. Beispiele
Assistent öffnen: https://huggingface.co/datasets/OpenAssistant/oasst1
- Von Menschen kommentierter Konversationskorpus im Assistentenstil, bestehend aus 161.443 Nachrichten, verteilt auf 66.497 Konversationsbäume, in 35 verschiedenen Sprachen, kommentiert mit 461.292 Qualitätsbewertungen
LIMA: 1K hochwertige Anleitung
- https://huggingface.co/datasets/GAIR/lima
databricks-dolly-15k: https://github.com/databrickslabs/dolly/tree/master/data
PRESTO: https://github.com/google-research-datasets/presto
- 550.000 kontextbezogene mehrsprachige Gespräche zwischen Menschen und virtuellen Assistenten
BB3x: https://parl.ai/projects/bb3x/
InstructCTG: https://github.com/MichaelZhouwang/InstructCTG
- Framework für kontrollierte Generierung https://arxiv.org/abs/2304.14293
CrossFit: https://github.com/INK-USC/CrossFit
Aufgabenquelle: https://arxiv.org/abs/2301.05948
ExMix: https://arxiv.org/abs/2111.10952
InstructEval: https://github.com/declare-lab/instruct-eval
M3IT: https://huggingface.co/datasets/MMInstruction/M3IT
- https://arxiv.org/abs/2306.04387
- 2,4 Millionen multimodale Instanzen und 400 Anweisungen für 40 Aufgaben und 80 Sprachen
MIMIC-IT: Multimodale In-Context-Anweisungsoptimierung: https://arxiv.org/abs/2306.05425
MultiInstruct: https://github.com/VT-NLP/MultiInstruct
COLLIE: https://github.com/princeton-nlp/Collie
Mind2Web: Auf dem Weg zu einem generalistischen Agenten für das Web https://osu-nlp-group.github.io/Mind2Web/
Android in the Wild: Ein umfangreicher Datensatz für die Android-Gerätesteuerung: https://github.com/google-research/google-research/tree/master/android_in_the_wild
FLASK: Feinkörnige Sprachmodellbewertung basierend auf Alignment-Skill-Sets https://github.com/kaistAI/FLASK
Safe-RLHF: https://arxiv.org/abs/2310.12773
- https://arxiv.org/pdf/2310.12773.pdf#https%3A//github.com/PKU-Alignment/safe-rlhf
HelpSteer: https://huggingface.co/datasets/nvidia/HelpSteer

Silberstandard/Generiert mit LM

Selbstunterricht: https://github.com/yizhongw/self-instruct
Unnatürliche Anweisungen: https://github.com/orhonovich/unnatural-instructions
Alpaka: https://huggingface.co/datasets/tatsu-lab/alpaca
- Alpaca-Clean: https://github.com/gururise/AlpacaDataCleaned
Code Alpaka: https://github.com/sahil280114/codealpaca
AlpacaGPT3.5Customized: https://huggingface.co/datasets/whitefox44/AlpacaGPT3.5Customized
GPT4All: https://github.com/nomic-ai/gpt4all
- GPT4All-pruned: https://huggingface.co/datasets/Nebulous/gpt4all_pruned
ShareGPT: https://huggingface.co/datasets/RyokoAI/ShareGPT52K
GPTeacher: https://github.com/teknium1/GPTeacher
KAMEL?: https://www.camel-ai.org/
Human ChatGPT-Vergleichskorpus: https://github.com/Hello-SimpleAI/chatgpt-comparison-detection
InstructionWild: https://github.com/XueFuzhao/InstructionWild
Anleitung Tuning mit GPT-4: https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM
Guanako: https://huggingface.co/datasets/JosephusCheung/GuanacoDataset
Der LongForm-Datensatz: https://github.com/akoksal/LongForm/tree/main/dataset
- LLM-Anweisungsgenerierung für einen vielfältigen Satz von Korpusbeispielen (27.739 Anweisungen und Langtextpaare)
UltraChat: https://huggingface.co/datasets/stingning/ultrachat
LLaVA Visual Instruct 150K: https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K
- GPT-generierte multimodale Befehlsfolgedaten
GPT4Tools: https://github.com/StevenGrove/GPT4Tools
- Anweisungsdaten zum Durchführen von API-Aufrufen an mehrere multimodale Modelle
LaMini-Anleitung: https://huggingface.co/datasets/MBZUAI/LaMini-instruction
- 2,58 Millionen Paare von Anweisungen und Antworten
Evol-Instruct 70k: https://github.com/nlpxucan/WizardLM
Dynosaurier: https://dynosaur-it.github.io/
Alpaka-Farm: https://github.com/tatsu-lab/alpaca_farm
- https://huggingface.co/datasets/tatsu-lab/alpaca_farm
ign_clean_instruct_dataset_500k: https://huggingface.co/datasets/ignmilton/ign_clean_instruct_dataset_500k
Airoboros: https://github.com/jondurbin/airoboros
UltraFeedback: https://huggingface.co/datasets/openbmb/UltraFeedback
WildChat: Korpus von 570.000 realen Benutzer-ChatGPT-Interaktionen https://wildchat.allen.ai/
Feedback-Sammlung: https://arxiv.org/abs/2310.08491
- https://huggingface.co/datasets/kaist-ai/Feedback-Collection

Präferenzdatensätze (können zum Trainieren des Belohnungsmodells verwendet werden)

HH-RLHF: https://huggingface.co/datasets/Anthropic/hh-rlhf
- Enthält menschliche Bewertungen der Schädlichkeit und Nützlichkeit von Modellergebnissen. Der Datensatz enthält ca. 160.000 von Menschen bewertete Beispiele, wobei jedes Beispiel in diesem Datensatz aus einem Antwortpaar eines Chatbots besteht, von denen eine von Menschen bevorzugt wird.
OpenAI WebGPT: https://huggingface.co/datasets/openai/webgpt_comparisons
- Enthält insgesamt etwa 20.000 Vergleiche, wobei jedes Beispiel eine Frage, ein Paar Modellantworten und Metadaten umfasst. Die Antworten werden von Menschen mit einem Präferenzwert bewertet.
OpenAI-Zusammenfassung: https://huggingface.co/datasets/openai/summarize_from_feedback
- Enthält ca. 93.000 Beispiele. Jedes Beispiel besteht aus Feedback von Menschen zu den von einem Modell generierten Zusammenfassungen. Menschliche Bewerter wählten aus zwei Optionen die bessere Zusammenfassung.
Stanford Human Preferences Dataset (SHP): https://huggingface.co/datasets/stanfordnlp/SHP
- 385.000 kollektive menschliche Präferenzen gegenüber Antworten auf Fragen/Anweisungen in 18 verschiedenen Themenbereichen
Stack-Exchange-Einstellungen: https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences
SLF5K: https://huggingface.co/datasets/JeremyAlain/SLF5K
qa-from-hf: https://github.com/lil-lab/qa-from-hf
Nektar: https://huggingface.co/datasets/berkeley-nest/Nectar
JudgeLM-100K: https://huggingface.co/datasets/BAAI/JudgeLM-100K
UltraFeedback: https://huggingface.co/datasets/openbmb/UltraFeedback

Sonstiges

OIG: https://huggingface.co/datasets/laion/OIG
- Obermenge einiger der Datensätze hier
oa_leet10k: https://huggingface.co/datasets/ehartford/oa_leet10k
- LeetCode-Probleme in mehreren Programmiersprachen gelöst
ProSocial-Dialog: https://huggingface.co/datasets/allenai/prosocial-dialog
ConvoKit: https://convokit.cornell.edu/documentation/datasets.html
CoT-Sammlung: https://github.com/kaist-lklab/CoT-Collection
DialogStudio: https://github.com/salesforce/DialogStudio
Chatbot-Arena-Gespräche https://huggingface.co/datasets/lmsys/chatbot_arena_conversations
lmsys 1M: https://huggingface.co/datasets/lmsys/lmsys-chat-1m
Konversationschroniken: https://conversation-chronicles.github.io/

Expandieren

Zusätzliche Informationen