hinglish conv dataset下载 - hinglish conv dataset集源代码下载

hinglish conv dataset

其他源码

1.0.0

下载

印度英语会话数据集

印度英语是一种混合了印地语和英语的混合语言，在印度常用，结合了两种语言的词汇和语法。印度人经常在文本对话中使用印度英语。印度英语文本主要包含从印地语句子音译而来的英语字符。例如：“Aaj ka din bohot acha hai”。

使用 Hinglish 数据集来微调像 LLAMA-2 这样在训练阶段没有看到此类数据的开源 LLM 是很有帮助的。然而，GPT-3 及以后的模型在训练期间已经看到了印度英语数据。

数据集

由于 GPT-3 及更高版本在训练期间看到了印度英语数据，因此我们利用它们来生成对话，这些对话经过进一步后处理以生成干净的数据集。使用的 GPT 提示是：

 I want you to generate a Hinglish conversation between two young Indians - a male and a female. Feel free to assume the names of these young Indians. The conversation should contain 100 dialogues. Conversation should be in the format [Name]: [Message]. Conversation should be strictly in Hinglish. If the conversation happens in English, I will punish you. The conversation should be slightly flirty in nature - ending in a romantic moment. The conversation is around the topic: '{setting}'. Do not change subjects frequently. If possible, talk about a subject at length.

请参阅topics.md以了解提示中提供的各种对话上下文。