uniflow llm based pdf extraction text cleaning data clustering下载 - uniflow llm based pdf extraction text cleaning data clustering源码下载

uniflow llm based pdf extraction text cleaning data clustering

其他源码

0.0.31

下载

？单流

uniflow提供统一的 LLM 接口来提取和转换原始文档。

文档类型：Uniflow 支持从 PDF、HTML 和 TXT 中提取数据。
LLM 不可知：Uniflow 支持最常用的文本转换 LLM，包括
- OpenAI 模型（GPT3.5 和 GPT4），
- Google Gemini 模型（Gemini 1.5，MultiModal），
- AWS BedRock 模型，
- Huggingface 开源模型包括 Mistral-7B、
- Azure OpenAI 模型等

❓ 需要解决的问题

Uniflow 解决了为 ML 科学家准备 LLM 训练数据时面临的两个关键挑战：

首先，由于复杂的 PDF 布局和提取过程中丢失的信息，将 PDF 和 Word 文件等遗留文档提取为干净的文本（法学硕士可以从中学习）是很棘手的；和
其次，将提取的数据转换为适合培训法学硕士的格式的劳动密集型过程，其中涉及为每个问题创建包含首选和拒绝答案的数据集，以支持基于反馈的学习技术。

因此，我们构建了 Uniflow，一个统一的 LLM 接口来提取和转换原始文档。

？使用案例

Uniflow 旨在帮助每位数据科学家生成自己的、保护隐私的、即用型的 LLM 微调训练数据集，从而使每个人都更容易获得 LLM 微调：rocket:。

检查 Uniflow 实践解决方案：

将财务报告 (PDF) 提取到摘要中
提取财务报告 (PDF) 并微调财务法学硕士
将数学书 (HTML) 提取到您的问答数据集中
将 PDF 提取到您的问答数据集中
为 LLM 微调构建 RLHF/RLAIF 偏好数据集

安装

如果您按照以下 3 个步骤安装uniflow ，大约需要 5-10 分钟：

使用以下命令在终端上创建 conda 环境：

 conda create -n uniflow python=3.10 -y
conda activate uniflow  # some OS requires `source activate uniflow`

根据您的操作系统安装兼容的 pytorch。
- 如果您使用的是 GPU，请根据您的 cuda 版本安装 pytorch。您可以通过nvcc -V找到您的 CUDA 版本。
```
 pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu121  # cu121 means cuda 12.1
```
- 如果您使用的是 CPU 实例，
```
 pip3 install torch
```
安装uniflow ：
```
 pip3 install uniflow
```
- （可选）如果您正在运行以下OpenAI流程之一，则必须设置 OpenAI API 密钥。为此，请在根 uniflow 文件夹中创建一个.env文件。然后将以下行添加到.env文件中：
```
 OPENAI_API_KEY=YOUR_API_KEY
```
- （可选）如果您正在运行HuggingfaceModelFlow ，您还需要安装transformers 、 accelerate 、 bitsandbytes 、 scipy库：
```
 pip3 install transformers accelerate bitsandbytes scipy
```
- （可选）如果您正在运行LMQGModelFlow ，您还需要安装lmqg和spacy库：
```
 pip3 install lmqg spacy
```

恭喜您已完成安装！

?‍ 开发设置

如果您有兴趣为我们做出贡献，这里是初步的开发设置。

 conda create -n uniflow python=3.10 -y
conda activate uniflow
cd uniflow
pip3 install poetry
poetry install --no-root

AWS EC2 开发设置

如果您使用的是 EC2，则可以使用以下配置启动 GPU 实例：

EC2 g4dn.xlarge （如果您想运行具有 7B 参数的预训练 LLM）
深度学习 AMI PyTorch GPU 2.0.1 (Ubuntu 20.04)
EBS：至少100G

API 密钥

如果您正在运行以下OpenAI流程之一，则必须设置 OpenAI API 密钥。

为此，请在根 uniflow 文件夹中创建一个.env文件。然后将以下行添加到.env文件中：

 OPENAI_API_KEY=YOUR_API_KEY

统一流手册

概述

要使用uniflow ，请执行以下三个主要步骤：

选择一个Config
这决定了 LLM 和不同的可配置参数。
构建您的Prompts
构建您想要用来提示模型的上下文。您可以使用PromptTemplate类配置自定义说明和示例。
运行您的Flow
对您的输入数据运行流程并从您的法学硕士生成输出。

注意：我们目前正在构建Preprocessing流程，以帮助处理来自不同来源的数据，例如pdf 、 html 、 Markdown等。

1. 配置

Config确定使用哪个 LLM 以及如何对输入数据进行序列化和反序列化。它还具有特定于法学硕士的参数。

以下是您可以使用的不同预定义配置及其相应的 LLM 的表格：

配置	法学硕士
配置	`gpt-3.5-turbo-1106`
开放人工智能配置	`gpt-3.5-turbo-1106`
Huggingface配置	`mistralai/Mistral-7B-Instruct-v0.1`
LMQG配置	`lmqg/t5-base-squad-qg-ae`

您可以使用默认值运行每个配置，也可以将自定义参数（例如temperature或batch_size传递到您的用例的配置中。有关更多详细信息，请参阅高级自定义配置部分。

2. 提示

默认情况下， uniflow设置为根据您传入的Context生成问题和答案。为此，它有一个默认指令和一些用于指导 LLM 的示例。

这是默认指令：

 Generate one question and its corresponding answer based on the last context in the last example. Follow the format of the examples below to include context, question, and answer in the response

以下是默认的少数示例：

    context="The quick brown fox jumps over the lazy brown dog.",
    question="What is the color of the fox?",
    answer="brown."

    context="The quick brown fox jumps over the lazy black dog.",
    question="What is the color of the dog?",
    answer="black."

要使用这些默认指令和示例运行，您所需要做的就是将Context对象列表传递给流程。然后， uniflow将生成一个自定义提示，其中包含每个Context对象的说明和少量示例，以发送到 LLM。有关更多详细信息，请参阅运行流程部分。

语境

Context类用于传递 LLM 提示的上下文。 Context由context属性组成，该属性是文本字符串。

要使用默认指令和少量示例运行uniflow ，您可以将Context对象列表传递给流。例如：

 from uniflow.op.prompt import Context

data = [
    Context(
        context="The quick brown fox jumps over the lazy brown dog.",
    ),
    ...
]

client.run(data)

有关运行流程的更详细概述，请参阅运行流程部分。

提示模板

如果您想使用自定义提示指令或少量示例来运行，您可以使用PromptTemplate对象。它具有instruction和example属性。

财产	类型	描述
`instruction`	斯特	LLM 的详细说明
`examples`	列表[上下文]	少数镜头的例子。

您可以根据需要覆盖任何默认值。

要查看如何使用PromptTemplate通过自定义instruction运行uniflow 、少数示例和自定义Context字段来生成摘要的示例，请查看 openai_pdf_source_10k_summary 笔记本

运行流程

一旦您决定了Config和提示策略，您就可以对输入数据运行流程。

导入uniflow Client 、 Config和Context对象。

 from uniflow.flow.client import TransformClient
from uniflow.flow.config import TransformOpenAIConfig, OpenAIModelConfig
from uniflow.op.prompt import Context

将数据预处理成块以传递到流程中。将来我们将有Preprocessing流程来帮助完成此步骤，但现在您可以使用您选择的库（例如 pypdf）来对数据进行分块。
```
 raw_input_context = ["It was a sunny day and the sky color is blue.", "My name is bobby and I am a talent software engineer working on AI/ML."]
```

创建Context对象列表以将数据传递到流中。

 data = [
    Context(context=c)
    for c in raw_input_context
]

[可选] 如果您想使用自定义说明和/或示例，请创建PromptTemplate 。

 from uniflow.op.prompt import PromptTemplate

guided_prompt = PromptTemplate(
instruction="Generate a one sentence summary based on the last context below. Follow the format of the examples below to include context and summary in the response",
few_shot_prompt=[
    Context(
        context="When you're operating on the maker's schedule, meetings are a disaster. A single meeting can blow a whole afternoon, by breaking it into two pieces each too small to do anything hard in. Plus you have to remember to go to the meeting. That's no problem for someone on the manager's schedule. There's always something coming on the next hour; the only question is what. But when someone on the maker's schedule has a meeting, they have to think about it.",
        summary="Meetings disrupt the productivity of those following a maker's schedule, dividing their time into impractical segments, while those on a manager's schedule are accustomed to a continuous flow of tasks.",
    ),
],
)

创建一个Config对象以传递给Client对象。

 config = TransformOpenAIConfig(
    prompt_template=guided_prompt,
    model_config=OpenAIModelConfig(
        response_format={"type": "json_object"}
    ),
)
client = TransformClient(config)

使用client对象对输入数据运行流程。
```
 output = client.run(data)
```
处理输出数据。默认情况下，LLM 输出将是一个输出字典列表，每个Context 。每个字典都有一个response属性，其中包含 LLM 响应以及任何错误。例如， output[0]['output'][0]看起来像这样：
```
 {
    'response': [{'context': 'It was a sunny day and the sky color is blue.',
    'question': 'What was the color of the sky?',
    'answer': 'blue.'}],
    'error': 'No errors.'
}
```

示例

有关更多示例，请参阅示例文件夹。

高级自定义配置

如果您想进一步调整 LLM 模型、线程数、温度等特定参数，您还可以通过将自定义配置或参数传递给Config对象来配置流程。

每个配置都有以下参数：

范围	类型	描述
`prompt_template`	`PromptTemplate`	用于引导提示的模板。
`num_threads`	整数	用于流的线程数。
`model_config`	`ModelConfig`	要传递给模型的配置。

您可以通过传入具有自定义参数的Model Configs之一来进一步配置model_config 。

型号配置

模型配置是传递给基本Config对象的配置，它确定使用哪个LLM模型，并具有特定于LLM模型的参数。

模型配置

基本配置称为ModelConfig并具有以下参数：

范围	类型	默认	描述
`model_name`	斯特	gpt-3.5-turbo-1106	OpenAI 网站

OpenAI模型配置

OpenAIModelConfig继承自ModelConfig ，并具有以下附加参数：

范围	类型	默认	描述
`num_calls`	整数	1	对 OpenAI API 的调用次数。
`temperature`	漂浮	1.5	OpenAI API 使用的温度。
`response_format`	字典[str, str]	{“类型”：“文本”}	用于 OpenAI API 的响应格式。可以是“文本”或“json”

Huggingface模型配置

HuggingfaceModelConfig继承自ModelConfig ，但默认情况下会覆盖model_name参数以使用mistralai/Mistral-7B-Instruct-v0.1模型。

范围	类型	默认	描述
`model_name`	斯特	米斯特拉莱/Mistral-7B-指令-v0.1	拥抱脸网站
`batch_size`	整数	1	用于 Hugging Face API 的批量大小。

LMQG模型配置

LMQGModelConfig继承自ModelConfig ，但默认情况下会覆盖model_name参数以使用lmqg/t5-base-squad-qg-ae模型。

范围	类型	默认	描述
`model_name`	斯特	lmqg/t5-基地小队-qg-ae	拥抱脸网站
`batch_size`	整数	1	用于 LMQG API 的批量大小。

自定义配置示例

以下是如何将自定义配置传递给Client对象的示例：

 from uniflow.flow.client import TransformClient
from uniflow.flow.config import TransformOpenAIConfig, OpenAIModelConfig
from uniflow.op.prompt import Context


contexts = ["It was a sunny day and the sky color is blue.", "My name is bobby and I am a talent software engineer working on AI/ML."]

data = [
    Context(
        context=c
    )
    for c in contexts
]

config = OpenAIConfig(
  num_threads=2,
  model_config=OpenAIModelConfig(
    model_name="gpt-4",
    num_calls=2,
    temperature=0.5,
  ),
)
client = TransformClient(config)
output = client.run(data)

正如您所看到的，我们根据需要将自定义参数传递给OpenAIModelConfig到OpenAIConfig配置。

展开

附加信息