uniflow llm based pdf extraction text cleaning data clustering下載 - uniflow llm based pdf extraction text cleaning data clustering源碼下載

uniflow llm based pdf extraction text cleaning data clustering

其他源碼

0.0.31

下載

？單流

uniflow提供統一的 LLM 介面來擷取和轉換原始文件。

文件類型：Uniflow 支援從 PDF、HTML 和 TXT 提取資料。
LLM 不可知：Uniflow 支援最常用的文字轉換 LLM，包括
- OpenAI 模型（GPT3.5 和 GPT4），
- Google Gemini 模型（Gemini 1.5，MultiModal），
- AWS BedRock 模型，
- Huggingface 開源模型包括 Mistral-7B、
- Azure OpenAI 模式等

❓ 需要解決的問題

Uniflow 解決了為 ML 科學家準備 LLM 訓練資料時面臨的兩個關鍵挑戰：

首先，由於複雜的 PDF 佈局和提取過程中丟失的信息，將 PDF 和 Word 文件等遺留文檔提取為乾淨的文本（法學碩士可以從中學習）是很棘手的；和
其次，將提取的數據轉換為適合培訓法學碩士的格式的勞動密集型過程，其中涉及為每個問題創建包含首選和拒絕答案的數據集，以支持基於反饋的學習技術。

因此，我們建構了 Uniflow，一個統一的 LLM 介面來擷取和轉換原始文件。

？使用案例

Uniflow 旨在幫助每位資料科學家產生自己的、保護隱私的、即用型的 LLM 微調訓練資料集，從而使每個人都更容易獲得 LLM 微調：rocket:。

檢查 Uniflow 實踐解決方案：

將財務報告 (PDF) 提取到摘要中
提取財務報告 (PDF) 並微調財務法學碩士
將數學書 (HTML) 提取到您的問答資料集中
將 PDF 提取到您的問答資料集中
為 LLM 微調建立 RLHF/RLAIF 偏好資料集

安裝

如果您按照以下 3 個步驟安裝uniflow ，大約需要 5-10 分鐘：

使用以下命令在終端機上建立 conda 環境：

 conda create -n uniflow python=3.10 -y
conda activate uniflow  # some OS requires `source activate uniflow`

根據您的作業系統安裝相容的 pytorch。
- 如果您使用的是 GPU，請根據您的 cuda 版本安裝 pytorch。您可以透過nvcc -V找到您的 CUDA 版本。
```
 pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu121  # cu121 means cuda 12.1
```
- 如果您使用的是 CPU 實例，
```
 pip3 install torch
```
安裝uniflow ：
```
 pip3 install uniflow
```
- （可選）如果您正在執行下列OpenAI流程之一，則必須設定 OpenAI API 金鑰。為此，請在根 uniflow 資料夾中建立一個.env檔。然後將以下行加入.env檔案：
```
 OPENAI_API_KEY=YOUR_API_KEY
```
- （可選）如果您正在執行HuggingfaceModelFlow ，您還需要安裝transformers 、 accelerate 、 bitsandbytes 、 scipy庫：
```
 pip3 install transformers accelerate bitsandbytes scipy
```
- （可選）如果您正在執行LMQGModelFlow ，您還需要安裝lmqg和spacy庫：
```
 pip3 install lmqg spacy
```

恭喜您已完成安裝！

?‍ 開發設置

如果您有興趣為我們做出貢獻，這裡是初步的開發設定。

 conda create -n uniflow python=3.10 -y
conda activate uniflow
cd uniflow
pip3 install poetry
poetry install --no-root

AWS EC2 開發設定

如果您使用的是 EC2，則可以使用下列設定啟動 GPU 執行個體：

EC2 g4dn.xlarge （如果您想執行具有 7B 參數的預訓練 LLM）
深度學習 AMI PyTorch GPU 2.0.1 (Ubuntu 20.04)
EBS：至少100G

API 金鑰

如果您正在執行下列OpenAI流程之一，則必須設定 OpenAI API 金鑰。

為此，請在根 uniflow 資料夾中建立一個.env檔。然後將以下行加入.env檔案：

 OPENAI_API_KEY=YOUR_API_KEY

統一流手冊

概述

若要使用uniflow ，請執行以下三個主要步驟：

選擇一個Config
這決定了 LLM 和不同的可配置參數。
建立您的Prompts
建立您想要用來提示模型的上下文。您可以使用PromptTemplate類別配置自訂說明和範例。
運行您的Flow
對您的輸入資料運行流程並從您的法學碩士產生輸出。

注意：我們目前正在建立Preprocessing流程，以幫助處理來自不同來源的數據，例如pdf 、 html 、 Markdown等。

1. 配置

Config決定使用哪個 LLM 以及如何對輸入資料進行序列化和反序列化。它還具有特定於法學碩士的參數。

以下是您可以使用的不同預定義配置及其對應的 LLM 的表格：

配置	法學碩士
配置	`gpt-3.5-turbo-1106`
開放人工智慧配置	`gpt-3.5-turbo-1106`
Huggingface配置	`mistralai/Mistral-7B-Instruct-v0.1`
LMQG配置	`lmqg/t5-base-squad-qg-ae`

您可以使用預設值執行每個配置，也可以將自訂參數（例如temperature或batch_size傳遞到您的用例的配置中。有關更多詳細信息，請參閱高級自訂配置部分。

2. 提示

預設情況下， uniflow設定為根據您傳入的Context產生問題和答案。

這是預設指令：

 Generate one question and its corresponding answer based on the last context in the last example. Follow the format of the examples below to include context, question, and answer in the response

以下是預設的少數範例：

    context="The quick brown fox jumps over the lazy brown dog.",
    question="What is the color of the fox?",
    answer="brown."

    context="The quick brown fox jumps over the lazy black dog.",
    question="What is the color of the dog?",
    answer="black."

要使用這些預設指令和範例運行，您所需要做的就是將Context物件清單傳遞給流程。然後， uniflow將產生一個自訂提示，其中包含每個Context物件的說明和少量範例，以傳送至 LLM。有關更多詳細信息，請參閱運行流程部分。

情境

Context類別用於傳遞 LLM 提示的上下文。 Context由context屬性組成，該屬性是文字字串。

若要使用預設指令和少量範例執行uniflow ，您可以將Context物件清單傳遞給流。例如：

 from uniflow.op.prompt import Context

data = [
    Context(
        context="The quick brown fox jumps over the lazy brown dog.",
    ),
    ...
]

client.run(data)

有關運行流程的更詳細概述，請參閱運行流程部分。

提示模板

如果您想使用自訂提示指令或少量範例來執行，您可以使用PromptTemplate物件。它具有instruction和example屬性。

財產	類型	描述
`instruction`	斯特	LLM 的詳細說明
`examples`	列表[上下文]	少數鏡頭的例子。

您可以根據需要覆蓋任何預設值。

若要查看如何使用PromptTemplate透過自訂instruction執行uniflow 、少數範例和自訂Context欄位來產生摘要的範例，請查看 openai_pdf_source_10k_summary 筆記本

運作流程

一旦您決定了Config和提示策略，您就可以對輸入資料運行流程。

導入uniflow Client 、 Config和Context物件。

 from uniflow.flow.client import TransformClient
from uniflow.flow.config import TransformOpenAIConfig, OpenAIModelConfig
from uniflow.op.prompt import Context

將資料預處理成區塊以傳遞到流程中。將來我們將有Preprocessing流程來幫助完成此步驟，但現在您可以使用您選擇的庫（例如 pypdf）來對資料進行分塊。
```
 raw_input_context = ["It was a sunny day and the sky color is blue.", "My name is bobby and I am a talent software engineer working on AI/ML."]
```

建立Context物件清單以將資料傳遞到流中。

 data = [
    Context(context=c)
    for c in raw_input_context
]

[選用] 如果您想使用自訂說明和/或範例，請建立PromptTemplate 。

 from uniflow.op.prompt import PromptTemplate

guided_prompt = PromptTemplate(
instruction="Generate a one sentence summary based on the last context below. Follow the format of the examples below to include context and summary in the response",
few_shot_prompt=[
    Context(
        context="When you're operating on the maker's schedule, meetings are a disaster. A single meeting can blow a whole afternoon, by breaking it into two pieces each too small to do anything hard in. Plus you have to remember to go to the meeting. That's no problem for someone on the manager's schedule. There's always something coming on the next hour; the only question is what. But when someone on the maker's schedule has a meeting, they have to think about it.",
        summary="Meetings disrupt the productivity of those following a maker's schedule, dividing their time into impractical segments, while those on a manager's schedule are accustomed to a continuous flow of tasks.",
    ),
],
)

建立一個Config物件以傳遞給Client物件。

 config = TransformOpenAIConfig(
    prompt_template=guided_prompt,
    model_config=OpenAIModelConfig(
        response_format={"type": "json_object"}
    ),
)
client = TransformClient(config)

使用client物件對輸入資料運行流程。
```
 output = client.run(data)
```
處理輸出數據。預設情況下，LLM 輸出將是一個輸出字典列表，每個Context 。每個字典都有一個response屬性，其中包含 LLM 回應以及任何錯誤。例如， output[0]['output'][0]看起來像這樣：
```
 {
    'response': [{'context': 'It was a sunny day and the sky color is blue.',
    'question': 'What was the color of the sky?',
    'answer': 'blue.'}],
    'error': 'No errors.'
}
```

範例

如需更多範例，請參閱範例資料夾。

進階自訂配置

如果您想進一步調整 LLM 模型、執行緒數、溫度等特定參數，您也可以透過將自訂配置或參數傳遞給Config物件來配置流程。

每個配置都有以下參數：

範圍	類型	描述
`prompt_template`	`PromptTemplate`	用於引導提示的模板。
`num_threads`	整數	用於流的線程數。
`model_config`	`ModelConfig`	要傳遞給模型的配置。

您可以透過傳入具有自訂參數的Model Configs之一來進一步配置model_config 。

型號配置

模型配置是傳遞給基本Config物件的配置，它決定使用哪個LLM模型，並具有特定於LLM模型的參數。

模型配置

基本配置稱為ModelConfig並具有以下參數：

範圍	類型	預設	描述
`model_name`	斯特	gpt-3.5-turbo-1106	OpenAI 網站

OpenAI模型配置

OpenAIModelConfig繼承自ModelConfig ，並具有以下附加參數：

範圍	類型	預設	描述
`num_calls`	整數	1	對 OpenAI API 的呼叫次數。
`temperature`	漂浮	1.5	OpenAI API 使用的溫度。
`response_format`	字典[str, str]	{“類型”：“文本”}	用於 OpenAI API 的回應格式。可以是“文字”或“json”

Huggingface模型配置

HuggingfaceModelConfig繼承自ModelConfig ，但預設會覆寫model_name參數以使用mistralai/Mistral-7B-Instruct-v0.1模型。

範圍	類型	預設	描述
`model_name`	斯特	米斯特拉萊/Mistral-7B-指令-v0.1	擁抱臉網站
`batch_size`	整數	1	用於 Hugging Face API 的批量大小。

LMQG模型配置

LMQGModelConfig繼承自ModelConfig ，但預設會覆寫model_name參數以使用lmqg/t5-base-squad-qg-ae模型。

範圍	類型	預設	描述
`model_name`	斯特	lmqg/t5-基地小隊-qg-ae	擁抱臉網站
`batch_size`	整數	1	用於 LMQG API 的批次大小。

自訂配置範例

以下是如何將自訂配置傳遞給Client物件的範例：

 from uniflow.flow.client import TransformClient
from uniflow.flow.config import TransformOpenAIConfig, OpenAIModelConfig
from uniflow.op.prompt import Context


contexts = ["It was a sunny day and the sky color is blue.", "My name is bobby and I am a talent software engineer working on AI/ML."]

data = [
    Context(
        context=c
    )
    for c in contexts
]

config = OpenAIConfig(
  num_threads=2,
  model_config=OpenAIModelConfig(
    model_name="gpt-4",
    num_calls=2,
    temperature=0.5,
  ),
)
client = TransformClient(config)
output = client.run(data)

正如您所看到的，我們根據需要將自訂參數傳遞給OpenAIModelConfig到OpenAIConfig配置。

展開

附加信息

版本 0.0.31
類型其他源碼
更新時間 2024-12-06
大小 31.58MB
來自於 Github

相關應用

TensorRT LLM

2024-11-10
Retrieval based Voice Conversion WebUI

2024-11-01
與耶穌發簡訊

2023-08-17
發短信或死亡

2023-07-03
智慧資料恢復

2009-06-18
使用 Ajax 清理您的網站

2009-05-29

爲您推薦

chat.petals.dev

其他源碼

1.0.0
GPT Prompt Templates

其他源碼

1.0.0
GPTyped

其他源碼

GPTyped 1.0.5
waymo open dataset

其他源碼

December 2023 Update
SmartTube

其他源碼

24.71 Stable
Sunamu

其他源碼

Release 2.2.0
waymo open dataset

其他源碼

December 2023 Update
wp functions

其他類別

1.0.0
termwind

其他類別

v2.3.0

相關資訊全部