拉格實驗室

其他源碼

下載

RAGLAB：用於檢索增強生成的模組化且面向研究的統一框架

RAGLAB 是一個模組化、研究導向的開源框架，用於檢索增強生成 (RAG) 演算法。它提供了 6 種現有 RAG 演算法的複製品和具有 10 個基準資料集的綜合評估系統，可實現 RAG 演算法之間的公平比較，並輕鬆擴展以高效開發新演算法、資料集和評估指標。

訊息

2024年10月6日：我們的論文已被EMNLP 2024系統演示接收。您可以在 RAGLAB 中找到我們的論文。
2024.9.9：RAGLAB開源了評估結果中的所有日誌檔案和評估文件？
2024年8月20日：RAGLAB開源了4個型號？
2024年8月6日：RAGLAB發表了？

特徵

全面的 RAG 生態系統：支援從資料收集、培訓到自動評估的整個 RAG 管道。
高階演算法實作：重現 6 種最先進的 RAG 演算法，並具有易於擴展的框架來開發新演算法。
互動模式與評估模式：互動模式專為快速理解演算法而設計。評估模式是專門為再現論文成果和科學研究而設計的。
公平比較平台：提供跨 5 種任務類型和 10 個資料集的 6 種演算法的基準結果。
高效率的檢索器用戶端：提供本地API用於並行存取和緩存，平均延遲低於1秒。
多功能生成器支援：與 70B+ 模型、VLLM 和量化技術相容。
靈活的教學實驗室：針對各種 RAG 場景的可自訂教學範本。

安裝環境

開發環境：pytorch:2.0.1-py3.10-cuda11.8.0-devel-ubuntu22.04
安裝迷你康達
git 克隆 RAGLAB
```
 https://github.com/fate-ubw/RAGLAB.git
```

從 yml 檔案建立環境

cd 拉格實驗室
conda env create -f 環境.yml

手動安裝 flash-attn、en_core_web_sm、punkt

 pip install flash-attn==2.2
python -m spacy 下載 en_core_web_sm
python -m nltk.downloader punkt

型號

raglab需要幾個模型請下載

cd 拉格實驗室
mkdir 模型cd 模型
mkdir output_models# 檢索器模型mkdir colbertv2.0
Huggingface-cli 下載 colbert-ir/colbertv2.0 --local-dir colbertv2.0/ --local-dir-use-symlinks False
mkdir 設計者-msmarco
Huggingface-cli 下載 facebook/contriever-msmarco --local-dir contriever-msmarco/ --local-dir-use-symlinks False# 微調產生器# 8B modelmkdir Llama3-8B-baseline
Huggingface-cli 下載 RAGLAB/Llama3-8B-baseline --local-dir Llama3-8B-baseline/ --local-dir-use-symlinks False
mkdir selfrag_llama3_8b-epoch_0_1
Huggingface-cli 下載 RAGLAB/selfrag_llama3-8B --local-dir selfrag_llama3_8b-epoch_0_1/ --local-dir-use-symlinks False# 70B modelmkdir Llama3-70B-baseline-adapter
Huggingface-cli 下載 RAGLAB/Llama3-70B-baseline-adapter --local-dir Llama3-70B-baseline-adapter/ --local-dir-use-symlinks False
mkdir selfrag_llama3_70B-適配器
Huggingface-cli 下載 RAGLAB/selfrag_llama3-70B-adapter --local-dir selfrag_llama3_70B-adapter/ --local-dir-use-symlinks False
mkdir Meta-Llama-3-70B
Huggingface-cli 下載meta-llama/Meta-Llama-3-70B --local-dir Meta-Llama-3-70B/ --local-dir-use-symlinks False#finetune 和 LoRAmkdir Meta-Llama-3 的基本模型-8B
Huggingface-cli 下載meta-llama/Meta-Llama-3-8B --local-dir Meta-Llama-3-8B/ --local-dir-use-symlinks False# ALCE Metric Modelsmkdir gpt2-large
Huggingface-cli 下載 openai-community/gpt2-large --local-dir gpt2-large/ --local-dir-use-symlinks False
mkdir 羅伯塔-大隊
Huggingface-cli 下載 gaotianyu1350/roberta-large-squad --local-dir roberta-large-squad/ --local-dir-use-symlinks False
mkdir t5_xxl_true_nli_mixture
Huggingface-cli download google/t5_xxl_true_nli_mixture --local-dir t5_xxl_true_nli_mixture/ --local-dir-use-symlinks False#factscore模型我們使用gpt3.5dkselfself-slperkkk sm來自官方模型的來自官方模型我們的
Huggingface-cli 下載 selfrag/selfrag_llama2_7b --local-dir selfrag_llama2_7b/ --local-dir-use-symlinks False# 你可以從 Huggingface 下載其他模型作為生成器

完整數據

如果您只需要了解不同演算法如何運作，RAGLAB開發的互動模式可以滿足您的需求。
如果你想重現論文中的結果，你需要從Hugging Face上下載所有需要的數據，包括訓練數據、知識數據和評估數據。我們已經為您打包了所有數據，因此您只需下載即可使用。
```
 cd 拉格實驗室
Huggingface-cli 下載 RAGLAB/data --local-dir data --repo-type 資料集
```

在互動模式下運行 Raglab

交互模式專為快速理解演算法而設計。在互動模式下，您可以非常快速地執行各種演算法，以了解不同演算法的推理過程，而無需下載任何額外的資料。

設定科爾伯特伺服器

raglab 中整合的所有演算法都包含兩種模式： interact和evaluation 。測試階段以interact方式演示，僅供演示和教育？

筆記

由於colbert對絕對路徑的要求，需要修改設定檔中的index_dbPath和text_dbPath以使用絕對路徑。

修改設定檔中的index_dbPath和text_dbPath ：colbert_server-10samples.yaml

 index_dbPath：/your_root_path/RAGLAB/data/retrieval/colbertv2.0_embedding/wiki2023-10samples
text_dbPath：/your_root_path/RAGLAB/data/retrieval/colbertv2.0_passages/wiki2023-10samples/enwiki-20230401-10samples.tsv

運行科爾伯特伺服器

cd 拉格實驗室
sh 運行/colbert_server/colbert_server-10samples.sh

筆記

此時colbert embedding會提示因路徑錯誤，需要重新處理colbert embedding。請輸入yes ，然後raglab將自動幫助您處理嵌入並啟動colbert伺服器。

現在請打開另一個終端並嘗試請求 colbert 伺服器
```
cd 拉格實驗室
sh 運行/colbert_server/ask_api.sh
```

如果回傳結果，則表示colbert伺服器啟動成功！？

運行 selfrag（簡短形式和自適應檢索）互動模式測試 10 樣本嵌入

cd 拉格實驗室
sh run/rag_inference/3-selfrag_reproduction-interact-short_form-adaptive_retrieval.sh

恭喜！
在raglab中，每個演算法在互動模式下內建了10個查詢，這些查詢是從不同的基準測試中採樣的

重現紙質結果

筆記

記得在運行論文結果之前下載 wiki2018 知識資料庫和模型

檢索伺服器和API

由於colbert對絕對路徑的要求，需要修改設定檔中的index_dbPath和text_dbPath並處理wiki2018嵌入資料庫

cd RAGLAB/config/colbert_server
vim colbert_server.yaml
index_dbPath：{your_root_path}/RAGLAB/data/retrieval/colbertv2.0_embedding/wiki2018
text_dbPath：{your_root_path}/RAGLAB/data/retrieval/colbertv2.0_passages/wiki2018/wiki2018.tsv

 vim /data/retrieval/colbertv2.0_embedding/wiki2018/indexes/wiki2018/metadata.json#更改根路徑，其他參數不需要修改"collection": "/{your_root_path}/RAGLAB/data/retrieval/colbertv2. 0_passages/wiki_path}/RAGLAB/data/retrieval/colbertv2. 0_passages/wiki2018 /wiki2018.tsv","實驗": "/{your_root_path}/RAGLAB/data/retrieval/colbertv2.0_embedding/wiki2018",

修改wiki2018嵌入原始檔中綁定的絕對路徑
修改設定檔中的路徑

注意：colbert_server 至少需要 60GB 內存

cd 拉格實驗室
sh 運行/colbert_server/colbert_server.sh

打開另一個終端測試您的 ColBERT 伺服器

cd 拉格實驗室
sh 運行/colbert_server/ask_api.sh

ColBERT伺服器啟動成功！？

自動GPU調度器

推理實驗需要並行運行數百個腳本，需要使用自動GPU調度器為並行的不同bash腳本自動分配GPU。
安裝simple_gpu_scheduler
```
 pip 安裝 simple_gpu_scheduler
```

在一條線上運行數百個實驗？

 cd 拉格實驗室
simple_gpu_scheduler --gpus 0,1,2,3,4,5,6,7 < auto_gpu_scheduling_scripts/auto_run-llama3_8b-baseline-scripts.txt#其他腳本可以用同樣的方法運行

如何寫your_script.txt？

 # auto_inference_selfreg-7b.txtsh run/rag_inference/selfrag_reproduction/selfrag_reproduction-evaluation-short_form-PubHealth-adaptive_retrieval-pregiven_passages.sh
sh run/rag_inference/selfrag_reproduct/selfrag_reproduct-evaluation-short_form-PubHealth-always_retrieval-pregiven_passages.sh

這是一個例子

ALCE 和 Factscore 評估

RAGLAB包括3種經典的評估方法：準確度、F1和EM（精確匹配）。這3種方法計算簡單，可以在推理過程中動態計算。然而，ALCE 和 Factscore 這兩個高階指標需要在評估之前完成推理過程。

ALCE ：RAGLAB 已將 ALCE 儲存庫整合到 RAGLAB 中。您只需在設定檔中設定推理結果的路徑即可。

 cd RAGLABcd run/ALCE/# 變更每個 sh 檔案中推理產生檔案的路徑# 例如：# python ./ALCE/eval.py --f './data/eval_results/ASQA/{your_input_file_path}.jsonl' # - -mauve # --qasimple_gpu_scheduler --gpus 0,1,2,3,4,5,6,7 < auto_gpu_scheduling_scripts/auto_eval_ALCE.txt

評估結果將與輸入檔在同一目錄下，檔名後綴為.score
Factscore ：Factscore 環境需要安裝torch 1.13.1 ，這與 RAGLAB 訓練和推理模組所需的 flash-attn 版本衝突。因此，RAGLAB目前無法整合Factscore環境，因此使用者需要單獨安裝Factscore環境進行評估。

安裝Factscore環境後，請修改bash檔案中推理結果的路徑

cd RAGLAB/run/Factscore/# 變更每個 sh 檔案中推理產生檔案的路徑# 例如：# python ./FActScore/factscore/factscorer.py # --input_path './data/eval_results/Factscore/{your_input_file_path './data/eval_results/Factscore/{your_input_file_path }. jsonl' # --model_name "retrieval+ChatGPT"# --openai_key ./api_keys.txt # --data_dir ./data/retrieval/colbertv2.0_passages/wiki2023 # --verbosesimple_gpu_scheduler --gpus 0,1,2, 3, 4,5,6,7 < auto_gpu_scheduling_scripts/auto_eval_Factscore.txt

評估結果將與輸入檔案在同一目錄下，檔案名稱後綴為_factscore_output.json

筆記

在Factscore評估過程中，我們使用GPT-3.5作為評估模型，因此無需下載本地模型。如果需要使用本地模型評估Factscore，請參考Factscore

從源頭處理知識資料庫

如果您想自行處理知識庫，請參考以下步驟。 RAGLAB已將處理後的知識庫上傳至Hugging Face
文件：process_wiki.md

火車模型

本節介紹 RAGLAB 中訓練模型的過程。您可以從 HuggingFace? 下載所有預訓練模型，或使用下面的教學從頭開始訓練。
所有數據提供了微調所需的所有數據。
文件：train_docs.md

引文

如果您發現此儲存庫有用，請引用我們的工作。

@inproceedings{zhang-etal-2024-raglab,
    title = "{RAGLAB}: A Modular and Research-Oriented Unified Framework for Retrieval-Augmented Generation",
    author = "Zhang, Xuanwang and
      Song, Yunze and
      Wang, Yidong and
      Tang, Shuyun and
      others",
    booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    month = dec,
    year = "2024",
    publisher = "Association for Computational Linguistics",
}