FlashRank.jl下載 - FlashRank.jl原始碼下載

FlashRank.jl

其他源碼

v0.4.1

下載

FlashRank.jl

FlashRank.jl 的靈感來自於最初由 Prithiviraj Damodaran 開發的很棒的 FlashRank Python 套件。該軟體包利用 Prithiviraj 的 HF 儲存庫和 Svilupp 的 HF 儲存庫中的模型權重，提供一種快速有效的方法來對與任何給定查詢相關的文件進行排名，而無需 GPU 和大量依賴項。

這透過優先考慮最合適的文件來增強檢索增強生成（RAG）管道。最小的模型幾乎可以在任何機器上運行。

特徵

四種排名模型：
- Tiny（~4MB，INT8）： ms-marco-TinyBERT-L-2-v2（預設）（別名:tiny ）
- MiniLM L-4（~70MB，FP32）： ms-marco-MiniLM-L-4-v2 ONNX（別名:mini4 ）
- MiniLM L-6（~83.4MB，FP32）： ms-marco-MiniLM-L-6-v2 ONNX（別名:mini6 ）
- MiniLM L-12（~23MB，INT8）： ms-marco-MiniLM-L-12-v2（別名:mini或mini12 ）
輕量級依賴，避免使用 Flux 和 CUDA 等重型框架，以便於整合。

有多快？使用 Tiny 模型，您可以在筆記型電腦上在大約 0.1 秒內對 100 個文件進行排名。使用 MiniLM（12 層）模型，您可以在大約 0.4 秒內對 100 個文件進行排名。

提示：選擇在延遲預算範圍內可以承受的最大型號，即 MiniLM L-12 最慢，但精度最高。

請注意，我們使用的 BERT 模型的最大區塊大小為 512 個標記（超過的任何內容都會被截斷）。

安裝

只需使用以下命令將其添加到您的環境中：

 using Pkg
Pkg . activate ( " . " )
Pkg . add ( " FlashRank " )

用法

針對給定查詢對文件進行排名非常簡單：

 ENV [ " DATADEPS_ALWAYS_ACCEPT " ] = " true "
using FlashRank

ranker = RankerModel () # Defaults to model = `:tiny`

query = " How to speedup LLMs? "
passages = [
        " Introduce *lookahead decoding*: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step. " ,
        " LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper " ,
        " There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods I’ve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face. This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call `model.to_bettertransformer()` on your Hugging Face model for a modest improvement in tokens per second. - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint. - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run. " ,
        " Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup. " ,
        " vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels " ,
];


result = rank (ranker, query, passages)

result的類型為RankResult ，包含已排序的段落、它們的分數（0-1，其中 1 是最好的）以及已排序文件的位置（參考原始passages向量）。

以下簡要概述如何將 FlashRank.jl 整合到 PromptingTools.jl RAG 管道中。

有關完整範例，請參閱examples/prompting_tools_integration.jl 。

 using FlashRank
using PromptingTools
using PromptingTools . Experimental . RAGTools
const RT = PromptingTools . Experimental . RAGTools

# Wrap the model to be a valid Ranker recognized by RAGTools
# It will be provided to the airag/rerank function to avoid instantiating it on every call
struct FlashRanker <: RT.AbstractReranker
    model :: RankerModel
end
reranker = RankerModel ( :tiny ) |> FlashRanker

# Define the method for ranking with it
function RT . rerank (
        reranker :: FlashRanker , index :: RT.AbstractDocumentIndex , question :: AbstractString ,
        candidates :: RT.AbstractCandidateChunks ; kwargs ... )
    # # omitted for brevity
    # # See examples/prompting_tools_integration.jl for details
end

# # Apply to the pipeline configuration, eg, 
cfg = RAGConfig (; retriever = RT . AdvancedRetriever (; reranker))
# # assumes existing index
question = " Tell me about prehistoric animals "
result = airag (cfg, index; question, return_all = true )

進階用法

您也可以透過tiny_embed模型（Bert-L4）利用相當「粗略」但快速的嵌入。

embedder = FlashRank . EmbedderModel ( :tiny_embed )

passages = [ " This is a test " , " This is another test " ]
result = FlashRank . embed (embedder, passages)