FlashRank.jl herunterladen - Download FlashRank.jl -Quellcodes

FlashRank.jl

Anderer Quellcode

v0.4.1

Herunterladen

FlashRank.jl

FlashRank.jl ist vom fantastischen FlashRank-Python-Paket inspiriert, das ursprünglich von Prithiviraj Damodaran entwickelt wurde. Dieses Paket nutzt Modellgewichte aus dem HF-Repo von Prithiviraj und dem HF-Repo von Svilupp, um eine schnelle und effiziente Möglichkeit zum Einstufen von Dokumenten zu bieten, die für eine bestimmte Abfrage relevant sind, ohne GPUs und große Abhängigkeiten .

Dies verbessert die Retrieval Augmented Generation (RAG)-Pipelines durch Priorisierung der am besten geeigneten Dokumente. Das kleinste Modell kann auf fast jeder Maschine ausgeführt werden.

Merkmale

Vier Ranking-Modelle:
- Tiny (~4 MB, INT8): ms-marco-TinyBERT-L-2-v2 (Standard) (Alias :tiny )
- MiniLM L-4 (~70 MB, FP32): ms-marco-MiniLM-L-4-v2 ONNX (Alias :mini4 )
- MiniLM L-6 (~83,4 MB, FP32): ms-marco-MiniLM-L-6-v2 ONNX (Alias :mini6 )
- MiniLM L-12 (~23 MB, INT8): ms-marco-MiniLM-L-12-v2 (Alias :mini oder mini12 )
Leichte Abhängigkeiten, Vermeidung schwerer Frameworks wie Flux und CUDA für eine einfachere Integration.

Wie schnell ist es? Mit dem Tiny-Modell können Sie auf einem Laptop 100 Dokumente in ca. 0,1 Sekunden sortieren. Mit dem MiniLM-Modell (12 Schichten) können Sie 100 Dokumente in ca. 0,4 Sekunden einordnen.

Tipp: Wählen Sie das größte Modell, das Sie sich mit Ihrem Latenzbudget leisten können, d. h. MiniLM L-12 ist das langsamste, bietet aber die beste Genauigkeit.

Beachten Sie, dass wir BERT-Modelle mit einer maximalen Blockgröße von 512 Token verwenden (alles darüber hinaus wird abgeschnitten).

Installation

Fügen Sie es Ihrer Umgebung einfach hinzu mit:

 using Pkg
Pkg . activate ( " . " )
Pkg . add ( " FlashRank " )

Verwendung

Das Ranking Ihrer Dokumente für eine bestimmte Abfrage ist so einfach wie:

 ENV [ " DATADEPS_ALWAYS_ACCEPT " ] = " true "
using FlashRank

ranker = RankerModel () # Defaults to model = `:tiny`

query = " How to speedup LLMs? "
passages = [
        " Introduce *lookahead decoding*: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step. " ,
        " LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper " ,
        " There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods I’ve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face. This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call `model.to_bettertransformer()` on your Hugging Face model for a modest improvement in tokens per second. - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint. - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run. " ,
        " Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup. " ,
        " vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels " ,
];


result = rank (ranker, query, passages)

result ist vom Typ RankResult und enthält die sortierten Passagen, ihre Bewertungen (0-1, wobei 1 die beste ist) und die Positionen der sortierten Dokumente (bezogen auf den ursprünglichen passages ).

Hier ist ein kurzer Überblick darüber, wie Sie FlashRank.jl in Ihre PromptingTools.jl RAG-Pipeline integrieren können.

Ein vollständiges Beispiel finden Sie examples/prompting_tools_integration.jl .

 using FlashRank
using PromptingTools
using PromptingTools . Experimental . RAGTools
const RT = PromptingTools . Experimental . RAGTools

# Wrap the model to be a valid Ranker recognized by RAGTools
# It will be provided to the airag/rerank function to avoid instantiating it on every call
struct FlashRanker <: RT.AbstractReranker
    model :: RankerModel
end
reranker = RankerModel ( :tiny ) |> FlashRanker

# Define the method for ranking with it
function RT . rerank (
        reranker :: FlashRanker , index :: RT.AbstractDocumentIndex , question :: AbstractString ,
        candidates :: RT.AbstractCandidateChunks ; kwargs ... )
    # # omitted for brevity
    # # See examples/prompting_tools_integration.jl for details
end

# # Apply to the pipeline configuration, eg, 
cfg = RAGConfig (; retriever = RT . AdvancedRetriever (; reranker))
# # assumes existing index
question = " Tell me about prehistoric animals "
result = airag (cfg, index; question, return_all = true )

Erweiterte Nutzung

Mit dem Modell tiny_embed (Bert-L4) können Sie auch recht „grobe“, aber schnelle Einbettungen nutzen.

embedder = FlashRank . EmbedderModel ( :tiny_embed )

passages = [ " This is a test " , " This is another test " ]
result = FlashRank . embed (embedder, passages)

Danksagungen

FlashRank und Transformers.jl waren bei der Entwicklung dieses Pakets von entscheidender Bedeutung.
Besonderer Dank geht an Prithiviraj Damodaran für den ursprünglichen FlashRank und die INT8-quantisierten Modellgewichte.
Und an Transformers.jl für die WordPiece-Implementierung und den BERT-Tokenizer, die für dieses Paket geforkt wurden (um Abhängigkeiten zu minimieren).

Roadmap

Stellen Sie eine Paketerweiterung für PromptingTools bereit
Bringen Sie noch kleinere Modelle mit (z. B. Ber-L2-128D)
Führen Sie eine einfach längenbasierte Anpassung des Einbettungsähnlichkeitswerts ein
Einbettungsmodelle mit maskenbasiertem Pooling erneut hochladen (kein wirklicher Unterschied, nur theoretisch korrekt)

Expandieren

Zusätzliche Informationen