Download FlashRank.jl - Download do código-fonte FlashRank.jl

FlashRank.jl

Outro código-fonte

v0.4.1

Baixar

FlashRank.jl

FlashRank.jl é inspirado no incrível pacote FlashRank Python, originalmente desenvolvido por Prithiviraj Damodaran. Este pacote aproveita pesos de modelo do repositório HF de Prithiviraj e do repositório HF de Svilupp para fornecer uma maneira rápida e eficiente de classificar documentos relevantes para qualquer consulta sem GPUs e grandes dependências .

Isso aprimora os pipelines de Retrieval Augmented Generation (RAG), priorizando os documentos mais adequados. O menor modelo pode ser executado em praticamente qualquer máquina.

Características

Quatro modelos de classificação:
- Minúsculo (~4MB, INT8): ms-marco-TinyBERT-L-2-v2 (padrão) (alias :tiny )
- MiniLM L-4 (~70 MB, FP32): ms-marco-MiniLM-L-4-v2 ONNX (alias :mini4 )
- MiniLM L-6 (~83,4 MB, FP32): ms-marco-MiniLM-L-6-v2 ONNX (alias :mini6 )
- MiniLM L-12 (~23 MB, INT8): ms-marco-MiniLM-L-12-v2 (alias :mini ou mini12 )
Dependências leves, evitando frameworks pesados como Flux e CUDA para facilitar a integração.

Quão rápido é isso? Com o modelo Tiny, você pode classificar 100 documentos em aproximadamente 0,1 segundos em um laptop. Com o modelo MiniLM (12 camadas), você pode classificar 100 documentos em aproximadamente 0,4 segundos.

Dica: Escolha o maior modelo que você puder pagar com seu orçamento de latência, ou seja, o MiniLM L-12 é o mais lento, mas tem a melhor precisão.

Observe que estamos usando modelos BERT com tamanho máximo de bloco de 512 tokens (qualquer coisa acima será truncada).

Instalação

Adicione-o ao seu ambiente simplesmente com:

 using Pkg
Pkg . activate ( " . " )
Pkg . add ( " FlashRank " )

Uso

Classificar seus documentos para uma determinada consulta é tão simples quanto:

 ENV [ " DATADEPS_ALWAYS_ACCEPT " ] = " true "
using FlashRank

ranker = RankerModel () # Defaults to model = `:tiny`

query = " How to speedup LLMs? "
passages = [
        " Introduce *lookahead decoding*: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step. " ,
        " LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper " ,
        " There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods I’ve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face. This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call `model.to_bettertransformer()` on your Hugging Face model for a modest improvement in tokens per second. - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint. - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run. " ,
        " Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup. " ,
        " vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels " ,
];


result = rank (ranker, query, passages)

result é do tipo RankResult e contém as passagens ordenadas, suas pontuações (0-1, onde 1 é a melhor) e as posições dos documentos ordenados (referentes ao vetor passages originais).

Aqui está um breve esboço de como você pode integrar FlashRank.jl em seu pipeline RAG PromptingTools.jl.

Para obter um exemplo completo, consulte examples/prompting_tools_integration.jl .

 using FlashRank
using PromptingTools
using PromptingTools . Experimental . RAGTools
const RT = PromptingTools . Experimental . RAGTools

# Wrap the model to be a valid Ranker recognized by RAGTools
# It will be provided to the airag/rerank function to avoid instantiating it on every call
struct FlashRanker <: RT.AbstractReranker
    model :: RankerModel
end
reranker = RankerModel ( :tiny ) |> FlashRanker

# Define the method for ranking with it
function RT . rerank (
        reranker :: FlashRanker , index :: RT.AbstractDocumentIndex , question :: AbstractString ,
        candidates :: RT.AbstractCandidateChunks ; kwargs ... )
    # # omitted for brevity
    # # See examples/prompting_tools_integration.jl for details
end

# # Apply to the pipeline configuration, eg, 
cfg = RAGConfig (; retriever = RT . AdvancedRetriever (; reranker))
# # assumes existing index
question = " Tell me about prehistoric animals "
result = airag (cfg, index; question, return_all = true )

Uso Avançado

Você também pode aproveitar embeddings bastante "grosseiros", mas rápidos, com o modelo tiny_embed (Bert-L4).

embedder = FlashRank . EmbedderModel ( :tiny_embed )

passages = [ " This is a test " , " This is another test " ]
result = FlashRank . embed (embedder, passages)

Agradecimentos

FlashRank e Transformers.jl foram essenciais no desenvolvimento deste pacote.
Agradecimentos especiais a Prithiviraj Damodaran pelo FlashRank original e pelos pesos do modelo quantizado INT8.
E para Transformers.jl para a implementação do WordPiece e o tokenizer BERT que foram bifurcados para este pacote (para minimizar dependências).

Roteiro

Fornece extensão de pacote para PromptingTools
Traga modelos ainda menores (por exemplo, Ber-L2-128D)
Introduzir um ajuste simplesmente baseado no comprimento para incorporar a pontuação de similaridade
Faça upload novamente de modelos incorporados com pool baseado em máscara (sem diferença real, apenas teoricamente correto)

Expandir

Informações adicionais

Versão v0.4.1
Tipo Outro código-fonte
Data da Última Atualização 2024-12-23
tamanho 31.33KB
Vindo de Github

Aplicativos Relacionados

Lib.Net.Http.WebPush

2024-11-10
MEDO 3

2022-09-05
Construtor MASSA

2022-08-29
RUSE

2022-08-20
COMA

2022-08-11
ZAR

2022-07-30

Recomendado para você

chat.petals.dev

Outro código-fonte

1.0.0
GPT Prompt Templates

Outro código-fonte

1.0.0
GPTyped

Outro código-fonte

GPTyped 1.0.5
waymo open dataset

Outro código-fonte

December 2023 Update
SmartTube

Outro código-fonte

24.71 Stable
Sunamu

Outro código-fonte

Release 2.2.0
waymo open dataset

Outro código-fonte

December 2023 Update
wp functions

Outras categorias

1.0.0
termwind

Outras categorias

v2.3.0

Informações Relacionadas Todos