Descargar FlashRank.jl - Descarga del código fuente FlashRank.jl

FlashRank.jl

Otro código fuente

v0.4.1

Descargar

FlashRank.jl

FlashRank.jl está inspirado en el increíble paquete FlashRank Python, desarrollado originalmente por Prithiviraj Damodaran. Este paquete aprovecha los pesos de los modelos del repositorio HF de Prithiviraj y del repositorio HF de Svilupp para proporcionar una forma rápida y eficiente de clasificar documentos relevantes para cualquier consulta determinada sin GPU ni grandes dependencias .

Esto mejora los canales de recuperación de generación aumentada (RAG) al priorizar los documentos más adecuados. El modelo más pequeño se puede ejecutar en casi cualquier máquina.

Características

Cuatro modelos de clasificación:
- Pequeño (~4 MB, INT8): ms-marco-TinyBERT-L-2-v2 (predeterminado) (alias :tiny )
- MiniLM L-4 (~70 MB, FP32): ms-marco-MiniLM-L-4-v2 ONNX (alias :mini4 )
- MiniLM L-6 (~83,4 MB, FP32): ms-marco-MiniLM-L-6-v2 ONNX (alias :mini6 )
- MiniLM L-12 (~23 MB, INT8): ms-marco-MiniLM-L-12-v2 (alias :mini o mini12 )
Dependencias ligeras, evitando marcos pesados como Flux y CUDA para facilitar la integración.

¿Qué tan rápido es? Con el modelo Tiny, puedes clasificar 100 documentos en aproximadamente 0,1 segundos en una computadora portátil. Con el modelo MiniLM (12 capas), puede clasificar 100 documentos en aproximadamente 0,4 segundos.

Consejo: elija el modelo más grande que pueda permitirse con su presupuesto de latencia, es decir, MiniLM L-12 es el más lento pero tiene la mejor precisión.

Tenga en cuenta que estamos usando modelos BERT con un tamaño de fragmento máximo de 512 tokens (cualquier cosa que supere se truncará).

Instalación

Agréguelo a su entorno simplemente con:

 using Pkg
Pkg . activate ( " . " )
Pkg . add ( " FlashRank " )

Uso

Clasificar sus documentos para una consulta determinada es tan simple como:

 ENV [ " DATADEPS_ALWAYS_ACCEPT " ] = " true "
using FlashRank

ranker = RankerModel () # Defaults to model = `:tiny`

query = " How to speedup LLMs? "
passages = [
        " Introduce *lookahead decoding*: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step. " ,
        " LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper " ,
        " There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods I’ve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face. This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call `model.to_bettertransformer()` on your Hugging Face model for a modest improvement in tokens per second. - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint. - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run. " ,
        " Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup. " ,
        " vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels " ,
];


result = rank (ranker, query, passages)

result es de tipo RankResult y contiene los pasajes ordenados, sus puntuaciones (0-1, donde 1 es la mejor) y las posiciones de los documentos ordenados (refiriéndose al vector passages originales).

A continuación se ofrece un breve resumen de cómo puede integrar FlashRank.jl en su proceso RAG de PromptingTools.jl.

Para ver un ejemplo completo, consulte examples/prompting_tools_integration.jl .

 using FlashRank
using PromptingTools
using PromptingTools . Experimental . RAGTools
const RT = PromptingTools . Experimental . RAGTools

# Wrap the model to be a valid Ranker recognized by RAGTools
# It will be provided to the airag/rerank function to avoid instantiating it on every call
struct FlashRanker <: RT.AbstractReranker
    model :: RankerModel
end
reranker = RankerModel ( :tiny ) |> FlashRanker

# Define the method for ranking with it
function RT . rerank (
        reranker :: FlashRanker , index :: RT.AbstractDocumentIndex , question :: AbstractString ,
        candidates :: RT.AbstractCandidateChunks ; kwargs ... )
    # # omitted for brevity
    # # See examples/prompting_tools_integration.jl for details
end

# # Apply to the pipeline configuration, eg, 
cfg = RAGConfig (; retriever = RT . AdvancedRetriever (; reranker))
# # assumes existing index
question = " Tell me about prehistoric animals "
result = airag (cfg, index; question, return_all = true )

Uso avanzado

También puede aprovechar incrustaciones bastante "bastantes" pero rápidas con el modelo tiny_embed (Bert-L4).

embedder = FlashRank . EmbedderModel ( :tiny_embed )

passages = [ " This is a test " , " This is another test " ]
result = FlashRank . embed (embedder, passages)

Expresiones de gratitud

FlashRank y Transformers.jl han sido fundamentales en el desarrollo de este paquete.
Un agradecimiento especial a Prithiviraj Damodaran por el FlashRank original y los pesos del modelo cuantificado INT8.
Y a Transformers.jl por la implementación de WordPiece y el tokenizador BERT que se han bifurcado para este paquete (para minimizar las dependencias).

Hoja de ruta

Proporcionar extensión de paquete para PromptingTools
Traiga modelos aún más pequeños (por ejemplo, Ber-L2-128D)
Introducir un ajuste basado simplemente en la longitud para incorporar la puntuación de similitud
Vuelva a cargar modelos integrados con agrupación basada en máscaras (no hay una diferencia real, solo teóricamente correcta)

Expandir

Información adicional

Versión v0.4.1
Tipo Otro código fuente
Fecha de actualización 2024-12-23
tamaño 31.33KB
Proviene de Github

Aplicaciones relacionadas

Lib.Net.Http.WebPush

2024-11-10
MIEDO 3

2022-09-05
Constructor de masa

2022-08-29
ARDID

2022-08-20
KOMA

2022-08-11
ZAR

2022-07-30

Recomendado para ti

chat.petals.dev

Otro código fuente

1.0.0
GPT Prompt Templates

Otro código fuente

1.0.0
GPTyped

Otro código fuente

GPTyped 1.0.5
waymo open dataset

Otro código fuente

December 2023 Update
SmartTube

Otro código fuente

24.71 Stable
Sunamu

Otro código fuente

Release 2.2.0
wp functions

Otras categorias

1.0.0
waymo open dataset

Otro código fuente

December 2023 Update
slugify

Otras categorias

Version 4.6.0 (10 September 2024)

Información relacionada Todo