FlashRank.jl 다운로드 - FlashRank.jl 소스 코드 다운로드

FlashRank.jl

기타 소스코드

v0.4.1

다운로드

FlashRank.jl

FlashRank.jl은 원래 Prithiviraj Damodaran이 개발한 멋진 FlashRank Python 패키지에서 영감을 받았습니다. 이 패키지는 Prithiviraj의 HF 저장소와 Svilupp의 HF 저장소의 모델 가중치를 활용하여 GPU 및 큰 종속성 없이 특정 쿼리와 관련된 문서의 순위를 지정하는 빠르고 효율적인 방법을 제공합니다.

이는 가장 적합한 문서의 우선순위를 지정하여 RAG(Retrieval Augmented Generation) 파이프라인을 향상시킵니다. 가장 작은 모델은 거의 모든 시스템에서 실행될 수 있습니다.

특징

4가지 순위 모델:
- 매우 작음(~4MB, INT8): ms-marco-TinyBERT-L-2-v2(기본값)(별칭 :tiny )
- MiniLM L-4(~70MB, FP32): ms-marco-MiniLM-L-4-v2 ONNX(별칭 :mini4 )
- MiniLM L-6(~83.4MB, FP32): ms-marco-MiniLM-L-6-v2 ONNX(별칭 :mini6 )
- MiniLM L-12(~23MB, INT8): ms-marco-MiniLM-L-12-v2(별칭 :mini 또는 mini12 )
간편한 통합을 위해 Flux 및 CUDA와 같은 무거운 프레임워크를 피하는 경량 종속성.

얼마나 빠른가요? Tiny 모델을 사용하면 노트북에서 최대 0.1초 만에 100개의 문서 순위를 매길 수 있습니다. MiniLM(12레이어) 모델을 사용하면 ~0.4초 안에 100개의 문서 순위를 매길 수 있습니다.

팁: 대기 시간 예산으로 감당할 수 있는 가장 큰 모델을 선택하십시오. 즉, MiniLM L-12는 가장 느리지만 정확도는 가장 높습니다.

우리는 최대 청크 크기가 512개 토큰인 BERT 모델을 사용하고 있습니다(그 이상의 토큰은 모두 잘립니다).

설치

다음을 사용하여 간단히 환경에 추가하세요.

 using Pkg
Pkg . activate ( " . " )
Pkg . add ( " FlashRank " )

용법

특정 쿼리에 대해 문서 순위를 매기는 방법은 다음과 같이 간단합니다.

 ENV [ " DATADEPS_ALWAYS_ACCEPT " ] = " true "
using FlashRank

ranker = RankerModel () # Defaults to model = `:tiny`

query = " How to speedup LLMs? "
passages = [
        " Introduce *lookahead decoding*: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step. " ,
        " LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper " ,
        " There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods I’ve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face. This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call `model.to_bettertransformer()` on your Hugging Face model for a modest improvement in tokens per second. - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint. - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run. " ,
        " Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup. " ,
        " vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels " ,
];


result = rank (ranker, query, passages)

result 는 RankResult 유형이며 정렬된 구절, 해당 점수(0-1, 여기서 1이 가장 좋음) 및 정렬된 문서의 위치(원본 passages 벡터 참조)를 포함합니다.

다음은 FlashRank.jl을 PromptingTools.jl RAG 파이프라인에 통합하는 방법에 대한 간략한 개요입니다.

전체 예를 보려면 examples/prompting_tools_integration.jl 참조하세요.

 using FlashRank
using PromptingTools
using PromptingTools . Experimental . RAGTools
const RT = PromptingTools . Experimental . RAGTools

# Wrap the model to be a valid Ranker recognized by RAGTools
# It will be provided to the airag/rerank function to avoid instantiating it on every call
struct FlashRanker <: RT.AbstractReranker
    model :: RankerModel
end
reranker = RankerModel ( :tiny ) |> FlashRanker

# Define the method for ranking with it
function RT . rerank (
        reranker :: FlashRanker , index :: RT.AbstractDocumentIndex , question :: AbstractString ,
        candidates :: RT.AbstractCandidateChunks ; kwargs ... )
    # # omitted for brevity
    # # See examples/prompting_tools_integration.jl for details
end

# # Apply to the pipeline configuration, eg, 
cfg = RAGConfig (; retriever = RT . AdvancedRetriever (; reranker))
# # assumes existing index
question = " Tell me about prehistoric animals "
result = airag (cfg, index; question, return_all = true )

고급 사용법

tiny_embed 모델(Bert-L4)을 사용하면 상당히 "거칠지만" 빠른 임베딩을 활용할 수도 있습니다.

embedder = FlashRank . EmbedderModel ( :tiny_embed )

passages = [ " This is a test " , " This is another test " ]
result = FlashRank . embed (embedder, passages)

감사의 말

FlashRank와 Transformers.jl은 이 패키지 개발에 필수적이었습니다.
원래 FlashRank와 INT8 양자화 모델 가중치를 제공한 Prithiviraj Damodaran에게 특별히 감사드립니다.
그리고 종속성을 최소화하기 위해 이 패키지에 대해 분기된 WordPiece 구현 및 BERT 토크나이저를 위한 Transformers.jl에 있습니다.