Загрузка FlashRank.jl - Загрузка исходного кода FlashRank.jl

FlashRank.jl

Другой исходный код

v0.4.1

Скачать

FlashRank.jl

FlashRank.jl создан на основе потрясающего пакета FlashRank Python, первоначально разработанного Притхивираджем Дамодараном. Этот пакет использует веса моделей из репозитория HF Prithiviraj и репозитория HF Svilupp, чтобы обеспечить быстрый и эффективный способ ранжирования документов, соответствующих любому заданному запросу, без использования графических процессоров и больших зависимостей .

Это расширяет возможности конвейеров расширенной генерации (RAG) за счет определения приоритета наиболее подходящих документов. Самая маленькая модель может работать практически на любой машине.

Функции

Четыре модели ранжирования:
- Tiny (~4 МБ, INT8): ms-marco-TinyBERT-L-2-v2 (по умолчанию) (псевдоним :tiny )
- MiniLM L-4 (~70 МБ, FP32): ms-marco-MiniLM-L-4-v2 ONNX (псевдоним :mini4 )
- MiniLM L-6 (~ 83,4 МБ, FP32): ms-marco-MiniLM-L-6-v2 ONNX (псевдоним :mini6 )
- MiniLM L-12 (~23 МБ, INT8): ms-marco-MiniLM-L-12-v2 (псевдоним :mini или mini12 )
Легкие зависимости, избегающие тяжелых фреймворков, таких как Flux и CUDA, для простоты интеграции.

Насколько это быстро? С помощью модели Tiny вы можете ранжировать 100 документов на ноутбуке примерно за 0,1 секунды. С помощью модели MiniLM (12 слоев) вы можете ранжировать 100 документов примерно за 0,4 секунды.

Совет: выберите самую большую модель, которую вы можете себе позволить с учетом вашего бюджета на задержку, т. е. MiniLM L-12 — самая медленная, но обладает лучшей точностью.

Обратите внимание, что мы используем модели BERT с максимальным размером фрагмента 512 токенов (все, что больше, будет обрезано).

Установка

Добавьте его в свою среду просто с помощью:

 using Pkg
Pkg . activate ( " . " )
Pkg . add ( " FlashRank " )

Использование

Ранжировать ваши документы по заданному запросу так же просто, как:

 ENV [ " DATADEPS_ALWAYS_ACCEPT " ] = " true "
using FlashRank

ranker = RankerModel () # Defaults to model = `:tiny`

query = " How to speedup LLMs? "
passages = [
        " Introduce *lookahead decoding*: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step. " ,
        " LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper " ,
        " There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods I’ve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face. This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call `model.to_bettertransformer()` on your Hugging Face model for a modest improvement in tokens per second. - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint. - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run. " ,
        " Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup. " ,
        " vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels " ,
];


result = rank (ranker, query, passages)

result имеет тип RankResult и содержит отсортированные отрывки, их оценки (0-1, где 1 — лучший результат) и позиции отсортированных документов (относительно исходного вектора passages ).

Вот краткое описание того, как можно интегрировать FlashRank.jl в конвейер RAG PromptingTools.jl.

Полный пример см. в examples/prompting_tools_integration.jl .

 using FlashRank
using PromptingTools
using PromptingTools . Experimental . RAGTools
const RT = PromptingTools . Experimental . RAGTools

# Wrap the model to be a valid Ranker recognized by RAGTools
# It will be provided to the airag/rerank function to avoid instantiating it on every call
struct FlashRanker <: RT.AbstractReranker
    model :: RankerModel
end
reranker = RankerModel ( :tiny ) |> FlashRanker

# Define the method for ranking with it
function RT . rerank (
        reranker :: FlashRanker , index :: RT.AbstractDocumentIndex , question :: AbstractString ,
        candidates :: RT.AbstractCandidateChunks ; kwargs ... )
    # # omitted for brevity
    # # See examples/prompting_tools_integration.jl for details
end

# # Apply to the pipeline configuration, eg, 
cfg = RAGConfig (; retriever = RT . AdvancedRetriever (; reranker))
# # assumes existing index
question = " Tell me about prehistoric animals "
result = airag (cfg, index; question, return_all = true )

Расширенное использование

Вы также можете использовать довольно «грубые», но быстрые внедрения с помощью модели tiny_embed (Bert-L4).

embedder = FlashRank . EmbedderModel ( :tiny_embed )

passages = [ " This is a test " , " This is another test " ]
result = FlashRank . embed (embedder, passages)

Благодарности

FlashRank и Transformers.jl сыграли важную роль в разработке этого пакета.
Особая благодарность Притхивираджу Дамодарану за оригинальный FlashRank и весовые коэффициенты квантованных моделей INT8.
И Transformers.jl для реализации WordPiece и токенизатора BERT, которые были созданы для этого пакета (чтобы минимизировать зависимости).

Дорожная карта

Предоставьте расширение пакета для PromptingTools.
Привозите еще меньшие модели (например, Бер-Л2-128Д)
Ввести простую корректировку на основе длины для встраивания показателя сходства.
Повторно загрузите модели внедрения с использованием пула на основе масок (реальной разницы нет, просто теоретически правильно).

Расширять

Дополнительная информация