SemanticCaches.jl下載 - SemanticCaches.jl原始碼下載

語義緩存.jl

SemanticCaches.jl 是人工智慧應用程式語義快取的一種非常駭客的實現，可以透過重複請求節省時間和金錢。它並不是特別快，因為我們試圖阻止甚至可能需要 20 秒的 API 呼叫。

請注意，我們使用最大區塊大小為 512 個令牌的微型 BERT 模型來提供在 CPU 上運行的快速本地嵌入。對於較長的句子，我們將它們分成幾個區塊並考慮它們的平均嵌入，但請小心使用！延遲可能會急劇上升，並且比簡單地呼叫原始 API 更糟糕。

安裝

要安裝 SemanticCaches.jl，只需使用 Julia 套件管理器新增套件即可：

 using Pkg;
Pkg . activate ( " . " )
Pkg . add ( " SemanticCaches " )

快速入門指南

 # # This line is very important to be able to download the models!!!
ENV [ " DATADEPS_ALWAYS_ACCEPT " ] = " true "
using SemanticCaches

sem_cache = SemanticCache ()
# First argument: the key must always match exactly, eg, model, temperature, etc
# Second argument: the input text to be compared with the cache, can be fuzzy matched
item = sem_cache ( " key1 " , " say hi! " ; verbose = 1 ) # notice the verbose flag it can 0,1,2 for different level of detail
if ! isvalid (item)
    @info " cache miss! "
    item . output = " expensive result X "
    # Save the result to the cache for future reference
    push! (sem_cache, item)
end

# If practice, long texts may take too long to embed even with our tiny model
# so let's not compare anything above 2000 tokens =~ 5000 characters (threshold of c. 100ms)

hash_cache = HashCache ()
input = " say hi "
input = " say hi " ^ 1000

active_cache = length (input) > 5000 ? hash_cache : sem_cache
item = active_cache ( " key1 " , input; verbose = 1 )

if ! isvalid (item)
    @info " cache miss! "
    item . output = " expensive result X "
    push! (active_cache, item)
end

它是如何運作的

建置此套件的主要目標是快取對 GenAI 模型的昂貴 API 呼叫。

該系統提供 STRING 輸入的精確匹配（更快， HashCache ）和語義相似性查找（更慢， SemanticCache ）。此外，所有請求首先在「快取鍵」上進行比較，該鍵提供的鍵必須始終與請求完全匹配才能被視為可互換（例如，相同型號、相同提供者、相同溫度等）。您需要根據您的用例選擇適當的快取鍵和輸入。快取鍵的預設選擇應該是模型名稱。

當您呼叫快取（提供cache_key和string_input ）時會發生什麼事？

所有快取的輸出都儲存在向量cache.items中。
當我們收到請求時，會尋找cache_key以查找items中對應項目的索引。如果未找到cache_key ，我們將傳回帶有空白output欄位的CachedItem （即isvalid(item) == false ）。
我們使用一個微小的 BERT 模型嵌入string_input並對嵌入進行歸一化（以便以後更容易比較餘弦距離）。
然後，我們將餘弦距離與緩存項目的嵌入進行比較。
如果餘弦距離高於min_similarity閾值，我們會傳回快取的項目（輸出可以在欄位item.output中找到）。

如果我們沒有找到任何快取的項目，我們將傳回帶有空白output欄位的CachedItem （即isvalid(item) == false ）。計算出回應並將其保存在item.output中後，您可以透過呼叫push!(cache, item)將項目推送到快取。

合適的用例

如果您知道您的請求量較小（例如，每個會話或機器<10k），那麼這個套件就非常有用。
降低運行評估的成本是理想的選擇，因為即使您更改 RAG 管道配置，許多呼叫也會重複並且可以利用快取。
最後，這個套件對於演示和小型用戶應用程式非常有用，您可以預先了解一些系統輸入，因此您可以快取它們並顯示令人難以置信的回應時間！
該軟體包不適合具有數十萬個請求的生產系統，請記住，這是一個非常基本的緩存，您需要隨著時間的推移手動使其失效！

進階用法

快取 HTTP 請求

根據您對 API 呼叫的了解，您需要確定：1) 快取鍵（快取項目的單獨存儲，例如不同的型號或溫度）以及 2) 如何將 HTTP 請求解包為字串（例如，解包和加入OpenAI API 的格式化訊息內容）。

以下簡要概述如何將 SemanticCaches.jl 與 PromptingTools.jl 一起使用。

 using PromptingTools
using SemanticCaches
using HTTP

# # Define the new caching mechanism as a layer for HTTP
# # See documentation [here](https://juliaweb.github.io/HTTP.jl/stable/client/#Quick-Examples)
module MyCache

using HTTP, JSON3
using SemanticCaches

const SEM_CACHE = SemanticCache ()
const HASH_CACHE = HashCache ()

function cache_layer (handler)
    return function (req; cache_key :: Union{AbstractString,Nothing} = nothing , kw ... )
        # only apply the cache layer if the user passed `cache_key`
        # we could also use the contents of the payload, eg, `cache_key = get(body, "model", "unknown")`
        if req . method == " POST " && cache_key != = nothing
            body = JSON3 . read ( copy (req . body))
            if occursin ( " v1/chat/completions " , req . target)
                # # We're in chat completion endpoint
                input = join ([m[ " content " ] for m in body[ " messages " ]], " " )
            elseif occursin ( " v1/embeddings " , req . target)
                # # We're in embedding endpoint
                input = body[ " input " ]
            else
                # # Skip, unknown API
                return handler (req; kw ... )
            end
            # # Check the cache
            @info " Check if we can cache this request ( $( length (input)) chars) "
            active_cache = length (input) > 5000 ? HASH_CACHE : SEM_CACHE
            item = active_cache ( " key1 " , input; verbose = 2 ) # change verbosity to 0 to disable detailed logs
            if ! isvalid (item)
                @info " Cache miss! Pinging the API "
                # pass the request along to the next layer by calling `cache_layer` arg `handler`
                resp = handler (req; kw ... )
                item . output = resp
                # Let's remember it for the next time
                push! (active_cache, item)
            end
            # # Return the calculated or cached result
            return item . output
        end
        # pass the request along to the next layer by calling `cache_layer` arg `handler`
        # also pass along the trailing keyword args `kw...`
        return handler (req; kw ... )
    end
end

# Create a new client with the auth layer added
HTTP . @client [cache_layer]

end # module


# Let's push the layer globally in all HTTP.jl requests
HTTP . pushlayer! (MyCache . cache_layer)
# HTTP.poplayer!() # to remove it later

# Let's call the API
@time msg = aigenerate ( " What is the meaning of life? " ; http_kwargs = (; cache_key = " key1 " ))

# The first call will be slow as usual, but any subsequent call should be pretty quick - try it a few times!

您也可以將其用於嵌入，例如，

 @time msg = aiembed ( " how is it going? " ; http_kwargs = (; cache_key = " key2 " )) # 0.7s
@time msg = aiembed ( " how is it going? " ; http_kwargs = (; cache_key = " key2 " )) # 0.02s

# Even with a tiny difference (no question mark), it still picks the right cache
@time msg = aiembed ( " how is it going " ; http_kwargs = (; cache_key = " key2 " )) # 0.02s

您可以透過呼叫HTTP.poplayer!()來刪除快取層（如果您進行了一些更改，則可以再次新增它）。

您可以透過呼叫MyCache.SEM_CACHE （例如， MyCache.SEM_CACHE.items[1] ）來偵測快取。

常見問題解答

表現如何？

大部分時間將花費在 1) 微小嵌入（對於大型文本，例如數千個標記）和計算餘弦相似度（對於大型緩存，例如超過 10k 個項目）。

作為參考，嵌入較小的文字（例如要嵌入的問題）只需幾毫秒。嵌入 2000 個令牌可能需要 50-100 毫秒。

當涉及到快取系統時，有很多鎖來避免故障，但開銷仍然可以忽略不計 - 我進行了 100k 順序插入的實驗，每個項目的時間只有幾毫秒（由餘弦相似度主導）。如果您的瓶頸在於餘弦相似度計算（大約100k 個項目需要4 毫秒），請考慮將向量移到矩陣中以實現連續記憶體和/或使用具有漢明距離的布林嵌入（異或運算符，c.數量級加速）。

總而言之，該系統比具有數千個快取項目的正常工作負載所需的速度要快。如果您的有效負載很大（考慮交換到磁碟），那麼您更有可能遇到 GC 和記憶體問題，而不是面臨計算限制。請記住，其動機是為了防止 API 呼叫耗時 1-20 秒！

如何衡量做X所花費的時間？

看看下面的範例片段 - 對您感興趣的部分進行計時。

sem_cache = SemanticCache ()
# First argument: the key must always match exactly, eg, model, temperature, etc
# Second argument: the input text to be compared with the cache, can be fuzzy matched
item = sem_cache ( " key1 " , " say hi! " ; verbose = 1 ) # notice the verbose flag it can 0,1,2 for different level of detail
if ! isvalid (item)
    @info " cache miss! "
    item . output = " expensive result X "
    # Save the result to the cache for future reference
    push! (sem_cache, item)
end

僅嵌入（調整min_similarity閾值或對嵌入進行計時）

 using SemanticCaches . FlashRank : embed
using SemanticCaches : EMBEDDER

@time res = embed (EMBEDDER, " say hi " )
#   0.000903 seconds (104 allocations: 19.273 KiB)
# see res.elapsed or res.embeddings

# long inputs (split into several chunks and then combining the embeddings)
@time embed (EMBEDDER, " say hi " ^ 1000 )
#   0.032148 seconds (8.11 k allocations: 662.656 KiB)

如何設定min_similarity閾值？

您可以透過新增 kwarg active_cache("key1", input; verbose=2, min_similarity=0.95)來設定min_similarity閾值。

預設值為 0.95，這是一個非常高的閾值。出於實用目的，我建議 ~0.9。如果您預計會出現一些拼字錯誤，您甚至可以降低一點（例如，0.85）。

警告

請注意相似性閾值。很難很好地嵌入超短序列！您可能需要根據輸入的長度調整閾值。始終用您的輸入來測試它們！

如果要計算餘弦相似度，請記住首先normalize或將點積除以範數。

 using SemanticCaches . LinearAlgebra : normalize, norm, dot
cosine_similarity = dot (r1 . embeddings, r2 . embeddings) / ( norm (r1 . embeddings) * norm (r2 . embeddings))
# remember that 1 is the best similarity, -1 is the exact opposite

您可以比較不同的輸入以確定適合您的用例的最佳閾值

emb1 = embed (EMBEDDER, " How is it going? " ) |> x -> vec (x . embeddings) |> normalize
emb2 = embed (EMBEDDER, " How is it goin'? " ) |> x -> vec (x . embeddings) |> normalize
dot (emb1, emb2) # 0.944

emb1 = embed (EMBEDDER, " How is it going? " ) |> x -> vec (x . embeddings) |> normalize
emb2 = embed (EMBEDDER, " How is it goin' " ) |> x -> vec (x . embeddings) |> normalize
dot (emb1, emb2) # 0.920