SemanticCaches.jl 是人工智慧應用程式語義快取的一種非常駭客的實現,可以透過重複請求節省時間和金錢。它並不是特別快,因為我們試圖阻止甚至可能需要 20 秒的 API 呼叫。
請注意,我們使用最大區塊大小為 512 個令牌的微型 BERT 模型來提供在 CPU 上運行的快速本地嵌入。對於較長的句子,我們將它們分成幾個區塊並考慮它們的平均嵌入,但請小心使用!延遲可能會急劇上升,並且比簡單地呼叫原始 API 更糟糕。
要安裝 SemanticCaches.jl,只需使用 Julia 套件管理器新增套件即可:
using Pkg;
Pkg . activate ( " . " )
Pkg . add ( " SemanticCaches " )
# # This line is very important to be able to download the models!!!
ENV [ " DATADEPS_ALWAYS_ACCEPT " ] = " true "
using SemanticCaches
sem_cache = SemanticCache ()
# First argument: the key must always match exactly, eg, model, temperature, etc
# Second argument: the input text to be compared with the cache, can be fuzzy matched
item = sem_cache ( " key1 " , " say hi! " ; verbose = 1 ) # notice the verbose flag it can 0,1,2 for different level of detail
if ! isvalid (item)
@info " cache miss! "
item . output = " expensive result X "
# Save the result to the cache for future reference
push! (sem_cache, item)
end
# If practice, long texts may take too long to embed even with our tiny model
# so let's not compare anything above 2000 tokens =~ 5000 characters (threshold of c. 100ms)
hash_cache = HashCache ()
input = " say hi "
input = " say hi " ^ 1000
active_cache = length (input) > 5000 ? hash_cache : sem_cache
item = active_cache ( " key1 " , input; verbose = 1 )
if ! isvalid (item)
@info " cache miss! "
item . output = " expensive result X "
push! (active_cache, item)
end
建置此套件的主要目標是快取對 GenAI 模型的昂貴 API 呼叫。
該系統提供 STRING 輸入的精確匹配(更快, HashCache
)和語義相似性查找(更慢, SemanticCache
)。此外,所有請求首先在「快取鍵」上進行比較,該鍵提供的鍵必須始終與請求完全匹配才能被視為可互換(例如,相同型號、相同提供者、相同溫度等)。您需要根據您的用例選擇適當的快取鍵和輸入。快取鍵的預設選擇應該是模型名稱。
當您呼叫快取(提供cache_key
和string_input
)時會發生什麼事?
cache.items
中。cache_key
以查找items
中對應項目的索引。如果未找到cache_key
,我們將傳回帶有空白output
欄位的CachedItem
(即isvalid(item) == false
)。string_input
並對嵌入進行歸一化(以便以後更容易比較餘弦距離)。min_similarity
閾值,我們會傳回快取的項目(輸出可以在欄位item.output
中找到)。如果我們沒有找到任何快取的項目,我們將傳回帶有空白output
欄位的CachedItem
(即isvalid(item) == false
)。計算出回應並將其保存在item.output
中後,您可以透過呼叫push!(cache, item)
將項目推送到快取。
根據您對 API 呼叫的了解,您需要確定:1) 快取鍵(快取項目的單獨存儲,例如不同的型號或溫度)以及 2) 如何將 HTTP 請求解包為字串(例如,解包和加入OpenAI API 的格式化訊息內容)。
以下簡要概述如何將 SemanticCaches.jl 與 PromptingTools.jl 一起使用。
using PromptingTools
using SemanticCaches
using HTTP
# # Define the new caching mechanism as a layer for HTTP
# # See documentation [here](https://juliaweb.github.io/HTTP.jl/stable/client/#Quick-Examples)
module MyCache
using HTTP, JSON3
using SemanticCaches
const SEM_CACHE = SemanticCache ()
const HASH_CACHE = HashCache ()
function cache_layer (handler)
return function (req; cache_key :: Union{AbstractString,Nothing} = nothing , kw ... )
# only apply the cache layer if the user passed `cache_key`
# we could also use the contents of the payload, eg, `cache_key = get(body, "model", "unknown")`
if req . method == " POST " && cache_key != = nothing
body = JSON3 . read ( copy (req . body))
if occursin ( " v1/chat/completions " , req . target)
# # We're in chat completion endpoint
input = join ([m[ " content " ] for m in body[ " messages " ]], " " )
elseif occursin ( " v1/embeddings " , req . target)
# # We're in embedding endpoint
input = body[ " input " ]
else
# # Skip, unknown API
return handler (req; kw ... )
end
# # Check the cache
@info " Check if we can cache this request ( $( length (input)) chars) "
active_cache = length (input) > 5000 ? HASH_CACHE : SEM_CACHE
item = active_cache ( " key1 " , input; verbose = 2 ) # change verbosity to 0 to disable detailed logs
if ! isvalid (item)
@info " Cache miss! Pinging the API "
# pass the request along to the next layer by calling `cache_layer` arg `handler`
resp = handler (req; kw ... )
item . output = resp
# Let's remember it for the next time
push! (active_cache, item)
end
# # Return the calculated or cached result
return item . output
end
# pass the request along to the next layer by calling `cache_layer` arg `handler`
# also pass along the trailing keyword args `kw...`
return handler (req; kw ... )
end
end
# Create a new client with the auth layer added
HTTP . @client [cache_layer]
end # module
# Let's push the layer globally in all HTTP.jl requests
HTTP . pushlayer! (MyCache . cache_layer)
# HTTP.poplayer!() # to remove it later
# Let's call the API
@time msg = aigenerate ( " What is the meaning of life? " ; http_kwargs = (; cache_key = " key1 " ))
# The first call will be slow as usual, but any subsequent call should be pretty quick - try it a few times!
您也可以將其用於嵌入,例如,
@time msg = aiembed ( " how is it going? " ; http_kwargs = (; cache_key = " key2 " )) # 0.7s
@time msg = aiembed ( " how is it going? " ; http_kwargs = (; cache_key = " key2 " )) # 0.02s
# Even with a tiny difference (no question mark), it still picks the right cache
@time msg = aiembed ( " how is it going " ; http_kwargs = (; cache_key = " key2 " )) # 0.02s
您可以透過呼叫HTTP.poplayer!()
來刪除快取層(如果您進行了一些更改,則可以再次新增它)。
您可以透過呼叫MyCache.SEM_CACHE
(例如, MyCache.SEM_CACHE.items[1]
)來偵測快取。
表現如何?
大部分時間將花費在 1) 微小嵌入(對於大型文本,例如數千個標記)和計算餘弦相似度(對於大型緩存,例如超過 10k 個項目)。
作為參考,嵌入較小的文字(例如要嵌入的問題)只需幾毫秒。嵌入 2000 個令牌可能需要 50-100 毫秒。
當涉及到快取系統時,有很多鎖來避免故障,但開銷仍然可以忽略不計 - 我進行了 100k 順序插入的實驗,每個項目的時間只有幾毫秒(由餘弦相似度主導)。如果您的瓶頸在於餘弦相似度計算(大約100k 個項目需要4 毫秒),請考慮將向量移到矩陣中以實現連續記憶體和/或使用具有漢明距離的布林嵌入(異或運算符,c.數量級加速)。
總而言之,該系統比具有數千個快取項目的正常工作負載所需的速度要快。如果您的有效負載很大(考慮交換到磁碟),那麼您更有可能遇到 GC 和記憶體問題,而不是面臨計算限制。請記住,其動機是為了防止 API 呼叫耗時 1-20 秒!
如何衡量做X所花費的時間?
看看下面的範例片段 - 對您感興趣的部分進行計時。
sem_cache = SemanticCache ()
# First argument: the key must always match exactly, eg, model, temperature, etc
# Second argument: the input text to be compared with the cache, can be fuzzy matched
item = sem_cache ( " key1 " , " say hi! " ; verbose = 1 ) # notice the verbose flag it can 0,1,2 for different level of detail
if ! isvalid (item)
@info " cache miss! "
item . output = " expensive result X "
# Save the result to the cache for future reference
push! (sem_cache, item)
end
僅嵌入(調整min_similarity
閾值或對嵌入進行計時)
using SemanticCaches . FlashRank : embed
using SemanticCaches : EMBEDDER
@time res = embed (EMBEDDER, " say hi " )
# 0.000903 seconds (104 allocations: 19.273 KiB)
# see res.elapsed or res.embeddings
# long inputs (split into several chunks and then combining the embeddings)
@time embed (EMBEDDER, " say hi " ^ 1000 )
# 0.032148 seconds (8.11 k allocations: 662.656 KiB)
如何設定min_similarity
閾值?
您可以透過新增 kwarg active_cache("key1", input; verbose=2, min_similarity=0.95)
來設定min_similarity
閾值。
預設值為 0.95,這是一個非常高的閾值。出於實用目的,我建議 ~0.9。如果您預計會出現一些拼字錯誤,您甚至可以降低一點(例如,0.85)。
警告
請注意相似性閾值。很難很好地嵌入超短序列!您可能需要根據輸入的長度調整閾值。始終用您的輸入來測試它們!
如果要計算餘弦相似度,請記住首先normalize
或將點積除以範數。
using SemanticCaches . LinearAlgebra : normalize, norm, dot
cosine_similarity = dot (r1 . embeddings, r2 . embeddings) / ( norm (r1 . embeddings) * norm (r2 . embeddings))
# remember that 1 is the best similarity, -1 is the exact opposite
您可以比較不同的輸入以確定適合您的用例的最佳閾值
emb1 = embed (EMBEDDER, " How is it going? " ) |> x -> vec (x . embeddings) |> normalize
emb2 = embed (EMBEDDER, " How is it goin'? " ) |> x -> vec (x . embeddings) |> normalize
dot (emb1, emb2) # 0.944
emb1 = embed (EMBEDDER, " How is it going? " ) |> x -> vec (x . embeddings) |> normalize
emb2 = embed (EMBEDDER, " How is it goin' " ) |> x -> vec (x . embeddings) |> normalize
dot (emb1, emb2) # 0.920
如何調試呢?
透過新增 kwarg verbose = 2
啟用詳細日誌記錄,例如item = active_cache("key1", input; verbose=2)
。
[] 基於時間的快取有效性 [] 加快嵌入過程/考慮預處理輸入 [] 與 PromptingTools 和 API 模式的本機集成