SemanticCaches.jl 是人工智能应用程序语义缓存的一种非常黑客的实现,可以通过重复请求节省时间和金钱。它并不是特别快,因为我们试图阻止甚至可能需要 20 秒的 API 调用。
请注意,我们使用最大块大小为 512 个令牌的微型 BERT 模型来提供在 CPU 上运行的快速本地嵌入。对于较长的句子,我们将它们分成几个块并考虑它们的平均嵌入,但请小心使用!延迟可能会急剧上升,并且比简单地调用原始 API 更糟糕。
要安装 SemanticCaches.jl,只需使用 Julia 包管理器添加包即可:
using Pkg;
Pkg . activate ( " . " )
Pkg . add ( " SemanticCaches " )
# # This line is very important to be able to download the models!!!
ENV [ " DATADEPS_ALWAYS_ACCEPT " ] = " true "
using SemanticCaches
sem_cache = SemanticCache ()
# First argument: the key must always match exactly, eg, model, temperature, etc
# Second argument: the input text to be compared with the cache, can be fuzzy matched
item = sem_cache ( " key1 " , " say hi! " ; verbose = 1 ) # notice the verbose flag it can 0,1,2 for different level of detail
if ! isvalid (item)
@info " cache miss! "
item . output = " expensive result X "
# Save the result to the cache for future reference
push! (sem_cache, item)
end
# If practice, long texts may take too long to embed even with our tiny model
# so let's not compare anything above 2000 tokens =~ 5000 characters (threshold of c. 100ms)
hash_cache = HashCache ()
input = " say hi "
input = " say hi " ^ 1000
active_cache = length (input) > 5000 ? hash_cache : sem_cache
item = active_cache ( " key1 " , input; verbose = 1 )
if ! isvalid (item)
@info " cache miss! "
item . output = " expensive result X "
push! (active_cache, item)
end
构建此包的主要目标是缓存对 GenAI 模型的昂贵 API 调用。
该系统提供 STRING 输入的精确匹配(更快, HashCache
)和语义相似性查找(更慢, SemanticCache
)。此外,所有请求首先在“缓存键”上进行比较,该键提供的键必须始终与请求完全匹配才能被视为可互换(例如,相同型号、相同提供商、相同温度等)。您需要根据您的用例选择适当的缓存键和输入。缓存键的默认选择应该是模型名称。
当您调用缓存(提供cache_key
和string_input
)时会发生什么?
cache.items
中。cache_key
以查找items
中相应项目的索引。如果未找到cache_key
,我们将返回带有空output
字段的CachedItem
(即isvalid(item) == false
)。string_input
并对嵌入进行归一化(以便以后更容易比较余弦距离)。min_similarity
阈值,我们返回缓存的项目(输出可以在字段item.output
中找到)。如果我们没有找到任何缓存的项目,我们将返回带有空output
字段的CachedItem
(即isvalid(item) == false
)。计算出响应并将其保存在item.output
中后,您可以通过调用push!(cache, item)
将项目推送到缓存。
根据您对 API 调用的了解,您需要确定:1) 缓存键(缓存项的单独存储,例如不同的型号或温度)以及 2) 如何将 HTTP 请求解包为字符串(例如,解包和加入 OpenAI API 的格式化消息内容)。
以下简要概述了如何将 SemanticCaches.jl 与 PromptingTools.jl 一起使用。
using PromptingTools
using SemanticCaches
using HTTP
# # Define the new caching mechanism as a layer for HTTP
# # See documentation [here](https://juliaweb.github.io/HTTP.jl/stable/client/#Quick-Examples)
module MyCache
using HTTP, JSON3
using SemanticCaches
const SEM_CACHE = SemanticCache ()
const HASH_CACHE = HashCache ()
function cache_layer (handler)
return function (req; cache_key :: Union{AbstractString,Nothing} = nothing , kw ... )
# only apply the cache layer if the user passed `cache_key`
# we could also use the contents of the payload, eg, `cache_key = get(body, "model", "unknown")`
if req . method == " POST " && cache_key != = nothing
body = JSON3 . read ( copy (req . body))
if occursin ( " v1/chat/completions " , req . target)
# # We're in chat completion endpoint
input = join ([m[ " content " ] for m in body[ " messages " ]], " " )
elseif occursin ( " v1/embeddings " , req . target)
# # We're in embedding endpoint
input = body[ " input " ]
else
# # Skip, unknown API
return handler (req; kw ... )
end
# # Check the cache
@info " Check if we can cache this request ( $( length (input)) chars) "
active_cache = length (input) > 5000 ? HASH_CACHE : SEM_CACHE
item = active_cache ( " key1 " , input; verbose = 2 ) # change verbosity to 0 to disable detailed logs
if ! isvalid (item)
@info " Cache miss! Pinging the API "
# pass the request along to the next layer by calling `cache_layer` arg `handler`
resp = handler (req; kw ... )
item . output = resp
# Let's remember it for the next time
push! (active_cache, item)
end
# # Return the calculated or cached result
return item . output
end
# pass the request along to the next layer by calling `cache_layer` arg `handler`
# also pass along the trailing keyword args `kw...`
return handler (req; kw ... )
end
end
# Create a new client with the auth layer added
HTTP . @client [cache_layer]
end # module
# Let's push the layer globally in all HTTP.jl requests
HTTP . pushlayer! (MyCache . cache_layer)
# HTTP.poplayer!() # to remove it later
# Let's call the API
@time msg = aigenerate ( " What is the meaning of life? " ; http_kwargs = (; cache_key = " key1 " ))
# The first call will be slow as usual, but any subsequent call should be pretty quick - try it a few times!
您还可以将其用于嵌入,例如,
@time msg = aiembed ( " how is it going? " ; http_kwargs = (; cache_key = " key2 " )) # 0.7s
@time msg = aiembed ( " how is it going? " ; http_kwargs = (; cache_key = " key2 " )) # 0.02s
# Even with a tiny difference (no question mark), it still picks the right cache
@time msg = aiembed ( " how is it going " ; http_kwargs = (; cache_key = " key2 " )) # 0.02s
您可以通过调用HTTP.poplayer!()
来删除缓存层(如果您进行了一些更改,则可以再次添加它)。
您可以通过调用MyCache.SEM_CACHE
(例如, MyCache.SEM_CACHE.items[1]
)来探测缓存。
表现如何?
大部分时间将花费在 1) 微小嵌入(对于大型文本,例如数千个标记)和计算余弦相似度(对于大型缓存,例如超过 10k 个项目)。
作为参考,嵌入较小的文本(例如要嵌入的问题)只需几毫秒。嵌入 2000 个令牌可能需要 50-100 毫秒。
当涉及到缓存系统时,有很多锁来避免故障,但开销仍然可以忽略不计 - 我进行了 100k 顺序插入的实验,每个项目的时间只有几毫秒(由余弦相似度主导)。如果您的瓶颈在于余弦相似度计算(大约 100k 个项目需要 4 毫秒),请考虑将向量移动到矩阵中以实现连续内存和/或使用具有汉明距离的布尔嵌入(异或运算符,c. 数量级加速)。
总而言之,该系统比具有数千个缓存项目的正常工作负载所需的速度要快。如果您的有效负载很大(考虑交换到磁盘),那么您更有可能遇到 GC 和内存问题,而不是面临计算限制。请记住,其动机是为了防止 API 调用耗时 1-20 秒!
如何衡量做X所花费的时间?
看看下面的示例片段 - 对您感兴趣的部分进行计时。
sem_cache = SemanticCache ()
# First argument: the key must always match exactly, eg, model, temperature, etc
# Second argument: the input text to be compared with the cache, can be fuzzy matched
item = sem_cache ( " key1 " , " say hi! " ; verbose = 1 ) # notice the verbose flag it can 0,1,2 for different level of detail
if ! isvalid (item)
@info " cache miss! "
item . output = " expensive result X "
# Save the result to the cache for future reference
push! (sem_cache, item)
end
仅嵌入(调整min_similarity
阈值或对嵌入进行计时)
using SemanticCaches . FlashRank : embed
using SemanticCaches : EMBEDDER
@time res = embed (EMBEDDER, " say hi " )
# 0.000903 seconds (104 allocations: 19.273 KiB)
# see res.elapsed or res.embeddings
# long inputs (split into several chunks and then combining the embeddings)
@time embed (EMBEDDER, " say hi " ^ 1000 )
# 0.032148 seconds (8.11 k allocations: 662.656 KiB)
如何设置min_similarity
阈值?
您可以通过添加 kwarg active_cache("key1", input; verbose=2, min_similarity=0.95)
来设置min_similarity
阈值。
默认值为 0.95,这是一个非常高的阈值。出于实用目的,我建议 ~0.9。如果您预计会出现一些拼写错误,您甚至可以降低一点(例如,0.85)。
警告
请注意相似性阈值。很难很好地嵌入超短序列!您可能需要根据输入的长度调整阈值。始终用您的输入来测试它们!
如果要计算余弦相似度,请记住首先对嵌入normalize
或将点积除以范数。
using SemanticCaches . LinearAlgebra : normalize, norm, dot
cosine_similarity = dot (r1 . embeddings, r2 . embeddings) / ( norm (r1 . embeddings) * norm (r2 . embeddings))
# remember that 1 is the best similarity, -1 is the exact opposite
您可以比较不同的输入以确定适合您的用例的最佳阈值
emb1 = embed (EMBEDDER, " How is it going? " ) |> x -> vec (x . embeddings) |> normalize
emb2 = embed (EMBEDDER, " How is it goin'? " ) |> x -> vec (x . embeddings) |> normalize
dot (emb1, emb2) # 0.944
emb1 = embed (EMBEDDER, " How is it going? " ) |> x -> vec (x . embeddings) |> normalize
emb2 = embed (EMBEDDER, " How is it goin' " ) |> x -> vec (x . embeddings) |> normalize
dot (emb1, emb2) # 0.920
如何调试呢?
通过添加 kwarg verbose = 2
启用详细日志记录,例如item = active_cache("key1", input; verbose=2)
。
[] 基于时间的缓存有效性 [] 加快嵌入过程/考虑预处理输入 [] 与 PromptingTools 和 API 模式的本机集成