SemanticCaches.jl下载 - SemanticCaches.jl源代码下载

语义缓存.jl

SemanticCaches.jl 是人工智能应用程序语义缓存的一种非常黑客的实现，可以通过重复请求节省时间和金钱。它并不是特别快，因为我们试图阻止甚至可能需要 20 秒的 API 调用。

请注意，我们使用最大块大小为 512 个令牌的微型 BERT 模型来提供在 CPU 上运行的快速本地嵌入。对于较长的句子，我们将它们分成几个块并考虑它们的平均嵌入，但请小心使用！延迟可能会急剧上升，并且比简单地调用原始 API 更糟糕。

安装

要安装 SemanticCaches.jl，只需使用 Julia 包管理器添加包即可：

 using Pkg;
Pkg . activate ( " . " )
Pkg . add ( " SemanticCaches " )

快速入门指南

 # # This line is very important to be able to download the models!!!
ENV [ " DATADEPS_ALWAYS_ACCEPT " ] = " true "
using SemanticCaches

sem_cache = SemanticCache ()
# First argument: the key must always match exactly, eg, model, temperature, etc
# Second argument: the input text to be compared with the cache, can be fuzzy matched
item = sem_cache ( " key1 " , " say hi! " ; verbose = 1 ) # notice the verbose flag it can 0,1,2 for different level of detail
if ! isvalid (item)
    @info " cache miss! "
    item . output = " expensive result X "
    # Save the result to the cache for future reference
    push! (sem_cache, item)
end

# If practice, long texts may take too long to embed even with our tiny model
# so let's not compare anything above 2000 tokens =~ 5000 characters (threshold of c. 100ms)

hash_cache = HashCache ()
input = " say hi "
input = " say hi " ^ 1000

active_cache = length (input) > 5000 ? hash_cache : sem_cache
item = active_cache ( " key1 " , input; verbose = 1 )

if ! isvalid (item)
    @info " cache miss! "
    item . output = " expensive result X "
    push! (active_cache, item)
end

它是如何运作的

构建此包的主要目标是缓存对 GenAI 模型的昂贵 API 调用。

该系统提供 STRING 输入的精确匹配（更快， HashCache ）和语义相似性查找（更慢， SemanticCache ）。此外，所有请求首先在“缓存键”上进行比较，该键提供的键必须始终与请求完全匹配才能被视为可互换（例如，相同型号、相同提供商、相同温度等）。您需要根据您的用例选择适当的缓存键和输入。缓存键的默认选择应该是模型名称。

当您调用缓存（提供cache_key和string_input ）时会发生什么？

所有缓存的输出都存储在向量cache.items中。
当我们收到请求时，会查找cache_key以查找items中相应项目的索引。如果未找到cache_key ，我们将返回带有空output字段的CachedItem （即isvalid(item) == false ）。
我们使用一个微小的 BERT 模型嵌入string_input并对嵌入进行归一化（以便以后更容易比较余弦距离）。
然后，我们将余弦距离与缓存项的嵌入进行比较。
如果余弦距离高于min_similarity阈值，我们返回缓存的项目（输出可以在字段item.output中找到）。

如果我们没有找到任何缓存的项目，我们将返回带有空output字段的CachedItem （即isvalid(item) == false ）。计算出响应并将其保存在item.output中后，您可以通过调用push!(cache, item)将项目推送到缓存。

合适的用例

如果您知道您的请求量较小（例如，每个会话或机器<10k），那么这个包就非常有用。
降低运行评估的成本是理想的选择，因为即使您更改 RAG 管道配置，许多调用也会重复并且可以利用缓存。
最后，这个包对于演示和小型用户应用程序非常有用，您可以预先了解一些系统输入，因此您可以缓存它们并显示令人难以置信的响应时间！
该软件包不适合具有数十万个请求的生产系统，请记住，这是一个非常基本的缓存，您需要随着时间的推移手动使其失效！

高级用法

缓存 HTTP 请求

根据您对 API 调用的了解，您需要确定：1) 缓存键（缓存项的单独存储，例如不同的型号或温度）以及 2) 如何将 HTTP 请求解包为字符串（例如，解包和加入 OpenAI API 的格式化消息内容）。

以下简要概述了如何将 SemanticCaches.jl 与 PromptingTools.jl 一起使用。

 using PromptingTools
using SemanticCaches
using HTTP

# # Define the new caching mechanism as a layer for HTTP
# # See documentation [here](https://juliaweb.github.io/HTTP.jl/stable/client/#Quick-Examples)
module MyCache

using HTTP, JSON3
using SemanticCaches

const SEM_CACHE = SemanticCache ()
const HASH_CACHE = HashCache ()

function cache_layer (handler)
    return function (req; cache_key :: Union{AbstractString,Nothing} = nothing , kw ... )
        # only apply the cache layer if the user passed `cache_key`
        # we could also use the contents of the payload, eg, `cache_key = get(body, "model", "unknown")`
        if req . method == " POST " && cache_key != = nothing
            body = JSON3 . read ( copy (req . body))
            if occursin ( " v1/chat/completions " , req . target)
                # # We're in chat completion endpoint
                input = join ([m[ " content " ] for m in body[ " messages " ]], " " )
            elseif occursin ( " v1/embeddings " , req . target)
                # # We're in embedding endpoint
                input = body[ " input " ]
            else
                # # Skip, unknown API
                return handler (req; kw ... )
            end
            # # Check the cache
            @info " Check if we can cache this request ( $( length (input)) chars) "
            active_cache = length (input) > 5000 ? HASH_CACHE : SEM_CACHE
            item = active_cache ( " key1 " , input; verbose = 2 ) # change verbosity to 0 to disable detailed logs
            if ! isvalid (item)
                @info " Cache miss! Pinging the API "
                # pass the request along to the next layer by calling `cache_layer` arg `handler`
                resp = handler (req; kw ... )
                item . output = resp
                # Let's remember it for the next time
                push! (active_cache, item)
            end
            # # Return the calculated or cached result
            return item . output
        end
        # pass the request along to the next layer by calling `cache_layer` arg `handler`
        # also pass along the trailing keyword args `kw...`
        return handler (req; kw ... )
    end
end

# Create a new client with the auth layer added
HTTP . @client [cache_layer]

end # module


# Let's push the layer globally in all HTTP.jl requests
HTTP . pushlayer! (MyCache . cache_layer)
# HTTP.poplayer!() # to remove it later

# Let's call the API
@time msg = aigenerate ( " What is the meaning of life? " ; http_kwargs = (; cache_key = " key1 " ))

# The first call will be slow as usual, but any subsequent call should be pretty quick - try it a few times!

您还可以将其用于嵌入，例如，

 @time msg = aiembed ( " how is it going? " ; http_kwargs = (; cache_key = " key2 " )) # 0.7s
@time msg = aiembed ( " how is it going? " ; http_kwargs = (; cache_key = " key2 " )) # 0.02s

# Even with a tiny difference (no question mark), it still picks the right cache
@time msg = aiembed ( " how is it going " ; http_kwargs = (; cache_key = " key2 " )) # 0.02s

您可以通过调用HTTP.poplayer!()来删除缓存层（如果您进行了一些更改，则可以再次添加它）。

您可以通过调用MyCache.SEM_CACHE （例如， MyCache.SEM_CACHE.items[1] ）来探测缓存。

常见问题解答

表现如何？

大部分时间将花费在 1) 微小嵌入（对于大型文本，例如数千个标记）和计算余弦相似度（对于大型缓存，例如超过 10k 个项目）。

作为参考，嵌入较小的文本（例如要嵌入的问题）只需几毫秒。嵌入 2000 个令牌可能需要 50-100 毫秒。

当涉及到缓存系统时，有很多锁来避免故障，但开销仍然可以忽略不计 - 我进行了 100k 顺序插入的实验，每个项目的时间只有几毫秒（由余弦相似度主导）。如果您的瓶颈在于余弦相似度计算（大约 100k 个项目需要 4 毫秒），请考虑将向量移动到矩阵中以实现连续内存和/或使用具有汉明距离的布尔嵌入（异或运算符，c. 数量级加速）。

总而言之，该系统比具有数千个缓存项目的正常工作负载所需的速度要快。如果您的有效负载很大（考虑交换到磁盘），那么您更有可能遇到 GC 和内存问题，而不是面临计算限制。请记住，其动机是为了防止 API 调用耗时 1-20 秒！

如何衡量做X所花费的时间？

看看下面的示例片段 - 对您感兴趣的部分进行计时。

sem_cache = SemanticCache ()
# First argument: the key must always match exactly, eg, model, temperature, etc
# Second argument: the input text to be compared with the cache, can be fuzzy matched
item = sem_cache ( " key1 " , " say hi! " ; verbose = 1 ) # notice the verbose flag it can 0,1,2 for different level of detail
if ! isvalid (item)
    @info " cache miss! "
    item . output = " expensive result X "
    # Save the result to the cache for future reference
    push! (sem_cache, item)
end

仅嵌入（调整min_similarity阈值或对嵌入进行计时）

 using SemanticCaches . FlashRank : embed
using SemanticCaches : EMBEDDER

@time res = embed (EMBEDDER, " say hi " )
#   0.000903 seconds (104 allocations: 19.273 KiB)
# see res.elapsed or res.embeddings

# long inputs (split into several chunks and then combining the embeddings)
@time embed (EMBEDDER, " say hi " ^ 1000 )
#   0.032148 seconds (8.11 k allocations: 662.656 KiB)

如何设置min_similarity阈值？

您可以通过添加 kwarg active_cache("key1", input; verbose=2, min_similarity=0.95)来设置min_similarity阈值。

默认值为 0.95，这是一个非常高的阈值。出于实用目的，我建议 ~0.9。如果您预计会出现一些拼写错误，您甚至可以降低一点（例如，0.85）。

警告

请注意相似性阈值。很难很好地嵌入超短序列！您可能需要根据输入的长度调整阈值。始终用您的输入来测试它们！

如果要计算余弦相似度，请记住首先对嵌入normalize或将点积除以范数。

 using SemanticCaches . LinearAlgebra : normalize, norm, dot
cosine_similarity = dot (r1 . embeddings, r2 . embeddings) / ( norm (r1 . embeddings) * norm (r2 . embeddings))
# remember that 1 is the best similarity, -1 is the exact opposite

您可以比较不同的输入以确定适合您的用例的最佳阈值

emb1 = embed (EMBEDDER, " How is it going? " ) |> x -> vec (x . embeddings) |> normalize
emb2 = embed (EMBEDDER, " How is it goin'? " ) |> x -> vec (x . embeddings) |> normalize
dot (emb1, emb2) # 0.944

emb1 = embed (EMBEDDER, " How is it going? " ) |> x -> vec (x . embeddings) |> normalize
emb2 = embed (EMBEDDER, " How is it goin' " ) |> x -> vec (x . embeddings) |> normalize
dot (emb1, emb2) # 0.920