SemanticCaches.jlダウンロード - SemanticCaches.jlソースコードのダウンロード

SemanticCaches.jl

SemanticCaches.jl は、AI アプリケーションがリクエストを繰り返すことで時間とコストを節約するためのセマンティックキャッシュの非常にハッキングな実装です。 20 秒もかかる可能性のある API 呼び出しを回避しようとしているため、特に高速ではありません。

CPU 上で高速なローカルエンベディングを実行するために、最大チャンクサイズ 512 トークンの小さな BERT モデルを使用していることに注意してください。長い文の場合は、文をいくつかのチャンクに分割し、平均的な埋め込みを考慮しますが、慎重に使用してください。レイテンシが急激に増加し、元の API を単に呼び出すよりも悪化する可能性があります。

インストール

SemanticCaches.jl をインストールするには、Julia パッケージマネージャーを使用してパッケージを追加するだけです。

 using Pkg;
Pkg . activate ( " . " )
Pkg . add ( " SemanticCaches " )

クイックスタートガイド

 # # This line is very important to be able to download the models!!!
ENV [ " DATADEPS_ALWAYS_ACCEPT " ] = " true "
using SemanticCaches

sem_cache = SemanticCache ()
# First argument: the key must always match exactly, eg, model, temperature, etc
# Second argument: the input text to be compared with the cache, can be fuzzy matched
item = sem_cache ( " key1 " , " say hi! " ; verbose = 1 ) # notice the verbose flag it can 0,1,2 for different level of detail
if ! isvalid (item)
    @info " cache miss! "
    item . output = " expensive result X "
    # Save the result to the cache for future reference
    push! (sem_cache, item)
end

# If practice, long texts may take too long to embed even with our tiny model
# so let's not compare anything above 2000 tokens =~ 5000 characters (threshold of c. 100ms)

hash_cache = HashCache ()
input = " say hi "
input = " say hi " ^ 1000

active_cache = length (input) > 5000 ? hash_cache : sem_cache
item = active_cache ( " key1 " , input; verbose = 1 )

if ! isvalid (item)
    @info " cache miss! "
    item . output = " expensive result X "
    push! (active_cache, item)
end

仕組み

このパッケージを構築する主な目的は、GenAI モデルへの高価な API 呼び出しをキャッシュすることでした。

このシステムは、STRING 入力の完全一致 (高速、 HashCache ) と意味的類似性検索 (低速、 SemanticCache ) を提供します。さらに、すべてのリクエストは最初に「キャッシュキー」で比較されます。これは、リクエストが交換可能であると見なされるために常に完全に一致する必要があるキーを示します (同じモデル、同じプロバイダー、同じ温度など)。ユースケースに応じて、適切なキャッシュキーと入力を選択する必要があります。キャッシュキーのデフォルトの選択はモデル名である必要があります。

キャッシュを呼び出すと ( cache_keyとstring_input指定すると) どうなりますか?

キャッシュされたすべての出力は、ベクトルcache.itemsに保存されます。
リクエストを受信すると、 cache_keyが検索され、 items内の対応する項目のインデックスが検索されます。 cache_keyが見つからない場合は、空のoutputフィールドを持つCachedItem返します (つまり、 isvalid(item) == false )。
小さな BERT モデルを使用してstring_inputを埋め込み、埋め込みを正規化します (後でコサイン距離を比較しやすくするため)。
次に、コサイン距離とキャッシュされたアイテムの埋め込みを比較します。
コサイン距離がmin_similarityしきい値より大きい場合は、キャッシュされたアイテムを返します (出力はフィールドitem.outputにあります)。

キャッシュされたアイテムが見つからなかった場合は、空のoutputフィールドを含むCachedItem返します (つまり、 isvalid(item) == false )。レスポンスを計算してitem.outputに保存したら、 push!(cache, item)を呼び出してアイテムをキャッシュにプッシュできます。

適切な使用例

このパッケージは、リクエストの量が少ないことがわかっている場合 (たとえば、セッションまたはマシンごとに 10k 未満) に最適です。
RAG パイプライン構成を変更した場合でも、呼び出しの多くは繰り返され、キャッシュを利用できるため、評価の実行コストを削減するのが理想的です。
最後に、このパッケージはデモや小規模なユーザーアプリケーションに非常に役立ちます。システム入力の一部を事前に知ることができるため、それらをキャッシュして驚異的な応答時間を示すことができます。
このパッケージは、数十万のリクエストを伴う運用システムには適していません。また、これは非常に基本的なキャッシュであり、時間をかけて手動で無効にする必要があることに注意してください。

高度な使用法

HTTPリクエストのキャッシュ

行われた API 呼び出しの知識に基づいて、1) キャッシュキー (キャッシュされたアイテムの別のストア、たとえば、さまざまなモデルや温度)、および 2) HTTP リクエストを文字列に解凍する方法 (たとえば、アンラップおよびOpenAI API 用にフォーマットされたメッセージコンテンツを結合します)。

SemanticCaches.jl を PromptingTools.jl とともに使用する方法の概要を次に示します。

 using PromptingTools
using SemanticCaches
using HTTP

# # Define the new caching mechanism as a layer for HTTP
# # See documentation [here](https://juliaweb.github.io/HTTP.jl/stable/client/#Quick-Examples)
module MyCache

using HTTP, JSON3
using SemanticCaches

const SEM_CACHE = SemanticCache ()
const HASH_CACHE = HashCache ()

function cache_layer (handler)
    return function (req; cache_key :: Union{AbstractString,Nothing} = nothing , kw ... )
        # only apply the cache layer if the user passed `cache_key`
        # we could also use the contents of the payload, eg, `cache_key = get(body, "model", "unknown")`
        if req . method == " POST " && cache_key != = nothing
            body = JSON3 . read ( copy (req . body))
            if occursin ( " v1/chat/completions " , req . target)
                # # We're in chat completion endpoint
                input = join ([m[ " content " ] for m in body[ " messages " ]], " " )
            elseif occursin ( " v1/embeddings " , req . target)
                # # We're in embedding endpoint
                input = body[ " input " ]
            else
                # # Skip, unknown API
                return handler (req; kw ... )
            end
            # # Check the cache
            @info " Check if we can cache this request ( $( length (input)) chars) "
            active_cache = length (input) > 5000 ? HASH_CACHE : SEM_CACHE
            item = active_cache ( " key1 " , input; verbose = 2 ) # change verbosity to 0 to disable detailed logs
            if ! isvalid (item)
                @info " Cache miss! Pinging the API "
                # pass the request along to the next layer by calling `cache_layer` arg `handler`
                resp = handler (req; kw ... )
                item . output = resp
                # Let's remember it for the next time
                push! (active_cache, item)
            end
            # # Return the calculated or cached result
            return item . output
        end
        # pass the request along to the next layer by calling `cache_layer` arg `handler`
        # also pass along the trailing keyword args `kw...`
        return handler (req; kw ... )
    end
end

# Create a new client with the auth layer added
HTTP . @client [cache_layer]

end # module


# Let's push the layer globally in all HTTP.jl requests
HTTP . pushlayer! (MyCache . cache_layer)
# HTTP.poplayer!() # to remove it later

# Let's call the API
@time msg = aigenerate ( " What is the meaning of life? " ; http_kwargs = (; cache_key = " key1 " ))

# The first call will be slow as usual, but any subsequent call should be pretty quick - try it a few times!

埋め込みにも使用できます。

 @time msg = aiembed ( " how is it going? " ; http_kwargs = (; cache_key = " key2 " )) # 0.7s
@time msg = aiembed ( " how is it going? " ; http_kwargs = (; cache_key = " key2 " )) # 0.02s

# Even with a tiny difference (no question mark), it still picks the right cache
@time msg = aiembed ( " how is it going " ; http_kwargs = (; cache_key = " key2 " )) # 0.02s

HTTP.poplayer!()を呼び出すことでキャッシュレイヤーを削除できます (変更を加えた場合は再度追加します)。

MyCache.SEM_CACHE (例: MyCache.SEM_CACHE.items[1] ) を呼び出すことでキャッシュを調査できます。

よくある質問

パフォーマンスはどうですか？

時間の大部分は、1) 小さな埋め込み (大きなテキスト、たとえば数千のトークンの場合) とコサイン類似度の計算 (大きなキャッシュ、たとえば 10,000 アイテム以上の場合) に費やされます。

参考までに、埋め込む質問などの小さなテキストの埋め込みには、わずか数ミリ秒しかかかりません。 2,000 個のトークンを埋め込むには、50 ～ 100 ミリ秒かかる場合があります。

キャッシュシステムに関しては、障害を回避するために多くのロックがありますが、オーバーヘッドは依然として無視できます。100k の連続挿入で実験を実行しましたが、アイテムあたりの時間はわずか数ミリ秒でした (コサイン類似度によって支配されています)。ボトルネックがコサイン類似度の計算 (c. 100k 項目で 4 ミリ秒) にある場合は、連続メモリ用にベクトルを行列に移動するか、ハミング距離 (XOR 演算子、c. 桁違いの高速化) を使用したブール埋め込みの使用を検討してください。

全体として、このシステムは、数千のキャッシュされたアイテムを含む通常のワークロードに必要な速度よりも高速です。ペイロードが大きい場合 (ディスクへのスワップを検討)、コンピューティングの限界に直面するよりも、GC とメモリの問題が発生する可能性が高くなります。目的は、1 ～ 20 秒かかる API 呼び出しを防ぐことであることに注意してください。

Xを実行するのにかかる時間を測定するにはどうすればよいですか?

以下のサンプルスニペットを見てください。興味のある部分を時間を計ってみてください。

sem_cache = SemanticCache ()
# First argument: the key must always match exactly, eg, model, temperature, etc
# Second argument: the input text to be compared with the cache, can be fuzzy matched
item = sem_cache ( " key1 " , " say hi! " ; verbose = 1 ) # notice the verbose flag it can 0,1,2 for different level of detail
if ! isvalid (item)
    @info " cache miss! "
    item . output = " expensive result X "
    # Save the result to the cache for future reference
    push! (sem_cache, item)
end

埋め込みのみ ( min_similarityしきい値を調整するか、埋め込みの時間を調整するため)

 using SemanticCaches . FlashRank : embed
using SemanticCaches : EMBEDDER

@time res = embed (EMBEDDER, " say hi " )
#   0.000903 seconds (104 allocations: 19.273 KiB)
# see res.elapsed or res.embeddings

# long inputs (split into several chunks and then combining the embeddings)
@time embed (EMBEDDER, " say hi " ^ 1000 )
#   0.032148 seconds (8.11 k allocations: 662.656 KiB)

min_similarityしきい値を設定するにはどうすればよいですか?

kwarg active_cache("key1", input; verbose=2, min_similarity=0.95)を追加することでmin_similarityしきい値を設定できます。

デフォルトは 0.95 で、これは非常に高いしきい値です。実用的な目的では、~0.9 をお勧めします。タイプミスが予想される場合は、さらに少し低くすることもできます (たとえば、0.85)。

警告

類似性のしきい値には注意してください。超短いシーケンスをうまく埋め込むのは難しい！入力の長さに応じてしきい値を調整することをお勧めします。常に入力してテストしてください。

コサイン類似度を計算したい場合は、最初に埋め込みnormalizeか、ドット積をノルムで割ることを忘れないでください。

 using SemanticCaches . LinearAlgebra : normalize, norm, dot
cosine_similarity = dot (r1 . embeddings, r2 . embeddings) / ( norm (r1 . embeddings) * norm (r2 . embeddings))
# remember that 1 is the best similarity, -1 is the exact opposite

さまざまな入力を比較して、ユースケースに最適なしきい値を決定できます。

emb1 = embed (EMBEDDER, " How is it going? " ) |> x -> vec (x . embeddings) |> normalize
emb2 = embed (EMBEDDER, " How is it goin'? " ) |> x -> vec (x . embeddings) |> normalize
dot (emb1, emb2) # 0.944

emb1 = embed (EMBEDDER, " How is it going? " ) |> x -> vec (x . embeddings) |> normalize
emb2 = embed (EMBEDDER, " How is it goin' " ) |> x -> vec (x . embeddings) |> normalize
dot (emb1, emb2) # 0.920