SemanticCaches.jl ดาวน์โหลด - SemanticCaches.jl ดาวน์โหลดซอร์สโค้ด

SemanticCaches.jl

SemanticCaches.jl เป็นการนำ semantic cache ไปใช้อย่างแฮ็กสำหรับแอปพลิเคชัน AI เพื่อประหยัดเวลาและเงินเมื่อมีการร้องขอซ้ำๆ มันไม่ได้เร็วเป็นพิเศษ เนื่องจากเรากำลังพยายามป้องกันการเรียก API ที่อาจใช้เวลาถึง 20 วินาที

โปรดทราบว่าเรากำลังใช้โมเดล BERT ขนาดเล็กที่มีขนาดก้อนสูงสุด 512 โทเค็นเพื่อให้การฝังภายในเครื่องทำงานบน CPU ได้อย่างรวดเร็ว สำหรับประโยคที่ยาวกว่านี้ เราจะแยกออกเป็นหลายๆ ส่วนและพิจารณาการฝังโดยเฉลี่ย แต่ใช้อย่างระมัดระวัง! เวลาแฝงอาจพุ่งสูงขึ้นและแย่กว่าการเรียก API ดั้งเดิมเพียงอย่างเดียว

การติดตั้ง

หากต้องการติดตั้ง SemanticCaches.jl เพียงเพิ่มแพ็คเกจโดยใช้ตัวจัดการแพ็คเกจ Julia:

 using Pkg;
Pkg . activate ( " . " )
Pkg . add ( " SemanticCaches " )

คู่มือเริ่มต้นใช้งานฉบับย่อ

 # # This line is very important to be able to download the models!!!
ENV [ " DATADEPS_ALWAYS_ACCEPT " ] = " true "
using SemanticCaches

sem_cache = SemanticCache ()
# First argument: the key must always match exactly, eg, model, temperature, etc
# Second argument: the input text to be compared with the cache, can be fuzzy matched
item = sem_cache ( " key1 " , " say hi! " ; verbose = 1 ) # notice the verbose flag it can 0,1,2 for different level of detail
if ! isvalid (item)
    @info " cache miss! "
    item . output = " expensive result X "
    # Save the result to the cache for future reference
    push! (sem_cache, item)
end

# If practice, long texts may take too long to embed even with our tiny model
# so let's not compare anything above 2000 tokens =~ 5000 characters (threshold of c. 100ms)

hash_cache = HashCache ()
input = " say hi "
input = " say hi " ^ 1000

active_cache = length (input) > 5000 ? hash_cache : sem_cache
item = active_cache ( " key1 " , input; verbose = 1 )

if ! isvalid (item)
    @info " cache miss! "
    item . output = " expensive result X "
    push! (active_cache, item)
end

มันทำงานอย่างไร

วัตถุประสงค์หลักของการสร้างแพ็คเกจนี้คือเพื่อแคชการเรียก API ที่มีราคาแพงไปยังโมเดล GenAI

ระบบนำเสนอการจับคู่แบบตรงทั้งหมด (เร็วกว่า HashCache ) และการค้นหาความคล้ายคลึงทางความหมาย (ช้ากว่า SemanticCache ) ของอินพุต STRING นอกจากนี้ คำขอทั้งหมดจะถูกเปรียบเทียบก่อนใน "คีย์แคช" ซึ่งนำเสนอคีย์ที่ต้องตรงกันทุกประการเสมอเพื่อให้คำขอได้รับการพิจารณาว่าใช้แทนกันได้ (เช่น รุ่นเดียวกัน ผู้ให้บริการเดียวกัน อุณหภูมิเดียวกัน ฯลฯ) คุณต้องเลือกคีย์แคชและอินพุตที่เหมาะสม ขึ้นอยู่กับกรณีการใช้งานของคุณ ตัวเลือกเริ่มต้นสำหรับคีย์แคชควรเป็นชื่อรุ่น

จะเกิดอะไรขึ้นเมื่อคุณเรียกแคช (ระบุ cache_key และ string_input )

เอาต์พุตแคชทั้งหมดจะถูกเก็บไว้ในเวกเตอร์ cache.items
เมื่อเราได้รับคำขอ cache_key จะถูกค้นหาเพื่อค้นหาดัชนีของรายการที่เกี่ยวข้องใน items หากไม่พบ cache_key เราจะส่งคืน CachedItem พร้อมกับฟิลด์ output ว่าง (เช่น isvalid(item) == false )
เราฝัง string_input โดยใช้โมเดล BERT ขนาดเล็กและทำให้การฝังเป็นมาตรฐาน (เพื่อให้ง่ายต่อการเปรียบเทียบระยะทางโคไซน์ในภายหลัง)
จากนั้นเราจะเปรียบเทียบระยะทางโคไซน์กับการฝังของรายการแคช
หากระยะทางโคไซน์สูงกว่าเกณฑ์ min_similarity เราจะส่งคืนรายการที่แคชไว้ (เอาต์พุตสามารถพบได้ในฟิลด์ item.output )

หากเราไม่พบรายการที่แคชไว้ เราจะส่งคืน CachedItem พร้อมกับฟิลด์ output ว่าง (เช่น isvalid(item) == false ) เมื่อคุณคำนวณการตอบสนองและบันทึกไว้ใน item.output คุณสามารถพุชไอเท็มไปที่แคชได้โดยการเรียก push!(cache, item)

กรณีการใช้งานที่เหมาะสม

แพ็คเกจนี้ดีมากหากคุณรู้ว่าคุณจะมีคำขอปริมาณน้อยลง (เช่น <10,000 ต่อเซสชันหรือเครื่อง)
วิธีที่ดีที่สุดคือลดต้นทุนในการเรียกใช้การประเมิน เนื่องจากแม้ว่าคุณจะเปลี่ยนการกำหนดค่าไปป์ไลน์ RAG การโทรหลายครั้งก็จะถูกเรียกซ้ำและสามารถใช้ประโยชน์จากการแคชได้
สุดท้ายนี้ แพ็คเกจนี้มีประโยชน์มากสำหรับการสาธิตและแอปพลิเคชันผู้ใช้ขนาดเล็ก ซึ่งคุณสามารถทราบอินพุตของระบบบางส่วนล่วงหน้า ดังนั้นคุณจึงสามารถแคชและแสดงเวลาตอบสนองที่น่าทึ่งได้!
แพ็คเกจนี้ไม่เหมาะสำหรับระบบที่ใช้งานจริงซึ่งมีคำขอนับแสนคำขอ และจำไว้ว่านี่เป็นแคชพื้นฐานที่คุณต้องทำให้ใช้ไม่ได้ด้วยตนเองเมื่อเวลาผ่านไป!

การใช้งานขั้นสูง

การแคชคำขอ HTTP

จากความรู้ของคุณเกี่ยวกับการเรียก API คุณจะต้องกำหนด: 1) คีย์แคช (จัดเก็บรายการแคชแยกกัน เช่น รุ่นหรืออุณหภูมิที่แตกต่างกัน) และ 2) วิธีคลายแพ็กคำขอ HTTP ลงในสตริง (เช่น แกะและ เข้าร่วมเนื้อหาข้อความที่จัดรูปแบบแล้วสำหรับ OpenAI API)

ต่อไปนี้เป็นโครงร่างโดยย่อเกี่ยวกับวิธีที่คุณสามารถใช้ SemanticCaches.jl กับ PromptingTools.jl

 using PromptingTools
using SemanticCaches
using HTTP

# # Define the new caching mechanism as a layer for HTTP
# # See documentation [here](https://juliaweb.github.io/HTTP.jl/stable/client/#Quick-Examples)
module MyCache

using HTTP, JSON3
using SemanticCaches

const SEM_CACHE = SemanticCache ()
const HASH_CACHE = HashCache ()

function cache_layer (handler)
    return function (req; cache_key :: Union{AbstractString,Nothing} = nothing , kw ... )
        # only apply the cache layer if the user passed `cache_key`
        # we could also use the contents of the payload, eg, `cache_key = get(body, "model", "unknown")`
        if req . method == " POST " && cache_key != = nothing
            body = JSON3 . read ( copy (req . body))
            if occursin ( " v1/chat/completions " , req . target)
                # # We're in chat completion endpoint
                input = join ([m[ " content " ] for m in body[ " messages " ]], " " )
            elseif occursin ( " v1/embeddings " , req . target)
                # # We're in embedding endpoint
                input = body[ " input " ]
            else
                # # Skip, unknown API
                return handler (req; kw ... )
            end
            # # Check the cache
            @info " Check if we can cache this request ( $( length (input)) chars) "
            active_cache = length (input) > 5000 ? HASH_CACHE : SEM_CACHE
            item = active_cache ( " key1 " , input; verbose = 2 ) # change verbosity to 0 to disable detailed logs
            if ! isvalid (item)
                @info " Cache miss! Pinging the API "
                # pass the request along to the next layer by calling `cache_layer` arg `handler`
                resp = handler (req; kw ... )
                item . output = resp
                # Let's remember it for the next time
                push! (active_cache, item)
            end
            # # Return the calculated or cached result
            return item . output
        end
        # pass the request along to the next layer by calling `cache_layer` arg `handler`
        # also pass along the trailing keyword args `kw...`
        return handler (req; kw ... )
    end
end

# Create a new client with the auth layer added
HTTP . @client [cache_layer]

end # module


# Let's push the layer globally in all HTTP.jl requests
HTTP . pushlayer! (MyCache . cache_layer)
# HTTP.poplayer!() # to remove it later

# Let's call the API
@time msg = aigenerate ( " What is the meaning of life? " ; http_kwargs = (; cache_key = " key1 " ))

# The first call will be slow as usual, but any subsequent call should be pretty quick - try it a few times!

คุณยังสามารถใช้มันสำหรับการฝังได้ เช่น

 @time msg = aiembed ( " how is it going? " ; http_kwargs = (; cache_key = " key2 " )) # 0.7s
@time msg = aiembed ( " how is it going? " ; http_kwargs = (; cache_key = " key2 " )) # 0.02s

# Even with a tiny difference (no question mark), it still picks the right cache
@time msg = aiembed ( " how is it going " ; http_kwargs = (; cache_key = " key2 " )) # 0.02s

คุณสามารถลบเลเยอร์แคชได้โดยการเรียก HTTP.poplayer!() (และเพิ่มอีกครั้งหากคุณทำการเปลี่ยนแปลง)

คุณสามารถตรวจสอบแคชได้โดยการเรียก MyCache.SEM_CACHE (เช่น MyCache.SEM_CACHE.items[1] )

คำถามที่พบบ่อย

ผลงานเป็นยังไงบ้าง?

เวลาส่วนใหญ่จะใช้เวลาใน 1) การฝังเล็กๆ น้อยๆ (สำหรับข้อความขนาดใหญ่ เช่น โทเค็นนับพัน) และในการคำนวณความคล้ายคลึงของโคไซน์ (สำหรับแคชขนาดใหญ่ เช่น มากกว่า 10,000 รายการ)

สำหรับการอ้างอิง การฝังข้อความขนาดเล็ก เช่น คำถาม ที่จะฝังจะใช้เวลาเพียงไม่กี่มิลลิวินาที การฝังโทเค็น 2,000 รายการอาจใช้เวลาตั้งแต่ 50-100 มิลลิวินาที

เมื่อพูดถึงระบบแคช มีการล็อคหลายอย่างเพื่อหลีกเลี่ยงข้อผิดพลาด แต่ค่าใช้จ่ายยังคงเล็กน้อย - ฉันรันการทดลองด้วยการแทรกตามลำดับ 100,000 และเวลาต่อรายการเป็นเพียงไม่กี่วินาที (ถูกครอบงำด้วยความคล้ายคลึงโคไซน์) หากคอขวดของคุณอยู่ในการคำนวณความคล้ายคลึงโคไซน์ (ประมาณ 4 มิลลิวินาทีสำหรับรายการ 100,000 รายการ) ให้ลองย้ายเวกเตอร์ไปยังเมทริกซ์สำหรับหน่วยความจำต่อเนื่องและ/หรือใช้การฝังบูลีนด้วยระยะแฮมมิง (ตัวดำเนินการ XOR, ค. ลำดับความสำคัญของความเร็ว)

โดยรวมแล้ว ระบบจะเร็วกว่าที่จำเป็นสำหรับปริมาณงานปกติที่มีรายการแคชนับพันรายการ คุณมีแนวโน้มที่จะมีปัญหา GC และหน่วยความจำมากขึ้นหากเพย์โหลดของคุณมีขนาดใหญ่ (พิจารณาสลับไปที่ดิสก์) มากกว่าที่จะเผชิญกับขอบเขตการประมวลผล โปรดจำไว้ว่าแรงจูงใจคือการป้องกันการเรียก API ที่ใช้เวลาระหว่าง 1-20 วินาที!

จะวัดเวลาที่ใช้ในการทำ X ได้อย่างไร?

ดูตัวอย่างด้านล่าง ไม่ว่าคุณจะสนใจส่วนใดของส่วนนั้น

sem_cache = SemanticCache ()
# First argument: the key must always match exactly, eg, model, temperature, etc
# Second argument: the input text to be compared with the cache, can be fuzzy matched
item = sem_cache ( " key1 " , " say hi! " ; verbose = 1 ) # notice the verbose flag it can 0,1,2 for different level of detail
if ! isvalid (item)
    @info " cache miss! "
    item . output = " expensive result X "
    # Save the result to the cache for future reference
    push! (sem_cache, item)
end

การฝังเท่านั้น (เพื่อปรับเกณฑ์ min_similarity หรือกำหนดเวลาการฝัง)

 using SemanticCaches . FlashRank : embed
using SemanticCaches : EMBEDDER

@time res = embed (EMBEDDER, " say hi " )
#   0.000903 seconds (104 allocations: 19.273 KiB)
# see res.elapsed or res.embeddings

# long inputs (split into several chunks and then combining the embeddings)
@time embed (EMBEDDER, " say hi " ^ 1000 )
#   0.032148 seconds (8.11 k allocations: 662.656 KiB)

จะตั้งค่าเกณฑ์ min_similarity ได้อย่างไร?

คุณสามารถตั้งค่าเกณฑ์ min_similarity ได้โดยเพิ่ม kwarg active_cache("key1", input; verbose=2, min_similarity=0.95)

ค่าเริ่มต้นคือ 0.95 ซึ่งเป็นเกณฑ์ที่สูงมาก ในทางปฏิบัติ ฉันขอแนะนำ ~0.9 หากคุณคาดว่าจะมีการพิมพ์ผิด คุณสามารถลดลงได้อีกเล็กน้อย (เช่น 0.85)

คำเตือน

ระวังเกณฑ์ความคล้ายคลึงกัน มันยากที่จะฝังลำดับที่สั้นมาก ๆ ให้ดี! คุณอาจต้องการปรับเกณฑ์ตามความยาวของอินพุต ทดสอบด้วยอินพุตของคุณเสมอ!!

หากคุณต้องการคำนวณความคล้ายคลึงของโคไซน์ อย่าลืมทำให้การฝัง normalize มาตรฐานก่อน หรือหารดอทโปรดัคตามบรรทัดฐาน

 using SemanticCaches . LinearAlgebra : normalize, norm, dot
cosine_similarity = dot (r1 . embeddings, r2 . embeddings) / ( norm (r1 . embeddings) * norm (r2 . embeddings))
# remember that 1 is the best similarity, -1 is the exact opposite

คุณเปรียบเทียบอินพุตต่างๆ เพื่อกำหนดเกณฑ์ที่ดีที่สุดสำหรับกรณีการใช้งานของคุณได้

emb1 = embed (EMBEDDER, " How is it going? " ) |> x -> vec (x . embeddings) |> normalize
emb2 = embed (EMBEDDER, " How is it goin'? " ) |> x -> vec (x . embeddings) |> normalize
dot (emb1, emb2) # 0.944

emb1 = embed (EMBEDDER, " How is it going? " ) |> x -> vec (x . embeddings) |> normalize
emb2 = embed (EMBEDDER, " How is it goin' " ) |> x -> vec (x . embeddings) |> normalize
dot (emb1, emb2) # 0.920