thumb下載 - thumb源碼下載

大拇指

一個簡單的法學碩士即時測試庫。

快速啟動

1.安裝庫

pip install thumb

2. 設定測試

 import os
import thumb

# Set your API key: https://platform.openai.com/account/api-keys
os . environ [ "OPENAI_API_KEY" ] = "YOUR_API_KEY_HERE"

# set up a prompt templates for the a/b test
prompt_a = "tell me a joke"
prompt_b = "tell me a family friendly joke"

# generate the responses
test = thumb . test ([ prompt_a , prompt_b ])

3. 對答案進行評分

預設情況下，每個提示非同步運行 10 次，這比順序運行快大約 9 倍。在 Jupyter Notebooks 中，會顯示一個簡單的使用者介面，用於盲評回應（您看不到哪個提示產生了回應）。

對所有回應進行評級後，將按提示範本細分計算以下效能統計資料：

avg_score正回饋量佔所有運行的百分比
avg_tokens ：提示和回應中使用了多少個令牌
avg_cost ：平均運行提示成本的估計

筆記本中會顯示一個簡單的報告，完整的資料將保存到 CSV 檔案thumb/ThumbTest-{TestID}.csv中。

功能性

測試用例

測試案例是指當您想要使用不同的輸入變數測試提示範本時。例如，如果您想要測試包含喜劇演員姓名變數的提示模板，您可以為不同的喜劇演員設定測試案例。

 # set up a prompt templates for the a/b test
prompt_a = "tell me a joke in the style of {comedian}"
prompt_b = "tell me a family friendly joke in the style of {comedian}"

# set test cases with different input variables
cases = [
  { "comedian" : "chris rock" }, 
  { "comedian" : "ricky gervais" }, 
  { "comedian" : "robin williams" }
  ]

# generate the responses
test = thumb . test ([ prompt_a , prompt_b ], cases )

每個測試案例都將針對每個提示模板運行，因此在本範例中，您將獲得6 個組合（3 個測試案例x 2 個提示模板），每個組合將運行10 次（總共60 次對OpenAI的調用）。每個測試案例必須包含提示範本中每個變數的值。

每個測試案例中的提示可能有多個變數。例如，如果您想要測試包含喜劇演員姓名變數和笑話主題的提示模板，您可以為不同的喜劇演員和主題設定測試案例。

 # set up a prompt templates for the a/b test
prompt_a = "tell me a joke about {subject} in the style of {comedian}"
prompt_b = "tell me a family friendly joke about {subject} in the style of {comedian}"

# set test cases with different input variables
cases = [
  { "subject" : "joe biden" , "comedian" : "chris rock" }, 
  { "subject" : "joe biden" , "comedian" : "ricky gervais" }, 
  { "subject" : "donald trump" , "comedian" : "chris rock" }, 
  { "subject" : "donald trump" , "comedian" : "ricky gervais" }, 
  ]

# generate the responses
test = thumb . test ([ prompt_a , prompt_b ], cases )

每個案例都會針對每個提示進行測試，以便在給定相同輸入資料的情況下對每個提示的效能進行公平比較。透過 4 個測試案例和 2 個提示，您將獲得 8 種組合（4 個測試案例 x 2 個提示模板），每個組合將運行 10 次（總共 80 次對 OpenAI 的呼叫）。

模型測試

 # set up a prompt templates for the a/b test
prompt_a = "tell me a joke"
prompt_b = "tell me a family friendly joke"

# generate the responses
test = thumb . test ([ prompt_a , prompt_b ], models = [ "gpt-4" , "gpt-3.5-turbo" ])

這將針對每個模型運行每個提示，以便在給定相同輸入資料的情況下對每個提示的效能進行公平比較。透過 2 個提示和 2 個模型，您將獲得 4 種組合（2 個提示 x 2 個模型），每個組合將運行 10 次（總共 40 次呼叫 OpenAI）。

系統訊息

 # set up a prompt templates for the a/b test
system_message = "You are the comedian {comedian}"

prompt_a = [ system_message , "tell me a funny joke about {subject}" ]
prompt_b = [ system_message , "tell me a hillarious joke {subject}" ]

cases = [{ "subject" : "joe biden" , "comedian" : "chris rock" }, 
         { "subject" : "donald trump" , "comedian" : "chris rock" }]

# generate the responses
test = thumb . test ([ prompt_a , prompt_b ], cases )

提示可以是字串或字串陣列。如果提示是數組，則第一個字串用作系統訊息，其餘提示在人類訊息和助理訊息之間交替（ [system, human, ai, human, ai, ...] ）。這對於測試包含系統訊息或使用預熱的提示（將先前的訊息插入聊天中以引導 AI 實現所需的行為）非常有用。

 # set up a prompt templates for the a/b test
system_message = "You are the comedian {comedian}"

prompt_a = [ system_message , # system
            "tell me a funny joke about {subject}" , # human
            "Sorry, as an AI language model, I am not capable of humor" , # assistant
            "That's fine just try your best" ] # human
prompt_b = [ system_message , # system
            "tell me a hillarious joke about {subject}" , # human
            "Sorry, as an AI language model, I am not capable of humor" , # assistant
            "That's fine just try your best" ] # human

cases = [{ "subject" : "joe biden" , "comedian" : "chris rock" }, 
         { "subject" : "donald trump" , "comedian" : "chris rock" }]

# generate the responses
test = thumb . test ([ prompt_a , prompt_b ], cases )

評估報告

測試完成後，您將獲得一份完整的評估報告，按 PID、CID 和型號細分，以及按所有組合細分的總體報告。如果您只測試一種模型或一種情況，這些故障將被丟棄。報告底部顯示一個按鍵，可查看哪個 ID 對應於哪個提示或案例。

參數

thumb.test函數採用以下參數：

必需的

提示：要測試的提示（字串）數組

選修的

case ：輸入到每個提示範本的變數字典（預設值： None ）
running ：每個提示和測試案例產生的回應數（預設值： 10 ）
models ：要從中產生回應的 OpenAI 模型清單（預設值：[ gpt-3.5-turbo ]）
async_generate ：一個布林值，表示是非同步運行還是順序運行（預設值： True ）

如果您使用 2 個提示範本和 3 個測試案例執行 10 次測試，則需要10 x 2 x 3 = 60呼叫 OpenAI。請注意：特別是使用 GPT-4 時，成本會迅速增加！

如果LANGCHAIN_API_KEY設定為環境變數（可選），則會自動啟用對 LangSmith 的 Langchain 追蹤。

載入和新增

.test()函數傳回一個ThumbTest物件。您可以為測試添加更多提示或案例，或多次執行測試。您還可以隨時產生、評估和匯出測試數據。

 # set up a prompt templates for the a/b test
prompt_a = "tell me a joke"
prompt_b = "tell me a family friendly joke"

# generate the responses
test = thumb . test ([ prompt_a , prompt_b ])

# add more prompts
test . add_prompts ([ "tell me a knock knock joke" , "tell me a knock knock joke about {subject}" ])

# add more cases
test . add_cases ([{ "subject" : "joe biden" }, { "subject" : "donald trump" }])

# run each prompt and case 5 more times
test . add_runs ( 5 )

# generate the responses
test . generate ()

# rate the responses
test . evaluate ()

# export the test data for analysis
test . export_to_csv ()

每個提示模板從每個測試案例獲取相同的輸入數據，但提示不需要使用測試案例中的所有變數。如上例所示， tell me a knock knock joke提示不使用subject變量，但它仍然為每個測試案例生成一次（沒有變量）。

在為提示和大小寫組合產生每組運行後，測試資料緩存在本機 JSON 檔案thumb/.cache/{TestID}.json中。如果你的測試被中斷，或者你想新增測試，你可以使用thumb.load函數從快取載入測試資料。

 # load a previous test
test_id = "abcd1234" # replace with your test id
test = thumb . load ( f"thumb/.cache/ { test_id } .json" )

# run each prompt and case 2 more times
test . add_runs ( 2 )

# generate the responses
test . generate ()

# rate the responses
test . evaluate ()

# export the test data for analysis
test . export_to_csv ()