維基搜尋
1.0.0
Streamlit 應用程序,用於對超過 1000 萬個由 Weaviate 嵌入向量化的維基百科文件進行多語言語義搜尋。該實作基於 Cohere 的部落格「使用 LLM 進行搜尋」及其相應的筆記本。它可以比較關鍵字搜尋、密集檢索和混合搜尋的效能來查詢維基百科資料集。它進一步演示如何使用 Cohere Rerank 來提高結果的準確性,並使用 Cohere Generation 根據所述排名結果提供回應。
語義搜尋是指在產生結果時考慮搜尋短語的意圖和上下文含義的搜尋演算法,而不是僅僅關注關鍵字匹配。它透過理解查詢背後的語義或含義來提供更準確和相關的結果。
嵌入是表示單字、句子、文件、圖像或音訊等資料的浮點數向量(列表)。所述數字表示捕獲資料的上下文、層次結構和相似性。它們可用於下游任務,例如分類、聚類、異常值檢測和語義搜尋。
向量資料庫(例如 Weaviate)是專門為優化嵌入的儲存和查詢功能而建構的。在實踐中,向量資料庫使用不同演算法的組合,這些演算法都參與近似最近鄰 (ANN) 搜尋。這些演算法透過雜湊、量化或基於圖的搜尋來優化搜尋。
關鍵字匹配:它會尋找屬性中包含搜尋字詞的物件。根據BM25F函數對結果進行評分:
@ retry ( wait = wait_random_exponential ( min = 1 , max = 5 ), stop = stop_after_attempt ( 5 ))
def with_bm25 ( self , query , lang = 'en' , top_n = 10 ) -> list :
"""
Performs a keyword search (sparse retrieval) on Wikipedia Articles using embeddings stored in Weaviate.
Parameters:
- query (str): The search query.
- lang (str, optional): The language of the articles. Default is 'en'.
- top_n (int, optional): The number of top results to return. Default is 10.
Returns:
- list: List of top articles based on BM25F scoring.
"""
logging . info ( "with_bm25()" )
where_filter = {
"path" : [ "lang" ],
"operator" : "Equal" ,
"valueString" : lang
}
response = (
self . weaviate . query . get ( "Articles" , self . WIKIPEDIA_PROPERTIES )
. with_bm25 ( query = query )
. with_where ( where_filter )
. with_limit ( top_n )
. do ()
)
return response [ "data" ][ "Get" ][ "Articles" ]
密集檢索:尋找與原始(非向量化)文字最相似的物件:
@ retry ( wait = wait_random_exponential ( min = 1 , max = 5 ), stop = stop_after_attempt ( 5 ))
def with_neartext ( self , query , lang = 'en' , top_n = 10 ) -> list :
"""
Performs a semantic search (dense retrieval) on Wikipedia Articles using embeddings stored in Weaviate.
Parameters:
- query (str): The search query.
- lang (str, optional): The language of the articles. Default is 'en'.
- top_n (int, optional): The number of top results to return. Default is 10.
Returns:
- list: List of top articles based on semantic similarity.
"""
logging . info ( "with_neartext()" )
nearText = {
"concepts" : [ query ]
}
where_filter = {
"path" : [ "lang" ],
"operator" : "Equal" ,
"valueString" : lang
}
response = (
self . weaviate . query . get ( "Articles" , self . WIKIPEDIA_PROPERTIES )
. with_near_text ( nearText )
. with_where ( where_filter )
. with_limit ( top_n )
. do ()
)
return response [ 'data' ][ 'Get' ][ 'Articles' ]
混合搜尋:根據關鍵字 (bm25) 搜尋和向量搜尋結果的加權組合產生結果。
@ retry ( wait = wait_random_exponential ( min = 1 , max = 5 ), stop = stop_after_attempt ( 5 ))
def with_hybrid ( self , query , lang = 'en' , top_n = 10 ) -> list :
"""
Performs a hybrid search on Wikipedia Articles using embeddings stored in Weaviate.
Parameters:
- query (str): The search query.
- lang (str, optional): The language of the articles. Default is 'en'.
- top_n (int, optional): The number of top results to return. Default is 10.
Returns:
- list: List of top articles based on hybrid scoring.
"""
logging . info ( "with_hybrid()" )
where_filter = {
"path" : [ "lang" ],
"operator" : "Equal" ,
"valueString" : lang
}
response = (
self . weaviate . query . get ( "Articles" , self . WIKIPEDIA_PROPERTIES )
. with_hybrid ( query = query )
. with_where ( where_filter )
. with_limit ( top_n )
. do ()
)
return response [ "data" ][ "Get" ][ "Articles" ]
@ retry ( wait = wait_random_exponential ( min = 1 , max = 5 ), stop = stop_after_attempt ( 5 ))
def rerank ( self , query , documents , top_n = 10 , model = 'rerank-english-v2.0' ) -> dict :
"""
Reranks a list of responses using Cohere's reranking API.
Parameters:
- query (str): The search query.
- documents (list): List of documents to be reranked.
- top_n (int, optional): The number of top reranked results to return. Default is 10.
- model: The model to use for reranking. Default is 'rerank-english-v2.0'.
Returns:
- dict: Reranked documents from Cohere's API.
"""
return self . cohere . rerank ( query = query , documents = documents , top_n = top_n , model = model )
來源:Cohere
@ retry ( wait = wait_random_exponential ( min = 1 , max = 5 ), stop = stop_after_attempt ( 5 ))
def with_llm ( self , context , query , temperature = 0.2 , model = "command" , lang = "english" ) -> list :
prompt = f"""
Use the information provided below to answer the questions at the end. /
Include some curious or relevant facts extracted from the context. /
Generate the answer in the language of the query. If you cannot determine the language of the query use { lang } . /
If the answer to the question is not contained in the provided information, generate "The answer is not in the context".
---
Context information:
{ context }
---
Question:
{ query }
"""
return self . cohere . generate (
prompt = prompt ,
num_generations = 1 ,
max_tokens = 1000 ,
temperature = temperature ,
model = model ,
)
[email protected]:dcarpintero/wikisearch.git
Windows:
py -m venv .venv
.venvscriptsactivate
macOS/Linux
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
streamlit run ./app.py
示範 Web 應用程式部署到 Streamlit Cloud,可在 https://wikisearch.streamlit.app/ 取得