Streamlit 應用程序,用於對超過 1000 萬個由 Weaviate 嵌入向量化的維基百科文件進行多語言語義搜尋。該實作基於 Cohere 的部落格「使用 LLM 進行搜尋」及其相應的筆記本。它可以比較關鍵字搜尋、密集檢索和混合搜尋的效能來查詢維基百科資料集。它進一步演示如何使用 Cohere Rerank 來提高結果的準確性,並使用 Cohere Generation 根據所述排名結果提供回應。
向量資料庫(例如 Weaviate)是專門為優化嵌入的儲存和查詢功能而建構的。在實踐中,向量資料庫使用不同演算法的組合,這些演算法都參與近似最近鄰 (ANN) 搜尋。這些演算法透過雜湊、量化或基於圖的搜尋來優化搜尋。
@ retry ( wait = wait_random_exponential ( min = 1 , max = 5 ), stop = stop_after_attempt ( 5 ))
def with_bm25 ( self , query , lang = 'en' , top_n = 10 ) -> list :
Performs a keyword search (sparse retrieval) on Wikipedia Articles using embeddings stored in Weaviate.
- query (str): The search query.
- lang (str, optional): The language of the articles. Default is 'en'.
- top_n (int, optional): The number of top results to return. Default is 10.
- list: List of top articles based on BM25F scoring.
logging . info ( "with_bm25()" )
where_filter = {
"path" : [ "lang" ],
"operator" : "Equal" ,
"valueString" : lang
response = (
self . weaviate . query . get ( "Articles" , self . WIKIPEDIA_PROPERTIES )
. with_bm25 ( query = query )
. with_where ( where_filter )
. with_limit ( top_n )
. do ()
return response [ "data" ][ "Get" ][ "Articles" ]
@ retry ( wait = wait_random_exponential ( min = 1 , max = 5 ), stop = stop_after_attempt ( 5 ))
def with_neartext ( self , query , lang = 'en' , top_n = 10 ) -> list :
Performs a semantic search (dense retrieval) on Wikipedia Articles using embeddings stored in Weaviate.
- query (str): The search query.
- lang (str, optional): The language of the articles. Default is 'en'.
- top_n (int, optional): The number of top results to return. Default is 10.
- list: List of top articles based on semantic similarity.
logging . info ( "with_neartext()" )
nearText = {
"concepts" : [ query ]
where_filter = {
"path" : [ "lang" ],
"operator" : "Equal" ,
"valueString" : lang
response = (
self . weaviate . query . get ( "Articles" , self . WIKIPEDIA_PROPERTIES )
. with_near_text ( nearText )
. with_where ( where_filter )
. with_limit ( top_n )
. do ()
return response [ 'data' ][ 'Get' ][ 'Articles' ]
混合搜尋:根據關鍵字 (bm25) 搜尋和向量搜尋結果的加權組合產生結果。
@ retry ( wait = wait_random_exponential ( min = 1 , max = 5 ), stop = stop_after_attempt ( 5 ))
def with_hybrid ( self , query , lang = 'en' , top_n = 10 ) -> list :
Performs a hybrid search on Wikipedia Articles using embeddings stored in Weaviate.
- query (str): The search query.
- lang (str, optional): The language of the articles. Default is 'en'.
- top_n (int, optional): The number of top results to return. Default is 10.
- list: List of top articles based on hybrid scoring.
logging . info ( "with_hybrid()" )
where_filter = {
"path" : [ "lang" ],
"operator" : "Equal" ,
"valueString" : lang
response = (
self . weaviate . query . get ( "Articles" , self . WIKIPEDIA_PROPERTIES )
. with_hybrid ( query = query )
. with_where ( where_filter )
. with_limit ( top_n )
. do ()
return response [ "data" ][ "Get" ][ "Articles" ]
@ retry ( wait = wait_random_exponential ( min = 1 , max = 5 ), stop = stop_after_attempt ( 5 ))
def rerank ( self , query , documents , top_n = 10 , model = 'rerank-english-v2.0' ) -> dict :
Reranks a list of responses using Cohere's reranking API.
- query (str): The search query.
- documents (list): List of documents to be reranked.
- top_n (int, optional): The number of top reranked results to return. Default is 10.
- model: The model to use for reranking. Default is 'rerank-english-v2.0'.
- dict: Reranked documents from Cohere's API.
return self . cohere . rerank ( query = query , documents = documents , top_n = top_n , model = model )
@ retry ( wait = wait_random_exponential ( min = 1 , max = 5 ), stop = stop_after_attempt ( 5 ))
def with_llm ( self , context , query , temperature = 0.2 , model = "command" , lang = "english" ) -> list :
prompt = f"""
Use the information provided below to answer the questions at the end. /
Include some curious or relevant facts extracted from the context. /
Generate the answer in the language of the query. If you cannot determine the language of the query use { lang } . /
If the answer to the question is not contained in the provided information, generate "The answer is not in the context".
Context information:
{ context }
{ query }
return self . cohere . generate (
prompt = prompt ,
num_generations = 1 ,
max_tokens = 1000 ,
temperature = temperature ,
model = model ,
[email protected]:dcarpintero/wikisearch.git
py -m venv .venv
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
streamlit run ./app.py
示範 Web 應用程式部署到 Streamlit Cloud,可在 https://wikisearch.streamlit.app/ 取得