维基搜索
1.0.0
Streamlit 应用程序,用于对超过 1000 万个由 Weaviate 嵌入矢量化的维基百科文档进行多语言语义搜索。该实现基于 Cohere 的博客“使用 LLM 进行搜索”及其相应的笔记本。它可以比较关键字搜索、密集检索和混合搜索的性能来查询维基百科数据集。它进一步演示了如何使用 Cohere Rerank 来提高结果的准确性,并使用 Cohere Generation 根据所述排名结果提供响应。
语义搜索是指在生成结果时考虑搜索短语的意图和上下文含义的搜索算法,而不是仅仅关注关键字匹配。它通过理解查询背后的语义或含义来提供更准确和相关的结果。
嵌入是表示单词、句子、文档、图像或音频等数据的浮点数向量(列表)。所述数字表示捕获数据的上下文、层次结构和相似性。它们可用于下游任务,例如分类、聚类、异常值检测和语义搜索。
矢量数据库(例如 Weaviate)是专门为优化嵌入的存储和查询功能而构建的。在实践中,矢量数据库使用不同算法的组合,这些算法都参与近似最近邻 (ANN) 搜索。这些算法通过散列、量化或基于图的搜索来优化搜索。
关键字匹配:它查找属性中包含搜索词的对象。根据BM25F函数对结果进行评分:
@ retry ( wait = wait_random_exponential ( min = 1 , max = 5 ), stop = stop_after_attempt ( 5 ))
def with_bm25 ( self , query , lang = 'en' , top_n = 10 ) -> list :
"""
Performs a keyword search (sparse retrieval) on Wikipedia Articles using embeddings stored in Weaviate.
Parameters:
- query (str): The search query.
- lang (str, optional): The language of the articles. Default is 'en'.
- top_n (int, optional): The number of top results to return. Default is 10.
Returns:
- list: List of top articles based on BM25F scoring.
"""
logging . info ( "with_bm25()" )
where_filter = {
"path" : [ "lang" ],
"operator" : "Equal" ,
"valueString" : lang
}
response = (
self . weaviate . query . get ( "Articles" , self . WIKIPEDIA_PROPERTIES )
. with_bm25 ( query = query )
. with_where ( where_filter )
. with_limit ( top_n )
. do ()
)
return response [ "data" ][ "Get" ][ "Articles" ]
密集检索:查找与原始(非矢量化)文本最相似的对象:
@ retry ( wait = wait_random_exponential ( min = 1 , max = 5 ), stop = stop_after_attempt ( 5 ))
def with_neartext ( self , query , lang = 'en' , top_n = 10 ) -> list :
"""
Performs a semantic search (dense retrieval) on Wikipedia Articles using embeddings stored in Weaviate.
Parameters:
- query (str): The search query.
- lang (str, optional): The language of the articles. Default is 'en'.
- top_n (int, optional): The number of top results to return. Default is 10.
Returns:
- list: List of top articles based on semantic similarity.
"""
logging . info ( "with_neartext()" )
nearText = {
"concepts" : [ query ]
}
where_filter = {
"path" : [ "lang" ],
"operator" : "Equal" ,
"valueString" : lang
}
response = (
self . weaviate . query . get ( "Articles" , self . WIKIPEDIA_PROPERTIES )
. with_near_text ( nearText )
. with_where ( where_filter )
. with_limit ( top_n )
. do ()
)
return response [ 'data' ][ 'Get' ][ 'Articles' ]
混合搜索:根据关键字 (bm25) 搜索和矢量搜索结果的加权组合生成结果。
@ retry ( wait = wait_random_exponential ( min = 1 , max = 5 ), stop = stop_after_attempt ( 5 ))
def with_hybrid ( self , query , lang = 'en' , top_n = 10 ) -> list :
"""
Performs a hybrid search on Wikipedia Articles using embeddings stored in Weaviate.
Parameters:
- query (str): The search query.
- lang (str, optional): The language of the articles. Default is 'en'.
- top_n (int, optional): The number of top results to return. Default is 10.
Returns:
- list: List of top articles based on hybrid scoring.
"""
logging . info ( "with_hybrid()" )
where_filter = {
"path" : [ "lang" ],
"operator" : "Equal" ,
"valueString" : lang
}
response = (
self . weaviate . query . get ( "Articles" , self . WIKIPEDIA_PROPERTIES )
. with_hybrid ( query = query )
. with_where ( where_filter )
. with_limit ( top_n )
. do ()
)
return response [ "data" ][ "Get" ][ "Articles" ]
@ retry ( wait = wait_random_exponential ( min = 1 , max = 5 ), stop = stop_after_attempt ( 5 ))
def rerank ( self , query , documents , top_n = 10 , model = 'rerank-english-v2.0' ) -> dict :
"""
Reranks a list of responses using Cohere's reranking API.
Parameters:
- query (str): The search query.
- documents (list): List of documents to be reranked.
- top_n (int, optional): The number of top reranked results to return. Default is 10.
- model: The model to use for reranking. Default is 'rerank-english-v2.0'.
Returns:
- dict: Reranked documents from Cohere's API.
"""
return self . cohere . rerank ( query = query , documents = documents , top_n = top_n , model = model )
来源:Cohere
@ retry ( wait = wait_random_exponential ( min = 1 , max = 5 ), stop = stop_after_attempt ( 5 ))
def with_llm ( self , context , query , temperature = 0.2 , model = "command" , lang = "english" ) -> list :
prompt = f"""
Use the information provided below to answer the questions at the end. /
Include some curious or relevant facts extracted from the context. /
Generate the answer in the language of the query. If you cannot determine the language of the query use { lang } . /
If the answer to the question is not contained in the provided information, generate "The answer is not in the context".
---
Context information:
{ context }
---
Question:
{ query }
"""
return self . cohere . generate (
prompt = prompt ,
num_generations = 1 ,
max_tokens = 1000 ,
temperature = temperature ,
model = model ,
)
[email protected]:dcarpintero/wikisearch.git
Windows:
py -m venv .venv
.venvscriptsactivate
macOS/Linux
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
streamlit run ./app.py
演示 Web 应用程序部署到 Streamlit Cloud,可在 https://wikisearch.streamlit.app/ 获取