Aplikasi Streamlit untuk Pencarian Semantik Multibahasa pada lebih dari 10 juta dokumen Wikipedia yang divektorkan dalam penyematan oleh Weaviate. Implementasi ini didasarkan pada blog Cohere ´Menggunakan LLM untuk Pencarian´ dan buku catatannya yang sesuai. Hal ini memungkinkan untuk membandingkan kinerja pencarian kata kunci , pengambilan padat , dan pencarian hibrid untuk menanyakan kumpulan data Wikipedia. Hal ini lebih lanjut menunjukkan penggunaan Cohere Rerank untuk meningkatkan akurasi hasil, dan Cohere Generate untuk memberikan respons berdasarkan hasil pemeringkatan tersebut.
Pencarian semantik mengacu pada algoritma pencarian yang mempertimbangkan maksud dan makna kontekstual dari frasa pencarian ketika menghasilkan hasil, bukan hanya berfokus pada pencocokan kata kunci. Ini memberikan hasil yang lebih akurat dan relevan dengan memahami semantik, atau makna, di balik kueri.
Penyematan adalah vektor (daftar) angka floating point yang mewakili data seperti kata, kalimat, dokumen, gambar, atau audio. Representasi numerik tersebut menangkap konteks, hierarki, dan kesamaan data. Mereka dapat digunakan untuk tugas-tugas hilir seperti klasifikasi, pengelompokan, deteksi outlier, dan pencarian semantik.
Basis data vektor, seperti Weaviate, dibuat khusus untuk mengoptimalkan penyimpanan dan kemampuan kueri untuk penyematan. Dalam praktiknya, database vektor menggunakan kombinasi berbagai algoritma yang semuanya berpartisipasi dalam pencarian Approximate Nearest Neighbor (ANN). Algoritme ini mengoptimalkan pencarian melalui hashing, kuantisasi, atau pencarian berbasis grafik.
Pencocokan Kata Kunci: mencari objek yang mengandung istilah pencarian di propertinya. Hasilnya dinilai berdasarkan fungsi BM25F:
@ retry ( wait = wait_random_exponential ( min = 1 , max = 5 ), stop = stop_after_attempt ( 5 ))
def with_bm25 ( self , query , lang = 'en' , top_n = 10 ) -> list :
"""
Performs a keyword search (sparse retrieval) on Wikipedia Articles using embeddings stored in Weaviate.
Parameters:
- query (str): The search query.
- lang (str, optional): The language of the articles. Default is 'en'.
- top_n (int, optional): The number of top results to return. Default is 10.
Returns:
- list: List of top articles based on BM25F scoring.
"""
logging . info ( "with_bm25()" )
where_filter = {
"path" : [ "lang" ],
"operator" : "Equal" ,
"valueString" : lang
}
response = (
self . weaviate . query . get ( "Articles" , self . WIKIPEDIA_PROPERTIES )
. with_bm25 ( query = query )
. with_where ( where_filter )
. with_limit ( top_n )
. do ()
)
return response [ "data" ][ "Get" ][ "Articles" ]
Pengambilan Padat: temukan objek yang paling mirip dengan teks mentah (tidak divektorkan):
@ retry ( wait = wait_random_exponential ( min = 1 , max = 5 ), stop = stop_after_attempt ( 5 ))
def with_neartext ( self , query , lang = 'en' , top_n = 10 ) -> list :
"""
Performs a semantic search (dense retrieval) on Wikipedia Articles using embeddings stored in Weaviate.
Parameters:
- query (str): The search query.
- lang (str, optional): The language of the articles. Default is 'en'.
- top_n (int, optional): The number of top results to return. Default is 10.
Returns:
- list: List of top articles based on semantic similarity.
"""
logging . info ( "with_neartext()" )
nearText = {
"concepts" : [ query ]
}
where_filter = {
"path" : [ "lang" ],
"operator" : "Equal" ,
"valueString" : lang
}
response = (
self . weaviate . query . get ( "Articles" , self . WIKIPEDIA_PROPERTIES )
. with_near_text ( nearText )
. with_where ( where_filter )
. with_limit ( top_n )
. do ()
)
return response [ 'data' ][ 'Get' ][ 'Articles' ]
Pencarian Hibrid: menghasilkan hasil berdasarkan kombinasi hasil tertimbang dari pencarian kata kunci (bm25) dan pencarian vektor.
@ retry ( wait = wait_random_exponential ( min = 1 , max = 5 ), stop = stop_after_attempt ( 5 ))
def with_hybrid ( self , query , lang = 'en' , top_n = 10 ) -> list :
"""
Performs a hybrid search on Wikipedia Articles using embeddings stored in Weaviate.
Parameters:
- query (str): The search query.
- lang (str, optional): The language of the articles. Default is 'en'.
- top_n (int, optional): The number of top results to return. Default is 10.
Returns:
- list: List of top articles based on hybrid scoring.
"""
logging . info ( "with_hybrid()" )
where_filter = {
"path" : [ "lang" ],
"operator" : "Equal" ,
"valueString" : lang
}
response = (
self . weaviate . query . get ( "Articles" , self . WIKIPEDIA_PROPERTIES )
. with_hybrid ( query = query )
. with_where ( where_filter )
. with_limit ( top_n )
. do ()
)
return response [ "data" ][ "Get" ][ "Articles" ]
@ retry ( wait = wait_random_exponential ( min = 1 , max = 5 ), stop = stop_after_attempt ( 5 ))
def rerank ( self , query , documents , top_n = 10 , model = 'rerank-english-v2.0' ) -> dict :
"""
Reranks a list of responses using Cohere's reranking API.
Parameters:
- query (str): The search query.
- documents (list): List of documents to be reranked.
- top_n (int, optional): The number of top reranked results to return. Default is 10.
- model: The model to use for reranking. Default is 'rerank-english-v2.0'.
Returns:
- dict: Reranked documents from Cohere's API.
"""
return self . cohere . rerank ( query = query , documents = documents , top_n = top_n , model = model )
Sumber: Cohere
@ retry ( wait = wait_random_exponential ( min = 1 , max = 5 ), stop = stop_after_attempt ( 5 ))
def with_llm ( self , context , query , temperature = 0.2 , model = "command" , lang = "english" ) -> list :
prompt = f"""
Use the information provided below to answer the questions at the end. /
Include some curious or relevant facts extracted from the context. /
Generate the answer in the language of the query. If you cannot determine the language of the query use { lang } . /
If the answer to the question is not contained in the provided information, generate "The answer is not in the context".
---
Context information:
{ context }
---
Question:
{ query }
"""
return self . cohere . generate (
prompt = prompt ,
num_generations = 1 ,
max_tokens = 1000 ,
temperature = temperature ,
model = model ,
)
[email protected]:dcarpintero/wikisearch.git
Windows:
py -m venv .venv
.venvscriptsactivate
macOS/Linux
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
streamlit run ./app.py
Aplikasi Web Demo diterapkan ke Streamlit Cloud dan tersedia di https://wikisearch.streamlit.app/