DocsScraper.jl下載 - DocsScraper.jl原始碼下載

DocsScraper.jl

Ai源碼

v0.1.0

下載

DocsScraper：“來自線上 Julia 文件的高效 RAG 知識包創建者”

DocsScraper 是一個軟體包，旨在從 Julia 語言的線上文件網站建立「知識包」。

它會抓取並解析 URL，並在 PromptingTools.jl 的幫助下建立可在 RAG 應用程式中使用的區塊及其嵌入的索引。它與 AIHelpMe.jl 和 PromptingTools.jl 集成，提供高效且相關的查詢檢索，確保系統生成的回應特定於所創建資料庫中的內容。

特徵

URL 抓取和解析：自動抓取和解析輸入 URL 以提取相關信息，特別關注程式碼片段和程式碼區塊。提供自訂區塊大小的選項
URL 爬網：可選擇爬網輸入 URL 以尋找相同網域中的多個頁面。
知識索引建立：利用 PromptingTools.jl 建立具有可自訂嵌入模型、大小和類型（Bool 和 Float32）的嵌入。

安裝

若要安裝 DocsScraper，請使用 Julia 套件管理員和套件名稱（尚未註冊）：

 using Pkg
Pkg . add (url = " https://github.com/JuliaGenAI/DocsScraper.jl " )

先決條件：

Julia（版本 1.10 或更高版本）。
用於 API 存取的網際網路連線。
具有可用積分的 OpenAI API 金鑰。請參閱如何取得 API 金鑰。

建立索引

 using DocsScraper
crawlable_urls = [ " https://juliagenai.github.io/DocsScraper.jl/dev " ]

index_path = make_knowledge_packs (crawlable_urls;
    index_name = " docsscraper " , embedding_dimension = 1024 , embedding_bool = true , target_path = " knowledge_packs " )

[ Info : robots . txt unavailable for https : // juliagenai . github . io : / DocsScraper . jl / dev / home /
[ Info : Scraping link : https : // juliagenai . github . io : / DocsScraper . jl / dev / home /
[ Info : robots . txt unavailable for https : // juliagenai . github . io : / DocsScraper . jl / dev
[ Info : Scraping link : https : // juliagenai . github . io : / DocsScraper . jl / dev
. . .
[ Info : Processing https : // juliagenai . github . io : / DocsScraper . jl / dev ...
[ Info : Parsing URL : https : // juliagenai . github . io : / DocsScraper . jl / dev
[ Info : Scraping done : 44 chunks
[ Info : Removed 0 short chunks
[ Info : Removed 1 duplicate chunks
[ Info : Created embeddings for docsscraper. Cost : $ 0. 001
a docsscraper__v20240823__textembedding3large - 1024 - Bool__v1. 0. hdf5
[ Info : ARTIFACT : docsscraper__v20240823__textembedding3large - 1024 - Bool__v1. 0. tar . gz
┌ Info : sha256 :
└   sha = " 977c2b9d9fe30bebea3b6db124b733d29b7762a8f82c9bd642751f37ad27ee2e "
┌ Info : git - tree - sha1 :
└   git_tree_sha = " eca409c0a32ed506fbd8125887b96987e9fb91d2 "
[ Info : Saving source URLS in Julia  knowledge_packs  docsscraper  docsscraper_URL_mapping . csv      
" Julia \ knowledge_packs \ docsscraper \ Index \ docsscraper__v20240823__textembedding3large-1024-Bool__v1.0.hdf5 "

make_knowledge_packs是包包的入口點。此函數接收要解析的 URL 並傳回索引。此索引可以傳遞到 AIHelpMe.jl 以回答對建置的知識包的查詢。

預設make_knowledge_packs參數：

預設嵌入類型是 Float32。透過可選參數變更為布林值： embedding_bool = true 。
預設嵌入大小embedding_dimension = custom_dimension 3072。
使用的預設模型是 OpenAI 的 text-embedding-3-large。
預設max_chunk_size = custom_max_size區塊大小為 384，最小min_chunk_size = custom_min_size大小為 40。

注意：對於日常使用，嵌入大小 = 1024 且嵌入類型 = Bool 就足夠了。這與 AIHelpMe 的:bronze和:silver管道 ( update_pipeline(:bronze) ) 相容。為了獲得更好的結果，請使用嵌入大小 = 3072 和嵌入類型 = Float32。這需要使用:gold管道（請參閱更多?RAG_CONFIGURATIONS ）

使用問題索引

 using AIHelpMe
using AIHelpMe : pprint, load_index!

# set it as the "default" index, then it will be automatically used for every question
load_index! (index_path)

aihelp ( " what is DocsScraper.jl? " ) |> pprint

[ Info : Updated RAG pipeline to ` :bronze ` (Configuration key : " textembedding3large-1024-Bool " ) .
[ Info : Loaded index from packs : julia into MAIN_INDEX
[ Info : Loading index from Julia  DocsScraper . jl  docsscraper  Index  docsscraper__v20240823__textembedding3large - 1024 - Bool__v1. 0. hdf5
[ Info : Loaded index a file Julia  DocsScraper . jl  docsscraper  Index  docsscraper__v20240823__textembedding3large - 1024 - Bool__v1. 0. hdf5 into MAIN_INDEX
[ Info : Done with RAG. Total cost : $ 0. 009
--------------------
AI Message
--------------------
DocsScraper . jl is a Julia package designed to create a vector database from input URLs. It scrapes and parses the URLs and, with the assistance of      
PromptingTools . jl, creates a vector store that can be utilized in RAG (Retrieval - Augmented Generation) applications. DocsScraper . jl integrates with     
AIHelpMe . jl and PromptingTools . jl to provide efficient and relevant query retrieval, ensuring that the responses generated by the system are specific to the content in the created database.

提示：使用pprint可以獲得更好的來源輸出，使用last_result獲得更詳細的輸出（來源）。

 using AIHelpMe : last_result
# last_result() returns the last result from the RAG pipeline, ie, same as running aihelp(; return_all=true)
print ( last_result ())

輸出

make_knowledge_packs建立以下檔案：

 index_name
│
├── Index
│   ├── index_name__artifact__info.txt
│   ├── index_name__vDate__model_embedding_size-embedding_type__v1.0.hdf5
│   └── index_name__vDate__model_embedding_size-embedding_type__v1.0.tar.gz 
│
├── Scraped_files
│   ├── scraped_hostname-chunks-max-chunk_size-min-min_chunk_size.jls
│   ├── scraped_hostname-sources-max-chunk_size-min-min_chunk_size.jls
│   └── . . .
│
└── index_name_URL_mapping.csv