DocsScraper.jl
v0.1.0
DocsScraper 是一个软件包,旨在从 Julia 语言的在线文档站点创建“知识包”。
它会抓取并解析 URL,并在 PromptingTools.jl 的帮助下创建可在 RAG 应用程序中使用的块及其嵌入的索引。它与 AIHelpMe.jl 和 PromptingTools.jl 集成,提供高效且相关的查询检索,确保系统生成的响应特定于所创建数据库中的内容。
要安装 DocsScraper,请使用 Julia 包管理器和包名称(尚未注册):
using Pkg
Pkg . add (url = " https://github.com/JuliaGenAI/DocsScraper.jl " )
先决条件:
using DocsScraper
crawlable_urls = [ " https://juliagenai.github.io/DocsScraper.jl/dev " ]
index_path = make_knowledge_packs (crawlable_urls;
index_name = " docsscraper " , embedding_dimension = 1024 , embedding_bool = true , target_path = " knowledge_packs " )
[ Info : robots . txt unavailable for https : // juliagenai . github . io : / DocsScraper . jl / dev / home /
[ Info : Scraping link : https : // juliagenai . github . io : / DocsScraper . jl / dev / home /
[ Info : robots . txt unavailable for https : // juliagenai . github . io : / DocsScraper . jl / dev
[ Info : Scraping link : https : // juliagenai . github . io : / DocsScraper . jl / dev
. . .
[ Info : Processing https : // juliagenai . github . io : / DocsScraper . jl / dev ...
[ Info : Parsing URL : https : // juliagenai . github . io : / DocsScraper . jl / dev
[ Info : Scraping done : 44 chunks
[ Info : Removed 0 short chunks
[ Info : Removed 1 duplicate chunks
[ Info : Created embeddings for docsscraper. Cost : $ 0. 001
a docsscraper__v20240823__textembedding3large - 1024 - Bool__v1. 0. hdf5
[ Info : ARTIFACT : docsscraper__v20240823__textembedding3large - 1024 - Bool__v1. 0. tar . gz
┌ Info : sha256 :
└ sha = " 977c2b9d9fe30bebea3b6db124b733d29b7762a8f82c9bd642751f37ad27ee2e "
┌ Info : git - tree - sha1 :
└ git_tree_sha = " eca409c0a32ed506fbd8125887b96987e9fb91d2 "
[ Info : Saving source URLS in Julia knowledge_packs docsscraper docsscraper_URL_mapping . csv
" Julia \ knowledge_packs \ docsscraper \ Index \ docsscraper__v20240823__textembedding3large-1024-Bool__v1.0.hdf5 "
make_knowledge_packs
是包的入口点。该函数接收要解析的 URL 并返回索引。该索引可以传递到 AIHelpMe.jl 以回答对构建的知识包的查询。
默认make_knowledge_packs
参数:
embedding_bool = true
。embedding_dimension = custom_dimension
。max_chunk_size = custom_max_size
和min_chunk_size = custom_min_size
。注意:对于日常使用,嵌入大小 = 1024 且嵌入类型 = Bool 就足够了。这与 AIHelpMe 的:bronze
和:silver
管道 ( update_pipeline(:bronze)
) 兼容。为了获得更好的结果,请使用嵌入大小 = 3072 和嵌入类型 = Float32。这需要使用:gold
管道(参见更多?RAG_CONFIGURATIONS
)
using AIHelpMe
using AIHelpMe : pprint, load_index!
# set it as the "default" index, then it will be automatically used for every question
load_index! (index_path)
aihelp ( " what is DocsScraper.jl? " ) |> pprint
[ Info : Updated RAG pipeline to ` :bronze ` (Configuration key : " textembedding3large-1024-Bool " ) .
[ Info : Loaded index from packs : julia into MAIN_INDEX
[ Info : Loading index from Julia DocsScraper . jl docsscraper Index docsscraper__v20240823__textembedding3large - 1024 - Bool__v1. 0. hdf5
[ Info : Loaded index a file Julia DocsScraper . jl docsscraper Index docsscraper__v20240823__textembedding3large - 1024 - Bool__v1. 0. hdf5 into MAIN_INDEX
[ Info : Done with RAG. Total cost : $ 0. 009
--------------------
AI Message
--------------------
DocsScraper . jl is a Julia package designed to create a vector database from input URLs. It scrapes and parses the URLs and, with the assistance of
PromptingTools . jl, creates a vector store that can be utilized in RAG (Retrieval - Augmented Generation) applications. DocsScraper . jl integrates with
AIHelpMe . jl and PromptingTools . jl to provide efficient and relevant query retrieval, ensuring that the responses generated by the system are specific to the content in the created database.
提示:使用pprint
可以获得更好的源输出,使用last_result
获得更详细的输出(源)。
using AIHelpMe : last_result
# last_result() returns the last result from the RAG pipeline, ie, same as running aihelp(; return_all=true)
print ( last_result ())
make_knowledge_packs
创建以下文件:
index_name
│
├── Index
│ ├── index_name__artifact__info.txt
│ ├── index_name__vDate__model_embedding_size-embedding_type__v1.0.hdf5
│ └── index_name__vDate__model_embedding_size-embedding_type__v1.0.tar.gz
│
├── Scraped_files
│ ├── scraped_hostname-chunks-max-chunk_size-min-min_chunk_size.jls
│ ├── scraped_hostname-sources-max-chunk_size-min-min_chunk_size.jls
│ └── . . .
│
└── index_name_URL_mapping.csv
该项目是作为 Google Summer of Code (GSoC) 计划的一部分开发的。 GSoC 是一项全球计划,为学生开发人员提供津贴,帮助他们为开源项目编写代码。我们感谢 Google 和开源社区通过这一举措提供的支持和机会。