DocsScraper is a package designed to create "knowledge packs" from online documentation sites for the Julia language.
It scrapes and parses the URLs and with the help of PromptingTools.jl, creates an index of chunks and their embeddings that can be used in RAG applications. It integrates with AIHelpMe.jl and PromptingTools.jl to offer highly efficient and relevant query retrieval, ensuring that the responses generated by the system are specific to the content in the created database.
To install DocsScraper, use the Julia package manager and the package name (it's not registered yet):
using Pkg
Pkg.add(url="https://github.com/JuliaGenAI/DocsScraper.jl")
Prerequisites:
using DocsScraper
crawlable_urls = ["https://juliagenai.github.io/DocsScraper.jl/dev"]
index_path = make_knowledge_packs(crawlable_urls;
index_name = "docsscraper", embedding_dimension = 1024, embedding_bool = true, target_path="knowledge_packs")
[ Info: robots.txt unavailable for https://juliagenai.github.io:/DocsScraper.jl/dev/home/
[ Info: Scraping link: https://juliagenai.github.io:/DocsScraper.jl/dev/home/
[ Info: robots.txt unavailable for https://juliagenai.github.io:/DocsScraper.jl/dev
[ Info: Scraping link: https://juliagenai.github.io:/DocsScraper.jl/dev
. . .
[ Info: Processing https://juliagenai.github.io:/DocsScraper.jl/dev...
[ Info: Parsing URL: https://juliagenai.github.io:/DocsScraper.jl/dev
[ Info: Scraping done: 44 chunks
[ Info: Removed 0 short chunks
[ Info: Removed 1 duplicate chunks
[ Info: Created embeddings for docsscraper. Cost: $0.001
a docsscraper__v20240823__textembedding3large-1024-Bool__v1.0.hdf5
[ Info: ARTIFACT: docsscraper__v20240823__textembedding3large-1024-Bool__v1.0.tar.gz
┌ Info: sha256:
└ sha = "977c2b9d9fe30bebea3b6db124b733d29b7762a8f82c9bd642751f37ad27ee2e"
┌ Info: git-tree-sha1:
└ git_tree_sha = "eca409c0a32ed506fbd8125887b96987e9fb91d2"
[ Info: Saving source URLS in Juliaknowledge_packsdocsscraperdocsscraper_URL_mapping.csv
"Julia\knowledge_packs\docsscraper\Index\docsscraper__v20240823__textembedding3large-1024-Bool__v1.0.hdf5"
make_knowledge_packs
is the entry point to the package. This function takes in the URLs to parse and returns the index. This index can be passed to AIHelpMe.jl to answer queries on the built knowledge packs.
Default make_knowledge_packs
Parameters:
embedding_bool = true
.embedding_dimension = custom_dimension
.max_chunk_size = custom_max_size
and min_chunk_size = custom_min_size
.Note: For everyday use, embedding size = 1024 and embedding type = Bool is sufficient. This is compatible with AIHelpMe's :bronze
and :silver
pipelines (update_pipeline(:bronze)
). For better results use embedding size = 3072 and embedding type = Float32. This requires the use of :gold
pipeline (see more ?RAG_CONFIGURATIONS
)
using AIHelpMe
using AIHelpMe: pprint, load_index!
# set it as the "default" index, then it will be automatically used for every question
load_index!(index_path)
aihelp("what is DocsScraper.jl?") |> pprint
[ Info: Updated RAG pipeline to `:bronze` (Configuration key: "textembedding3large-1024-Bool").
[ Info: Loaded index from packs: julia into MAIN_INDEX
[ Info: Loading index from JuliaDocsScraper.jldocsscraperIndexdocsscraper__v20240823__textembedding3large-1024-Bool__v1.0.hdf5
[ Info: Loaded index a file JuliaDocsScraper.jldocsscraperIndexdocsscraper__v20240823__textembedding3large-1024-Bool__v1.0.hdf5 into MAIN_INDEX
[ Info: Done with RAG. Total cost: $0.009
--------------------
AI Message
--------------------
DocsScraper.jl is a Julia package designed to create a vector database from input URLs. It scrapes and parses the URLs and, with the assistance of
PromptingTools.jl, creates a vector store that can be utilized in RAG (Retrieval-Augmented Generation) applications. DocsScraper.jl integrates with
AIHelpMe.jl and PromptingTools.jl to provide efficient and relevant query retrieval, ensuring that the responses generated by the system are specific to the content in the created database.
Tip: Use pprint
for nicer outputs with sources and last_result
for more detailed outputs (with sources).
using AIHelpMe: last_result
# last_result() returns the last result from the RAG pipeline, ie, same as running aihelp(; return_all=true)
print(last_result())
make_knowledge_packs
creates the following files:
index_name
│
├── Index
│ ├── index_name__artifact__info.txt
│ ├── index_name__vDate__model_embedding_size-embedding_type__v1.0.hdf5
│ └── index_name__vDate__model_embedding_size-embedding_type__v1.0.tar.gz
│
├── Scraped_files
│ ├── scraped_hostname-chunks-max-chunk_size-min-min_chunk_size.jls
│ ├── scraped_hostname-sources-max-chunk_size-min-min_chunk_size.jls
│ └── . . .
│
└── index_name_URL_mapping.csv
This project was developed as part of the Google Summer of Code (GSoC) program. GSoC is a global program that offers student developers stipends to write code for open-source projects. We are grateful for the support and opportunity provided by Google and the open-source community through this initiative.