ดาวน์โหลด DocsScraper.jl - ดาวน์โหลด DocsScraper.jl ซอร์สโค้ด

DocsScraper.jl

โค้ดแหล่งที่มา AI

v0.1.0

ดาวน์โหลด

DocsScraper: "ผู้สร้างชุดความรู้ RAG ที่มีประสิทธิภาพจากเอกสาร Julia ออนไลน์"

DocsScraper เป็นแพ็คเกจที่ออกแบบมาเพื่อสร้าง "ชุดความรู้" จากไซต์เอกสารออนไลน์สำหรับภาษา Julia

มันคัดลอกและแยกวิเคราะห์ URL และด้วยความช่วยเหลือของ PromptingTools.jl จะสร้างดัชนีของชิ้นและการฝังที่สามารถใช้ในแอปพลิเคชัน RAG โดยผสานรวมกับ AIHelpMe.jl และ PromptingTools.jl เพื่อให้การสืบค้นข้อมูลแบบสอบถามมีประสิทธิภาพสูงและมีความเกี่ยวข้อง ทำให้มั่นใจได้ว่าการตอบสนองที่สร้างโดยระบบนั้นมีความเฉพาะเจาะจงกับเนื้อหาในฐานข้อมูลที่สร้างขึ้น

คุณสมบัติ

การคัดลอกและแยกวิเคราะห์ URL : คัดลอกและแยกวิเคราะห์ URL ที่ป้อนโดยอัตโนมัติเพื่อดึงข้อมูลที่เกี่ยวข้อง โดยให้ความสนใจเป็นพิเศษกับข้อมูลโค้ดและการบล็อกโค้ด ให้ตัวเลือกในการปรับแต่งขนาดก้อน
การรวบรวมข้อมูล URL : เลือกรวบรวมข้อมูล URL ที่ป้อนเพื่อค้นหาหน้าเว็บหลายหน้าในโดเมนเดียวกัน
การสร้างดัชนีความรู้ : ใช้ประโยชน์จาก PromptingTools.jl เพื่อสร้างการฝังด้วยโมเดล ขนาด และประเภทการฝังที่ปรับแต่งได้ (Bool และ Float32)

การติดตั้ง

หากต้องการติดตั้ง DocsScraper ให้ใช้ตัวจัดการแพ็คเกจ Julia และชื่อแพ็คเกจ (ยังไม่ได้ลงทะเบียน):

 using Pkg
Pkg . add (url = " https://github.com/JuliaGenAI/DocsScraper.jl " )

ข้อกำหนดเบื้องต้น:

Julia (เวอร์ชัน 1.10 หรือใหม่กว่า)
การเชื่อมต่ออินเทอร์เน็ตสำหรับการเข้าถึง API
คีย์ OpenAI API พร้อมเครดิตที่มีอยู่ ดูวิธีรับคีย์ API

การสร้างดัชนี

 using DocsScraper
crawlable_urls = [ " https://juliagenai.github.io/DocsScraper.jl/dev " ]

index_path = make_knowledge_packs (crawlable_urls;
    index_name = " docsscraper " , embedding_dimension = 1024 , embedding_bool = true , target_path = " knowledge_packs " )

[ Info : robots . txt unavailable for https : // juliagenai . github . io : / DocsScraper . jl / dev / home /
[ Info : Scraping link : https : // juliagenai . github . io : / DocsScraper . jl / dev / home /
[ Info : robots . txt unavailable for https : // juliagenai . github . io : / DocsScraper . jl / dev
[ Info : Scraping link : https : // juliagenai . github . io : / DocsScraper . jl / dev
. . .
[ Info : Processing https : // juliagenai . github . io : / DocsScraper . jl / dev ...
[ Info : Parsing URL : https : // juliagenai . github . io : / DocsScraper . jl / dev
[ Info : Scraping done : 44 chunks
[ Info : Removed 0 short chunks
[ Info : Removed 1 duplicate chunks
[ Info : Created embeddings for docsscraper. Cost : $ 0. 001
a docsscraper__v20240823__textembedding3large - 1024 - Bool__v1. 0. hdf5
[ Info : ARTIFACT : docsscraper__v20240823__textembedding3large - 1024 - Bool__v1. 0. tar . gz
┌ Info : sha256 :
└   sha = " 977c2b9d9fe30bebea3b6db124b733d29b7762a8f82c9bd642751f37ad27ee2e "
┌ Info : git - tree - sha1 :
└   git_tree_sha = " eca409c0a32ed506fbd8125887b96987e9fb91d2 "
[ Info : Saving source URLS in Julia  knowledge_packs  docsscraper  docsscraper_URL_mapping . csv      
" Julia \ knowledge_packs \ docsscraper \ Index \ docsscraper__v20240823__textembedding3large-1024-Bool__v1.0.hdf5 "

make_knowledge_packs เป็นจุดเริ่มต้นของแพ็คเกจ ฟังก์ชันนี้ใช้ URL เพื่อแยกวิเคราะห์และส่งกลับดัชนี ดัชนีนี้สามารถส่งผ่านไปยัง AIHelpMe.jl เพื่อตอบคำถามเกี่ยวกับชุดความรู้ที่สร้างขึ้น

พารามิเตอร์ make_knowledge_packs ดีฟอลต์:

ประเภทการฝังเริ่มต้นคือ Float32 เปลี่ยนเป็นบูลีนด้วยพารามิเตอร์เผื่อเลือก: embedding_bool = true
ขนาดการฝังดีฟอลต์คือ 3072 เปลี่ยนเป็นขนาดที่กำหนดเองโดยใช้พารามิเตอร์ทางเลือก: embedding_dimension = custom_dimension
โมเดลเริ่มต้นที่ใช้คือ text-embedding-3-large ของ OpenAI
ขนาดก้อนสูงสุดเริ่มต้นคือ 384 และขนาดก้อนขั้นต่ำคือ 40 เปลี่ยนตามพารามิเตอร์ทางเลือก: max_chunk_size = custom_max_size และ min_chunk_size = custom_min_size

หมายเหตุ: สำหรับการใช้งานในชีวิตประจำวัน ขนาดการฝัง = 1024 และประเภทการฝัง = Bool ก็เพียงพอแล้ว สิ่งนี้เข้ากันได้กับ AIHelpMe's :bronze และ :silver ไปป์ไลน์ ( update_pipeline(:bronze) ) เพื่อผลลัพธ์ที่ดีกว่า ให้ใช้ขนาดการฝัง = 3072 และประเภทการฝัง = Float32 สิ่งนี้ต้องใช้ :gold ไปป์ไลน์ (ดูเพิ่มเติม ?RAG_CONFIGURATIONS )

การใช้ดัชนีสำหรับคำถาม

 using AIHelpMe
using AIHelpMe : pprint, load_index!

# set it as the "default" index, then it will be automatically used for every question
load_index! (index_path)

aihelp ( " what is DocsScraper.jl? " ) |> pprint

[ Info : Updated RAG pipeline to ` :bronze ` (Configuration key : " textembedding3large-1024-Bool " ) .
[ Info : Loaded index from packs : julia into MAIN_INDEX
[ Info : Loading index from Julia  DocsScraper . jl  docsscraper  Index  docsscraper__v20240823__textembedding3large - 1024 - Bool__v1. 0. hdf5
[ Info : Loaded index a file Julia  DocsScraper . jl  docsscraper  Index  docsscraper__v20240823__textembedding3large - 1024 - Bool__v1. 0. hdf5 into MAIN_INDEX
[ Info : Done with RAG. Total cost : $ 0. 009
--------------------
AI Message
--------------------
DocsScraper . jl is a Julia package designed to create a vector database from input URLs. It scrapes and parses the URLs and, with the assistance of      
PromptingTools . jl, creates a vector store that can be utilized in RAG (Retrieval - Augmented Generation) applications. DocsScraper . jl integrates with     
AIHelpMe . jl and PromptingTools . jl to provide efficient and relevant query retrieval, ensuring that the responses generated by the system are specific to the content in the created database.

เคล็ดลับ: ใช้ pprint สำหรับเอาต์พุตที่ดีกว่าพร้อมแหล่งที่มา และใช้ last_result สำหรับเอาต์พุตที่มีรายละเอียดมากขึ้น (พร้อมแหล่งที่มา)

 using AIHelpMe : last_result
# last_result() returns the last result from the RAG pipeline, ie, same as running aihelp(; return_all=true)
print ( last_result ())

เอาท์พุต

make_knowledge_packs สร้างไฟล์ต่อไปนี้:

 index_name
│
├── Index
│   ├── index_name__artifact__info.txt
│   ├── index_name__vDate__model_embedding_size-embedding_type__v1.0.hdf5
│   └── index_name__vDate__model_embedding_size-embedding_type__v1.0.tar.gz 
│
├── Scraped_files
│   ├── scraped_hostname-chunks-max-chunk_size-min-min_chunk_size.jls
│   ├── scraped_hostname-sources-max-chunk_size-min-min_chunk_size.jls
│   └── . . .
│
└── index_name_URL_mapping.csv

ดัชนี: ประกอบด้วยไฟล์ .hdf5 และ .tar.gz พร้อมด้วย artifact__info.txt ข้อมูลสิ่งประดิษฐ์มีแฮช sha256 และ git-tree-sha1
Scraped_files: มีชิ้นส่วนและแหล่งที่มาที่คัดลอกมา สิ่งเหล่านี้ถูกคั่นด้วยชื่อโฮสต์ของ URL
URL_mapping.csv มี URL ที่คัดลอกมาซึ่งแมปกับชื่อแพ็คเกจโดยประมาณ

Google Summer of Code 2024

โปรเจ็กต์นี้ได้รับการพัฒนาโดยเป็นส่วนหนึ่งของโปรแกรม Google Summer of Code (GSoC) GSoC เป็นโปรแกรมระดับโลกที่ให้ค่าตอบแทนแก่นักพัฒนาระดับนักศึกษาในการเขียนโค้ดสำหรับโครงการโอเพ่นซอร์ส เราขอขอบคุณสำหรับการสนับสนุนและโอกาสที่ Google และชุมชนโอเพ่นซอร์สมอบให้ผ่านโครงการริเริ่มนี้

ขยาย

ข้อมูลเพิ่มเติม

เวอร์ชัน v0.1.0
ประเภท โค้ดแหล่งที่มา AI
เวลาอัปเดต 2024-12-25
ขนาด 36.98KB
มาจาก Github

แอปที่เกี่ยวข้อง

Lib.Net.Http.WebPush

2024-11-10
ความกลัว 3

2022-09-05
ผู้สร้างมวล

2022-08-29
รูส

2022-08-20
โคมะ

2022-08-11
ซาร์

2022-07-30

แนะนำสำหรับคุณ

chat.petals.dev

ซอร์สโค้ดอื่น ๆ

1.0.0
GPT Prompt Templates

ซอร์สโค้ดอื่น ๆ

1.0.0
GPTyped

ซอร์สโค้ดอื่น ๆ

GPTyped 1.0.5
node telegram bot api

โค้ดแหล่งที่มา AI

v0.50.0
typebot.io

โค้ดแหล่งที่มา AI

v3.1.2
python wechaty getting started

โค้ดแหล่งที่มา AI

1.0.0
waymo open dataset

ซอร์สโค้ดอื่น ๆ

December 2023 Update
termwind

หมวดหมู่อื่นๆ

v2.3.0
wp functions

หมวดหมู่อื่นๆ

1.0.0

ข้อมูลที่เกี่ยวข้อง ทั้งหมด