split markdown4gpt 다운로드 - split markdown4gpt 소스 코드 다운로드

split markdown4gpt

기타 소스코드

1.0.0

다운로드

Split_markdown4gpt

split_markdown4gpt 는 지정된 토큰 제한에 따라 대규모 Markdown 파일을 더 작은 섹션으로 분할하도록 설계된 Python 도구입니다. 이는 모델이 관리 가능한 청크로 데이터를 처리할 수 있도록 해주기 때문에 GPT 모델로 대용량 Markdown 파일을 처리하는 데 특히 유용합니다.

버전 1.0.9 (2023-06-19)

설치

pip를 통해 split_markdown4gpt 설치할 수 있습니다.

pip install split_markdown4gpt

CLI 사용법

설치 후 mdsplit4gpt 명령을 사용하여 Markdown 파일을 분할할 수 있습니다. 기본 구문은 다음과 같습니다.

mdsplit4gpt path_to_your_file.md --model gpt-3.5-turbo --limit 4096 --separator " === SPLIT === "

이 명령은 path_to_your_file.md 의 Markdown 파일을 섹션으로 분할합니다. 각 섹션에는 4096개 이하의 토큰이 포함됩니다( gpt-3.5-turbo 모델로 계산). 섹션은 === SPLIT === 로 구분됩니다.

모든 CLI 옵션:

 NAME
    mdsplit4gpt - Splits a Markdown file into sections according to GPT token size limits.

SYNOPSIS
    mdsplit4gpt MD_PATH <flags>

DESCRIPTION
    This tool loads a Markdown file, and splits its content into sections
    that are within the specified token size limit using the desired GPT tokenizing model. The resulting
    sections are then concatenated using the specified separator and returned as a single string.

POSITIONAL ARGUMENTS
    MD_PATH
        Type: Union
        The path of the source Markdown file to be split.

FLAGS
    -m, --model=MODEL
        Type: str
        Default: 'gpt-3.5-turbo'
        The GPT tokenizer model to use for calculating token sizes. Defaults to "gpt-3.5-turbo".
    -l, --limit=LIMIT
        Type: Optional[int]
        Default: None
        The maximum number of GPT tokens allowed per section. Defaults to the model's maximum tokens.
    -s, --separator=SEPARATOR
        Type: str
        Default: '=== SPLIT ==='
        The string used to separate sections in the output. Defaults to "=== SPLIT ===".

파이썬 사용법

Python 코드에서 split_markdown4gpt 사용할 수도 있습니다. 기본적인 예는 다음과 같습니다.

 from split_markdown4gpt import split

sections = split ( "path_to_your_file.md" , model = "gpt-3.5-turbo" , limit = 4096 )
for section in sections :
    print ( section )

이 코드는 위의 CLI 명령과 동일한 작업을 수행하지만 Python에서 수행됩니다.

고급 사용법은 API 문서를 참조하세요.

작동 방식

split_markdown4gpt 지정된 GPT 모델의 토크나이저(기본값은 gpt-3.5-turbo )를 사용하여 입력 Markdown 파일을 토큰화하는 방식으로 작동합니다. 그런 다음 파일을 섹션으로 분할합니다. 각 섹션에는 지정된 토큰 제한만 포함됩니다.

분할 프로세스는 Markdown 파일의 구조를 따릅니다. 섹션이 토큰 제한보다 길지 않으면 여러 출력 섹션에 걸쳐 섹션(Markdown 제목으로 정의된 대로)을 분할하지 않습니다. 그 경우에는

문장 수준에서 섹션을 분할합니다.

이 도구는 이를 수행하기 위해 여러 라이브러리를 사용합니다.

GPT 모델의 규칙에 따라 텍스트를 토큰화하기 위한 tiktoken .
CLI 생성을 위해 fire .
Markdown 파일의 머리말(파일 시작 부분의 메타데이터)을 구문 분석하기 위한 frontmatter .
Markdown 파일을 구문 트리로 구문 분석하기 위한 mistletoe .
텍스트를 문장으로 분할하는 syntok .
다양한 유틸리티 기능을 위한 regex 및 PyYAML .

사용 사례

split_markdown4gpt 는 GPT 모델을 사용하여 대규모 Markdown 파일을 처리해야 하는 시나리오에서 특히 유용합니다. 예를 들어:

텍스트 생성 : GPT 모델을 사용하여 대규모 Markdown 파일을 기반으로 텍스트를 생성하는 경우, split_markdown4gpt 사용하여 파일을 관리 가능한 섹션으로 분할할 수 있습니다. 이를 통해 GPT 모델이 파일을 청크로 처리하여 토큰 오버플로 오류를 방지할 수 있습니다.
데이터 전처리 : 기계 학습 프로젝트에서는 데이터를 모델에 공급하기 전에 데이터를 전처리해야 하는 경우가 많습니다. 데이터가 대규모 Markdown 파일 형식인 경우, split_markdown4gpt 모델의 토큰 제한에 따라 이러한 파일을 더 작은 섹션으로 분할하는 데 도움이 될 수 있습니다.
문서 분석 : 대규모 마크다운 문서를 분석하는 경우(예: 키워드 추출, 내용 요약), split_markdown4gpt 사용하여 문서를 더 작은 섹션으로 나눌 수 있습니다. 이를 통해 분석을 더욱 관리하기 쉽고 효율적으로 만들 수 있습니다.