Screw traditional OCRs or heavy libraries to get data from PDFs, GenAI does a better job!
AIPDF is a stand-alone, minimalistic, yet powerful pure Python library that leverages multi-modal gen AI models (OpenAI, llama3 or compatible alternatives) to extract data from PDFs and convert it into various formats such as Markdown or JSON.
pip install aipdf
in macOS you will need to install poppler
brew install poppler
from aipdf import ocr
# Your OpenAI API key
api_key = 'your_openai_api_key'
file = open('somepdf.pdf', 'rb')
markdown_pages = ocr(file, api_key)
You can use with any ollama multi-modal models
ocr(pdf_file, api_key='ollama', model="llama3.2", base_url= 'http://localhost:11434/v1', prompt=...)
We chose that you pass a file object, because that way it is flexible for you to use this with any type of file system, s3, localfiles, urls etc
pdf_file = io.BytesIO(requests.get('https://arxiv.org/pdf/2410.02467').content)
# extract
pages = ocr(pdf_file, api_key, prompt="extract tables, return each table in json")
s3 = boto3.client('s3', config=Config(signature_version='s3v4'),
aws_access_key_id=access_token,
aws_secret_access_key='', # Not needed for token-based auth
aws_session_token=access_token)
pdf_file = io.BytesIO(s3.get_object(Bucket=bucket_name, Key=object_key)['Body'].read())
# extract
pages = ocr(pdf_file, api_key, prompt="extract charts data, turn it into tables that represent the variables in the chart")
We will keep this super clean, only 3 required libraries:
This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Please feel free to submit a Pull Request.
If you encounter any problems or have any questions, please open an issue on the GitHub repository.
AIPDF makes PDF data extraction simple, flexible, and powerful. Try it out and simplify your PDF processing workflow today!