Pandas on AWS
Easy integration with Athena, Glue, Redshift, Timestream, OpenSearch, Neptune, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
An AWS Professional Service open source initiative | [email protected]
Source | Downloads | Installation Command |
---|---|---|
PyPi | pip install awswrangler | |
Conda | conda install -c conda-forge awswrangler |
️ Starting version 3.0, optional modules must be installed explicitly:
➡️pip install 'awswrangler[redshift]'
Quick Start
At Scale
Read The Docs
Getting Help
Logging
Installation command: pip install awswrangler
️ Starting version 3.0, optional modules must be installed explicitly:
➡️pip install 'awswrangler[redshift]'
import awswrangler as wrimport pandas as pdfrom datetime import datetimedf = pd.DataFrame({"id": [1, 2], "value": ["foo", "boo"]})# Storing data on Data Lakewr.s3.to_parquet(df=df,path="s3://bucket/dataset/",dataset=True,database="my_db",table="my_table")# Retrieving the data directly from Amazon S3df = wr.s3.read_parquet("s3://bucket/dataset/", dataset=True)# Retrieving the data from Amazon Athenadf = wr.athena.read_sql_query("SELECT * FROM my_table", database="my_db")# Get a Redshift connection from Glue Catalog and retrieving data from Redshift Spectrumcon = wr.redshift.connect("my-glue-connection")df = wr.redshift.read_sql_query("SELECT * FROM external_schema.my_table", con=con)con.close()# Amazon Timestream Writedf = pd.DataFrame({"time": [datetime.now(), datetime.now()], "my_dimension": ["foo", "boo"],"measure": [1.0, 1.1], })rejected_records = wr.timestream.write(df,database="sampleDB",table="sampleTable",time_col="time",measure_col="measure",dimensions_cols=["my_dimension"], )# Amazon Timestream Querywr.timestream.query("""SELECT time, measure_value::double, my_dimensionFROM "sampleDB"."sampleTable" ORDER BY time DESC LIMIT 3""")
AWS SDK for pandas can also run your workflows at scale by leveraging Modin and Ray. Both projects aim to speed up data workloads by distributing processing over a cluster of workers.
Read our docs or head to our latest tutorials to learn more.
️ Ray is currently not available for Python 3.12. While AWS SDK for pandas supports Python 3.12, it cannot be used at scale.
What is AWS SDK for pandas?
Install
PyPi (pip)
Conda
AWS Lambda Layer
AWS Glue Python Shell Jobs
AWS Glue PySpark Jobs
Amazon SageMaker Notebook
Amazon SageMaker Notebook Lifecycle
EMR
From source
At scale
Getting Started
Supported APIs
Resources
Tutorials
001 - Introduction
002 - Sessions
003 - Amazon S3
004 - Parquet Datasets
005 - Glue Catalog
006 - Amazon Athena
007 - Databases (Redshift, MySQL, PostgreSQL, SQL Server and Oracle)
008 - Redshift - Copy & Unload.ipynb
009 - Redshift - Append, Overwrite and Upsert
010 - Parquet Crawler
011 - CSV Datasets
012 - CSV Crawler
013 - Merging Datasets on S3
014 - Schema Evolution
015 - EMR
016 - EMR & Docker
017 - Partition Projection
018 - QuickSight
019 - Athena Cache
020 - Spark Table Interoperability
021 - Global Configurations
022 - Writing Partitions Concurrently
023 - Flexible Partitions Filter
024 - Athena Query Metadata
025 - Redshift - Loading Parquet files with Spectrum
026 - Amazon Timestream
027 - Amazon Timestream 2
028 - Amazon DynamoDB
029 - S3 Select
030 - Data Api
031 - OpenSearch
033 - Amazon Neptune
034 - Distributing Calls Using Ray
035 - Distributing Calls on Ray Remote Cluster
037 - Glue Data Quality
038 - OpenSearch Serverless
039 - Athena Iceberg
040 - EMR Serverless
041 - Apache Spark on Amazon Athena
API Reference
Amazon S3
AWS Glue Catalog
Amazon Athena
Amazon Redshift
PostgreSQL
MySQL
SQL Server
Oracle
Data API Redshift
Data API RDS
OpenSearch
AWS Glue Data Quality
Amazon Neptune
DynamoDB
Amazon Timestream
Amazon EMR
Amazon CloudWatch Logs
Amazon Chime
Amazon QuickSight
AWS STS
AWS Secrets Manager
Global Configurations
Distributed - Ray
License
Contributing
The best way to interact with our team is through GitHub. You can open an issue and choose from one of our templates for bug reports, feature requests... You may also find help on these community resources:
The #aws-sdk-pandas Slack channel
Ask a question on Stack Overflow
and tag it with awswrangler
Runbook for AWS SDK for pandas with Ray
Enabling internal logging examples:
import logginglogging.basicConfig(level=logging.INFO, format="[%(name)s][%(funcName)s] %(message)s")logging.getLogger("awswrangler").setLevel(logging.DEBUG)logging.getLogger("botocore.credentials").setLevel(logging.CRITICAL)
Into AWS lambda:
import logginglogging.getLogger("awswrangler").setLevel(logging.DEBUG)