A lightweight, GPU accelerated, SQL engine built on the RAPIDS.ai ecosystem.
Get Started on app.blazingsql.com
Getting Started | Documentation | Examples | Contributing | License | Blog | Try Now
BlazingSQL is a GPU accelerated SQL engine built on top of the RAPIDS ecosystem. RAPIDS is based on the Apache Arrow columnar memory format, and cuDF is a GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data.
BlazingSQL is a SQL interface for cuDF, with various features to support large scale data science workflows and enterprise datasets.
Try our 5-min Welcome Notebook to start using BlazingSQL and RAPIDS AI.
Here's two copy + paste reproducable BlazingSQL snippets, keep scrolling to find example Notebooks below.
Create and query a table from a cudf.DataFrame
with progress bar:
import cudf
df = cudf.DataFrame()
df['key'] = ['a', 'b', 'c', 'd', 'e']
df['val'] = [7.6, 2.9, 7.1, 1.6, 2.2]
from blazingsql import BlazingContext
bc = BlazingContext(enable_progress_bar=True)
bc.create_table('game_1', df)
bc.sql('SELECT * FROM game_1 WHERE val > 4') # the query progress will be shown
Key | Value | |
---|---|---|
0 | a | 7.6 |
1 | b | 7.1 |
Create and query a table from a AWS S3 bucket:
from blazingsql import BlazingContext
bc = BlazingContext()
bc.s3('blazingsql-colab', bucket_name='blazingsql-colab')
bc.create_table('taxi', 's3://blazingsql-colab/yellow_taxi/taxi_data.parquet')
bc.sql('SELECT passenger_count, trip_distance FROM taxi LIMIT 2')
passenger_count | fare_amount | |
---|---|---|
0 | 1.0 | 1.1 |
1 | 1.0 | 0.7 |
Notebook Title | Description | Try Now |
---|---|---|
Welcome Notebook | An introduction to BlazingSQL Notebooks and the GPU Data Science Ecosystem. | |
The DataFrame | Learn how to use BlazingSQL and cuDF to create GPU DataFrames with SQL and Pandas-like APIs. | |
Data Visualization | Plug in your favorite Python visualization packages, or use GPU accelerated visualization tools to render millions of rows in a flash. | |
Machine Learning | Learn about cuML, mirrored after the Scikit-Learn API, it offers GPU accelerated machine learning on GPU DataFrames. |
You can find our full documentation at docs.blazingdb.com.
BlazingSQL can be installed with conda (miniconda, or the full Anaconda distribution) from the blazingsql channel:
conda install -c blazingsql -c rapidsai -c nvidia -c conda-forge -c defaults blazingsql python=$PYTHON_VERSION cudatoolkit=$CUDA_VERSION
Where $CUDA_VERSION is 11.0, 11.2 or 11.4 and $PYTHON_VERSION is 3.7 or 3.8 For example for CUDA 11.2 and Python 3.8:
conda install -c blazingsql -c rapidsai -c nvidia -c conda-forge -c defaults blazingsql python=3.8 cudatoolkit=11.2
For nightly version cuda 11+ are only supported, see https://github.com/rapidsai/cudf#cudagpu-requirements
conda install -c blazingsql-nightly -c rapidsai-nightly -c nvidia -c conda-forge -c defaults blazingsql python=$PYTHON_VERSION cudatoolkit=$CUDA_VERSION
Where $CUDA_VERSION is 11.0, 11.2 or 11.4 and $PYTHON_VERSION is 3.7 or 3.8 For example for CUDA 11.2 and Python 3.8:
conda install -c blazingsql-nightly -c rapidsai-nightly -c nvidia -c conda-forge -c defaults blazingsql python=3.8 cudatoolkit=11.2
This is the recommended way of building all of the BlazingSQL components and dependencies from source. It ensures that all the dependencies are available to the build process.
conda create -n bsql python=$PYTHON_VERSION
conda activate bsql
./dependencies.sh 21.08 $CUDA_VERSION
Where $CUDA_VERSION is is 11.0, 11.2 or 11.4 and $PYTHON_VERSION is 3.7 or 3.8 For example for CUDA 11.2 and Python 3.7:
conda create -n bsql python=3.7
conda activate bsql
./dependencies.sh 21.08 11.2
The build process will checkout the BlazingSQL repository and will build and install into the conda environment.
cd $CONDA_PREFIX
git clone https://github.com/BlazingDB/blazingsql.git
cd blazingsql
git checkout main
export CUDACXX=/usr/local/cuda/bin/nvcc
./build.sh
NOTE: You can do ./build.sh -h
to see more build options.
$CONDA_PREFIX now has a folder for the blazingsql repository.
For nightly version cuda 11+ are only supported, see https://github.com/rapidsai/cudf#cudagpu-requirements
conda create -n bsql python=$PYTHON_VERSION
conda activate bsql
./dependencies.sh 21.10 $CUDA_VERSION nightly
Where $CUDA_VERSION is 11.0, 11.2 or 11.4 and $PYTHON_VERSION is 3.7 or 3.8 For example for CUDA 11.2 and Python 3.8:
conda create -n bsql python=3.8
conda activate bsql
./dependencies.sh 21.10 11.2 nightly
The build process will checkout the BlazingSQL repository and will build and install into the conda environment.
cd $CONDA_PREFIX
git clone https://github.com/BlazingDB/blazingsql.git
cd blazingsql
export CUDACXX=/usr/local/cuda/bin/nvcc
./build.sh
NOTE: You can do ./build.sh -h
to see more build options.
NOTE: You can perform static analysis with cppcheck with the command cppcheck --project=compile_commands.json
in any of the cpp project build directories.
$CONDA_PREFIX now has a folder for the blazingsql repository.
To build without the storage plugins (AWS S3, Google Cloud Storage) use the next arguments:
# Disable all storage plugins
./build.sh disable-aws-s3 disable-google-gs
# Disable AWS S3 storage plugin
./build.sh disable-aws-s3
# Disable Google Cloud Storage plugin
./build.sh disable-google-gs
NOTE: By disabling the storage plugins you don't need to install previously AWS SDK C++ or Google Cloud Storage (neither any of its dependencies).
To build without the SQL providers (MySQL, PostgreSQL, SQLite) use the next arguments:
# Disable all SQL providers
./build.sh disable-mysql disable-sqlite disable-postgresql
# Disable MySQL provider
./build.sh disable-mysql
...
NOTES:
User guides and public APIs documentation can be found at here
Our internal code architecture can be built using Spinx.
conda install -c conda-forge doxygen
cd $CONDA_PREFIX
cd blazingsql/docsrc
pip install -r requirements.txt
make doxygen
make html
The generated documentation can be viewed in a browser at blazingsql/docsrc/build/html/index.html
Have questions or feedback? Post a new github issue.
Please see our guide for contributing to BlazingSQL.
Feel free to join our channel (#blazingsql) in the RAPIDS-GoAi Slack: .
You can also email us at [email protected] or find out more details on BlazingSQL.com.
Apache License 2.0
The RAPIDS suite of open source software libraries aim to enable execution of end-to-end data science and analytics pipelines entirely on GPUs. It relies on NVIDIA® CUDA® primitives for low-level compute optimization, but exposing that GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces.
The GPU version of Apache Arrow is a common API that enables efficient interchange of tabular data between processes running on the GPU. End-to-end computation on the GPU avoids unnecessary copying and converting of data off the GPU, reducing compute time and cost for high-performance analytics common in artificial intelligence workloads. As the name implies, cuDF uses the Apache Arrow columnar data format on the GPU. Currently, a subset of the features in Apache Arrow are supported.