polarsダウンロード - polarsソースコードのダウンロード

ドキュメント: Python - Rust - Node.js - R | StackOverflow : Python - Rust - Node.js - R |ユーザーガイド |不和

Polars: Rust、Python、Node.js、R、SQL の超高速データフレーム

Polars は、メモリモデルとして Apache Arrow Columnar Format を使用して Rust に実装された OLAP クエリエンジン上の DataFrame インターフェイスです。

怠け者 |熱心な実行
マルチスレッド
SIMD
クエリの最適化
強力な式 API
ハイブリッドストリーミング (RAM より大きいデータセット)
さび |パイソン |ノードJS | R | ...

詳細については、ユーザーガイドをお読みください。

パイソン

 >> > import polars as pl
>> > df = pl . DataFrame (
...     {
...         "A" : [ 1 , 2 , 3 , 4 , 5 ],
...         "fruits" : [ "banana" , "banana" , "apple" , "apple" , "banana" ],
...         "B" : [ 5 , 4 , 3 , 2 , 1 ],
...         "cars" : [ "beetle" , "audi" , "beetle" , "beetle" , "beetle" ],
...     }
... )

# embarrassingly parallel execution & very expressive query language
>> > df . sort ( "fruits" ). select (
...     "fruits" ,
...     "cars" ,
...     pl . lit ( "fruits" ). alias ( "literal_string_fruits" ),
...     pl . col ( "B" ). filter ( pl . col ( "cars" ) == "beetle" ). sum (),
...     pl . col ( "A" ). filter ( pl . col ( "B" ) > 2 ). sum (). over ( "cars" ). alias ( "sum_A_by_cars" ),
...     pl . col ( "A" ). sum (). over ( "fruits" ). alias ( "sum_A_by_fruits" ),
...     pl . col ( "A" ). reverse (). over ( "fruits" ). alias ( "rev_A_by_fruits" ),
...     pl . col ( "A" ). sort_by ( "B" ). over ( "fruits" ). alias ( "sort_A_by_B_by_fruits" ),
... )
shape : ( 5 , 8 )
┌──────────┬──────────┬──────────────┬─────┬─────────────┬─────────────┬─────────────┬─────────────┐
│ fruits   ┆ cars     ┆ literal_stri ┆ B   ┆ sum_A_by_ca ┆ sum_A_by_fr ┆ rev_A_by_fr ┆ sort_A_by_B │
│ - - -      ┆ - - -      ┆ ng_fruits    ┆ - - - ┆ rs          ┆ uits        ┆ uits        ┆ _by_fruits  │
│ str      ┆ str      ┆ - - -          ┆ i64 ┆ - - -         ┆ - - -         ┆ - - -         ┆ - - -         │
│          ┆          ┆ str          ┆     ┆ i64         ┆ i64         ┆ i64         ┆ i64         │
╞══════════╪══════════╪══════════════╪═════╪═════════════╪═════════════╪═════════════╪═════════════╡
│ "apple"  ┆ "beetle" ┆ "fruits"     ┆ 11  ┆ 4           ┆ 7           ┆ 4           ┆ 4           │
│ "apple"  ┆ "beetle" ┆ "fruits"     ┆ 11  ┆ 4           ┆ 7           ┆ 3           ┆ 3           │
│ "banana" ┆ "beetle" ┆ "fruits"     ┆ 11  ┆ 4           ┆ 8           ┆ 5           ┆ 5           │
│ "banana" ┆ "audi"   ┆ "fruits"     ┆ 11  ┆ 2           ┆ 8           ┆ 2           ┆ 2           │
│ "banana" ┆ "beetle" ┆ "fruits"     ┆ 11  ┆ 4           ┆ 8           ┆ 1           ┆ 1           │
└──────────┴──────────┴──────────────┴─────┴─────────────┴─────────────┴─────────────┴─────────────┘

SQL

 >> > df = pl . scan_csv ( "docs/assets/data/iris.csv" )
>> > ## OPTION 1
>> > # run SQL queries on frame-level
>> > df . sql ( """
...	SELECT species,
...	  AVG(sepal_length) AS avg_sepal_length
...	FROM self
...	GROUP BY species
...	""" ). collect ()
shape : ( 3 , 2 )
┌────────────┬──────────────────┐
│ species    ┆ avg_sepal_length │
│ - - -        ┆ - - -              │
│ str        ┆ f64              │
╞════════════╪══════════════════╡
│ Virginica  ┆ 6.588            │
│ Versicolor ┆ 5.936            │
│ Setosa     ┆ 5.006            │
└────────────┴──────────────────┘
>> > ## OPTION 2
>> > # use pl.sql() to operate on the global context
>> > df2 = pl . LazyFrame ({
...    "species" : [ "Setosa" , "Versicolor" , "Virginica" ],
...    "blooming_season" : [ "Spring" , "Summer" , "Fall" ]
...})
>> > pl . sql ( """
... SELECT df.species,
...     AVG(df.sepal_length) AS avg_sepal_length,
...     df2.blooming_season
... FROM df
... LEFT JOIN df2 ON df.species = df2.species
... GROUP BY df.species, df2.blooming_season
... """ ). collect ()

SQL コマンドは、Polars CLI を使用してターミナルから直接実行することもできます。

 # run an inline SQL query
> polars -c " SELECT species, AVG(sepal_length) AS avg_sepal_length, AVG(sepal_width) AS avg_sepal_width FROM read_csv('docs/assets/data/iris.csv') GROUP BY species; "

# run interactively
> polars
Polars CLI v0.3.0
Type .help for help.

> SELECT species, AVG(sepal_length) AS avg_sepal_length, AVG(sepal_width) AS avg_sepal_width FROM read_csv( ' docs/assets/data/iris.csv ' ) GROUP BY species ;

詳細については、Polars CLI リポジトリを参照してください。

パフォーマンス

驚くほど速い

ポラーズはとても速いです。実際、これは利用可能なソリューションの中で最もパフォーマンスの高いソリューションの 1 つです。 PDS-H ベンチマークの結果を参照してください。

軽量

Polars は非常に軽量です。必要な依存関係はまったくなく、これはインポート時間に現れています。

極: 70ms
数値: 104ms
パンダ: 520ミリ秒

RAM を超えるデータを処理します

メモリに収まらないデータがある場合、Polars のクエリエンジンはクエリ (またはクエリの一部) をストリーミング形式で処理できます。これによりメモリ要件が大幅に軽減されるため、ラップトップで 250 GB のデータセットを処理できる可能性があります。クエリストリーミングを実行するには、 collect(streaming=True)を使用して収集します。 (少し遅くなるかもしれませんが、それでも非常に速いです!)

設定

パイソン

以下を使用して、Polars の最新バージョンをインストールします。

pip install polars

conda パッケージ ( conda install -c conda-forge polars ) もありますが、Polars をインストールするには pip が推奨される方法です。

すべてのオプションの依存関係を含む Polars をインストールします。

pip install ' polars[all] '

すべてのオプションの依存関係のサブセットをインストールすることもできます。

pip install ' polars[numpy,pandas,pyarrow] '

オプションの依存関係の詳細については、ユーザーガイドを参照してください。

現在の Polars バージョンとそのオプションの依存関係の完全なリストを表示するには、次を実行します。

 pl . show_versions ()

現時点では、リリースは非常に頻繁に (毎週 / 数日おきに) 行われるため、Polars を定期的に更新して最新のバグ修正や機能を入手することは悪い考えではないかもしれません。

さび

最新のリリースはcrates.ioから取得できます。または、最新の機能やパフォーマンスの向上を使用したい場合は、このリポジトリのmainブランチを参照してください。

 polars = { git = " https://github.com/pola-rs/polars " , rev = " <optional git tag> " }

Rust バージョン>=1.80が必要です。

貢献する

貢献したいですか?貢献ガイドをお読みください。

Python: ソースから Polars をコンパイルする

最先端のリリースや最大限のパフォーマンスが必要な場合は、Polars をソースからコンパイルする必要があります。

これは、次の手順を順番に実行することで実行できます。

最新のRustコンパイラをインストールする
maturin をインストールします: pip install maturin
cd py-polars実行し、次のいずれかを選択します。
- make build 、デバッグアサーションとシンボルを含む遅いバイナリ、速いコンパイル時間
- make build-release 、デバッグアサーションのない高速バイナリ、最小限のデバッグシンボル、長いコンパイル時間
- make build-nodebug-release 、 build-release と同じですが、デバッグシンボルがなく、コンパイルがわずかに速くなります
- make build-debug-release 、 build-release と同じですが、完全なデバッグシンボルを使用し、コンパイルがわずかに遅くなります
- make build-dist-release 、最速のバイナリ、極端なコンパイル時間

デフォルトでは、バイナリは最新の CPU 向けに最適化を有効にしてコンパイルされます。 CPU が古く、AVX2 などをサポートしていない場合は、コマンドでLTS_CPU=1指定します。

Python バインディングを実装する Rust クレートは、ラップされた Rust クレートpolars自体と区別するためにpy-polarsと呼ばれることに注意してください。ただし、Python パッケージと Python モジュールの名前は両方ともpolarsであるため、 pip install polarsおよびimport polars実行できます。

Python でのカスタム Rust 関数の使用

Rust でコンパイルされた UDF を使用して Polar を拡張するのは簡単です。 DataFrameおよびSeriesデータ構造用の PyO3 拡張機能を公開します。詳細については、https://github.com/pola-rs/pyo3-polars をご覧ください。

大きくなる...

2^32 (約 42 億) 行を超えると予想されますか? bigidx機能フラグを使用して Polars をコンパイルするか、Python ユーザーの場合はpip install polars-u64-idxをインストールします。

Polars のデフォルトのビルドは高速でメモリ消費量が少ないため、行境界に達しない限りこれを使用しないでください。

遺産

Polars を古い CPU (たとえば 2011 年以前のもの) で実行したいですか、それとも Rosetta 下の Apple Silicon 上の Python のx86-64ビルドで実行したいですか? pip install polars-lts-cpuをインストールします。このバージョンの Polars は、AVX ターゲット機能なしでコンパイルされています。