polars下載 - polars原始碼下載

文件：Python - Rust - Node.js - R | StackOverflow ：Python - Rust - Node.js - R |使用者指南|不和諧

Polars：Rust、Python、Node.js、R 和 SQL 中的極快 DataFrame

Polars 是一個位於 OLAP 查詢引擎之上的 DataFrame 接口，使用 Apache Arrow Columnar Format 作為記憶體模型在 Rust 中實作。

懶惰|急於執行
多執行緒
單指令多資料流
查詢最佳化
強大的表達式API
混合流（大於 RAM 資料集）
鐵鏽|蟒蛇 | NodeJS |右 | …

要了解更多信息，請閱讀用戶指南。

Python

 >> > import polars as pl
>> > df = pl . DataFrame (
...     {
...         "A" : [ 1 , 2 , 3 , 4 , 5 ],
...         "fruits" : [ "banana" , "banana" , "apple" , "apple" , "banana" ],
...         "B" : [ 5 , 4 , 3 , 2 , 1 ],
...         "cars" : [ "beetle" , "audi" , "beetle" , "beetle" , "beetle" ],
...     }
... )

# embarrassingly parallel execution & very expressive query language
>> > df . sort ( "fruits" ). select (
...     "fruits" ,
...     "cars" ,
...     pl . lit ( "fruits" ). alias ( "literal_string_fruits" ),
...     pl . col ( "B" ). filter ( pl . col ( "cars" ) == "beetle" ). sum (),
...     pl . col ( "A" ). filter ( pl . col ( "B" ) > 2 ). sum (). over ( "cars" ). alias ( "sum_A_by_cars" ),
...     pl . col ( "A" ). sum (). over ( "fruits" ). alias ( "sum_A_by_fruits" ),
...     pl . col ( "A" ). reverse (). over ( "fruits" ). alias ( "rev_A_by_fruits" ),
...     pl . col ( "A" ). sort_by ( "B" ). over ( "fruits" ). alias ( "sort_A_by_B_by_fruits" ),
... )
shape : ( 5 , 8 )
┌──────────┬──────────┬──────────────┬─────┬─────────────┬─────────────┬─────────────┬─────────────┐
│ fruits   ┆ cars     ┆ literal_stri ┆ B   ┆ sum_A_by_ca ┆ sum_A_by_fr ┆ rev_A_by_fr ┆ sort_A_by_B │
│ - - -      ┆ - - -      ┆ ng_fruits    ┆ - - - ┆ rs          ┆ uits        ┆ uits        ┆ _by_fruits  │
│ str      ┆ str      ┆ - - -          ┆ i64 ┆ - - -         ┆ - - -         ┆ - - -         ┆ - - -         │
│          ┆          ┆ str          ┆     ┆ i64         ┆ i64         ┆ i64         ┆ i64         │
╞══════════╪══════════╪══════════════╪═════╪═════════════╪═════════════╪═════════════╪═════════════╡
│ "apple"  ┆ "beetle" ┆ "fruits"     ┆ 11  ┆ 4           ┆ 7           ┆ 4           ┆ 4           │
│ "apple"  ┆ "beetle" ┆ "fruits"     ┆ 11  ┆ 4           ┆ 7           ┆ 3           ┆ 3           │
│ "banana" ┆ "beetle" ┆ "fruits"     ┆ 11  ┆ 4           ┆ 8           ┆ 5           ┆ 5           │
│ "banana" ┆ "audi"   ┆ "fruits"     ┆ 11  ┆ 2           ┆ 8           ┆ 2           ┆ 2           │
│ "banana" ┆ "beetle" ┆ "fruits"     ┆ 11  ┆ 4           ┆ 8           ┆ 1           ┆ 1           │
└──────────┴──────────┴──────────────┴─────┴─────────────┴─────────────┴─────────────┴─────────────┘

SQL

 >> > df = pl . scan_csv ( "docs/assets/data/iris.csv" )
>> > ## OPTION 1
>> > # run SQL queries on frame-level
>> > df . sql ( """
...	SELECT species,
...	  AVG(sepal_length) AS avg_sepal_length
...	FROM self
...	GROUP BY species
...	""" ). collect ()
shape : ( 3 , 2 )
┌────────────┬──────────────────┐
│ species    ┆ avg_sepal_length │
│ - - -        ┆ - - -              │
│ str        ┆ f64              │
╞════════════╪══════════════════╡
│ Virginica  ┆ 6.588            │
│ Versicolor ┆ 5.936            │
│ Setosa     ┆ 5.006            │
└────────────┴──────────────────┘
>> > ## OPTION 2
>> > # use pl.sql() to operate on the global context
>> > df2 = pl . LazyFrame ({
...    "species" : [ "Setosa" , "Versicolor" , "Virginica" ],
...    "blooming_season" : [ "Spring" , "Summer" , "Fall" ]
...})
>> > pl . sql ( """
... SELECT df.species,
...     AVG(df.sepal_length) AS avg_sepal_length,
...     df2.blooming_season
... FROM df
... LEFT JOIN df2 ON df.species = df2.species
... GROUP BY df.species, df2.blooming_season
... """ ). collect ()

SQL 指令也可以使用 Polars CLI 直接從終端機執行：

 # run an inline SQL query
> polars -c " SELECT species, AVG(sepal_length) AS avg_sepal_length, AVG(sepal_width) AS avg_sepal_width FROM read_csv('docs/assets/data/iris.csv') GROUP BY species; "

# run interactively
> polars
Polars CLI v0.3.0
Type .help for help.

> SELECT species, AVG(sepal_length) AS avg_sepal_length, AVG(sepal_width) AS avg_sepal_width FROM read_csv( ' docs/assets/data/iris.csv ' ) GROUP BY species ;

請參閱 Polars CLI 儲存庫以了解更多資訊。

表現

速度極快

極地的速度非常快。事實上，它是可用的性能最佳的解決方案之一。查看 PDS-H 基準測試結果。

輕的

Polars 也非常輕。它具有零所需的依賴項，這在導入時間中有所體現：

極地：70ms
numpy：104ms
熊貓：520毫秒

處理大於 RAM 的數據

如果您的資料無法裝入內存，Polars 的查詢引擎能夠以串流處理方式處理您的查詢（或部分查詢）。這大大減少了記憶體需求，因此您也許能夠在筆記型電腦上處理 250GB 的資料集。使用collect(streaming=True)進行收集以執行查詢流。（這可能會慢一點，但仍然很快！）

設定

Python

安裝最新的 Polars 版本：

pip install polars

我們還有一個 conda 軟體包（ conda install -c conda-forge polars ），但是 pip 是安裝 Polars 的首選方式。

安裝 Polars 以及所有可選依賴項。

pip install ' polars[all] '

您也可以安裝所有可選依賴項的子集。

pip install ' polars[numpy,pandas,pyarrow] '

有關可選依賴項的更多詳細信息，請參閱使用者指南

若要查看目前的 Polars 版本及其可選依賴項的完整列表，請執行：

 pl . show_versions ()

目前發布非常頻繁（每週/每隔幾天），因此定期更新 Polars 以獲取最新的錯誤修復/功能可能不是一個壞主意。

鏽

您可以從crates.io取得最新版本，或者如果您想使用最新功能/效能改進，請指向此儲存庫的main分支。

 polars = { git = " https://github.com/pola-rs/polars " , rev = " <optional git tag> " }

需要 Rust 版本>=1.80 。

貢獻

想做出貢獻嗎？閱讀我們的貢獻指南。

Python：從原始碼編譯Polars

如果您想要最前沿的版本或最大的效能，您應該從原始程式碼編譯 Polars。

這可以透過按順序執行以下步驟來完成：

安裝最新的 Rust 編譯器
安裝maturin： pip install maturin
cd py-polars並選擇以下選項之一：
- make build ，帶有調試斷言和符號的慢速二進位文件，快速編譯時間
- make build-release ，無需調試斷言的快速二進位文件，最少的調試符號，長編譯時間
- make build-nodebug-release ，與 build-release 相同，但沒有任何調試符號，編譯速度稍快
- make build-debug-release ，與 build-release 相同，但具有完整的調試符號，編譯速度稍慢
- make build-dist-release ，最快的二進位文件，極端的編譯時間

預設情況下，二進位檔案是在針對現代 CPU 啟用最佳化的情況下編譯的。如果您的 CPU 較舊且不支援 AVX2 等，請使用指令指定LTS_CPU=1 。

請注意，實作 Python 綁定的 Rust 板條箱稱為py-polars以區別於包裝的 Rust 板條箱polars本身。但是，Python 套件和 Python 模組都被命名為polars ，因此您可以pip install polars並import polars 。

在 Python 中使用自訂 Rust 函數

使用 Rust 編譯的 UDF 擴充 Polars 很容易。我們公開了DataFrame和Series資料結構的 PyO3 擴充。請參閱 https://github.com/pola-rs/pyo3-polars 以了解更多資訊。

搞大了...

您預計行數是否超過 2^32（約 42 億）？使用bigidx功能標誌編譯 Polars，或對於 Python 用戶，安裝pip install polars-u64-idx 。

除非達到行邊界，否則請勿使用此選項，因為 Polars 的預設建置速度更快且消耗的記憶體更少。

遺產

您希望 Polars 在舊 CPU 上運行（例如 2011 年之前的 CPU），還是在 Rosetta 下 Apple Silicon 上的x86-64 Python 版本上運行？安裝pip install polars-lts-cpu 。此版本的 Polars 編譯時沒有 AVX 目標功能。