polars下载 - polars源代码下载

文档：Python - Rust - Node.js - R | StackOverflow ：Python - Rust - Node.js - R |用户指南|不和谐

Polars：Rust、Python、Node.js、R 和 SQL 中的极快 DataFrame

Polars 是一个位于 OLAP 查询引擎之上的 DataFrame 接口，使用 Apache Arrow Columnar Format 作为内存模型在 Rust 中实现。

懒惰|急于执行
多线程
单指令多数据流
查询优化
强大的表达式API
混合流（大于 RAM 数据集）
铁锈|蟒蛇 | NodeJS |右 | ...

要了解更多信息，请阅读用户指南。

Python

 >> > import polars as pl
>> > df = pl . DataFrame (
...     {
...         "A" : [ 1 , 2 , 3 , 4 , 5 ],
...         "fruits" : [ "banana" , "banana" , "apple" , "apple" , "banana" ],
...         "B" : [ 5 , 4 , 3 , 2 , 1 ],
...         "cars" : [ "beetle" , "audi" , "beetle" , "beetle" , "beetle" ],
...     }
... )

# embarrassingly parallel execution & very expressive query language
>> > df . sort ( "fruits" ). select (
...     "fruits" ,
...     "cars" ,
...     pl . lit ( "fruits" ). alias ( "literal_string_fruits" ),
...     pl . col ( "B" ). filter ( pl . col ( "cars" ) == "beetle" ). sum (),
...     pl . col ( "A" ). filter ( pl . col ( "B" ) > 2 ). sum (). over ( "cars" ). alias ( "sum_A_by_cars" ),
...     pl . col ( "A" ). sum (). over ( "fruits" ). alias ( "sum_A_by_fruits" ),
...     pl . col ( "A" ). reverse (). over ( "fruits" ). alias ( "rev_A_by_fruits" ),
...     pl . col ( "A" ). sort_by ( "B" ). over ( "fruits" ). alias ( "sort_A_by_B_by_fruits" ),
... )
shape : ( 5 , 8 )
┌──────────┬──────────┬──────────────┬─────┬─────────────┬─────────────┬─────────────┬─────────────┐
│ fruits   ┆ cars     ┆ literal_stri ┆ B   ┆ sum_A_by_ca ┆ sum_A_by_fr ┆ rev_A_by_fr ┆ sort_A_by_B │
│ - - -      ┆ - - -      ┆ ng_fruits    ┆ - - - ┆ rs          ┆ uits        ┆ uits        ┆ _by_fruits  │
│ str      ┆ str      ┆ - - -          ┆ i64 ┆ - - -         ┆ - - -         ┆ - - -         ┆ - - -         │
│          ┆          ┆ str          ┆     ┆ i64         ┆ i64         ┆ i64         ┆ i64         │
╞══════════╪══════════╪══════════════╪═════╪═════════════╪═════════════╪═════════════╪═════════════╡
│ "apple"  ┆ "beetle" ┆ "fruits"     ┆ 11  ┆ 4           ┆ 7           ┆ 4           ┆ 4           │
│ "apple"  ┆ "beetle" ┆ "fruits"     ┆ 11  ┆ 4           ┆ 7           ┆ 3           ┆ 3           │
│ "banana" ┆ "beetle" ┆ "fruits"     ┆ 11  ┆ 4           ┆ 8           ┆ 5           ┆ 5           │
│ "banana" ┆ "audi"   ┆ "fruits"     ┆ 11  ┆ 2           ┆ 8           ┆ 2           ┆ 2           │
│ "banana" ┆ "beetle" ┆ "fruits"     ┆ 11  ┆ 4           ┆ 8           ┆ 1           ┆ 1           │
└──────────┴──────────┴──────────────┴─────┴─────────────┴─────────────┴─────────────┴─────────────┘

SQL

 >> > df = pl . scan_csv ( "docs/assets/data/iris.csv" )
>> > ## OPTION 1
>> > # run SQL queries on frame-level
>> > df . sql ( """
...	SELECT species,
...	  AVG(sepal_length) AS avg_sepal_length
...	FROM self
...	GROUP BY species
...	""" ). collect ()
shape : ( 3 , 2 )
┌────────────┬──────────────────┐
│ species    ┆ avg_sepal_length │
│ - - -        ┆ - - -              │
│ str        ┆ f64              │
╞════════════╪══════════════════╡
│ Virginica  ┆ 6.588            │
│ Versicolor ┆ 5.936            │
│ Setosa     ┆ 5.006            │
└────────────┴──────────────────┘
>> > ## OPTION 2
>> > # use pl.sql() to operate on the global context
>> > df2 = pl . LazyFrame ({
...    "species" : [ "Setosa" , "Versicolor" , "Virginica" ],
...    "blooming_season" : [ "Spring" , "Summer" , "Fall" ]
...})
>> > pl . sql ( """
... SELECT df.species,
...     AVG(df.sepal_length) AS avg_sepal_length,
...     df2.blooming_season
... FROM df
... LEFT JOIN df2 ON df.species = df2.species
... GROUP BY df.species, df2.blooming_season
... """ ). collect ()

SQL 命令也可以使用 Polars CLI 直接从终端运行：

 # run an inline SQL query
> polars -c " SELECT species, AVG(sepal_length) AS avg_sepal_length, AVG(sepal_width) AS avg_sepal_width FROM read_csv('docs/assets/data/iris.csv') GROUP BY species; "

# run interactively
> polars
Polars CLI v0.3.0
Type .help for help.

> SELECT species, AVG(sepal_length) AS avg_sepal_length, AVG(sepal_width) AS avg_sepal_width FROM read_csv( ' docs/assets/data/iris.csv ' ) GROUP BY species ;

请参阅 Polars CLI 存储库了解更多信息。

表现

速度极快

极地的速度非常快。事实上，它是可用的性能最佳的解决方案之一。查看 PDS-H 基准测试结果。

轻的

Polars 也非常轻。它具有零所需的依赖项，这在导入时间中有所体现：

极地：70ms
numpy：104ms
熊猫：520毫秒

处理大于 RAM 的数据

如果您的数据无法装入内存，Polars 的查询引擎能够以流式处理方式处理您的查询（或部分查询）。这大大减少了内存需求，因此您也许能够在笔记本电脑上处理 250GB 的数据集。使用collect(streaming=True)进行收集以运行查询流。（这可能会慢一点，但仍然很快！）

设置

Python

安装最新的 Polars 版本：

pip install polars

我们还有一个 conda 软件包（ conda install -c conda-forge polars ），但是 pip 是安装 Polars 的首选方式。

安装 Polars 以及所有可选依赖项。

pip install ' polars[all] '

您还可以安装所有可选依赖项的子集。

pip install ' polars[numpy,pandas,pyarrow] '

有关可选依赖项的更多详细信息，请参阅用户指南

要查看当前的 Polars 版本及其可选依赖项的完整列表，请运行：

 pl . show_versions ()

目前发布非常频繁（每周/每隔几天），因此定期更新 Polars 以获取最新的错误修复/功能可能不是一个坏主意。

锈

您可以从crates.io获取最新版本，或者如果您想使用最新功能/性能改进，请指向此存储库的main分支。

 polars = { git = " https://github.com/pola-rs/polars " , rev = " <optional git tag> " }

需要 Rust 版本>=1.80 。

贡献

想做出贡献吗？阅读我们的贡献指南。

Python：从源代码编译Polars

如果您想要最前沿的版本或最大的性能，您应该从源代码编译 Polars。

这可以通过按顺序执行以下步骤来完成：

安装最新的 Rust 编译器
安装maturin： pip install maturin
cd py-polars并选择以下选项之一：
- make build ，带有调试断言和符号的慢速二进制文件，快速编译时间
- make build-release ，无需调试断言的快速二进制文件，最少的调试符号，长编译时间
- make build-nodebug-release ，与 build-release 相同，但没有任何调试符号，编译速度稍快
- make build-debug-release ，与 build-release 相同，但具有完整的调试符号，编译速度稍慢
- make build-dist-release ，最快的二进制文件，极端的编译时间

默认情况下，二进制文件是在针对现代 CPU 启用优化的情况下编译的。如果您的 CPU 较旧且不支持 AVX2 等，请使用命令指定LTS_CPU=1 。

请注意，实现 Python 绑定的 Rust 板条箱称为py-polars以区别于包装的 Rust 板条箱polars本身。但是，Python 包和 Python 模块都被命名为polars ，因此您可以pip install polars并import polars 。

在 Python 中使用自定义 Rust 函数

使用 Rust 编译的 UDF 扩展 Polars 很容易。我们公开了DataFrame和Series数据结构的 PyO3 扩展。请参阅 https://github.com/pola-rs/pyo3-polars 了解更多信息。

搞大了...

您预计会有超过 2^32（约 42 亿）行吗？使用bigidx功能标志编译 Polars，或者对于 Python 用户，安装pip install polars-u64-idx 。

除非达到行边界，否则请勿使用此选项，因为 Polars 的默认构建速度更快且消耗的内存更少。

遗产

您希望 Polars 在旧 CPU 上运行（例如 2011 年之前的 CPU），还是在 Rosetta 下 Apple Silicon 上的x86-64 Python 版本上运行？安装pip install polars-lts-cpu 。此版本的 Polars 编译时没有 AVX 目标功能。