rtdl revisiting models下载 - rtdl revisiting models源码下载

rtdl revisiting models

Ai源码

1.0.0

下载

重新审视表格数据的深度学习模型 (NeurIPS 2021)

重要的

查看新的表格 DL 模型：TabM

arXiv ？ Python 包其他表格 DL 项目

这是论文“Revisiting Deep Learning Models for Tabular Data”的正式实现。

长话短说

一句话：类似 MLP 的模型仍然是很好的基线，而 FT-Transformer 是针对表格数据问题的 Transformer 架构的一种新的强大改编。

本文重点讨论表格数据问题的架构。结果：

简单调整的MLP仍然是一个很好的基准：它的性能与大多数复杂的架构相当甚至更好。
ResNet （一种具有跳跃连接和批量归一化的类 MLP 模型）进一步强调了这一点：类 MLP 模型是表格深度学习的良好基线，之前的工作并没有超越它们。
FT-Transformer是一种新的架构，它改变了这一现状：
- 在基准测试中，它表现出了深度模型（包括前面提到的类似 MLP 的基线）中最好的平均性能；
- 在 GBDT（梯度增强决策树）优于 DL 模型的数据集上，FT-Transformer 缩小了（并非完全）GBDT 和 DL 之间的差距。
- FT-Transformer 比 MLP 类模型慢

Python包

package/目录中的 Python 包是在实践和未来工作中使用本文的推荐方式。

文件的其余部分：

指标和超参数
如何重现报告的结果
如何引用

如何探索指标和超参数

output/目录包含本文中使用的各种模型和数据集的大量结果和（调整的）超参数。

指标

例如，让我们探讨一下 MLP 模型的指标。首先，让我们加载报告（ stats.json文件）：

 import json
from pathlib import Path

import pandas as pd

df = pd . json_normalize ([
    json . loads ( x . read_text ())
    for x in Path ( 'output' ). glob ( '*/mlp/tuned/*/stats.json' )
])

现在，对于每个数据集，我们计算所有随机种子的平均测试分数：

 print ( df . groupby ( 'dataset' )[ 'metrics.test.score' ]. mean (). round ( 3 ))

输出与论文中的表 2 完全匹配：

 dataset
adult                 0.852
aloi                  0.954
california_housing   -0.499
covtype               0.962
epsilon               0.898
helena                0.383
higgs_small           0.723
jannis                0.719
microsoft            -0.747
yahoo                -0.757
year                 -8.853
Name: metrics.test.score, dtype: float64

超参数

上述方法还可用于探索超参数，以直观地了解不同算法的典型超参数值。例如，以下是计算 MLP 模型的中值调整学习率的方法：

笔记

对于某些算法（例如 MLP），最近的项目提供了更多可以以类似方式探索的结果。例如，请参阅 TabR 上的这篇论文。

警告

请谨慎使用此方法。研究超参数值时：

谨防异常值。
查看原始的未聚合值，以获得对典型值的直觉。
要获得高级概述，请绘制分布图和/或计算多个分位数。

 print ( df [ df [ 'config.seed' ] == 0 ][ 'config.training.lr' ]. quantile ( 0.5 ))
# Output: 0.0002161505605899536

如何重现结果

笔记

这一段很长。在文本编辑器中使用 GitHub 上的“大纲”功能来获取本节的概述。

代码概述

代码组织如下：

bin ：
- 所有模型的训练代码
- ensemble.py执行集成
- tune.py执行超参数调整
- “什么时候 FT-Transformer 比 ResNet 更好？”部分的代码论文：
  - analysis_gbdt_vs_nn.py运行实验
  - create_synthetic_data_plots.py构建绘图
lib包含bin中程序使用的常用工具
output包含配置文件（ bin中程序的输入）和结果（指标、调整配置等）
package包含本文的Python包

设置环境

PyTorch环境

安装康达

 export PROJECT_DIR= < ABSOLUTE path to the repository root >
# example: export PROJECT_DIR=/home/myusername/repositories/revisiting-models
git clone https://github.com/yandex-research/tabular-dl-revisiting-models $PROJECT_DIR
cd $PROJECT_DIR

conda create -n revisiting-models python=3.8.8
conda activate revisiting-models

conda install pytorch==1.7.1 torchvision==0.8.2 cudatoolkit=10.1.243 numpy=1.19.2 -c pytorch -y
conda install cudnn=7.6.5 -c anaconda -y
pip install -r requirements.txt
conda install nodejs -y
jupyter labextension install @jupyter-widgets/jupyterlab-manager

# if the following commands do not succeed, update conda
conda env config vars set PYTHONPATH= ${PYTHONPATH} : ${PROJECT_DIR}
conda env config vars set PROJECT_DIR= ${PROJECT_DIR}
conda env config vars set LD_LIBRARY_PATH= ${CONDA_PREFIX} /lib: ${LD_LIBRARY_PATH}
conda env config vars set CUDA_HOME= ${CONDA_PREFIX}
conda env config vars set CUDA_ROOT= ${CONDA_PREFIX}

conda deactivate
conda activate revisiting-models

TensorFlow环境

仅在尝试 TabNet 时才需要此环境。对于所有其他情况，请使用 PyTorch 环境。

这些说明与 PyTorch 环境相同（包括 PyTorch 的安装！），但是：

python=3.7.10
cudatoolkit=10.0
在pip install -r requirements.txt之前执行以下操作：
- pip install tensorflow-gpu==1.14
- 注释掉requirements.txt中的tensorboard

数据

许可证：通过下载我们的数据集，您接受其所有组件的许可证。除了这些许可证之外，我们不会施加任何新的限制。您可以在我们论文的“参考文献”部分找到来源列表。

下载数据： wget https://www.dropbox.com/s/o53umyg6mn3zhxy/data.tar.gz?dl=1 -O revisiting_models_data.tar.gz
将存档移至存储库的根目录： mv revisiting_models_data.tar.gz $PROJECT_DIR
转到存储库的根目录： cd $PROJECT_DIR
解压存档： tar -xvf revisiting_models_data.tar.gz

教程

本节仅提供具体命令，注释很少。完成本教程后，我们建议您查看下一部分，以更好地了解如何使用存储库。它还将有助于更好地理解本教程。

在本教程中，我们将在加州住房数据集上重现 MLP 的结果。我们将涵盖：

调音
评估
合奏
模型之间的比较

请注意，获得完全相同结果的机会相当低，但是，它们应该与我们的结果相差不大。在运行任何内容之前，请转到存储库的根目录并显式设置CUDA_VISIBLE_DEVICES （如果您打算使用 GPU）：

 cd $PROJECT_DIR
export CUDA_VISIBLE_DEVICES=0

检查环境

在开始之前，我们先检查一下环境是否配置成功。以下命令应在加州住房数据集上训练一个 MLP：

mkdir draft
cp output/california_housing/mlp/tuned/0.toml draft/check_environment.toml
python bin/mlp.py draft/check_environment.toml

结果应该位于目录draft/check_environment中。目前，结果的内容并不重要。

调音

我们在加州住房数据集上调整 MLP 的配置位于output/california_housing/mlp/tuning/0.toml 。为了重现调整，请复制我们的配置并运行您的调整：

 # you can choose any other name instead of "reproduced.toml"; it is better to keep this
# name while completing the tutorial
cp output/california_housing/mlp/tuning/0.toml output/california_housing/mlp/tuning/reproduced.toml
# let's reduce the number of tuning iterations to make tuning fast (and ineffective)
python -c "
from pathlib import Path
p = Path('output/california_housing/mlp/tuning/reproduced.toml')
p.write_text(p.read_text().replace('n_trials = 100', 'n_trials = 5'))
"
python bin/tune.py output/california_housing/mlp/tuning/reproduced.toml

您的调整结果将位于output/california_housing/mlp/tuning/reproduced ，您可以将其与我们的进行比较： output/california_housing/mlp/tuning/0 。文件best.toml包含我们将在下一节中评估的最佳配置。

评估

现在我们必须使用 15 个不同的随机种子来评估调整后的配置。

 # create a directory for evaluation
mkdir -p output/california_housing/mlp/tuned_reproduced

# clone the best config from the tuning stage with 15 different random seeds
python -c "
for seed in range(15):
    open(f'output/california_housing/mlp/tuned_reproduced/{seed}.toml', 'w').write(
        open('output/california_housing/mlp/tuning/reproduced/best.toml').read().replace('seed = 0', f'seed = {seed}')
    )
"

# train MLP with all 15 configs
for seed in {0..14}
do
    python bin/mlp.py output/california_housing/mlp/tuned_reproduced/ ${seed} .toml
done

我们的评估结果目录就位于您的目录旁边，即位于output/california_housing/mlp/tuned 。

合奏

 # just run this single command
python bin/ensemble.py mlp output/california_housing/mlp/tuned_reproduced

您的结果将位于output/california_housing/mlp/tuned_reproduced_ensemble ，您可以将其与我们的结果进行比较： output/california_housing/mlp/tuned_ensemble 。

“可视化”结果

使用此处描述的方法总结所进行的实验的结果（相应地修改.glob(...)中的路径过滤器： tuned -> tuned_reproduced ）。

其他模型和数据集怎么样？

可以对所有模型和数据集执行类似的步骤。网格搜索的调整过程略有不同：您必须运行所有所需的配置，并根据验证性能手动选择最佳配置。例如，请参见output/epsilon/ft_transformer 。

如何使用存储库

如何运行脚本

您应该从存储库的根目录运行 Python 脚本。大多数程序都期望配置文件作为唯一的参数。输出将是一个与配置同名的目录，但没有扩展名。配置是用 TOML 编写的。未提供程序的可能参数列表，应从脚本中推断出来（通常，配置用脚本中的args变量表示）。如果要使用 CUDA，则必须显式设置CUDA_VISIBLE_DEVICES环境变量。例如：

 # The result will be at "path/to/my_experiment"
CUDA_VISIBLE_DEVICES=0 python bin/mlp.py path/to/my_experiment.toml

# The following example will run WITHOUT CUDA
python bin/mlp.py path/to/my_experiment.toml

如果你打算一直使用CUDA，可以将环境变量保存在Conda环境中：

conda env config vars set CUDA_VISIBLE_DEVICES= " 0 "

-f ( --force ) 选项将删除现有结果并从头开始运行脚本：

python bin/whatever.py path/to/config.toml -f  # rewrites path/to/config

bin/tune.py支持延续：

python bin/tune.py path/to/config.toml --continue

`stats.json`和其他结果

对于所有脚本， stats.json是输出中最重要的部分。内容因节目而异。它可以包含：

指标
传递给程序的配置
硬件信息
执行时间
和其他信息

通常还会保存训练集、验证集和测试集的预测。

结论

现在，您知道重现所有结果并扩展此存储库以满足您的需求所需的一切。现在教程也应该更加清晰了。请随意提出问题并提出问题。

如何引用

 @inproceedings{gorishniy2021revisiting,
    title={Revisiting Deep Learning Models for Tabular Data},
    author={Yury Gorishniy and Ivan Rubachev and Valentin Khrulkov and Artem Babenko},
    booktitle={{NeurIPS}},
    year={2021},
}

展开

附加信息

版本 1.0.0
类型 Ai源码
更新时间 2025-01-26
大小 14.78MB
来自于 Github

rtdl revisiting models

重新审视表格数据的深度学习模型 (NeurIPS 2021)

长话短说

Python包

如何探索指标和超参数

指标

超参数

如何重现结果

代码概述

设置环境

PyTorch环境

TensorFlow环境

数据

教程

检查环境

调音

评估

合奏

“可视化”结果

其他模型和数据集怎么样？

如何使用存储库

如何运行脚本

`stats.json`和其他结果

结论

如何引用

llama models

GitHub sgrebnov/cordova plugin background download

models

pytorch image models

Wa ch the greatest of all time 2024 ull ovie Online For Fr e Strea ings At Home

wolfs 2024 f llmo ie f lmyz lla dow load ree 7 0p 4 0p a d 10 0p

chat.petals.dev

GPT Prompt Templates

GPTyped

node telegram bot api

typebot.io

python wechaty getting started

waymo open dataset

termwind

wp functions

rtdl revisiting models

重新审视表格数据的深度学习模型 (NeurIPS 2021)

长话短说

Python包

如何探索指标和超参数

指标

超参数

如何重现结果

代码概述

设置环境

PyTorch环境

TensorFlow环境

数据

教程

检查环境

调音

评估

合奏

“可视化”结果

其他模型和数据集怎么样？

如何使用存储库

如何运行脚本

stats.json和其他结果

结论

如何引用

`stats.json`和其他结果