大师状态:
发展状况:
包装信息:
要尝试 TPOT2 ( alpha ),请访问此处!
TPOT代表基于树的管道优化工具。将 TPOT 视为您的数据科学助手。 TPOT 是一种 Python 自动化机器学习工具,可使用遗传编程优化机器学习管道。
TPOT 将通过智能地探索数千种可能的管道来找到最适合您的数据的管道,从而自动化机器学习中最繁琐的部分。
机器学习管道示例
一旦 TPOT 完成搜索(或者您厌倦了等待),它会为您提供找到的最佳管道的 Python 代码,以便您可以从那里修改管道。
TPOT 构建在 scikit-learn 之上,因此它生成的所有代码应该看起来很熟悉......如果您熟悉 scikit-learn,无论如何。
TPOT 仍在积极开发中,我们鼓励您定期查看此存储库以获取更新。
有关 TPOT 的更多信息,请参阅项目文档。
请参阅存储库许可证以获取 TPOT 的许可和使用信息。
一般来说,我们已经获得了 TPOT 的许可,以使其尽可能广泛地使用。
我们在文档中保留了 TPOT 安装说明。 TPOT 需要安装有效的 Python。
TPOT 可以在命令行或 Python 代码中使用。
单击相应的链接可在文档中查找有关 TPOT 使用的更多信息。
下面是一个手写数字数据集光学识别的最小工作示例。
from tpot import TPOTClassifier
from sklearn . datasets import load_digits
from sklearn . model_selection import train_test_split
digits = load_digits ()
X_train , X_test , y_train , y_test = train_test_split ( digits . data , digits . target ,
train_size = 0.75 , test_size = 0.25 , random_state = 42 )
tpot = TPOTClassifier ( generations = 5 , population_size = 50 , verbosity = 2 , random_state = 42 )
tpot . fit ( X_train , y_train )
print ( tpot . score ( X_test , y_test ))
tpot . export ( 'tpot_digits_pipeline.py' )
运行此代码应该会发现一个测试准确率约为 98% 的管道,并且相应的 Python 代码应导出到tpot_digits_pipeline.py
文件,看起来类似于以下内容:
import numpy as np
import pandas as pd
from sklearn . ensemble import RandomForestClassifier
from sklearn . linear_model import LogisticRegression
from sklearn . model_selection import train_test_split
from sklearn . pipeline import make_pipeline , make_union
from sklearn . preprocessing import PolynomialFeatures
from tpot . builtins import StackingEstimator
from tpot . export_utils import set_param_recursive
# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd . read_csv ( 'PATH/TO/DATA/FILE' , sep = 'COLUMN_SEPARATOR' , dtype = np . float64 )
features = tpot_data . drop ( 'target' , axis = 1 )
training_features , testing_features , training_target , testing_target =
train_test_split ( features , tpot_data [ 'target' ], random_state = 42 )
# Average CV score on the training set was: 0.9799428471757372
exported_pipeline = make_pipeline (
PolynomialFeatures ( degree = 2 , include_bias = False , interaction_only = False ),
StackingEstimator ( estimator = LogisticRegression ( C = 0.1 , dual = False , penalty = "l1" )),
RandomForestClassifier ( bootstrap = True , criterion = "entropy" , max_features = 0.35000000000000003 , min_samples_leaf = 20 , min_samples_split = 19 , n_estimators = 100 )
)
# Fix random state for all the steps in exported pipeline
set_param_recursive ( exported_pipeline . steps , 'random_state' , 42 )
exported_pipeline . fit ( training_features , training_target )
results = exported_pipeline . predict ( testing_features )
同样,TPOT 可以针对回归问题优化管道。以下是波士顿房价实践数据集的最小工作示例。
from tpot import TPOTRegressor
from sklearn . datasets import load_boston
from sklearn . model_selection import train_test_split
housing = load_boston ()
X_train , X_test , y_train , y_test = train_test_split ( housing . data , housing . target ,
train_size = 0.75 , test_size = 0.25 , random_state = 42 )
tpot = TPOTRegressor ( generations = 5 , population_size = 50 , verbosity = 2 , random_state = 42 )
tpot . fit ( X_train , y_train )
print ( tpot . score ( X_test , y_test ))
tpot . export ( 'tpot_boston_pipeline.py' )
这应该会导致管道达到约 12.77 均方误差 (MSE),并且tpot_boston_pipeline.py
中的 Python 代码应类似于:
import numpy as np
import pandas as pd
from sklearn . ensemble import ExtraTreesRegressor
from sklearn . model_selection import train_test_split
from sklearn . pipeline import make_pipeline
from sklearn . preprocessing import PolynomialFeatures
from tpot . export_utils import set_param_recursive
# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd . read_csv ( 'PATH/TO/DATA/FILE' , sep = 'COLUMN_SEPARATOR' , dtype = np . float64 )
features = tpot_data . drop ( 'target' , axis = 1 )
training_features , testing_features , training_target , testing_target =
train_test_split ( features , tpot_data [ 'target' ], random_state = 42 )
# Average CV score on the training set was: -10.812040755234403
exported_pipeline = make_pipeline (
PolynomialFeatures ( degree = 2 , include_bias = False , interaction_only = False ),
ExtraTreesRegressor ( bootstrap = False , max_features = 0.5 , min_samples_leaf = 2 , min_samples_split = 3 , n_estimators = 100 )
)
# Fix random state for all the steps in exported pipeline
set_param_recursive ( exported_pipeline . steps , 'random_state' , 42 )
exported_pipeline . fit ( training_features , training_target )
results = exported_pipeline . predict ( testing_features )
查看文档以获取更多示例和教程。
我们欢迎您检查现有问题是否存在错误或需要改进的增强功能。如果您有扩展 TPOT 的想法,请提交新问题,以便我们进行讨论。
在提交任何贡献之前,请查看我们的贡献指南。
请检查现有的打开和关闭的问题,看看您的问题是否已得到解决。如果没有,请在此存储库上提交新问题,以便我们审核您的问题。
如果您在科学出版物中使用 TPOT,请考虑至少引用以下论文之一:
Trang T. Le、Weixuan Fu 和 Jason H. Moore (2020)。使用特征集选择器将基于树的自动化机器学习扩展到生物医学大数据。生物信息学.36(1):250-256。
BibTeX 条目:
@article { le2020scaling ,
title = { Scaling tree-based automated machine learning to biomedical big data with a feature set selector } ,
author = { Le, Trang T and Fu, Weixuan and Moore, Jason H } ,
journal = { Bioinformatics } ,
volume = { 36 } ,
number = { 1 } ,
pages = { 250--256 } ,
year = { 2020 } ,
publisher = { Oxford University Press }
}
兰德尔·S·奥尔森、瑞安·J·厄本诺维奇、彼得·C·安德鲁斯、妮可·A·拉文德、拉·克雷斯·基德和杰森·H·摩尔 (2016)。通过基于树的管道优化实现生物医学数据科学的自动化。进化计算的应用,第 123-137 页。
BibTeX 条目:
@inbook { Olson2016EvoBio ,
author = { Olson, Randal S. and Urbanowicz, Ryan J. and Andrews, Peter C. and Lavender, Nicole A. and Kidd, La Creis and Moore, Jason H. } ,
editor = { Squillero, Giovanni and Burelli, Paolo } ,
chapter = { Automating Biomedical Data Science Through Tree-Based Pipeline Optimization } ,
title = { Applications of Evolutionary Computation: 19th European Conference, EvoApplications 2016, Porto, Portugal, March 30 -- April 1, 2016, Proceedings, Part I } ,
year = { 2016 } ,
publisher = { Springer International Publishing } ,
pages = { 123--137 } ,
isbn = { 978-3-319-31204-0 } ,
doi = { 10.1007/978-3-319-31204-0_9 } ,
url = { http://dx.doi.org/10.1007/978-3-319-31204-0_9 }
}
兰德尔·奥尔森 (Randal S. Olson)、内森·巴特利 (Nathan Bartley)、瑞安·J·厄本诺维奇 (Ryan J. Urbanowicz) 和杰森·H·摩尔 (Jason H. Moore) (2016)。用于自动化数据科学的基于树的管道优化工具的评估。 GECCO 2016 年会议记录,第 485-492 页。
BibTeX 条目:
@inproceedings { OlsonGECCO2016 ,
author = { Olson, Randal S. and Bartley, Nathan and Urbanowicz, Ryan J. and Moore, Jason H. } ,
title = { Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science } ,
booktitle = { Proceedings of the Genetic and Evolutionary Computation Conference 2016 } ,
series = { GECCO '16 } ,
year = { 2016 } ,
isbn = { 978-1-4503-4206-3 } ,
location = { Denver, Colorado, USA } ,
pages = { 485--492 } ,
numpages = { 8 } ,
url = { http://doi.acm.org/10.1145/2908812.2908918 } ,
doi = { 10.1145/2908812.2908918 } ,
acmid = { 2908918 } ,
publisher = { ACM } ,
address = { New York, NY, USA } ,
}
或者,您可以使用以下 DOI 直接引用存储库:
TPOT 是在 NIH R01 AI117694 资助下,在宾夕法尼亚大学计算遗传学实验室开发的。我们非常感谢 NIH 和宾夕法尼亚大学在该项目开发过程中的支持。
TPOT 徽标由 Todd Newmuis 设计,他慷慨地为该项目贡献了自己的时间。