大師狀態:
發展狀況:
包裝資訊:
要嘗試 TPOT2 ( alpha ),請造訪此處!
TPOT代表基於樹的管道最佳化工具。將 TPOT 視為您的資料科學助理。 TPOT 是一種 Python 自動化機器學習工具,可使用遺傳編程優化機器學習管道。
TPOT 將透過聰明地探索數千種可能的管道來找到最適合您的資料的管道,從而自動化機器學習中最繁瑣的部分。
機器學習管道範例
一旦 TPOT 完成搜尋(或您厭倦了等待),它會為您提供找到的最佳管道的 Python 程式碼,以便您可以從那裡修改管道。
TPOT 建構在 scikit-learn 之上,因此它產生的所有程式碼應該看起來很熟悉......如果您熟悉 scikit-learn,無論如何。
TPOT 仍在積極開發中,我們鼓勵您定期查看此儲存庫以獲取更新。
有關 TPOT 的更多信息,請參閱項目文檔。
請參閱儲存庫許可證以取得 TPOT 的授權和使用資訊。
一般來說,我們已經獲得了 TPOT 的許可,以使其盡可能廣泛地使用。
我們在文件中保留了 TPOT 安裝說明。 TPOT 需要安裝有效的 Python。
TPOT 可以在命令列或 Python 程式碼中使用。
按一下對應的連結以在文件中尋找有關 TPOT 使用的更多資訊。
下面是一個手寫數位資料集光學辨識的最小工作範例。
from tpot import TPOTClassifier
from sklearn . datasets import load_digits
from sklearn . model_selection import train_test_split
digits = load_digits ()
X_train , X_test , y_train , y_test = train_test_split ( digits . data , digits . target ,
train_size = 0.75 , test_size = 0.25 , random_state = 42 )
tpot = TPOTClassifier ( generations = 5 , population_size = 50 , verbosity = 2 , random_state = 42 )
tpot . fit ( X_train , y_train )
print ( tpot . score ( X_test , y_test ))
tpot . export ( 'tpot_digits_pipeline.py' )
運行此程式碼應該會發現一個測試準確率約為 98% 的管道,並且相應的 Python 程式碼應導出到tpot_digits_pipeline.py
文件,看起來類似於以下內容:
import numpy as np
import pandas as pd
from sklearn . ensemble import RandomForestClassifier
from sklearn . linear_model import LogisticRegression
from sklearn . model_selection import train_test_split
from sklearn . pipeline import make_pipeline , make_union
from sklearn . preprocessing import PolynomialFeatures
from tpot . builtins import StackingEstimator
from tpot . export_utils import set_param_recursive
# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd . read_csv ( 'PATH/TO/DATA/FILE' , sep = 'COLUMN_SEPARATOR' , dtype = np . float64 )
features = tpot_data . drop ( 'target' , axis = 1 )
training_features , testing_features , training_target , testing_target =
train_test_split ( features , tpot_data [ 'target' ], random_state = 42 )
# Average CV score on the training set was: 0.9799428471757372
exported_pipeline = make_pipeline (
PolynomialFeatures ( degree = 2 , include_bias = False , interaction_only = False ),
StackingEstimator ( estimator = LogisticRegression ( C = 0.1 , dual = False , penalty = "l1" )),
RandomForestClassifier ( bootstrap = True , criterion = "entropy" , max_features = 0.35000000000000003 , min_samples_leaf = 20 , min_samples_split = 19 , n_estimators = 100 )
)
# Fix random state for all the steps in exported pipeline
set_param_recursive ( exported_pipeline . steps , 'random_state' , 42 )
exported_pipeline . fit ( training_features , training_target )
results = exported_pipeline . predict ( testing_features )
同樣,TPOT 可以針對回歸問題優化管道。以下是波士頓房價實踐資料集的最小工作範例。
from tpot import TPOTRegressor
from sklearn . datasets import load_boston
from sklearn . model_selection import train_test_split
housing = load_boston ()
X_train , X_test , y_train , y_test = train_test_split ( housing . data , housing . target ,
train_size = 0.75 , test_size = 0.25 , random_state = 42 )
tpot = TPOTRegressor ( generations = 5 , population_size = 50 , verbosity = 2 , random_state = 42 )
tpot . fit ( X_train , y_train )
print ( tpot . score ( X_test , y_test ))
tpot . export ( 'tpot_boston_pipeline.py' )
這應該會導致管道達到約 12.77 均方誤差 (MSE),而tpot_boston_pipeline.py
中的 Python 程式碼應類似於:
import numpy as np
import pandas as pd
from sklearn . ensemble import ExtraTreesRegressor
from sklearn . model_selection import train_test_split
from sklearn . pipeline import make_pipeline
from sklearn . preprocessing import PolynomialFeatures
from tpot . export_utils import set_param_recursive
# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd . read_csv ( 'PATH/TO/DATA/FILE' , sep = 'COLUMN_SEPARATOR' , dtype = np . float64 )
features = tpot_data . drop ( 'target' , axis = 1 )
training_features , testing_features , training_target , testing_target =
train_test_split ( features , tpot_data [ 'target' ], random_state = 42 )
# Average CV score on the training set was: -10.812040755234403
exported_pipeline = make_pipeline (
PolynomialFeatures ( degree = 2 , include_bias = False , interaction_only = False ),
ExtraTreesRegressor ( bootstrap = False , max_features = 0.5 , min_samples_leaf = 2 , min_samples_split = 3 , n_estimators = 100 )
)
# Fix random state for all the steps in exported pipeline
set_param_recursive ( exported_pipeline . steps , 'random_state' , 42 )
exported_pipeline . fit ( training_features , training_target )
results = exported_pipeline . predict ( testing_features )
查看文件以取得更多範例和教學。
我們歡迎您檢查現有問題是否有錯誤或需要改進的增強功能。如果您有擴展 TPOT 的想法,請提交新問題,以便我們進行討論。
在提交任何貢獻之前,請查看我們的貢獻指南。
請檢查現有的開啟和關閉的問題,看看您的問題是否已解決。如果沒有,請在此儲存庫上提交新問題,以便我們審核您的問題。
如果您在科學出版物中使用 TPOT,請考慮至少引用以下論文之一:
Trang T. Le、Weixuan Fu 與 Jason H. Moore (2020)。使用特徵集選擇器將基於樹的自動化機器學習擴展到生物醫學大數據。生物資訊學.36(1):250-256。
BibTeX 條目:
@article { le2020scaling ,
title = { Scaling tree-based automated machine learning to biomedical big data with a feature set selector } ,
author = { Le, Trang T and Fu, Weixuan and Moore, Jason H } ,
journal = { Bioinformatics } ,
volume = { 36 } ,
number = { 1 } ,
pages = { 250--256 } ,
year = { 2020 } ,
publisher = { Oxford University Press }
}
蘭德爾·S·奧爾森、瑞安·J·厄本諾維奇、彼得·C·安德魯斯、妮可·A·拉文德、拉·克雷斯·基德和傑森·H·摩爾(2016)。透過基於樹的管道優化實現生物醫學數據科學的自動化。進化計算的應用,第 123-137 頁。
BibTeX 條目:
@inbook { Olson2016EvoBio ,
author = { Olson, Randal S. and Urbanowicz, Ryan J. and Andrews, Peter C. and Lavender, Nicole A. and Kidd, La Creis and Moore, Jason H. } ,
editor = { Squillero, Giovanni and Burelli, Paolo } ,
chapter = { Automating Biomedical Data Science Through Tree-Based Pipeline Optimization } ,
title = { Applications of Evolutionary Computation: 19th European Conference, EvoApplications 2016, Porto, Portugal, March 30 -- April 1, 2016, Proceedings, Part I } ,
year = { 2016 } ,
publisher = { Springer International Publishing } ,
pages = { 123--137 } ,
isbn = { 978-3-319-31204-0 } ,
doi = { 10.1007/978-3-319-31204-0_9 } ,
url = { http://dx.doi.org/10.1007/978-3-319-31204-0_9 }
}
奧爾森(Randal S. Olson)、內森·巴特利(Nathan Bartley)、瑞安·J·厄本諾維奇(Ryan J. Urbanowicz) 和傑森·H·摩爾(Jason H. Moore) (2016)。用於自動化數據科學的基於樹的管道優化工具的評估。 GECCO 2016 年會議紀錄,第 485-492 頁。
BibTeX 條目:
@inproceedings { OlsonGECCO2016 ,
author = { Olson, Randal S. and Bartley, Nathan and Urbanowicz, Ryan J. and Moore, Jason H. } ,
title = { Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science } ,
booktitle = { Proceedings of the Genetic and Evolutionary Computation Conference 2016 } ,
series = { GECCO '16 } ,
year = { 2016 } ,
isbn = { 978-1-4503-4206-3 } ,
location = { Denver, Colorado, USA } ,
pages = { 485--492 } ,
numpages = { 8 } ,
url = { http://doi.acm.org/10.1145/2908812.2908918 } ,
doi = { 10.1145/2908812.2908918 } ,
acmid = { 2908918 } ,
publisher = { ACM } ,
address = { New York, NY, USA } ,
}
或者,您可以使用以下 DOI 直接引用儲存庫:
TPOT 是在 NIH R01 AI117694 資助下,在賓州大學計算遺傳學實驗室開發的。我們非常感謝 NIH 和賓州大學在該計畫開發過程中的支持。
TPOT 標誌由 Todd Newmuis 設計,他慷慨地為這個專案貢獻了自己的時間。