tpotダウンロード - tpotソースコードのダウンロード

マスターステータス:

開発状況:

パッケージ情報:

TPOT2 ( alpha ) を試すには、ここにアクセスしてください。

TPOT は、ツリーベースのパイプライン最適化ツールの略です。 TPOT をデータサイエンスアシスタントとして検討してください。 TPOT は、遺伝的プログラミングを使用して機械学習パイプラインを最適化する Python 自動機械学習ツールです。

TPOT Demo

TPOT は、何千もの可能なパイプラインをインテリジェントに探索してデータに最適なパイプラインを見つけることにより、機械学習の最も面倒な部分を自動化します。

An example Machine Learning pipeline

機械学習パイプラインの例

TPOT の検索が完了すると (または待ちくたびれてしまうと)、見つかった最適なパイプラインの Python コードが提供されるので、そこからパイプラインをいじることができます。

An example TPOT pipeline

TPOT は scikit-learn 上に構築されているため、生成されるコードはすべて見覚えがあるはずです...scikit-learn に慣れている場合はともかく。

TPOT は現在も開発中であるため、定期的にこのリポジトリをチェックして更新を確認することをお勧めします。

TPOT の詳細については、プロジェクトのドキュメントを参照してください。

ライセンス

TPOT のライセンスおよび使用情報については、リポジトリライセンスを参照してください。

一般に、TPOT をできるだけ広く使用できるようにするために、TPOT のライセンスを取得しています。

インストール

TPOT のインストール手順はドキュメントに記載されています。 TPOT には、Python が動作するインストールが必要です。

使用法

TPOT はコマンドラインまたは Python コードで使用できます。

対応するリンクをクリックすると、ドキュメントで TPOT の使用法に関する詳細情報が表示されます。

例

分類

以下は、手書き数字データセットの光学認識を使用した最小限の動作例です。

 from tpot import TPOTClassifier
from sklearn . datasets import load_digits
from sklearn . model_selection import train_test_split

digits = load_digits ()
X_train , X_test , y_train , y_test = train_test_split ( digits . data , digits . target ,
                                                    train_size = 0.75 , test_size = 0.25 , random_state = 42 )

tpot = TPOTClassifier ( generations = 5 , population_size = 50 , verbosity = 2 , random_state = 42 )
tpot . fit ( X_train , y_train )
print ( tpot . score ( X_test , y_test ))
tpot . export ( 'tpot_digits_pipeline.py' )

このコードを実行すると、約 98% のテスト精度を達成するパイプラインが検出され、対応する Python コードがtpot_digits_pipeline.pyファイルにエクスポートされ、次のようになります。

 import numpy as np
import pandas as pd
from sklearn . ensemble import RandomForestClassifier
from sklearn . linear_model import LogisticRegression
from sklearn . model_selection import train_test_split
from sklearn . pipeline import make_pipeline , make_union
from sklearn . preprocessing import PolynomialFeatures
from tpot . builtins import StackingEstimator
from tpot . export_utils import set_param_recursive

# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd . read_csv ( 'PATH/TO/DATA/FILE' , sep = 'COLUMN_SEPARATOR' , dtype = np . float64 )
features = tpot_data . drop ( 'target' , axis = 1 )
training_features , testing_features , training_target , testing_target = 
            train_test_split ( features , tpot_data [ 'target' ], random_state = 42 )

# Average CV score on the training set was: 0.9799428471757372
exported_pipeline = make_pipeline (
    PolynomialFeatures ( degree = 2 , include_bias = False , interaction_only = False ),
    StackingEstimator ( estimator = LogisticRegression ( C = 0.1 , dual = False , penalty = "l1" )),
    RandomForestClassifier ( bootstrap = True , criterion = "entropy" , max_features = 0.35000000000000003 , min_samples_leaf = 20 , min_samples_split = 19 , n_estimators = 100 )
)
# Fix random state for all the steps in exported pipeline
set_param_recursive ( exported_pipeline . steps , 'random_state' , 42 )

exported_pipeline . fit ( training_features , training_target )
results = exported_pipeline . predict ( testing_features )

回帰

同様に、TPOT は回帰問題に対してパイプラインを最適化できます。以下は、ボストンの住宅価格の実践データセットを使用した最小限の実際の例です。

 from tpot import TPOTRegressor
from sklearn . datasets import load_boston
from sklearn . model_selection import train_test_split

housing = load_boston ()
X_train , X_test , y_train , y_test = train_test_split ( housing . data , housing . target ,
                                                    train_size = 0.75 , test_size = 0.25 , random_state = 42 )

tpot = TPOTRegressor ( generations = 5 , population_size = 50 , verbosity = 2 , random_state = 42 )
tpot . fit ( X_train , y_train )
print ( tpot . score ( X_test , y_test ))
tpot . export ( 'tpot_boston_pipeline.py' )

これにより、約 12.77 の平均二乗誤差 (MSE) を達成するパイプラインが得られ、 tpot_boston_pipeline.pyの Python コードは次のようになります。

 import numpy as np
import pandas as pd
from sklearn . ensemble import ExtraTreesRegressor
from sklearn . model_selection import train_test_split
from sklearn . pipeline import make_pipeline
from sklearn . preprocessing import PolynomialFeatures
from tpot . export_utils import set_param_recursive

# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd . read_csv ( 'PATH/TO/DATA/FILE' , sep = 'COLUMN_SEPARATOR' , dtype = np . float64 )
features = tpot_data . drop ( 'target' , axis = 1 )
training_features , testing_features , training_target , testing_target = 
            train_test_split ( features , tpot_data [ 'target' ], random_state = 42 )

# Average CV score on the training set was: -10.812040755234403
exported_pipeline = make_pipeline (
    PolynomialFeatures ( degree = 2 , include_bias = False , interaction_only = False ),
    ExtraTreesRegressor ( bootstrap = False , max_features = 0.5 , min_samples_leaf = 2 , min_samples_split = 3 , n_estimators = 100 )
)
# Fix random state for all the steps in exported pipeline
set_param_recursive ( exported_pipeline . steps , 'random_state' , 42 )

exported_pipeline . fit ( training_features , training_target )
results = exported_pipeline . predict ( testing_features )

その他の例やチュートリアルについては、ドキュメントを確認してください。

TPOTへの貢献

既存の問題をチェックして、取り組むべきバグや機能拡張がないかどうかを確認してください。 TPOT の拡張に関するアイデアがある場合は、新しい問題を提出してください。それについて話し合うことができます。

寄稿を送信する前に、寄稿ガイドラインをご確認ください。

TPOT に関して問題や質問がありますか?

既存の未解決の問題と終了した問題をチェックして、問題がすでに対応されているかどうかを確認してください。問題が解決されていない場合は、このリポジトリに新しい問題を提出して、問題を確認してください。

TPOTを引用

科学出版物で TPOT を使用する場合は、次の論文の少なくとも 1 つを引用することを検討してください。

Trang T. Le、Weixuan Fu、ジェイソン H. ムーア (2020)。特徴セットセレクターを使用して、ツリーベースの自動機械学習を生物医学ビッグデータに拡張します。バイオインフォマティクス.36(1): 250-256。

BibTeX エントリ:

 @article { le2020scaling ,
  title = { Scaling tree-based automated machine learning to biomedical big data with a feature set selector } ,
  author = { Le, Trang T and Fu, Weixuan and Moore, Jason H } ,
  journal = { Bioinformatics } ,
  volume = { 36 } ,
  number = { 1 } ,
  pages = { 250--256 } ,
  year = { 2020 } ,
  publisher = { Oxford University Press }
}

ランダル S. オルソン、ライアン J. ウルバノウィッツ、ピーター C. アンドリュース、ニコール A. ラベンダー、ラクレイスキッド、ジェイソン H. ムーア (2016)。ツリーベースのパイプライン最適化を通じて生物医学データサイエンスを自動化します。進化的計算の応用、123 ～ 137 ページ。

BibTeX エントリ:

 @inbook { Olson2016EvoBio ,
    author = { Olson, Randal S. and Urbanowicz, Ryan J. and Andrews, Peter C. and Lavender, Nicole A. and Kidd, La Creis and Moore, Jason H. } ,
    editor = { Squillero, Giovanni and Burelli, Paolo } ,
    chapter = { Automating Biomedical Data Science Through Tree-Based Pipeline Optimization } ,
    title = { Applications of Evolutionary Computation: 19th European Conference, EvoApplications 2016, Porto, Portugal, March 30 -- April 1, 2016, Proceedings, Part I } ,
    year = { 2016 } ,
    publisher = { Springer International Publishing } ,
    pages = { 123--137 } ,
    isbn = { 978-3-319-31204-0 } ,
    doi = { 10.1007/978-3-319-31204-0_9 } ,
    url = { http://dx.doi.org/10.1007/978-3-319-31204-0_9 }
}

ランダル S. オルソン、ネイサンバートリー、ライアン J. ウルバノウィッツ、ジェイソン H. ムーア (2016)。データサイエンスを自動化するためのツリーベースのパイプライン最適化ツールの評価。 GECCO 2016 論文集、485 ～ 492 ページ。

BibTeX エントリ:

 @inproceedings { OlsonGECCO2016 ,
    author = { Olson, Randal S. and Bartley, Nathan and Urbanowicz, Ryan J. and Moore, Jason H. } ,
    title = { Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science } ,
    booktitle = { Proceedings of the Genetic and Evolutionary Computation Conference 2016 } ,
    series = { GECCO '16 } ,
    year = { 2016 } ,
    isbn = { 978-1-4503-4206-3 } ,
    location = { Denver, Colorado, USA } ,
    pages = { 485--492 } ,
    numpages = { 8 } ,
    url = { http://doi.acm.org/10.1145/2908812.2908918 } ,
    doi = { 10.1145/2908812.2908918 } ,
    acmid = { 2908918 } ,
    publisher = { ACM } ,
    address = { New York, NY, USA } ,
}