TPOT

TPOT constructs multiple pipelines, each incorporating various data preprocessors, feature constructors, feature selectors, models, and hyperparameters. These pipelines form a population from which TPOT selects the one that delivers the best model performance.

Built on top of scikit-learn, TPOT leverages its built-in functions. It uses a genetic algorithm to select the optimal pipeline, which makes the computational cost relatively high ๐Ÿ˜“, despite efforts to enhance scalability through the Feature Set Selector (FSS) and Templates.

Regression with TPOT

Let's see how to use TPOT in a regression problem.

  • generations is the number of iterations to run the pipeline selection

  • population_size specifies the number of pipelines to retrain in each iteration

Both generations and population_size are parameters used in the genetic algorithm. Increasing their values will extend the time required to run the entire TPOT pipeline, but higher values don't necessarily guarantee better results.

  • config_dict allows you to choose different TPOT configurations. In the code below, Lady H. chose "TPOT light" so that only simpler and fast-running operators will be used in the pipeline, otherwise it takes even longer time to run TPOT. You can try other configurations, such as

    • "Default TPOT" will select a broad range of operators into the pipeline

    • "TPOT NN" adds more neural network estimators upon all the choices of "Default TPOT"

    • "TPOT cuML" supports the search using GPU

    • And other choices

By default, K-fold cross validation is used in TPOT, and at the end of the pipeline selection, the best pipeline is trained on the entire training data, which is a good practice.

In the code above, we can see, even though Lady H. was using light settings, the pipeline still took 4 hours but the final performance is no better than what FLAML achieved in 5 minutes.

๐ŸŒป Look into TPOT regression experiment details >>

One of the benefits of using TPOT is the saved .py file which contains the code of running the selected pipeline, so that in the future you can just run this python file to reproduce the optimized results.

Classification with TPOT

The classification data is much smaller and it only took 61 seconds to finish the TPOT pipeline, with a decent testing performance.

๐ŸŒป Look into TPOT classification experiment details >>

Although TPOT completed the classification task quickly with good performance, it's important to note that real-world datasets are often much larger than those used here (the classification dataset with 14 features, 340 records, and 30 classes; the regression dataset with 18 features and 693,861 records). In such cases, TPOT's speed might become a concern.

Last updated