TPOT
TPOT constructs multiple pipelines, each incorporating various data preprocessors, feature constructors, feature selectors, models, and hyperparameters. These pipelines form a population from which TPOT selects the one that delivers the best model performance.
Built on top of scikit-learn, TPOT leverages its built-in functions. It uses a genetic algorithm to select the optimal pipeline, which makes the computational cost relatively high ๐, despite efforts to enhance scalability through the Feature Set Selector (FSS) and Templates.

Regression with TPOT
Let's see how to use TPOT in a regression problem.
generationsis the number of iterations to run the pipeline selectionpopulation_sizespecifies the number of pipelines to retrain in each iteration
Both generations and population_size are parameters used in the genetic algorithm. Increasing their values will extend the time required to run the entire TPOT pipeline, but higher values don't necessarily guarantee better results.
config_dictallows you to choose different TPOT configurations. In the code below, Lady H. chose "TPOT light" so that only simpler and fast-running operators will be used in the pipeline, otherwise it takes even longer time to run TPOT. You can try other configurations, such as"Default TPOT" will select a broad range of operators into the pipeline
"TPOT NN" adds more neural network estimators upon all the choices of "Default TPOT"
"TPOT cuML" supports the search using GPU
And other choices
By default, K-fold cross validation is used in TPOT, and at the end of the pipeline selection, the best pipeline is trained on the entire training data, which is a good practice.

In the code above, we can see, even though Lady H. was using light settings, the pipeline still took 4 hours but the final performance is no better than what FLAML achieved in 5 minutes.
๐ป Look into TPOT regression experiment details >>
One of the benefits of using TPOT is the saved .py file which contains the code of running the selected pipeline, so that in the future you can just run this python file to reproduce the optimized results.

Classification with TPOT
The classification data is much smaller and it only took 61 seconds to finish the TPOT pipeline, with a decent testing performance.


๐ป Look into TPOT classification experiment details >>
Although TPOT completed the classification task quickly with good performance, it's important to note that real-world datasets are often much larger than those used here (the classification dataset with 14 features, 340 records, and 30 classes; the regression dataset with 18 features and 693,861 records). In such cases, TPOT's speed might become a concern.
Last updated