🍃Hyperparameter Optimization (HPO)

To thrive in winter, Lotus Queen had optimized her genes over centuries, allowing her body to adapt perfectly to the cold weather. Similarly, in Applied Data Science, we can fine-tune a model's hyperparameters to achieve the best performance. This process is known as "Hyperparameter Optimization" (HPO).

In this chapter, we will apply the latest HPO technology in classical classification & regression as well as deep learning problems, using FLAML, Optuna and Keras Tuner.

FLAML vs Optuna - HPO for Classical Machine learning

Optuna is a hyperparameter optimization framework published in 2019, it was designed for improving cost effectiveness of HPO with efficient parameter searching strategy and pruning algorithm. Till 2024, it has become a mature tool that supports different frameworks of machine learning and deep learning.

FLAML is an automated HPO library published in 2021, powered by self-invented parameter searching algorithms, it aims at freeing users from selecting learners and hyperparameters while delivering fast and economical HPO results.

While working on some garden businesses, Lady H. has experimented with these tools. Let's look at a comparison between FLAML and Optuna through her experiments:

Leaves30 has 14 features and 340 records in total, with 30 different specimens to classify, it is a typical multi-class classification dataset.
- 🌻 The code to generate Leaves30 data >>
Sales data has 18 features and 693861 records in total, it is used to forecast sales, a regression problem.
- 🌻 The code to generate Sales data >>

In the classification problem, Lady H. was using balanced accuracy to measure the percentage of correctly predicted specimens in the testing data, with its value closer to 1, the better model performance we get. In the regression problem, R2 (r-square) was used to measure how close the forecasted sales to the real sales, with its value closer to 1, the better model performance we get. Meanwhile, the computational efficiency is an important metrics too.

Table 1.1 shows Lady H.'s experiments results, by comparing the baseline model vs FLAML vs Optuna, we can see FLAML has an overall better performance in both classification and regression. Now let’s look into details.

The Baseline Model

The baseline model provides a bottomline performance. In Lady H.'s experiments, she was using LightGBM (LGBM) with default settings.

LGBM is an ensembling model that is widely used in the industry and data science competitions, it has been proved to be an excellent estimator in both model performance and computation efficiency.

Another benefit of choosing LGBM is the saved efforts in data preprocessing:

Numerical features in different scales are not required to be scaled. Because, as a type of tree model, LGBM is not sensitive to the variance in features.
LGBM handles missing values automatically by allocating them to wherever reduces the loss most.
LGBM offers good accuracy with integer-coded categorical features. Users only need to convert the integer-coded categorical features as “category” data type in python pandas.
LGBM is a non-parametric method which doesn’t make assumptions on the data, so preprocessing methods such as data normalization or reducing data correlation are not required either.

The baseline performance is the average balanced accuracy of cross validation (CV) results. By using cross validation, we can observe the performance of each fold as well as the performance variance among folds. Because of the variance, we average all folds' results as the final performance, in order to show a less biased view.

Leaves30 has small amount of data, so using 5-fold CV here:

🌻 Look into Leaves30 Baseline details >>

Sales data is large enough to use 10-fold CV:

🌻 Look into Sales Baseline details >>

Design Overview - FLAML

The overall design of FLAML is shown in Figure 1.2:

It has 2 major components:

ML Layer contains the candidate learners, such as XGBoost, LightGBM, etc.
AutoML Layer includes a Resampling Strategy Proposer, a Learner Proposer, a Hyperparam & Sample Size Proposer and a Controller. This layer controls the core logic of the search strategy, with the goal of minimizing the total cost before finding a model with the optimal test error.
- "Total Cost" means the total CPU time of training and validation using cross validation or holdout. This cost is also expected to increase as the test error decreases.

Now let's look into each step:

Step 1 - Resampling Strategy: It's a simple thresholding rule to choose between cross validation or holdout. According to FLAML researchers, cross validation is preferred over holdout for small sample size or large time budget.
Step 2 - Learner Proposer: A learner gets a higher priority if it makes improvement with less estimated cost. Meanwhile, every learner has a chance to be searched again since the estimation can be impresise.
Step 3 - Hyperparam & Sample Size Proposer: In this step, each learner chooses between increasing the sample size or trying out a new parameter set in order to make the improvement. By default, each new parameter set is searched by a randomized direct search strategy, CFO. You will see details soon.
Step 4 - Controller: The controller will invoke the parameter tuning trials using the selected learner and observe both validation error as well as CPU time cost of each trial.

Step 2 ~ 4 are repeated by iterations until running out of the time budget.

"Time Budget" means the total amount of time the user allows FLAML to run HPO.

🌻 Learn more from FLAML paper >>

Design Overview - Optuna

The overall design of Optuna is shown as Figure 1.3:

Optuna introduced define-by-run framework into HPO in 2019. The main idea behind define-by-run is, a user can rely on Optuna to decide the hyperparamster values in each trial dynamically (when the program is running), without explicitly define everything in advance. There are different ways to dynamically construct the parameter search space, Optuna's is based on historical evaluated trials' results. Meanwhile, Optuna provides highly modularized programming that a user-defined objective function receives a living trial as input and evaluates the trial result, which also enables the parallel computation of multiple trials.

Optuna's sampling algorithm works as its search strategy, supporting both independent sampling (such as TPE) and relational sampling (such as CMA-ES). Independent sampling samples hyperparameters independently while relational sampling exploits the correlations between hyperparaemters. To achieve cost-effectiveness, Optuna also provides pruning algorithm to terminate unpromising trials based on periodically monitored intermediate objective values.

As we can see in Figure 1.3, each Optuna worker executes an instance of the objective function as well as sampling algorithm and pruning algorithm of a study. This type of design is suitable for distributed environment where workers are running in parallel. Furthermore, workers are sharing the progress of current study via the storage. An objective function can access the storage to get the information of past studies.

🌻 Learn more from Optuna paper >>

Hyperopt is another popular tool for hyperparameter tuning, but Lady H. prefers to use Optuna and didn't include it in the experiments.

Design Overview - Summary

Table 1.2 has compared and summarized the design of FLAML and Optuna:

While sharing several common strengths, FLAML is designed to be more automated in optimization. The main differences in their core algorithms are, FLAML makes decisions based on the estimated evaluation while Optuna is based on the historical evaluation, and FLAML's time complexity seems more efficient.

Now time to show you Lady H.'s experiments with FLAML and Optuna!

PreviousLotus Queen NextExperiment 1 - Optimization with Default Settings

Last updated 1 year ago