XGBoost + Optuna: Practical Hyperparameter Tuning • Brandon Rundquist

Most hyperparameter tuning is done wrong. Not because the search algorithm is hard, but because the evaluation is. You run 100 trials, pick the configuration with the best validation RMSE, and call it your improvement. The problem: you just optimized for that validation score. It’s no longer a neutral measurement. The only honest number left is a holdout set you never touched during tuning.

XGBoost’s defaults are strong, but they’re not universally optimal. A good tuning process can deliver real gains on structured data. A careless one delivers false confidence.

This guide focuses on doing it right:

a sensible search space (6-8 high-impact parameters, not everything at once)
fast iteration with early stopping and pruning
reproducible experiments with seeds and persistent storage
honest evaluation on a holdout that’s never part of the tuning loop

The Core Idea (in 60 seconds)

Model training and hyperparameter tuning are different optimization problems.

Training learns model parameters (weights and splits) by minimizing loss on training data.
Tuning chooses hyperparameters (such as learning_rate, max_depth, subsample) using validation performance.
Validation performance is a black-box objective: expensive, noisy, and often non-differentiable.

That is why Optuna works well. It learns from past trials and proposes better next trials, instead of blindly enumerating combinations.

A Practical Tuning Recipe

If you only keep one workflow, keep this one:

Build a baseline and record its metric.
Tune 6-8 high-impact parameters first.
Use log-scale sampling where magnitude spans orders of magnitude.
Use early_stopping_rounds inside each trial.
Add Optuna pruning to stop weak trials early.
Keep an untouched holdout for final model selection.

End-to-End Example (Iowa Liquor Sales)

1) Load data and create a leak-safe split

from pathlib import Path
 
import numpy as np
import optuna
import polars as pl
# If this import fails, run: pip install "optuna-integration[xgboost]"
from optuna.integration import XGBoostPruningCallback
from sklearn.compose import ColumnTransformer
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder
from xgboost import XGBRegressor
 
RANDOM_SEED = 42
TARGET = "sale_dollars"
CATEGORICAL = ["county", "city", "category_name", "day_of_week"]
NUMERIC = [
    "store_number",
    "vendor_number",
    "item_number",
    "bottles_sold",
    "volume_sold_liters",
    "day_of_week_idx",
    "sale_month",
    "week_of_year",
]
FEATURES = CATEGORICAL + NUMERIC
 
DATA_DIR = Path("data/artifacts/website_datasets")
dev_df = pl.read_parquet(DATA_DIR / "dev_sample.parquet").to_pandas()
holdout_df = pl.read_parquet(DATA_DIR / "validation_holdout.parquet").to_pandas()
 
# Ensure stable string categories before encoding
for col in CATEGORICAL:
    dev_df[col] = dev_df[col].astype(str)
    holdout_df[col] = holdout_df[col].astype(str)
 
# Tuning split (train/validation) from development data only
train_df, val_df = train_test_split(
    dev_df,
    test_size=0.2,
    random_state=RANDOM_SEED,
)
 
 
def make_preprocessor() -> ColumnTransformer:
    return ColumnTransformer(
        transformers=[
            (
                "cat",
                OrdinalEncoder(
                    handle_unknown="use_encoded_value",
                    unknown_value=-1,
                ),
                CATEGORICAL,
            ),
            ("num", "passthrough", NUMERIC),
        ],
        sparse_threshold=0.0,
    )
 
 
def rmse(y_true, y_pred) -> float:
    return float(np.sqrt(mean_squared_error(y_true, y_pred)))
 
 
preprocessor = make_preprocessor()
X_train = preprocessor.fit_transform(train_df[FEATURES]).astype(np.float32)
X_val = preprocessor.transform(val_df[FEATURES]).astype(np.float32)
y_train = train_df[TARGET].to_numpy(np.float32)
y_val = val_df[TARGET].to_numpy(np.float32)

2) Create a baseline first

baseline = XGBRegressor(
    objective="reg:squarederror",
    eval_metric="rmse",
    tree_method="hist",
    random_state=RANDOM_SEED,
    n_estimators=3000,
    learning_rate=0.1,
    max_depth=6,
    n_jobs=-1,
    early_stopping_rounds=100,
)
 
baseline.fit(
    X_train,
    y_train,
    eval_set=[(X_val, y_val)],
    verbose=False,
)
 
baseline_iterations = int(baseline.best_iteration) + 1 if baseline.best_iteration is not None else 300
baseline_val_rmse = rmse(y_val, baseline.predict(X_val))
print(f"Baseline validation RMSE: {baseline_val_rmse:,.2f}")

This baseline tells you whether tuning is helping, and by how much.

3) Define a search space that makes sense

Start with high-impact parameters only:

Parameter	Range	Sampling	Why
`learning_rate`	`1e-3` to `0.2`	log	Most sensitive; spans orders of magnitude
`max_depth`	`3` to `12`	int	Controls tree complexity
`min_child_weight`	`1e-2` to `20`	log	Regularizes split creation
`subsample`	`0.6` to `1.0`	uniform	Row sampling for variance reduction
`colsample_bytree`	`0.6` to `1.0`	uniform	Feature sampling for robustness
`reg_alpha`	`1e-8` to `10`	log	L1 regularization
`reg_lambda`	`1e-8` to `10`	log	L2 regularization
`gamma`	`0` to `10`	uniform	Minimum gain required to split

4) Write an Optuna objective with pruning

def objective(trial: optuna.Trial) -> float:
    params = {
        "objective": "reg:squarederror",
        "eval_metric": "rmse",
        "tree_method": "hist",
        "random_state": RANDOM_SEED,
        "n_estimators": 3000,  # intentionally high, controlled by early stopping
        "n_jobs": -1,
        "early_stopping_rounds": 100,
        "learning_rate": trial.suggest_float("learning_rate", 1e-3, 0.2, log=True),
        "max_depth": trial.suggest_int("max_depth", 3, 12),
        "min_child_weight": trial.suggest_float("min_child_weight", 1e-2, 20.0, log=True),
        "subsample": trial.suggest_float("subsample", 0.6, 1.0),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.6, 1.0),
        "reg_alpha": trial.suggest_float("reg_alpha", 1e-8, 10.0, log=True),
        "reg_lambda": trial.suggest_float("reg_lambda", 1e-8, 10.0, log=True),
        "gamma": trial.suggest_float("gamma", 0.0, 10.0),
    }
 
    model = XGBRegressor(**params)
    pruning_callback = XGBoostPruningCallback(trial, "validation_0-rmse")
 
    model.fit(
        X_train,
        y_train,
        eval_set=[(X_val, y_val)],
        callbacks=[pruning_callback],
        verbose=False,
    )
 
    preds = model.predict(X_val)
    score = rmse(y_val, preds)
 
    # Save metadata useful for final retraining
    best_iteration = model.best_iteration if model.best_iteration is not None else params["n_estimators"] - 1
    trial.set_user_attr("best_iteration", int(best_iteration))
    return score

5) Run the study (and make it resumable)

study = optuna.create_study(
    direction="minimize",
    sampler=optuna.samplers.TPESampler(
        seed=RANDOM_SEED,
        n_startup_trials=20,
    ),
    pruner=optuna.pruners.MedianPruner(
        n_startup_trials=10,
        n_warmup_steps=50,
    ),
    study_name="xgboost_liquor_sales",
    storage="sqlite:///xgboost_optuna_study.db",
    load_if_exists=True,
)
 
study.optimize(
    objective,
    n_trials=120,
    timeout=60 * 60,  # 1 hour budget
    show_progress_bar=True,
)
 
print(f"Best trial: {study.best_trial.number}")
print(f"Best validation RMSE: {study.best_value:,.2f}")
print("Best parameters:")
for k, v in study.best_trial.params.items():
    print(f"  {k}: {v}")

load_if_exists=True makes interruptions much less painful. Stop and resume whenever you need.

6) Refit on all dev data, then evaluate once on holdout

Now retrain a final model with the best hyperparameters and evaluate exactly once on untouched holdout data.

# Refit preprocessor on all dev data (still no holdout fitting)
final_preprocessor = make_preprocessor()
X_dev = final_preprocessor.fit_transform(dev_df[FEATURES]).astype(np.float32)
X_holdout = final_preprocessor.transform(holdout_df[FEATURES]).astype(np.float32)
 
y_dev = dev_df[TARGET].to_numpy(np.float32)
y_holdout = holdout_df[TARGET].to_numpy(np.float32)
 
# Baseline model retrained on all dev data
baseline_final = XGBRegressor(
    objective="reg:squarederror",
    eval_metric="rmse",
    tree_method="hist",
    random_state=RANDOM_SEED,
    n_estimators=baseline_iterations,
    learning_rate=0.1,
    max_depth=6,
    n_jobs=-1,
)
baseline_final.fit(X_dev, y_dev, verbose=False)
baseline_holdout_rmse = rmse(y_holdout, baseline_final.predict(X_holdout))
 
# Tuned model retrained on all dev data
best_params = study.best_trial.params.copy()
best_iteration = int(study.best_trial.user_attrs.get("best_iteration", 300))
 
best_params.update(
    {
        "objective": "reg:squarederror",
        "eval_metric": "rmse",
        "tree_method": "hist",
        "random_state": RANDOM_SEED,
        "n_estimators": best_iteration + 1,
        "n_jobs": -1,
    }
)
 
tuned_final = XGBRegressor(**best_params)
tuned_final.fit(X_dev, y_dev, verbose=False)
tuned_holdout_rmse = rmse(y_holdout, tuned_final.predict(X_holdout))
 
improvement = (baseline_holdout_rmse - tuned_holdout_rmse) / baseline_holdout_rmse * 100
print(f"Baseline holdout RMSE: {baseline_holdout_rmse:,.2f}")
print(f"Tuned holdout RMSE:    {tuned_holdout_rmse:,.2f}")
print(f"Relative improvement:  {improvement:.2f}%")

Why This Workflow Works

TPE learns from prior trials, so your search becomes smarter over time.
early_stopping_rounds prevents wasting trees inside a trial.
Pruning prevents wasting full trials on weak configurations.
Persistent storage makes runs resumable and auditable.
An untouched holdout protects you from “winning” validation while failing in production.

Common Mistakes (and Better Defaults)

Tuning too many parameters at once -> start with 6-8 high-impact parameters.
Using linear sampling for learning_rate -> use log=True.
No baseline -> always measure uplift versus defaults.
No holdout -> keep one dataset split untouched until the end.
Over-optimizing tiny gains -> stop when the curve plateaus and complexity grows.

Quick FAQ

How many trials should I run? Start with 50-150 for 6-8 parameters. Increase only if your best score is still improving late in the run.

Should I use cross-validation? Use CV when data is small or noisy and you need robust estimates. For larger datasets, a strong holdout strategy is often enough.

Can I run this in parallel? Yes. Point all workers at the same Optuna storage (SQLite for local, PostgreSQL for teams) and run study.optimize(...) from multiple processes.

Do I need pruning? Not always, but it is usually a major speedup when some trials clearly underperform early.

Final Takeaways

When the code runs, you get three numbers: baseline holdout RMSE, tuned holdout RMSE, and relative improvement. That final number is the only one worth reporting. Everything else - validation scores, trial history, best parameters - is scaffolding. The holdout is the verdict.

If the improvement is meaningful, you have a better model and the proof to show for it. If it isn’t, you’ve learned something equally useful: the defaults were already close to optimal for this dataset, and you can ship sooner.

XGBoost + Optuna remains a top-tier combo for tabular ML.
Search space design matters more than trying every possible parameter.
Early stopping + pruning gives faster, cheaper experimentation.
Reproducibility (seed, storage, consistent splits) is non-negotiable.
The real win is not just lower RMSE, but a workflow you can trust and repeat.