Most hyperparameter tuning is done wrong. Not because the search algorithm is hard, but because the evaluation is. You run 100 trials, pick the configuration with the best validation RMSE, and call it your improvement. The problem: you just optimized for that validation score. It’s no longer a neutral measurement. The only honest number left is a holdout set you never touched during tuning.
XGBoost’s defaults are strong, but they’re not universally optimal. A good tuning process can deliver real gains on structured data. A careless one delivers false confidence.
This guide focuses on doing it right:
- a sensible search space (6-8 high-impact parameters, not everything at once)
- fast iteration with early stopping and pruning
- reproducible experiments with seeds and persistent storage
- honest evaluation on a holdout that’s never part of the tuning loop
The Core Idea (in 60 seconds)
Model training and hyperparameter tuning are different optimization problems.
- Training learns model parameters (weights and splits) by minimizing loss on training data.
- Tuning chooses hyperparameters (such as
learning_rate,max_depth,subsample) using validation performance. - Validation performance is a black-box objective: expensive, noisy, and often non-differentiable.
That is why Optuna works well. It learns from past trials and proposes better next trials, instead of blindly enumerating combinations.
A Practical Tuning Recipe
If you only keep one workflow, keep this one:
- Build a baseline and record its metric.
- Tune 6-8 high-impact parameters first.
- Use log-scale sampling where magnitude spans orders of magnitude.
- Use
early_stopping_roundsinside each trial. - Add Optuna pruning to stop weak trials early.
- Keep an untouched holdout for final model selection.
End-to-End Example (Iowa Liquor Sales)
1) Load data and create a leak-safe split
from pathlib import Path
import numpy as np
import optuna
import polars as pl
# If this import fails, run: pip install "optuna-integration[xgboost]"
from optuna.integration import XGBoostPruningCallback
from sklearn.compose import ColumnTransformer
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder
from xgboost import XGBRegressor
RANDOM_SEED = 42
TARGET = "sale_dollars"
CATEGORICAL = ["county", "city", "category_name", "day_of_week"]
NUMERIC = [
"store_number",
"vendor_number",
"item_number",
"bottles_sold",
"volume_sold_liters",
"day_of_week_idx",
"sale_month",
"week_of_year",
]
FEATURES = CATEGORICAL + NUMERIC
DATA_DIR = Path("data/artifacts/website_datasets")
dev_df = pl.read_parquet(DATA_DIR / "dev_sample.parquet").to_pandas()
holdout_df = pl.read_parquet(DATA_DIR / "validation_holdout.parquet").to_pandas()
# Ensure stable string categories before encoding
for col in CATEGORICAL:
dev_df[col] = dev_df[col].astype(str)
holdout_df[col] = holdout_df[col].astype(str)
# Tuning split (train/validation) from development data only
train_df, val_df = train_test_split(
dev_df,
test_size=0.2,
random_state=RANDOM_SEED,
)
def make_preprocessor() -> ColumnTransformer:
return ColumnTransformer(
transformers=[
(
"cat",
OrdinalEncoder(
handle_unknown="use_encoded_value",
unknown_value=-1,
),
CATEGORICAL,
),
("num", "passthrough", NUMERIC),
],
sparse_threshold=0.0,
)
def rmse(y_true, y_pred) -> float:
return float(np.sqrt(mean_squared_error(y_true, y_pred)))
preprocessor = make_preprocessor()
X_train = preprocessor.fit_transform(train_df[FEATURES]).astype(np.float32)
X_val = preprocessor.transform(val_df[FEATURES]).astype(np.float32)
y_train = train_df[TARGET].to_numpy(np.float32)
y_val = val_df[TARGET].to_numpy(np.float32)2) Create a baseline first
baseline = XGBRegressor(
objective="reg:squarederror",
eval_metric="rmse",
tree_method="hist",
random_state=RANDOM_SEED,
n_estimators=3000,
learning_rate=0.1,
max_depth=6,
n_jobs=-1,
early_stopping_rounds=100,
)
baseline.fit(
X_train,
y_train,
eval_set=[(X_val, y_val)],
verbose=False,
)
baseline_iterations = int(baseline.best_iteration) + 1 if baseline.best_iteration is not None else 300
baseline_val_rmse = rmse(y_val, baseline.predict(X_val))
print(f"Baseline validation RMSE: {baseline_val_rmse:,.2f}")This baseline tells you whether tuning is helping, and by how much.
3) Define a search space that makes sense
Start with high-impact parameters only:
| Parameter | Range | Sampling | Why |
|---|---|---|---|
learning_rate | 1e-3 to 0.2 | log | Most sensitive; spans orders of magnitude |
max_depth | 3 to 12 | int | Controls tree complexity |
min_child_weight | 1e-2 to 20 | log | Regularizes split creation |
subsample | 0.6 to 1.0 | uniform | Row sampling for variance reduction |
colsample_bytree | 0.6 to 1.0 | uniform | Feature sampling for robustness |
reg_alpha | 1e-8 to 10 | log | L1 regularization |
reg_lambda | 1e-8 to 10 | log | L2 regularization |
gamma | 0 to 10 | uniform | Minimum gain required to split |
4) Write an Optuna objective with pruning
def objective(trial: optuna.Trial) -> float:
params = {
"objective": "reg:squarederror",
"eval_metric": "rmse",
"tree_method": "hist",
"random_state": RANDOM_SEED,
"n_estimators": 3000, # intentionally high, controlled by early stopping
"n_jobs": -1,
"early_stopping_rounds": 100,
"learning_rate": trial.suggest_float("learning_rate", 1e-3, 0.2, log=True),
"max_depth": trial.suggest_int("max_depth", 3, 12),
"min_child_weight": trial.suggest_float("min_child_weight", 1e-2, 20.0, log=True),
"subsample": trial.suggest_float("subsample", 0.6, 1.0),
"colsample_bytree": trial.suggest_float("colsample_bytree", 0.6, 1.0),
"reg_alpha": trial.suggest_float("reg_alpha", 1e-8, 10.0, log=True),
"reg_lambda": trial.suggest_float("reg_lambda", 1e-8, 10.0, log=True),
"gamma": trial.suggest_float("gamma", 0.0, 10.0),
}
model = XGBRegressor(**params)
pruning_callback = XGBoostPruningCallback(trial, "validation_0-rmse")
model.fit(
X_train,
y_train,
eval_set=[(X_val, y_val)],
callbacks=[pruning_callback],
verbose=False,
)
preds = model.predict(X_val)
score = rmse(y_val, preds)
# Save metadata useful for final retraining
best_iteration = model.best_iteration if model.best_iteration is not None else params["n_estimators"] - 1
trial.set_user_attr("best_iteration", int(best_iteration))
return score5) Run the study (and make it resumable)
study = optuna.create_study(
direction="minimize",
sampler=optuna.samplers.TPESampler(
seed=RANDOM_SEED,
n_startup_trials=20,
),
pruner=optuna.pruners.MedianPruner(
n_startup_trials=10,
n_warmup_steps=50,
),
study_name="xgboost_liquor_sales",
storage="sqlite:///xgboost_optuna_study.db",
load_if_exists=True,
)
study.optimize(
objective,
n_trials=120,
timeout=60 * 60, # 1 hour budget
show_progress_bar=True,
)
print(f"Best trial: {study.best_trial.number}")
print(f"Best validation RMSE: {study.best_value:,.2f}")
print("Best parameters:")
for k, v in study.best_trial.params.items():
print(f" {k}: {v}")load_if_exists=True makes interruptions much less painful. Stop and resume whenever you need.
6) Refit on all dev data, then evaluate once on holdout
Now retrain a final model with the best hyperparameters and evaluate exactly once on untouched holdout data.
# Refit preprocessor on all dev data (still no holdout fitting)
final_preprocessor = make_preprocessor()
X_dev = final_preprocessor.fit_transform(dev_df[FEATURES]).astype(np.float32)
X_holdout = final_preprocessor.transform(holdout_df[FEATURES]).astype(np.float32)
y_dev = dev_df[TARGET].to_numpy(np.float32)
y_holdout = holdout_df[TARGET].to_numpy(np.float32)
# Baseline model retrained on all dev data
baseline_final = XGBRegressor(
objective="reg:squarederror",
eval_metric="rmse",
tree_method="hist",
random_state=RANDOM_SEED,
n_estimators=baseline_iterations,
learning_rate=0.1,
max_depth=6,
n_jobs=-1,
)
baseline_final.fit(X_dev, y_dev, verbose=False)
baseline_holdout_rmse = rmse(y_holdout, baseline_final.predict(X_holdout))
# Tuned model retrained on all dev data
best_params = study.best_trial.params.copy()
best_iteration = int(study.best_trial.user_attrs.get("best_iteration", 300))
best_params.update(
{
"objective": "reg:squarederror",
"eval_metric": "rmse",
"tree_method": "hist",
"random_state": RANDOM_SEED,
"n_estimators": best_iteration + 1,
"n_jobs": -1,
}
)
tuned_final = XGBRegressor(**best_params)
tuned_final.fit(X_dev, y_dev, verbose=False)
tuned_holdout_rmse = rmse(y_holdout, tuned_final.predict(X_holdout))
improvement = (baseline_holdout_rmse - tuned_holdout_rmse) / baseline_holdout_rmse * 100
print(f"Baseline holdout RMSE: {baseline_holdout_rmse:,.2f}")
print(f"Tuned holdout RMSE: {tuned_holdout_rmse:,.2f}")
print(f"Relative improvement: {improvement:.2f}%")Why This Workflow Works
TPElearns from prior trials, so your search becomes smarter over time.early_stopping_roundsprevents wasting trees inside a trial.- Pruning prevents wasting full trials on weak configurations.
- Persistent storage makes runs resumable and auditable.
- An untouched holdout protects you from “winning” validation while failing in production.
Common Mistakes (and Better Defaults)
- Tuning too many parameters at once -> start with 6-8 high-impact parameters.
- Using linear sampling for
learning_rate-> uselog=True. - No baseline -> always measure uplift versus defaults.
- No holdout -> keep one dataset split untouched until the end.
- Over-optimizing tiny gains -> stop when the curve plateaus and complexity grows.
Quick FAQ
How many trials should I run?
Start with 50-150 for 6-8 parameters. Increase only if your best score is still improving late in the run.
Should I use cross-validation? Use CV when data is small or noisy and you need robust estimates. For larger datasets, a strong holdout strategy is often enough.
Can I run this in parallel?
Yes. Point all workers at the same Optuna storage (SQLite for local, PostgreSQL for teams) and run study.optimize(...) from multiple processes.
Do I need pruning? Not always, but it is usually a major speedup when some trials clearly underperform early.
Final Takeaways
When the code runs, you get three numbers: baseline holdout RMSE, tuned holdout RMSE, and relative improvement. That final number is the only one worth reporting. Everything else - validation scores, trial history, best parameters - is scaffolding. The holdout is the verdict.
If the improvement is meaningful, you have a better model and the proof to show for it. If it isn’t, you’ve learned something equally useful: the defaults were already close to optimal for this dataset, and you can ship sooner.
- XGBoost + Optuna remains a top-tier combo for tabular ML.
- Search space design matters more than trying every possible parameter.
- Early stopping + pruning gives faster, cheaper experimentation.
- Reproducibility (
seed, storage, consistent splits) is non-negotiable. - The real win is not just lower RMSE, but a workflow you can trust and repeat.