Post

S6E5 Driver's High: Driver Feature Engineering

S6E5 Driver's High: Driver Feature Engineering

🏎️ S6E5 Driver’s High: Driver Feature Engineering

Kaggle notebook link:
S6E5 Driver’s High: Driver Feature Engineering

S6E5 Driver's High cover

The central question of this notebook is narrow:

Does Driver contain measurable signal after race-state, tyre-state, compound, and stint information are already modeled?

In a pit-stop prediction problem, Driver is a tempting feature because it looks like a real strategic actor. But statistically, it is also a high-cardinality categorical variable. That means a strong result must distinguish between actual reusable signal and category memorization.

The notebook therefore treats Driver feature engineering as a sequence of controlled experiments:

\[\Delta_{\text{AUC}}(B) = \operatorname{AUC}\!\left(\hat p_{\text{base}+B}^{\text{oof}}, y\right) - \operatorname{AUC}\!\left(\hat p_{\text{counterfactual}}^{\text{oof}}, y\right)\]

where each candidate block (B) is promoted only when it improves the correct counterfactual under the same OOF protocol.

1. The Modeling Question

The target is the probability of a pit stop on the next lap:

\[y_i = \begin{cases} 1, & \text{pit next lap} \\ 0, & \text{otherwise} \end{cases}\]

The model produces:

\[\hat p_i = P(y_i = 1 \mid x_i)\]

The score is ROC-AUC, which can be interpreted as a ranking probability:

\[\operatorname{AUC} = P(\hat p^+ > \hat p^-) + \frac{1}{2}P(\hat p^+ = \hat p^-)\]

So the Driver question is not whether Driver changes the predicted probability for some rows. The question is whether Driver improves the relative ranking of positive and negative rows.

LayerRoleStatistical Risk
🧱 Race-state foundationlap, progress, stint, tyre age, degradationunderfitting if the physics-like state is too thin
🛞 Compound and tyre pressuretyre age relative to expected compound lifewrong scale if compound effects are treated as raw IDs only
🧑‍✈️ Driver representationstring structure, frequency, target prior, interactionshigh-cardinality leakage or memorization
🧪 OOF protocolsame split, one changed blockfalse lift if validation rows encode their own target
🧩 Stack / residual layercompare Driver signal against stronger artifactsover-crediting Driver when the real gain comes from model family or original data

The key design choice is that Driver is not allowed to enter as one undifferentiated feature. It is split into progressively riskier representations.

2. Driver Inventory

The first diagnostic checks whether Driver behaves like a compact real-world categorical variable or a synthetic high-cardinality one.

DatasetRowsUnique DriversD... Code Rate3-Letter Rate
Train439,14088759.58%40.42%
Test188,16580159.34%40.66%
Original101,371310.00%100.00%

This matters because the train/test data contain many D### style synthetic IDs, while the original data is made of 3-letter driver codes. The notebook therefore separates:

Driver SignalMeaningLeakage Risk
🧬 String structureD### vs 3-letter code, numeric ID binslow
📚 Original vocabulary flagwhether Driver appears in the original data vocabularylow to moderate
📊 Frequency/supportnumber of rows behind Driver/Race contextslow
🎯 Target encodingsmoothed empirical pit-stop prior by categoryhigh unless fold-safe
🔀 Interaction target encodingDriver-Compound, Race-Stint, Compound-Stint priorshigher variance

The support distribution is also important:

ThresholdDriversCovered RowsCovered Row Rate
>= 10 rows666438,43299.84%
>= 50 rows535435,48599.17%
>= 100 rows485431,97998.37%
>= 250 rows411419,50195.53%
>= 500 rows337392,58089.40%

This makes frequency features statistically meaningful. Most rows are covered by drivers with enough support, but the long tail is still large enough that raw one-hot identity can overfit.

Show notebook snippet: data loading and Driver inventory
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
train = pd.read_csv(train_path)
test = pd.read_csv(test_path)
submission_template = pd.read_csv(submission_path)
original = pd.read_csv(original_path) if original_path is not None else None

TARGET = resolve_target_column(train, TARGET_CANDIDATES, "train")
SUBMISSION_TARGET = resolve_target_column(submission_template, TARGET_CANDIDATES, "sample_submission")

ID_COL = "id"
base_features = [col for col in test.columns if col != ID_COL]
raw_numeric_cols = [col for col in base_features if pd.api.types.is_numeric_dtype(train[col])]
raw_categorical_cols = [col for col in base_features if col not in raw_numeric_cols]
original_driver_set = set(original["Driver"].astype(str)) if original is not None and "Driver" in original else set()

driver_inventory = pd.DataFrame({
    "Dataset": ["Train", "Test", "Original" if original is not None else "Original missing"],
    "Rows": [len(train), len(test), len(original) if original is not None else 0],
    "Unique_Drivers": [
        train["Driver"].nunique(),
        test["Driver"].nunique(),
        original["Driver"].nunique() if original is not None else 0,
    ],
    "D_Code_Rate": [
        train["Driver"].astype(str).str.match(r"^D\d+$").mean(),
        test["Driver"].astype(str).str.match(r"^D\d+$").mean(),
        original["Driver"].astype(str).str.match(r"^D\d+$").mean() if original is not None else np.nan,
    ],
    "Three_Letter_Rate": [
        train["Driver"].astype(str).str.match(r"^[A-Z]{3}$").mean(),
        test["Driver"].astype(str).str.match(r"^[A-Z]{3}$").mean(),
        original["Driver"].astype(str).str.match(r"^[A-Z]{3}$").mean() if original is not None else np.nan,
    ],
})

3. Race-State Foundation

Before measuring Driver, the model needs a strong non-Driver baseline. The foundation translates raw lap rows into variables that better represent pit-stop pressure.

The first transformation estimates race length:

\[\widehat L_i = \frac{\text{LapNumber}_i}{\text{RaceProgress}_i}\]

Then it derives remaining race state:

\[R_i = 1 - \text{RaceProgress}_i, \qquad \text{NormalizedTyreLife}_i = \frac{\text{TyreLife}_i}{\widehat L_i}\]

Compound-specific tyre pressure is expressed as:

\[u_i = \frac{\text{TyreLife}_i}{E[\text{Life}\mid \text{Compound}_i]}\]

and a pit-window indicator is:

\[\mathbb{1}\!\left(u_i \ge w_{\text{Compound}_i}\right)\]

This makes the model compare tyre age on a compound-adjusted scale instead of treating all tyre life values as equivalent.

Feature BlockMain VariablesInterpretation
🏁 Race/tyre algebraestimated race laps, laps remaining, normalized tyre liferace phase and tyre age scale
🛞 Compound structurehardness, dry/wet-like flags, tyre-life × hardnesscompound-dependent wear pressure
⏱️ Public domain featurespit window, degradation per lap, stop urgencyF1-inspired strategy pressure
🔥 Nonlinear interactionstyre life × progress, tyre life × stint, degradation × agethreshold-like pit behavior
Show notebook snippet: race-state and compound feature blocks
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
COMPOUND_EXPECTED_LIFE = {
    "SOFT": 25.0,
    "MEDIUM": 35.0,
    "HARD": 45.0,
    "INTERMEDIATE": 30.0,
    "WET": 40.0,
}

COMPOUND_WINDOW_START = {
    "SOFT": 0.64,
    "MEDIUM": 0.68,
    "HARD": 0.72,
    "INTERMEDIATE": 0.62,
    "WET": 0.60,
}

def add_race_tyre_algebra(df):
    out = df.copy()
    estimated_laps = _safe_divide(
        out["LapNumber"].astype(float),
        out["RaceProgress"].astype(float),
    ).replace([np.inf, -np.inf], np.nan)

    out["EstimatedRaceLaps"] = estimated_laps.clip(lower=1, upper=120)
    out["LapsRemainingEstimate"] = (out["EstimatedRaceLaps"] - out["LapNumber"]).clip(lower=-10, upper=120)
    out["RemainingRaceProgress"] = (1.0 - out["RaceProgress"].astype(float)).clip(-0.1, 1.1)
    out["TyreLifeToLapNumber"] = _safe_divide(out["TyreLife"].astype(float), out["LapNumber"].astype(float))
    out["TyreLifeToRaceProgress"] = _safe_divide(out["TyreLife"].astype(float), out["RaceProgress"].astype(float))
    out["NormalizedTyreLifeEstimate"] = _safe_divide(out["TyreLife"].astype(float), out["EstimatedRaceLaps"].astype(float))
    return out

def add_public_domain_features(df):
    out = df.copy()
    expected_life = out["Compound"].astype(str).map(COMPOUND_EXPECTED_LIFE).fillna(35.0)
    window_start = out["Compound"].astype(str).map(COMPOUND_WINDOW_START).fillna(0.68)

    tyre_life = out["TyreLife"].astype(float)
    race_progress = out["RaceProgress"].astype(float).clip(0, 1.05)
    lap_delta = out["LapTime_Delta"].astype(float)
    cum_deg = out["Cumulative_Degradation"].astype(float)

    out["TyreLifePct"] = _safe_divide(tyre_life, expected_life).clip(0, 3)
    out["PitWindowStartPct"] = window_start
    out["InCompoundPitWindow"] = (out["TyreLifePct"] >= window_start).astype(int)
    out["WindowOvershootPct"] = (out["TyreLifePct"] - window_start).clip(lower=0, upper=2)
    out["DegPerLap"] = _safe_divide(cum_deg, tyre_life + 1.0)
    out["PositiveLapTimeDelta"] = lap_delta.clip(lower=0)
    out["AgeXDelta"] = tyre_life * out["PositiveLapTimeDelta"]
    out["StopUrgency"] = out["TyreLifePct"] * race_progress * (1.0 + 0.20 * out["Stint"].astype(float).clip(1, 4))
    return out

4. Driver Representation

The Driver feature ladder starts with the safest information and moves toward higher-risk encodings.

4.1 String Structure

The first compressed representation avoids raw one-hot identity:

\[\text{Driver} \mapsto \left[ \mathbb{1}_{D\text{-code}}, \mathbb{1}_{3\text{-letter}}, \text{DriverNum}, \text{DriverNumBin}, \mathbb{1}_{\text{missing num}} \right]\]

This captures the synthetic-vs-original vocabulary structure without allocating one dummy variable per driver.

4.2 Frequency Encoding

For a categorical key (c), frequency encoding adds:

\[n(c) = \sum_i \mathbb{1}(x_i = c), \qquad f(c) = \log(1+n(c))\]

and a rare-driver flag:

\[\mathbb{1}\{n(\text{Driver}) < 50\}\]

This does not say that frequent drivers are inherently better. It tells the model how much support a category estimate has.

4.3 Context Normalization

Race-relative z-scores compare a row against its local context:

\[z_{i,g,v} = \frac{x_{i,v} - \mu_{g(i),v}}{\sigma_{g(i),v} + \epsilon}\]

where (g(i)) can be Race, Race_Compound, Race_Lap, or Compound. This is useful because pit timing is not purely absolute. An old tyre on one race/lap context may be ordinary in another.

4.4 Target Encoding

For a category (c), the smoothed target prior is:

\[\operatorname{TE}(c) = \frac{\sum_{i:g_i=c} w_i y_i + \alpha \bar y} {\sum_{i:g_i=c} w_i + \alpha}\]

The smoothing parameter (\alpha) controls shrinkage toward the global mean. Large (\alpha) is useful for small-support interactions such as Driver_Compound.

Show notebook snippet: Driver structure, count maps, and target encoding
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
def add_driver_structure(df):
    out = df.copy()
    driver = out["Driver"].astype(str)
    extracted = driver.str.extract(r"^D(\d+)$", expand=False)
    out["Driver_is_D_code"] = driver.str.match(r"^D\d+$").astype(int)
    out["Driver_is_3letter"] = driver.str.match(r"^[A-Z]{3}$").astype(int)
    out["Driver_num"] = pd.to_numeric(extracted, errors="coerce").fillna(-1)
    out["Driver_num_missing"] = (out["Driver_num"] < 0).astype(int)
    bins = [-2, -0.5, 99, 199, 399, 699, 9999]
    out["Driver_num_bin"] = pd.cut(out["Driver_num"], bins=bins, labels=False, include_lowest=True).astype(float)
    return out

def fit_count_maps(fit_df, columns, weight=None):
    w = np.ones(len(fit_df), dtype=float) if weight is None else np.asarray(weight, dtype=float)
    maps = {}
    for col in columns:
        tmp = pd.DataFrame({col: fit_df[col].astype(str), "__w": w})
        maps[col] = tmp.groupby(col, observed=True)["__w"].sum()
    return maps

def apply_count_maps(df, count_maps):
    out = df.copy()
    for col, counts in count_maps.items():
        mapped = out[col].astype(str).map(counts).fillna(0).astype(float)
        out[f"{col}_count"] = mapped
        out[f"{col}_freq_log1p"] = np.log1p(mapped)
        if col == "Driver":
            out["Driver_is_rare_50"] = (mapped < 50).astype(int)
    return out

def fit_target_maps(fit_df, columns, target, weight=None, alpha=80):
    y = fit_df[target].astype(float).to_numpy()
    w = np.ones(len(fit_df), dtype=float) if weight is None else np.asarray(weight, dtype=float)
    prior = np.average(y, weights=w)
    maps = {}
    tmp = fit_df.copy()
    tmp["__y"] = y
    tmp["__w"] = w
    tmp["__yw"] = y * w
    for col in columns:
        grouped = tmp.groupby(col, observed=True).agg(
            weight_sum=("__w", "sum"),
            target_sum=("__yw", "sum"),
        )
        maps[col] = ((grouped["target_sum"] + alpha * prior) / (grouped["weight_sum"] + alpha)).to_dict()
    return maps, prior

5. Fold-Safe Validation

The most important implementation detail is that all fitted transformations are learned inside the fold.

For fold (k):

\[\mathcal{D}_{-k} = \text{training rows outside fold } k, \qquad \mathcal{D}_{k} = \text{validation rows inside fold } k\]

Then any fitted encoder (T_k) must satisfy:

\[T_k = \operatorname{fit}(\mathcal{D}_{-k}), \qquad X_k^{\text{valid}} = T_k(\mathcal{D}_k)\]

For target encoding, the training side itself also needs an inner OOF encoding. Otherwise a row can encode its own target.

\[\hat z_i = \operatorname{TE}_{-h(i)}(x_i)\]

where (h(i)) is the inner fold containing row (i).

Show notebook snippet: fold-safe feature fitting and OOF runner
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
def fit_fold_features(fit_df, valid_df, test_df, blocks, target, fit_weight=None):
    fit_x = prepare_static_features(fit_df, blocks)
    valid_x = prepare_static_features(valid_df, blocks)
    test_x = prepare_static_features(test_df, blocks)

    if "frequency_encoding" in blocks:
        count_maps = fit_count_maps(fit_x, FREQUENCY_COLUMNS, weight=fit_weight)
        fit_x = apply_count_maps(fit_x, count_maps)
        valid_x = apply_count_maps(valid_x, count_maps)
        test_x = apply_count_maps(test_x, count_maps)

    if "context_normalization" in blocks:
        value_cols = sorted({value for values in CONTEXT_GROUP_SPECS.values() for value in values})
        global_stats = fit_global_stats(fit_x, value_cols, weight=fit_weight)
        group_stats = fit_group_stats(fit_x, CONTEXT_GROUP_SPECS, weight=fit_weight)
        fit_x = apply_group_stats(fit_x, group_stats, global_stats)
        valid_x = apply_group_stats(valid_x, group_stats, global_stats)
        test_x = apply_group_stats(test_x, group_stats, global_stats)

    if "oof_target_encoding" in blocks:
        fit_x = add_inner_oof_target_encoding(
            fit_x,
            target=target,
            columns=TARGET_ENCODING_COLUMNS,
            sample_weight=fit_weight,
            alpha=80,
            n_splits=INNER_TE_SPLITS,
            suffix="te",
        )
        maps, prior = fit_target_maps(fit_x, TARGET_ENCODING_COLUMNS, target, fit_weight, alpha=80)
        valid_x = apply_target_maps(valid_x, maps, prior, suffix="te")
        test_x = apply_target_maps(test_x, maps, prior, suffix="te")

    return fit_x, valid_x, test_x

def run_oof_experiment(config, train_frame=None, test_frame=None, verbose=True):
    train_frame = cv_train if train_frame is None else train_frame
    test_frame = test if test_frame is None else test_frame
    y = train_frame[TARGET].astype(int).to_numpy()
    skf = StratifiedKFold(n_splits=N_SPLITS, shuffle=True, random_state=RANDOM_STATE)

    oof = np.zeros(len(train_frame), dtype=float)
    test_pred = np.zeros(len(test_frame), dtype=float)

    for fold, (fit_idx, valid_idx) in enumerate(skf.split(train_frame, y), start=1):
        fold_fit = train_frame.iloc[fit_idx].copy()
        fold_valid = train_frame.iloc[valid_idx].copy()

        fit_x, valid_x, test_x = fit_fold_features(
            fold_fit,
            fold_valid,
            test_frame,
            config.blocks,
            TARGET,
        )

        numeric_features = select_numeric_features(config.blocks, available_columns=fit_x.columns)
        preprocessor = build_preprocessor(
            numeric_features=numeric_features,
            categorical_features=select_categorical_features(config.include_driver_ohe),
            scale_numeric=(config.model_kind == "logreg"),
        )

        X_fit = preprocessor.fit_transform(fit_x)
        X_valid = preprocessor.transform(valid_x)
        X_test = preprocessor.transform(test_x)

        model = make_model(config.model_kind)
        valid_pred, fold_test_pred, fitted_model = fit_predict_model(
            config.model_kind,
            model,
            X_fit,
            fold_fit[TARGET].astype(int).to_numpy(),
            X_valid,
            fold_valid[TARGET].astype(int).to_numpy(),
            X_test,
        )
        oof[valid_idx] = valid_pred
        test_pred += fold_test_pred / N_SPLITS

    return {
        "config": config,
        "oof": oof,
        "test_pred": test_pred,
        "auc": roc_auc_score(y, oof),
        "logloss": log_loss(y, np.clip(oof, 1e-6, 1 - 1e-6)),
    }

6. Driver Ladder Results

The Driver ladder is split into two branches:

BranchQuestionCounterfactual
🧑‍✈️ Raw identity ladderDoes Driver one-hot help directly?no-Driver race-state foundation
🧬 Compressed ladderCan Driver signal survive without raw OHE?same foundation, adding one Driver representation at a time
🔀 Interaction branchAre Driver/compound/stint priors stable?the B04 target-encoding parent

6.1 Raw Driver OHE

ExperimentAdded BlockOOF AUCDelta vs No Driver
A00_no_driverRace-state foundation without raw Driver ID0.9462430.000000
A01_driver_oheAdd raw Driver one-hot identity0.945964-0.000279

Raw Driver identity did not improve this saved CV run. This is an important negative result because it prevents the notebook from over-trusting a high-cardinality ID.

6.2 Compressed Driver Signal

ExperimentAdded BlockOOF AUCDelta vs No DriverCV Read
B00_no_driverRace-state foundation0.9462430.000000baseline
B01_driver_stringDriver string structure0.946691+0.000448improves
B02_original_vocaboriginal-driver vocabulary flag0.946586+0.000343below previous
B03_driver_frequencyDriver/Race frequency features0.946818+0.000575best compressed step
B04_driver_teinner-OOF Driver/Race target encoding0.946670+0.000427below previous

The best saved Driver representation is:

\[\texttt{B03\_driver\_frequency}\]

with:

\[\Delta_{\text{AUC}} = 0.946818 - 0.946243 = 0.000575\]

That is small, but it is directionally consistent in the paired bootstrap:

ComparisonMean DeltaP05MedianP95Probability Positive
B03_driver_frequency - B000.0005730.0003840.0005720.0007641.000

6.3 Interaction Target Encoding

Interaction TE did not beat the B04_driver_te parent:

ExperimentParentDelta vs ParentDecision
B05_driver_compound_teB04_driver_te-0.000261reject
B06_race_compound_teB04_driver_te-0.000076reject
B07_race_stint_teB04_driver_te-0.000143reject
B08_compound_stint_teB04_driver_te-0.000125reject

This is the expected failure mode for sparse interactions. The conditional prior is conceptually reasonable, but the sample support is thinner and the variance penalty dominates.

Show notebook snippet: Driver ladder configuration and bootstrap check
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
DRIVER_FOUNDATION_BLOCKS = [
    "race_tyre_algebra",
    "compound_structure",
    "nonlinear_interactions",
]

identity_driver_configs = [
    ExperimentConfig(
        "A00_no_driver",
        "Race-state foundation without raw Driver ID",
        DRIVER_FOUNDATION_BLOCKS,
        include_driver_ohe=False,
        model_kind=LADDER_MODEL,
    ),
    ExperimentConfig(
        "A01_driver_ohe",
        "Add raw Driver one-hot identity",
        DRIVER_FOUNDATION_BLOCKS,
        include_driver_ohe=True,
        model_kind=LADDER_MODEL,
    ),
]

compressed_driver_configs = [
    ExperimentConfig(
        "B00_no_driver",
        "Race-state foundation without raw Driver ID",
        DRIVER_FOUNDATION_BLOCKS,
        include_driver_ohe=False,
        model_kind=LADDER_MODEL,
    ),
    ExperimentConfig(
        "B01_driver_string",
        "Driver string structure without raw OHE",
        DRIVER_FOUNDATION_BLOCKS + ["driver_structure"],
        include_driver_ohe=False,
        model_kind=LADDER_MODEL,
    ),
    ExperimentConfig(
        "B03_driver_frequency",
        "Add Driver/Race frequency features",
        DRIVER_FOUNDATION_BLOCKS + ["driver_structure", "driver_original_vocab", "frequency_encoding"],
        include_driver_ohe=False,
        model_kind=LADDER_MODEL,
    ),
    ExperimentConfig(
        "B04_driver_te",
        "Add inner-OOF Driver/Race target encoding",
        DRIVER_FOUNDATION_BLOCKS + ["driver_structure", "driver_original_vocab", "frequency_encoding", "oof_target_encoding"],
        include_driver_ohe=False,
        model_kind=LADDER_MODEL,
    ),
]

def paired_auc_delta_bootstrap(y, pred_a, pred_b, comparison, n_boot=800, seed=RANDOM_STATE):
    rng = np.random.default_rng(seed)
    y = np.asarray(y)
    pred_a = np.asarray(pred_a)
    pred_b = np.asarray(pred_b)
    deltas = []

    for _ in range(n_boot):
        idx = rng.integers(0, len(y), size=len(y))
        if len(np.unique(y[idx])) < 2:
            continue
        deltas.append(roc_auc_score(y[idx], pred_b[idx]) - roc_auc_score(y[idx], pred_a[idx]))

    deltas = np.asarray(deltas, dtype=float)
    return pd.DataFrame([{
        "Comparison": comparison,
        "Mean_Delta": deltas.mean(),
        "P05": np.quantile(deltas, 0.05),
        "Median": np.quantile(deltas, 0.50),
        "P95": np.quantile(deltas, 0.95),
        "Prob_Positive": np.mean(deltas > 0),
        "Bootstraps": len(deltas),
    }])

7. Score-Oriented Candidate Stack

After the Driver ladder, the notebook tests whether public-stack style preprocessing beats the current Driver-frequency selection.

Two matrix styles are compared:

CandidateFeature CountOOF AUCDelta vs Current Best
S05_stack_core_preprocessing350.944899-0.0019
S06_stack_domain_preprocessing1110.946050-0.0008

The larger domain matrix improves over the core matrix, but neither beats B03_driver_frequency in this local LGBM run. The interpretation is not that sequence/group features are bad. It is that their value appears more compatible with the heavier XGB/Cat/LGBM stack than with the saved lightweight ladder.

The score gap table makes this explicit:

Model or StackReference OOF AUCGap vs B03_driver_frequencyRead
B03_driver_frequency0.9468180.000000current local best
Domain LGBM public reference0.948120+0.001302small improvement
RealMLP public reference0.949510+0.002692small improvement
XGB baseline + original0.958160+0.011342leaderboard-scale gap
XGB Optuna + original0.959940+0.013122leaderboard-scale gap
XGB + CatBoost ensemble0.960760+0.013942leaderboard-scale gap

This reframes Driver as an auxiliary signal. The largest score movement is coming from model family, original data usage, and ensemble structure.

Show notebook snippet: stack preprocessing candidates
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
SCORE_FOUNDATION_BLOCKS = [
    "race_tyre_algebra",
    "compound_structure",
    "public_domain_features",
    "nonlinear_interactions",
]

SCORE_SAFE_DRIVER_BLOCKS = SCORE_FOUNDATION_BLOCKS + [
    "driver_structure",
    "driver_original_vocab",
    "frequency_encoding",
    "context_normalization",
]

score_candidate_configs = [
    ExperimentConfig(
        "S00_driver_ohe_domain",
        "Driver OHE + public pit-window domain features",
        SCORE_FOUNDATION_BLOCKS,
        include_driver_ohe=True,
        model_kind=LADDER_MODEL,
    ),
    ExperimentConfig(
        "S01_driver_ohe_context",
        "Add weighted support and field-relative context",
        SCORE_SAFE_DRIVER_BLOCKS,
        include_driver_ohe=True,
        model_kind=LADDER_MODEL,
    ),
    ExperimentConfig(
        "S02_driver_ohe_driver_te",
        "Add inner-OOF Driver/Race target encoding",
        SCORE_SAFE_DRIVER_BLOCKS + ["oof_target_encoding"],
        include_driver_ohe=True,
        model_kind=LADDER_MODEL,
    ),
]

def add_stack_domain_features(train_df, test_df):
    train_base = add_stack_row_features(train_df)
    test_base = add_stack_row_features(test_df)
    train_base["__part"] = "train"
    test_base["__part"] = "test"
    train_base["__row"] = np.arange(len(train_base))
    test_base["__row"] = np.arange(len(test_base))
    full = pd.concat([train_base, test_base], ignore_index=True, sort=False)

    sort_cols = [col for col in ["Race", "Driver", "Year", "Stint", "LapNumber"] if col in full.columns]
    full = full.sort_values(sort_cols).copy()
    group_cols = [col for col in ["Race", "Driver", "Year", "Stint"] if col in full.columns]
    g = full.groupby(group_cols, sort=False, observed=True)

    full["stack_lap_in_stint"] = g.cumcount() + 1
    full["stack_stint_len"] = g["LapNumber"].transform("count")
    full["stack_pos_prev1"] = g["Position"].shift(1)
    full["stack_pos_delta1"] = full["Position"] - full["stack_pos_prev1"]
    full["stack_lt_prev1"] = g["LapTime (s)"].shift(1)
    full["stack_lt_delta1"] = full["LapTime (s)"] - full["stack_lt_prev1"]
    full["stack_tyre_life_lag1"] = g["TyreLife"].shift(1)
    full["stack_tyre_life_diff1"] = full["TyreLife"] - full["stack_tyre_life_lag1"]
    full["stack_race_lap_meantyrelife"] = full.groupby(["Race", "LapNumber"], observed=True)["TyreLife"].transform("mean")
    full["stack_tyrelife_vs_field_mean"] = full["TyreLife"] - full["stack_race_lap_meantyrelife"]

    return full

8. Driver As A Residual Correction

The second Driver hypothesis is subtler:

Driver may be too weak as a main feature, but still useful as a post-model calibration bias.

The residual layer works in logit space. For a base prediction (\hat p_i):

\[\operatorname{logit}(\hat p_i) = \log \frac{\hat p_i}{1-\hat p_i}\]

For a Driver group (g), the notebook estimates a smoothed observed rate and a smoothed predicted rate:

\[\tilde y_g = \frac{\sum_{i:g_i=g} y_i + \alpha \bar y}{n_g+\alpha}, \qquad \tilde p_g = \frac{\sum_{i:g_i=g} \hat p_i + \alpha \bar p}{n_g+\alpha}\]

The group residual is:

\[\delta_g = \operatorname{logit}(\tilde y_g) - \operatorname{logit}(\tilde p_g)\]

Then the corrected prediction is:

\[\hat p'_i = \sigma\!\left( \operatorname{logit}(\hat p_i) + \lambda \sum_{g \in G_i} w_g\delta_g \right)\]

where (\lambda) is selected by OOF AUC.

The saved local residual table:

ExperimentGroupsLambdaOOF AUCDelta vs BaseDecision
R00_base_selected_stacknone0.000.9468180.000000base
R01_driver_logit_biasDriver-0.250.946820+0.000002watchlist
R02_driver_phase_logit_biasDriver + Driver_Phase-0.250.946820+0.000002watchlist
R05_driver_strategy_logit_biasDriver + phase + compound + stint-0.100.946819+0.000001watchlist

The promotion threshold is:

\[\Delta_{\text{AUC}} \ge 0.0003\]

The local residual gains are far below that threshold, so they remain watchlist signals rather than final candidates.

Show notebook snippet: logit residual correction
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
def logit_clip(p, eps=1e-6):
    p = np.clip(np.asarray(p, dtype=float), eps, 1 - eps)
    return np.log(p / (1 - p))

def fit_logit_residual_maps(residual_frame, y, base_pred, group_alphas):
    maps = {}
    y = np.asarray(y, dtype=float)
    base_pred = np.clip(np.asarray(base_pred, dtype=float), 1e-6, 1 - 1e-6)
    global_y = float(np.mean(y))
    global_p = float(np.mean(base_pred))

    for col, alpha in group_alphas.items():
        tmp = pd.DataFrame({
            "key": residual_frame[col].astype(str),
            "y": y,
            "p": base_pred,
        })
        grouped = tmp.groupby("key", observed=True).agg(
            n=("y", "size"),
            y_sum=("y", "sum"),
            p_sum=("p", "sum"),
        )
        grouped["y_smooth"] = (grouped["y_sum"] + alpha * global_y) / (grouped["n"] + alpha)
        grouped["p_smooth"] = (grouped["p_sum"] + alpha * global_p) / (grouped["n"] + alpha)
        grouped["delta_logit"] = logit_clip(grouped["y_smooth"]) - logit_clip(grouped["p_smooth"])
        maps[col] = grouped[["delta_logit", "n"]]
    return maps

def evaluate_logit_residual_spec(spec, train_resid_frame, test_resid_frame, y, base_oof, base_test):
    skf = StratifiedKFold(n_splits=N_SPLITS, shuffle=True, random_state=RANDOM_STATE + 100)
    delta_oof = np.zeros(len(train_resid_frame), dtype=float)
    delta_test = np.zeros(len(test_resid_frame), dtype=float)

    for fold, (fit_idx, valid_idx) in enumerate(skf.split(train_resid_frame, y), start=1):
        maps = fit_logit_residual_maps(
            train_resid_frame.iloc[fit_idx],
            y[fit_idx],
            base_oof[fit_idx],
            spec["alphas"],
        )
        delta_oof[valid_idx] = apply_logit_residual_signal(
            train_resid_frame.iloc[valid_idx],
            maps,
            spec["groups"],
        )
        delta_test += apply_logit_residual_signal(test_resid_frame, maps, spec["groups"]) / N_SPLITS

    base_logit_oof = logit_clip(base_oof)
    lambda_grid = np.array([-1.00, -0.75, -0.50, -0.25, -0.10, 0.0, 0.10, 0.25, 0.50, 0.75, 1.00])

    rows = []
    for lam in lambda_grid:
        corrected_oof = sigmoid_array(base_logit_oof + lam * delta_oof)
        rows.append({
            "Lambda": lam,
            "OOF_AUC": roc_auc_score(y, corrected_oof),
        })

    strength_table = pd.DataFrame(rows)
    best_lambda = float(strength_table.loc[strength_table["OOF_AUC"].idxmax(), "Lambda"])
    corrected_oof = sigmoid_array(base_logit_oof + best_lambda * delta_oof)
    return corrected_oof, strength_table

9. Heavy-Stack Overlay

The final section attaches stronger OOF/test artifacts and asks whether Driver residuals add value on top of the real scoring base.

The stack search uses probability and rank blends:

\[\hat p^{\text{prob}}_i = \sum_{m=1}^{M} w_m \hat p_{im}, \qquad w_m \ge 0,\quad \sum_m w_m = 1\]

For rank blending:

\[\hat r_i = \sum_{m=1}^{M} w_m \operatorname{rank}_{01}(\hat p_{im})\]

The discovered base models had these OOF scores:

ModelOOF AUC
realmlp_full_fullrows_5fold0.952177
lgb_domain_w07_full_fullrows_10fold0.950665
xgb_core_w1_full_fullrows_10fold0.950331
cat_core_w1_full_fullrows_10fold0.949682
xgb_public_te_w1_full_fullrows_10fold0.949381

The selected heavy-stack artifact reached:

\[\operatorname{AUC}_{\text{heavy stack}} \approx 0.953404\]

The Driver residual overlay reached about:

\[0.953409\]

but the improvement was not large enough to beat the promotion threshold. So the final blend assigned all weight to the heavy-stack artifact:

ModelBlend Weight
B03_driver_frequency0.000
H00_heavy_stack_artifact1.000

This is a useful outcome. It says Driver frequency is a valid local signal, but once stronger full-row stack predictions are available, Driver residuals do not yet justify replacing the artifact base.

Show notebook snippet: artifact blending and heavy-stack residual overlay
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
def rank01_np(values):
    values = np.asarray(values, dtype=float)
    order = np.argsort(values, kind="mergesort")
    ranks = np.empty(len(values), dtype=float)
    ranks[order] = np.arange(1, len(values) + 1)
    return ranks / (len(values) + 1)

def simplex_grid(n, step):
    units = int(round(1 / step))
    current = []

    def rec(left, slots):
        if slots == 1:
            yield current + [left]
        else:
            for value in range(left + 1):
                current.append(value)
                yield from rec(left - value, slots - 1)
                current.pop()

    for parts in rec(units, n):
        yield np.array(parts, dtype=float) / units

def search_stack_base_blends(names, oof_matrix, test_matrix, y):
    rank_oof = np.column_stack([rank01_np(oof_matrix[:, idx]) for idx in range(oof_matrix.shape[1])])
    rank_test = np.column_stack([rank01_np(test_matrix[:, idx]) for idx in range(test_matrix.shape[1])])
    candidates = []

    for mode, matrix_oof, matrix_test in [
        ("probability", oof_matrix, test_matrix),
        ("rank", rank_oof, rank_test),
    ]:
        best_auc = -np.inf
        best_weights = None
        best_oof = None
        best_test = None

        for weights in simplex_grid(len(names), STACK_BLEND_GRID_STEP):
            pred = matrix_oof @ weights
            auc = roc_auc_score(y, pred)
            if auc > best_auc:
                best_auc = auc
                best_weights = weights.copy()
                best_oof = pred
                best_test = matrix_test @ weights

        candidates.append({
            "Mode": mode,
            "OOF_AUC": best_auc,
            "Weights": best_weights,
            "OOF": best_oof,
            "Test": best_test,
        })

    best_candidate = max(candidates, key=lambda item: item["OOF_AUC"])
    probability_candidate = next(item for item in candidates if item["Mode"] == "probability")
    return best_candidate, probability_candidate

def run_heavy_stack_residual_overlay(stack_train_frame, stack_test_frame, y, probability_base, best_artifact_base):
    spec_results = [
        evaluate_logit_residual_spec(
            spec,
            build_residual_frame(stack_train_frame),
            build_residual_frame(stack_test_frame),
            y,
            np.clip(probability_base["OOF"], 1e-6, 1 - 1e-6),
            np.clip(probability_base["Test"], 1e-6, 1 - 1e-6),
        )
        for spec in RESIDUAL_SPECS
    ]

    best_residual = max(spec_results, key=lambda item: item["auc"])
    if best_residual["auc"] >= best_artifact_base["OOF_AUC"] + RESIDUAL_PROMOTION_MIN_AUC:
        selected_oof = best_residual["oof"]
        selected_test = best_residual["test_pred"]
        selected_decision = "promoted residual"
    else:
        selected_oof = best_artifact_base["OOF"]
        selected_test = best_artifact_base["Test"]
        selected_decision = "artifact base"

    return selected_oof, selected_test, selected_decision

10. Final Submission Logic

The final file is a standard probability submission. The notebook checks row count, ID alignment, probability bounds, and missing values:

\[0 \le \hat p_i \le 1\]

The saved output summary:

RowsMean PredictionMin PredictionMax PredictionFilename
188,1650.19820.00010.9954submission.csv
Show notebook snippet: final submission checks
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
submission = submission_template.copy()
submission[SUBMISSION_TARGET] = np.clip(blend_test, 0, 1)

assert submission.shape[0] == test.shape[0]
assert submission[ID_COL].equals(test[ID_COL])
assert submission[SUBMISSION_TARGET].between(0, 1).all()
assert submission[SUBMISSION_TARGET].isna().sum() == 0

submission.to_csv(SUBMISSION_FILENAME, index=False)

submission_summary = pd.DataFrame({
    "Rows": [len(submission)],
    "Mean_Prediction": [submission[SUBMISSION_TARGET].mean()],
    "Min_Prediction": [submission[SUBMISSION_TARGET].min()],
    "Max_Prediction": [submission[SUBMISSION_TARGET].max()],
    "Filename": [SUBMISSION_FILENAME],
})

11. What The Notebook Establishes

The most useful conclusion is not simply “Driver helps” or “Driver does not help.” The more precise reading is:

ClaimEvidenceDecision
Raw Driver one-hot is not enoughA01 - A00 = -0.000279 AUCreject in saved CV
Compressed Driver support is usefulB03 - B00 = +0.000575 AUCkeep as local signal
Target encoding needs strict guardrailsB04 falls below B03do not promote by complexity alone
Interaction TE is high varianceall tested interactions fall below the parentreject interaction branches
Public-style stack features need stronger modelslocal LGBM stack candidates trail B03keep for heavier stack
Driver residual is not promotedgains are around a few (10^{-6}) AUCwatchlist only
Heavy-stack artifact dominates final blendselected blend weight is 1.0 on artifact stackfinal submission base

11.1 Statistical Meaning Of The Conclusion

The conclusion should not be read as a causal claim about driver skill. The notebook does not establish that a particular driver is intrinsically better at pit strategy. What it establishes is narrower:

After race-state, tyre-state, compound, stint, and degradation features are already present, Driver-derived support and frequency variables still contain a small amount of additional OOF ranking signal.

The relevant marginal effect is:

\[\Delta_{\text{Driver frequency}} = \operatorname{AUC}(\text{race-state} + \text{Driver frequency}) - \operatorname{AUC}(\text{race-state only})\]

In the saved run:

\[\Delta_{\text{Driver frequency}} = 0.946818 - 0.946243 = 0.000575\]

So the statistically careful interpretation is:

Driver frequency has small marginal predictive utility under the controlled OOF protocol.

This is different from saying that raw Driver identity is useful. Raw one-hot Driver identity moved in the wrong direction:

\[\Delta_{\text{raw Driver OHE}} = 0.945964 - 0.946243 = -0.000279\]

That negative result matters. It suggests that a high-cardinality Driver ID can increase variance or memorize unstable category effects, while lower-variance summaries such as support counts and frequency logs are more robust.

Statistically, the surviving Driver signal is closer to a support-size / reliability proxy than a direct identity effect:

\[\text{Driver} \not\Rightarrow \text{skill claim} \qquad \text{Driver frequency} \Rightarrow \text{category support and generator-context signal}\]

The residual-correction result gives the second half of the interpretation. Once the stronger stack artifact is used as the base, Driver residual corrections do not clear the promotion threshold:

\[\operatorname{AUC}_{\text{corrected}} - \operatorname{AUC}_{\text{base}} < 0.0003\]

This means the small Driver signal is either already absorbed by the stronger model stack or too weak to survive as an independent post-stack correction. Therefore, Driver is best interpreted as a low-amplitude auxiliary feature, not as a primary modeling axis.

The notebook’s statistical structure is therefore:

\[\text{Driver ID} \rightarrow \text{Driver compression} \rightarrow \text{fold-safe Driver prior} \rightarrow \text{residual Driver correction} \rightarrow \text{heavy-stack comparison}\]

The important part is the counterfactual discipline. Every Driver idea must answer:

\[\operatorname{AUC}_{\text{candidate}} - \operatorname{AUC}_{\text{matching counterfactual}} > 0\]

and for residual overlays:

\[\operatorname{AUC}_{\text{corrected}} - \operatorname{AUC}_{\text{base}} \ge 0.0003\]

That is why the final interpretation is conservative:

Driver frequency survives as a small, measurable signal. Driver residual correction does not yet survive promotion once the stronger stack artifact is the base.

This post is licensed under CC BY 4.0 by the author.