Example usage

Here we will demonstrate how to use the stepsel package for stepwise selection of variables in a regression model and how to prepare categorical data for such a model.

# Importing the packages
import stepsel
import numpy as np
import statsmodels.api as sm

Load the data

First we load the data and take a look at the first few rows. It is soccer data from Czech First League.

dt = stepsel.datasets.load_soccer_data()
print(dt.shape, dt.columns)
(518, 24) Index(['match_id', 'match_datetime', 'team', 'hga', 'team_opp', 'goals',
       'goals_opp', 'yellow', 'yellow_opp', 'red', 'red_opp', 'penalty',
       'penalty_opp', 'fouls', 'fouls_opp', 'attacks', 'attacks_opp',
       'dangerous_attacks', 'dangerous_attacks_opp', 'ref_interv',
       'ref_interv_opp', 'ref_interv_per_attack', 'ref_interv_per_attack_opp',
       'ref_interv_per_attack_diff'],
      dtype='object')
dt.head()
match_id match_datetime team hga team_opp goals goals_opp yellow yellow_opp red ... fouls_opp attacks attacks_opp dangerous_attacks dangerous_attacks_opp ref_interv ref_interv_opp ref_interv_per_attack ref_interv_per_attack_opp ref_interv_per_attack_diff
0 1 2023-05-14 18:00:00 Bohemians H Slovácko 0.0 0.0 2 2 0 ... 12.0 117.0 142.0 71.0 88.0 20.0 18.0 0.140845 0.153846 -0.013001
1 2 2023-05-14 15:00:00 Jablonec H Baník Ostrava 1.0 1.0 1 1 0 ... 11.0 90.0 116.0 74.0 58.0 14.0 22.0 0.120690 0.244444 -0.123755
2 3 2023-05-14 15:00:00 Teplice H Zlín 2.0 1.0 4 4 0 ... 14.0 118.0 134.0 54.0 62.0 30.0 26.0 0.223881 0.220339 0.003542
3 4 2023-05-14 15:00:00 Zbrojovka Brno H Pardubice 0.0 2.0 1 2 0 ... 14.0 99.0 126.0 57.0 76.0 17.0 36.0 0.134921 0.363636 -0.228716
4 5 2023-05-13 18:00:00 Sparta Praha H Slavia Praha 3.0 2.0 1 2 0 ... 15.0 87.0 145.0 47.0 94.0 15.0 29.0 0.103448 0.333333 -0.229885

5 rows × 24 columns

Binning the data

There are two ways to bin data in stepsel package: by specifying the cut points or optimal binning. We will demonstrate both.

By specifying the cut points

dt["fouls_opp_cat1"] = stepsel.binning.bin_values(dt["fouls_opp"], [8, 12, 16, 20])
dt["fouls_opp_cat1"].value_counts()
fouls_opp_cat1
(8, 12]      195
(12, 16]     167
(-Inf, 8]     84
(16, 20]      58
(20, Inf)     14
Name: count, dtype: int64

Optimal binning

Optimal binning can be time consuming, so when the data is large, and there are many cp values, they are grouped using K-means clustering and only the best clusters are used. This process is controlled by the Flow Parameters (see the docstring). Other parameters are used to control the scikit-learn’s DecisionTreeRegressor, KMeans and GridSearchCV.

X = np.array(dt.loc[:, ["fouls_opp"]])
y = np.array(dt.loc[:, "goals"])

clf = stepsel.binning.optimal.OptimalBinningUsingDecisionTreeRegressor(
    criterion = 'poisson',
    scoring = "neg_mean_poisson_deviance",
    refit = "neg_mean_poisson_deviance",
    max_depth = 3
)
clf.fit(X, y) # Fit the model
clf.set_feature_names(["fouls_opp"]) # Set the feature names for the plot
clf.plot_tree((15,10)) # Plot the tree
Starting cross-validation for 58 cp values at 20:55:12.
Estimated runtime: 0.013480726877848308 minutes.
_images/5631122fb746721fdec1108408707e0d9806d1f4debc1656d4092f155c331fcd.png
# Predicting the values
dt["fouls_opp_cat2"] = clf.predict(X)
dt["fouls_opp_cat2"].value_counts()
fouls_opp_cat2
1.551570    446
1.152778     72
Name: count, dtype: int64
# Creating the bins
dt["fouls_opp_cat2"] = clf.bin_values(dt["fouls_opp"])
dt["fouls_opp_cat2"] = dt["fouls_opp_cat2"].astype("category")
# Function to get the cut points can be used for any scikit-learn tree based model
# The final tree can be obtained from the cv_fit attribute of the model Log object
stepsel.binning.helper.get_tree_cut_points(clf.Log.fit_log["cv_fit"].best_estimator_, feature_names=["fouls_opp"])
{'fouls_opp': array([16.5])}

Setting the reference category

You can order any number of teams, the first one will be the reference category

dt["hga"] = stepsel.modeling.prep.relevel_categorical_variable(dt["hga"], ["H"])
dt["team"] = stepsel.modeling.prep.relevel_categorical_variable(dt["team"], ["Bohemians", "Jablonec"])
dt["team_opp"] = stepsel.modeling.prep.relevel_categorical_variable(dt["team_opp"], ["Bohemians", "Jablonec", "Teplice"])
dt["fouls_opp_cat1"] = stepsel.modeling.prep.relevel_categorical_variable(dt["fouls_opp_cat1"], ["(-Inf, 8]"])
Series is not categorical, converting to categorical.
Series is not categorical, converting to categorical.
Series is not categorical, converting to categorical.
Series is not categorical, converting to categorical.

Stepwise selection

The most important parameters for the selection are formula, include, slentry and slstay. Specifying the formula in R-style format, the variables to always include in the model, and the significance levels of the likelihood ratio test for entering and staying in the model.

# Parameters for the model
formula = "goals ~ team + hga + red_opp + fouls_opp_cat1 + fouls_opp_cat2"
include = ['ref_interv_per_attack * ref_interv_per_attack_opp', "penalty"]
slentry = 0.05
slstay = 0.10
family = sm.families.Poisson()
# Fitting the model
model1 = stepsel.modeling.stepwise_glm.StepwiseGLM(
    formula = formula,
    data = dt,
    include = include,
    slentry = slentry,
    slstay = slstay,
    family = family
)
model1.fit()
Best entry candidates:
          feature   p_value
0            team  0.000002
1             hga  0.000609
2         red_opp  0.005159
3  fouls_opp_cat2  0.013067
4  fouls_opp_cat1  0.060894
>> team entered the model.

Best removal candidates:
  feature   p_value
0    team  0.000002
>> No feature removed from the model.

Best entry candidates:
          feature   p_value
0             hga  0.000608
1         red_opp  0.004282
2  fouls_opp_cat2  0.045388
3  fouls_opp_cat1  0.160092
>> hga entered the model.

Best removal candidates:
  feature   p_value
0     hga  0.000608
1    team  0.000002
>> No feature removed from the model.

Best entry candidates:
          feature   p_value
0         red_opp  0.003997
1  fouls_opp_cat2  0.022037
2  fouls_opp_cat1  0.052854
>> red_opp entered the model.

Best removal candidates:
   feature   p_value
0  red_opp  0.003997
1      hga  0.000569
2     team  0.000002
>> No feature removed from the model.

Best entry candidates:
          feature   p_value
0  fouls_opp_cat2  0.037692
1  fouls_opp_cat1  0.108477
>> fouls_opp_cat2 entered the model.
Best removal candidates:
          feature   p_value
0  fouls_opp_cat2  0.037692
1         red_opp  0.006665
2             hga  0.000305
3            team  0.000004
>> No feature removed from the model.

Best entry candidates:
          feature   p_value
0  fouls_opp_cat1  0.353892
>> No feature entered the model.

Best removal candidates:
          feature   p_value
0  fouls_opp_cat2  0.037692
1         red_opp  0.006665
2             hga  0.000305
3            team  0.000004
>> No feature removed from the model.


Selection stopped because all candidates for removal are significant at the 0.1 level and no candidate for entry is significant at the 0.05 level.

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                  goals   No. Observations:                  518
Model:                            GLM   Df Residuals:                      497
Model Family:                 Poisson   Df Model:                           20
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -754.52
Date:                Tue, 03 Sep 2024   Deviance:                       521.53
Time:                        20:55:13   Pearson chi2:                     442.
No. Iterations:                     5   Pseudo R-squ. (CS):             0.1771
Covariance Type:            nonrobust                                         
=====================================================================================================================
                                                        coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------------------------------------------
Intercept                                             0.6056      0.148      4.103      0.000       0.316       0.895
red_opp                                               0.3259      0.116      2.814      0.005       0.099       0.553
penalty                                               0.3239      0.084      3.867      0.000       0.160       0.488
team: Jablonec                                       -0.0641      0.199     -0.323      0.747      -0.453       0.325
team: Mladá Boleslav                                 -0.3061      0.211     -1.453      0.146      -0.719       0.107
team: Slavia Praha                                    0.5141      0.172      2.992      0.003       0.177       0.851
team: Baník Ostrava                                  -0.0717      0.199     -0.360      0.719      -0.462       0.318
team: Slovácko                                       -0.2592      0.210     -1.235      0.217      -0.670       0.152
team: Pardubice                                      -0.4425      0.226     -1.955      0.051      -0.886       0.001
team: Sparta Praha                                    0.2930      0.179      1.639      0.101      -0.057       0.643
team: Teplice                                        -0.3210      0.208     -1.545      0.122      -0.728       0.086
team: Sigma Olomouc                                  -0.0832      0.197     -0.423      0.672      -0.469       0.302
team: Zlín                                           -0.2405      0.211     -1.141      0.254      -0.654       0.173
team: Zbrojovka Brno                                 -0.2573      0.209     -1.231      0.218      -0.667       0.152
team: Hradec Králové                                 -0.2959      0.216     -1.367      0.172      -0.720       0.128
team: Slovan Liberec                                 -0.1045      0.203     -0.515      0.607      -0.502       0.293
team: České Budějovice                               -0.2703      0.213     -1.269      0.204      -0.688       0.147
team: Viktoria Plzeň                                  0.0649      0.189      0.343      0.732      -0.306       0.436
hga: A                                               -0.2616      0.073     -3.597      0.000      -0.404      -0.119
fouls_opp_cat2: (16.5, Inf)                          -0.2456      0.121     -2.024      0.043      -0.483      -0.008
ref_interv_per_attack * ref_interv_per_attack_opp    -1.5845      1.311     -1.208      0.227      -4.154       0.985
=====================================================================================================================
<stepsel.modeling.stepwise_glm.StepwiseGLM at 0x7f33444bee40>

Refitting the model

After the selection it is possible to specify different model formula and use the Log of the previously fitted model to speed up the selection. This is useful when the data is large and the selection is time consuming. The same partial models during the selection won’t be fitted again. Only use this functionality when the data has not changed!

# The new formula including the team_opp variable
formula_new = "goals ~ team + hga + red_opp + fouls_opp_cat1 + fouls_opp_cat2 + team_opp"
model2 = stepsel.modeling.stepwise_glm.StepwiseGLM(
    formula = formula_new,
    data = dt,
    include = include,
    slentry = slentry,
    slstay = slstay,
    model_fit_log = model1.model_fit_log, # Use the previous model fit Log object
    family = sm.families.Poisson()
)
model2.fit()
Best entry candidates:
          feature   p_value
0            team  0.000002
1        team_opp  0.000284
2             hga  0.000609
3         red_opp  0.005159
4  fouls_opp_cat2  0.013067
>> team entered the model.

Best removal candidates:
  feature   p_value
0    team  0.000002
>> No feature removed from the model.

Best entry candidates:
          feature   p_value
0        team_opp  0.000392
1             hga  0.000608
2         red_opp  0.004282
3  fouls_opp_cat2  0.045388
4  fouls_opp_cat1  0.160092
>> team_opp entered the model.

Best removal candidates:
    feature   p_value
0  team_opp  0.000392
1      team  0.000003
>> No feature removed from the model.

Best entry candidates:
          feature   p_value
0             hga  0.000621
1         red_opp  0.005264
2  fouls_opp_cat2  0.078540
3  fouls_opp_cat1  0.131029
>> hga entered the model.
Best removal candidates:
    feature   p_value
0       hga  0.000621
1  team_opp  0.000397
2      team  0.000003
>> No feature removed from the model.

Best entry candidates:
          feature   p_value
0         red_opp  0.005355
1  fouls_opp_cat1  0.044937
2  fouls_opp_cat2  0.049454
>> red_opp entered the model.

Best removal candidates:
    feature   p_value
0   red_opp  0.005355
1       hga  0.000632
2  team_opp  0.000478
3      team  0.000003
>> No feature removed from the model.
Best entry candidates:
          feature   p_value
0  fouls_opp_cat2  0.096376
1  fouls_opp_cat1  0.097127
>> No feature entered the model.

Best removal candidates:
    feature   p_value
0   red_opp  0.005355
1       hga  0.000632
2  team_opp  0.000478
3      team  0.000003
>> No feature removed from the model.


Selection stopped because all candidates for removal are significant at the 0.1 level and no candidate for entry is significant at the 0.05 level.

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                  goals   No. Observations:                  518
Model:                            GLM   Df Residuals:                      483
Model Family:                 Poisson   Df Model:                           34
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -736.76
Date:                Tue, 03 Sep 2024   Deviance:                       486.00
Time:                        20:55:13   Pearson chi2:                     405.
No. Iterations:                     5   Pseudo R-squ. (CS):             0.2316
Covariance Type:            nonrobust                                         
=====================================================================================================================
                                                        coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------------------------------------------
Intercept                                             0.6859      0.204      3.367      0.001       0.287       1.085
red_opp                                               0.3446      0.120      2.880      0.004       0.110       0.579
penalty                                               0.3079      0.085      3.633      0.000       0.142       0.474
team: Jablonec                                       -0.1051      0.200     -0.526      0.599      -0.496       0.286
team: Mladá Boleslav                                 -0.3335      0.212     -1.576      0.115      -0.748       0.081
team: Slavia Praha                                    0.4651      0.173      2.691      0.007       0.126       0.804
team: Baník Ostrava                                  -0.1165      0.200     -0.584      0.559      -0.508       0.275
team: Slovácko                                       -0.3340      0.211     -1.585      0.113      -0.747       0.079
team: Pardubice                                      -0.4832      0.227     -2.129      0.033      -0.928      -0.038
team: Sparta Praha                                    0.2698      0.180      1.501      0.133      -0.082       0.622
team: Teplice                                        -0.3210      0.208     -1.541      0.123      -0.729       0.087
team: Sigma Olomouc                                  -0.1164      0.197     -0.590      0.556      -0.503       0.271
team: Zlín                                           -0.2897      0.212     -1.369      0.171      -0.704       0.125
team: Zbrojovka Brno                                 -0.2716      0.210     -1.294      0.196      -0.683       0.140
team: Hradec Králové                                 -0.3747      0.217     -1.727      0.084      -0.800       0.051
team: Slovan Liberec                                 -0.1739      0.204     -0.853      0.393      -0.573       0.226
team: České Budějovice                               -0.2836      0.214     -1.326      0.185      -0.703       0.136
team: Viktoria Plzeň                                  0.0574      0.190      0.302      0.763      -0.315       0.430
hga: A                                               -0.2470      0.073     -3.404      0.001      -0.389      -0.105
team_opp: Jablonec                                    0.1064      0.187      0.570      0.569      -0.260       0.472
team_opp: Teplice                                     0.1953      0.184      1.064      0.287      -0.165       0.555
team_opp: Mladá Boleslav                             -0.1597      0.202     -0.791      0.429      -0.556       0.236
team_opp: Slavia Praha                               -0.6235      0.229     -2.718      0.007      -1.073      -0.174
team_opp: Baník Ostrava                              -0.1468      0.204     -0.720      0.471      -0.546       0.253
team_opp: Slovácko                                   -0.2684      0.205     -1.310      0.190      -0.670       0.133
team_opp: Pardubice                                   0.0812      0.188      0.432      0.666      -0.287       0.449
team_opp: Sparta Praha                               -0.4883      0.223     -2.192      0.028      -0.925      -0.052
team_opp: Sigma Olomouc                              -0.1868      0.202     -0.925      0.355      -0.583       0.209
team_opp: Zlín                                        0.1161      0.188      0.619      0.536      -0.252       0.484
team_opp: Zbrojovka Brno                              0.1574      0.185      0.849      0.396      -0.206       0.521
team_opp: Hradec Králové                             -0.3510      0.208     -1.685      0.092      -0.759       0.057
team_opp: Slovan Liberec                             -0.1634      0.199     -0.820      0.412      -0.554       0.227
team_opp: České Budějovice                            0.0914      0.188      0.485      0.627      -0.278       0.461
team_opp: Viktoria Plzeň                             -0.3707      0.214     -1.732      0.083      -0.790       0.049
ref_interv_per_attack * ref_interv_per_attack_opp    -1.1145      1.309     -0.852      0.394      -3.679       1.450
=====================================================================================================================
<stepsel.modeling.stepwise_glm.StepwiseGLM at 0x7f3342816870>
# Print the summary of the final model. The extracted model is statsmodels object.
final_model = model2.current_model
print(final_model.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                  goals   No. Observations:                  518
Model:                            GLM   Df Residuals:                      483
Model Family:                 Poisson   Df Model:                           34
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -736.76
Date:                Tue, 03 Sep 2024   Deviance:                       486.00
Time:                        20:55:14   Pearson chi2:                     405.
No. Iterations:                     5   Pseudo R-squ. (CS):             0.2316
Covariance Type:            nonrobust                                         
=====================================================================================================================
                                                        coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------------------------------------------
Intercept                                             0.6859      0.204      3.367      0.001       0.287       1.085
red_opp                                               0.3446      0.120      2.880      0.004       0.110       0.579
penalty                                               0.3079      0.085      3.633      0.000       0.142       0.474
team: Jablonec                                       -0.1051      0.200     -0.526      0.599      -0.496       0.286
team: Mladá Boleslav                                 -0.3335      0.212     -1.576      0.115      -0.748       0.081
team: Slavia Praha                                    0.4651      0.173      2.691      0.007       0.126       0.804
team: Baník Ostrava                                  -0.1165      0.200     -0.584      0.559      -0.508       0.275
team: Slovácko                                       -0.3340      0.211     -1.585      0.113      -0.747       0.079
team: Pardubice                                      -0.4832      0.227     -2.129      0.033      -0.928      -0.038
team: Sparta Praha                                    0.2698      0.180      1.501      0.133      -0.082       0.622
team: Teplice                                        -0.3210      0.208     -1.541      0.123      -0.729       0.087
team: Sigma Olomouc                                  -0.1164      0.197     -0.590      0.556      -0.503       0.271
team: Zlín                                           -0.2897      0.212     -1.369      0.171      -0.704       0.125
team: Zbrojovka Brno                                 -0.2716      0.210     -1.294      0.196      -0.683       0.140
team: Hradec Králové                                 -0.3747      0.217     -1.727      0.084      -0.800       0.051
team: Slovan Liberec                                 -0.1739      0.204     -0.853      0.393      -0.573       0.226
team: České Budějovice                               -0.2836      0.214     -1.326      0.185      -0.703       0.136
team: Viktoria Plzeň                                  0.0574      0.190      0.302      0.763      -0.315       0.430
hga: A                                               -0.2470      0.073     -3.404      0.001      -0.389      -0.105
team_opp: Jablonec                                    0.1064      0.187      0.570      0.569      -0.260       0.472
team_opp: Teplice                                     0.1953      0.184      1.064      0.287      -0.165       0.555
team_opp: Mladá Boleslav                             -0.1597      0.202     -0.791      0.429      -0.556       0.236
team_opp: Slavia Praha                               -0.6235      0.229     -2.718      0.007      -1.073      -0.174
team_opp: Baník Ostrava                              -0.1468      0.204     -0.720      0.471      -0.546       0.253
team_opp: Slovácko                                   -0.2684      0.205     -1.310      0.190      -0.670       0.133
team_opp: Pardubice                                   0.0812      0.188      0.432      0.666      -0.287       0.449
team_opp: Sparta Praha                               -0.4883      0.223     -2.192      0.028      -0.925      -0.052
team_opp: Sigma Olomouc                              -0.1868      0.202     -0.925      0.355      -0.583       0.209
team_opp: Zlín                                        0.1161      0.188      0.619      0.536      -0.252       0.484
team_opp: Zbrojovka Brno                              0.1574      0.185      0.849      0.396      -0.206       0.521
team_opp: Hradec Králové                             -0.3510      0.208     -1.685      0.092      -0.759       0.057
team_opp: Slovan Liberec                             -0.1634      0.199     -0.820      0.412      -0.554       0.227
team_opp: České Budějovice                            0.0914      0.188      0.485      0.627      -0.278       0.461
team_opp: Viktoria Plzeň                             -0.3707      0.214     -1.732      0.083      -0.790       0.049
ref_interv_per_attack * ref_interv_per_attack_opp    -1.1145      1.309     -0.852      0.394      -3.679       1.450
=====================================================================================================================

Prepering model matrix for other models

The stepsel package can be used to prepare the model matrix for other models. The prepare_model_matrix function is used for this purpose using the convienient formula syntax. Interaction terms of two variables are supported.

# Prepare the model matrix
formula = "goals ~ team + hga * penalty"
y, X, feature_ids = stepsel.modeling.prep.prepare_model_matrix(formula, dt)

Fit statsmodels model with prepared model matrix using stepsel

# Train-test split based on the match_id
train_filter = dt["match_id"] <= 200
y_train = y[train_filter]
X_train = X.loc[train_filter,:].copy()
y_test = y[~train_filter]
X_test = X.loc[~train_filter,:].copy()
# Fit the model
model3 = sm.GLM(endog = y_train,
              exog = X_train,
              family = family)
model3 = model3.fit()
print(model3.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                  goals   No. Observations:                  400
Model:                            GLM   Df Residuals:                      382
Model Family:                 Poisson   Df Model:                           17
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -593.90
Date:                Tue, 03 Sep 2024   Deviance:                       423.76
Time:                        20:55:14   Pearson chi2:                     369.
No. Iterations:                     5   Pseudo R-squ. (CS):             0.1442
Covariance Type:            nonrobust                                         
==========================================================================================
                             coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------------
Intercept                  0.4030      0.159      2.531      0.011       0.091       0.715
team: Jablonec            -0.0476      0.226     -0.211      0.833      -0.490       0.394
team: Mladá Boleslav      -0.2761      0.244     -1.131      0.258      -0.755       0.202
team: Slavia Praha         0.4624      0.201      2.295      0.022       0.067       0.857
team: Baník Ostrava       -0.0086      0.224     -0.038      0.969      -0.448       0.431
team: Slovácko            -0.2853      0.244     -1.171      0.242      -0.763       0.192
team: Pardubice           -0.3462      0.249     -1.390      0.165      -0.834       0.142
team: Sparta Praha         0.4073      0.201      2.031      0.042       0.014       0.800
team: Teplice             -0.2247      0.234     -0.960      0.337      -0.683       0.234
team: Sigma Olomouc        0.0312      0.221      0.141      0.888      -0.403       0.465
team: Zlín                -0.1886      0.239     -0.790      0.430      -0.657       0.279
team: Zbrojovka Brno      -0.3861      0.248     -1.557      0.120      -0.872       0.100
team: Hradec Králové      -0.2945      0.243     -1.210      0.226      -0.772       0.183
team: Slovan Liberec      -0.1769      0.235     -0.754      0.451      -0.637       0.283
team: České Budějovice    -0.2186      0.239     -0.915      0.360      -0.687       0.250
team: Viktoria Plzeň       0.1407      0.214      0.656      0.512      -0.279       0.561
hga: H * penalty           0.4112      0.120      3.426      0.001       0.176       0.646
hga: A * penalty           0.2397      0.112      2.131      0.033       0.019       0.460
==========================================================================================

Fixing some of the variables coefficients

It is possible to fix some of the variables coefficients to a specific value using offset. This is particularly useful when the variable is a factor and one or more of its levels are not well fitted by the model.

# Fix some of the coefficients - use the names of the coefficients from the summary
adjusted_coeffs = {"team: Baník Ostrava": 0.00,
                   "hga: H * penalty": 0.45 
}
# Adjust the model matrix and offset
X_train, X_test, offset_train, offset_test = stepsel.modeling.prep.model_matrix.adjust_model_matrix([X_train, X_test], adjusted_coeffs)
# Fit the final model using the adjusted model matrix and offset
model_final2 = sm.GLM(endog = y_train,
              exog = X_train,
              family = family,
              offset = offset_train)
model_final2 = model_final2.fit()
print(model_final2.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                  goals   No. Observations:                  400
Model:                            GLM   Df Residuals:                      384
Model Family:                 Poisson   Df Model:                           15
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -593.96
Date:                Tue, 03 Sep 2024   Deviance:                       423.87
Time:                        20:55:14   Pearson chi2:                     370.
No. Iterations:                     5   Pseudo R-squ. (CS):             0.1189
Covariance Type:            nonrobust                                         
==========================================================================================
                             coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------------
Intercept                  0.3918      0.112      3.492      0.000       0.172       0.612
team: Jablonec            -0.0413      0.197     -0.209      0.834      -0.428       0.345
team: Mladá Boleslav      -0.2652      0.217     -1.223      0.221      -0.690       0.160
team: Slavia Praha         0.4703      0.169      2.785      0.005       0.139       0.801
team: Slovácko            -0.2766      0.217     -1.276      0.202      -0.701       0.148
team: Pardubice           -0.3373      0.223     -1.514      0.130      -0.774       0.099
team: Sparta Praha         0.4092      0.169      2.426      0.015       0.079       0.740
team: Teplice             -0.2206      0.207     -1.065      0.287      -0.627       0.185
team: Sigma Olomouc        0.0376      0.192      0.196      0.845      -0.339       0.414
team: Zlín                -0.1821      0.212     -0.860      0.390      -0.597       0.233
team: Zbrojovka Brno      -0.3819      0.223     -1.716      0.086      -0.818       0.054
team: Hradec Králové      -0.2878      0.217     -1.327      0.185      -0.713       0.137
team: Slovan Liberec      -0.1685      0.207     -0.814      0.416      -0.574       0.237
team: České Budějovice    -0.2099      0.212     -0.992      0.321      -0.625       0.205
team: Viktoria Plzeň       0.1474      0.184      0.802      0.423      -0.213       0.508
hga: A * penalty           0.2439      0.112      2.184      0.029       0.025       0.463
==========================================================================================

Scoring the model

The model can be scored via standard statsmodels API. stepsel package provides a scoring_table object, which handles the fixed variables and the offset in a straightforward way. It is also possible to generate SQL code for scoring the model in the database.

# Create scoring table
# Notice the fixed coefficients are also used in the scoring table
# You can also create the scoring table from csv file or pandas DataFrame.
scoring_table = stepsel.modeling.predict.ScoringTableGLM.from_glm_model(model_final2, adjusted_coeffs)
print(scoring_table)
ScoringTableGLM(scoring_table=         var1     var2        level_var1 level_var2  estimate
0   Intercept     None              None       None  0.391757
1        team     None  České Budějovice       None -0.209868
2        team     None              Zlín       None -0.182095
3        team     None    Zbrojovka Brno       None -0.381924
4        team     None    Viktoria Plzeň       None  0.147379
5        team     None           Teplice       None -0.220638
6        team     None      Sparta Praha       None  0.409163
7        team     None          Slovácko       None -0.276560
8        team     None    Slovan Liberec       None -0.168498
9        team     None      Slavia Praha       None  0.470338
10       team     None     Sigma Olomouc       None  0.037562
11       team     None         Pardubice       None -0.337274
12       team     None    Mladá Boleslav       None -0.265193
13       team     None          Jablonec       None -0.041279
14       team     None    Hradec Králové       None -0.287798
15       team     None     Baník Ostrava       None  0.000000
16        hga  penalty                 H       None  0.450000
17        hga  penalty                 A       None  0.243900)
# For prediction, prepare the full model matrix
y, X_full, feature_ids_full = stepsel.modeling.prep.prepare_model_matrix(formula, dt, drop_first=False)

# There is a method to predict the linear predictor of the model, which can be used to calculate the predictions
dt["preds"] = np.exp(scoring_table.predict_linear(X_full))
# Generate the SQL code for the model (microsoft SQL Server flavor)
print(scoring_table.sql(format=True))
 + 0.39175660628921727
 + case  when TRIM(CAST(team as varchar(999))) = 'České Budějovice' then -0.209868356396289
 when TRIM(CAST(team as varchar(999))) = 'Zlín' then -0.18209532272001622
 when TRIM(CAST(team as varchar(999))) = 'Zbrojovka Brno' then -0.38192399630625423
 when TRIM(CAST(team as varchar(999))) = 'Viktoria Plzeň' then 0.14737941642116714
 when TRIM(CAST(team as varchar(999))) = 'Teplice' then -0.22063773340931692
 when TRIM(CAST(team as varchar(999))) = 'Sparta Praha' then 0.4091626261463444
 when TRIM(CAST(team as varchar(999))) = 'Slovácko' then -0.2765597308949606
 when TRIM(CAST(team as varchar(999))) = 'Slovan Liberec' then -0.16849805557501135
 when TRIM(CAST(team as varchar(999))) = 'Slavia Praha' then 0.4703384015175227
 when TRIM(CAST(team as varchar(999))) = 'Sigma Olomouc' then 0.037561752737359344
 when TRIM(CAST(team as varchar(999))) = 'Pardubice' then -0.3372735198024266
 when TRIM(CAST(team as varchar(999))) = 'Mladá Boleslav' then -0.2651933788212607
 when TRIM(CAST(team as varchar(999))) = 'Jablonec' then -0.041278776042565524
 when TRIM(CAST(team as varchar(999))) = 'Hradec Králové' then -0.28779833962517487
 when TRIM(CAST(team as varchar(999))) = 'Baník Ostrava' then 0.0 else 0.0 end
 + case  when TRIM(CAST(hga as varchar(999))) = 'H' then penalty * 0.45
 when TRIM(CAST(hga as varchar(999))) = 'A' then penalty * 0.24390049563090516 else 0.0 end

Verifying the model

For quick verification of the model, one can use the group_over_columns function, which goes over the specified columns and applies the specified grouping dictionary. This is built on top of the pandas groupby function.

# Evaluate the model. Note you can evaluate any combination of variables.
vars2eval = ["team", "hga", "penalty", "fouls_opp_cat2", ["hga", "penalty"]]
evaluation_train = stepsel.tools.group_over_columns(dt.loc[train_filter,], vars2eval, {"goals": "mean", "preds": "mean" })
evaluation_test = stepsel.tools.group_over_columns(dt.loc[~train_filter,:], vars2eval, {"goals": "mean", "preds": "mean"})
# One can see that rather simple descriptive model fits the data reasonably well, even though it was constructed only to demonstrate the package.
# The biggest discrepancy is in matches with penalty, and for the teams of Slavia Praha and Sparta Praha. This has its reasons,
# and Czech football fans will have their own opinions on that.
print(evaluation_train, evaluation_test, sep="\n\n")
        variable_1           level_1 variable_2  level_2     goals     preds
0             team         Bohemians        NaN      NaN  1.640000  1.636806
1             team          Jablonec        NaN      NaN  1.520000  1.520000
2             team    Mladá Boleslav        NaN      NaN  1.160000  1.160000
3             team      Slavia Praha        NaN      NaN  2.560000  2.560000
4             team     Baník Ostrava        NaN      NaN  1.560000  1.563194
5             team          Slovácko        NaN      NaN  1.160000  1.160000
6             team         Pardubice        NaN      NaN  1.080000  1.080000
7             team      Sparta Praha        NaN      NaN  2.560000  2.560000
8             team           Teplice        NaN      NaN  1.320000  1.320000
9             team     Sigma Olomouc        NaN      NaN  1.640000  1.640000
10            team              Zlín        NaN      NaN  1.291667  1.291667
11            team    Zbrojovka Brno        NaN      NaN  1.080000  1.080000
12            team    Hradec Králové        NaN      NaN  1.160000  1.160000
13            team    Slovan Liberec        NaN      NaN  1.320000  1.320000
14            team  České Budějovice        NaN      NaN  1.240000  1.240000
15            team    Viktoria Plzeň        NaN      NaN  1.807692  1.807692
16             hga                 H        NaN      NaN  1.755000  1.542806
17             hga                 A        NaN      NaN  1.260000  1.472194
18         penalty               0.0        NaN      NaN  1.380403  1.387885
19         penalty               1.0        NaN      NaN  2.361702  2.193053
20         penalty               2.0        NaN      NaN  2.166667  3.055077
21  fouls_opp_cat2      (-Inf, 16.5]        NaN      NaN  1.567251  1.512505
22  fouls_opp_cat2       (16.5, Inf)        NaN      NaN  1.155172  1.477987
23             hga                 H    penalty      0.0  1.641618  1.404335
24             hga                 H    penalty      1.0  2.600000  2.379531
25             hga                 H    penalty      2.0  1.000000  3.061540
26             hga                 A    penalty      0.0  1.120690  1.371529
27             hga                 A    penalty      1.0  2.090909  1.981147
28             hga                 A    penalty      2.0  2.750000  3.051845

        variable_1           level_1 variable_2  level_2     goals     preds
0             team         Bohemians        NaN      NaN  1.625000  1.635771
1             team          Jablonec        NaN      NaN  1.428571  1.535011
2             team    Mladá Boleslav        NaN      NaN  1.428571  1.363987
3             team      Slavia Praha        NaN      NaN  3.375000  2.449881
4             team     Baník Ostrava        NaN      NaN  1.285714  1.479578
5             team          Slovácko        NaN      NaN  1.375000  1.201807
6             team         Pardubice        NaN      NaN  0.571429  1.055995
7             team      Sparta Praha        NaN      NaN  1.500000  2.544079
8             team           Teplice        NaN      NaN  1.142857  1.282971
9             team     Sigma Olomouc        NaN      NaN  1.125000  1.645341
10            team              Zlín        NaN      NaN  1.000000  1.233260
11            team    Zbrojovka Brno        NaN      NaN  1.857143  1.049731
12            team    Hradec Králové        NaN      NaN  1.000000  1.109554
13            team    Slovan Liberec        NaN      NaN  1.714286  1.250144
14            team  České Budějovice        NaN      NaN  1.000000  1.199480
15            team    Viktoria Plzeň        NaN      NaN  1.714286  1.714525
16             hga                 H        NaN      NaN  1.440678  1.545113
17             hga                 A        NaN      NaN  1.474576  1.454344
18         penalty               0.0        NaN      NaN  1.428571  1.411392
19         penalty               1.0        NaN      NaN  1.692308  2.213212
20  fouls_opp_cat2      (-Inf, 16.5]        NaN      NaN  1.500000  1.513460
21  fouls_opp_cat2       (16.5, Inf)        NaN      NaN  1.142857  1.397720
22             hga                 H    penalty      0.0  1.440000  1.400752
23             hga                 H    penalty      1.0  1.444444  2.347115
24             hga                 A    penalty      0.0  1.418182  1.421065
25             hga                 A    penalty      1.0  2.250000  1.911932