stepsel.modeling.prep.model_matrix

Functions

`prepare_model_matrix`(formula, data[, intercept, ...])	Prepare a model matrix based on a formula and a data set.
`adjust_model_matrix`(model_matrices, adjusted_coeffs[, ...])	Adjust model matrix (and offset) based on adjusted coefficients dictionary.

Module Contents

stepsel.modeling.prep.model_matrix.prepare_model_matrix(formula: str, data: pandas.DataFrame, intercept: bool = True, drop_first: bool = True, omit_left_side_variables: bool = False)[source]

Prepare a model matrix based on a formula and a data set. TODO: If intercept = False, keep all the levels of the first categorical variable.

Parameters:

formula (str) – The formula for the model.
data (pandas.DataFrame) – The data set.
intercept (bool, optional) – Whether to include an intercept in the model matrix. Default is True.
drop_first (bool, optional) – Whether to drop the first level of each categorical variable. Default is True.
omit_left_side_variables (bool, optional) – Whether to omit the left side variables from the output. Default is False. If True, the function will return only the model matrix and the feature IDs.

Returns:

y (pandas.Series) – The response variable. If omit_left_side_variables is True, the function won’t return y.
model_matrix (pandas.DataFrame) – The model matrix.
feature_ids (list) – The feature IDs.

Raises:

ValueError – If interaction type is not supported.

Notes

The function will create a model matrix based on the formula and the data set. Categories will be dummy-encoded. Interaction terms will be created and dummy-encoded if necessary. The feature IDs will be a list of strings of the variable names corresponding to the columns of the model matrix.

Examples

>>> import pandas as pd
>>> import numpy as np
>>> from stepsel.modeling.prep import prepare_model_matrix
>>> data = pd.DataFrame({"y": np.random.normal(size=100),
...                      "x1": np.random.normal(size=100),
...                      "x2": np.random.choice(["A", "B", "C"], size=100),
...                      "x3": np.random.choice(["A", "B", "C"], size=100)})
>>> data[["x2", "x3"]] = data[["x2", "x3"]].astype("category")
>>> y, model_matrix, feature_ids = prepare_model_matrix("y ~ x1 + x2 + x3 + x1*x2 + x1*x3", data)

stepsel.modeling.prep.model_matrix.adjust_model_matrix(model_matrices: list, adjusted_coeffs: dict, offsets: list = None)[source]

Adjust model matrix (and offset) based on adjusted coefficients dictionary.

Parameters:

model_matrices (list (of data frames)) – The model matrices.
adjusted_coeffs (dict) –
The adjusted coefficients dictionary. The format of the dictionary is as follows:

{variable_name: adjusted_coefficient} Variable_name is the name of the variable in the model. Example: {“ts_new9_g: 06”: 0.20, “drpou_cpp_dop3: H”: -1.74}
offsets (list (of numpy arrays or pandas Series), optional) – The offsets. Default is None.

Returns:

model_matrices (tuple (of data frames)) – The adjusted model matrices.
offsets (tuple (of numpy arrays or pandas Series)) – The adjusted offsets.

Raises:

Exception – If the number of offsets is not equal to the number of model matrices. If the number of rows in the model matrix is not equal to the number of offset values.

Notes

The function will adjust the model matrices and offsets based on the adjusted coefficients dictionary. The function will delete the variables from the model matrices and add the adjusted coefficients to the offsets. The function will return a tuple of the adjusted model matrices and offsets. Adjustments are done in-place. If both matrices and offsets are provided, re-assignment is not necessary. If one wants to keep the original model matrices and offsets, make a copy of them before calling the function.