stepsel.modeling.prep

The stepsel.modeling.prep module includes functions for data preparation.

Submodules

Functions

get_interaction_type(interaction, ...)

Get the interaction type of an interaction.

relevel_categorical_variable(series, new_order)

Relevel a categorical variable.

parse_model_formula(formula)

Parse a model formula into its components.

recognize_variable_types(data, interaction_variables, ...)

Recognize the types of the variables.

interaction_categorical_numerical(series1, series2)

Create an interaction term between a categorical and a numerical variable.

interaction_categorical_categorical(series1, series2)

Create an interaction term between two categorical variables.

interaction_numerical_numerical(series1, series2)

Create an interaction term between two numerical variables.

prepare_model_matrix(formula, data[, intercept, ...])

Prepare a model matrix based on a formula and a data set.

Package Contents

stepsel.modeling.prep.get_interaction_type(interaction: str, interaction_numerical_variables: list, interaction_categorical_variables: list)[source]

Get the interaction type of an interaction.

Parameters:
  • interaction (str) – The interaction to get the type of.

  • interaction_numerical_variables (list) – The numerical variables that are used in the interactions.

  • interaction_categorical_variables (list) – The categorical variables that are used in the interactions.

Returns:

interaction_type – The interaction type. One of “numerical_numerical”, “categorical_categorical”, “numerical_categorical”, “categorical_numerical”.

Return type:

str

Raises:

ValueError – If the interaction does not contain exactly one ‘*’ character. If the interaction variables are not in exactly one of the two lists.

Examples

>>> get_interaction_type("a * b", ["a"], ["b"])
"numerical_categorical"
stepsel.modeling.prep.relevel_categorical_variable(series: pandas.Series, new_order: list)[source]

Relevel a categorical variable.

Parameters:
  • series (pd.Series) – The categorical variable to relevel.

  • new_order (list) – The new order of the categories.

Returns:

series – The relevelled categorical variable.

Return type:

pd.Series

Raises:

ValueError – If the new order is not a subset of the current categories. If the new order contains duplicates.

stepsel.modeling.prep.parse_model_formula(formula: str)[source]

Parse a model formula into its components.

Parameters:

formula (str) – The model formula to parse.

Returns:

  • left_side_variables (list) – The variables on the left side of the formula.

  • interaction_variables (list) – The interaction variables on the right side of the formula.

  • non_interaction_variables (list) – The non-interaction variables on the right side of the formula.

Raises:

ValueError – If the formula does not contain exactly one ‘~’ character.

Examples

>>> parse_model_formula("y ~ a + b + a * b")
(["y"], ["a * b"], ["a", "b"])
stepsel.modeling.prep.recognize_variable_types(data: pandas.DataFrame, interaction_variables: list, non_interaction_variables: list)[source]

Recognize the types of the variables.

Parameters:
  • data (pd.DataFrame) – The data to recognize the variable types from.

  • interaction_variables (list) – The interaction variables to recognize the types from.

  • non_interaction_variables (list) – The non-interaction variables to recognize the types from.

Returns:

dictionary – A dictionary containing the variable types.

interaction_numerical_variableslist

The numerical variables in the interaction variables.

interaction_categorical_variableslist

The categorical variables in the interaction variables.

non_interaction_numerical_variableslist

The numerical variables in the non-interaction variables.

non_interaction_categorical_variableslist

The categorical variables in the non-interaction variables.

interaction_variableslist

The interaction variables.

Return type:

dict

Raises:

ValueError – If the interaction variables are not either numerical or categorical. If the non-interaction variables are not either numerical or categorical.

Examples

>>> recognize_variable_types(data, ["a * b"], ["a", "b", "c"])
(["a"], ["b"], [], [], ["a * b"])
{"non_interaction_numerical_variables": ["a"],
 "non_interaction_categorical_variables": ["b"],
 "interaction_numerical_variables": ["a"],
 "interaction_categorical_variables": ["b", "c"],
 "interaction_variables": ["a * b"]}
stepsel.modeling.prep.interaction_categorical_numerical(series1: pandas.Series, series2: pandas.Series)[source]

Create an interaction term between a categorical and a numerical variable.

Parameters:
  • series1 (pandas.Series) – The first series.

  • series2 (pandas.Series) – The second series.

Returns:

interaction_df – A DataFrame with the interaction terms.

Return type:

pandas.DataFrame

Raises:

ValueError – If one (and only one) of the series is not categorical. If one (and only one) of the series is not numerical.

Notes

The function will create dummy variables for the categorical variable and multiply them by the numerical variable. The dummy variables will be named “categorical_variable: category * numerical_variable”.

Examples

>>> import pandas as pd
>>> import numpy as np
>>> from stepsel.modeling.prep import interaction_categorical_numerical
>>> categorical_series = pd.Series(np.random.choice(["A", "B", "C"], size=10), name="categorical").astype("category")
>>> numerical_series = pd.Series(np.random.normal(size=10), name="numerical")
>>> interaction_categorical_numerical(categorical_series, numerical_series)
        categorical: A * numerical  categorical: B * numerical  categorical: C * numerical
0                        -0.626453                    0.417258                    0.619825
1                         0.183643                   -0.720788                   -0.720788
2                         0.835979                   -0.632650                   -0.632650
...
stepsel.modeling.prep.interaction_categorical_categorical(series1: pandas.Series, series2: pandas.Series)[source]

Create an interaction term between two categorical variables.

Parameters:
  • series1 (pandas.Series) – The first series.

  • series2 (pandas.Series) – The second series.

Returns:

interaction – A Series with the interaction terms.

Return type:

pandas.Series

Raises:

ValueError – If one series1 is not categorical. If one series2 is not categorical.

Notes

The function will create an interaction term between the two categorical variables. The interaction term will be named “categorical_variable1 * categorical_variable2”. The interactions will be in form of “category1 * category2”.

Examples

>>> import pandas as pd
>>> import numpy as np
>>> from stepsel.modeling.prep import interaction_categorical_categorical
>>> categorical_series1 = pd.Series(["A", "B", "C"], name="categorical1").astype("category")
>>> categorical_series2 = pd.Series(["X", "Y", "Z"], name="categorical2").astype("category")
>>> interaction_categorical_categorical(categorical_series1, categorical_series2)
0    A * X
1    B * Y
2    C * Z
Name: categorical1 * categorical2, dtype: category
Categories (3, object): ['A * X', 'B * Y', 'C * Z']
stepsel.modeling.prep.interaction_numerical_numerical(series1: pandas.Series, series2: pandas.Series)[source]

Create an interaction term between two numerical variables.

Parameters:
  • series1 (pandas.Series) – The first series.

  • series2 (pandas.Series) – The second series.

Returns:

interaction – A Series with the interaction terms.

Return type:

pandas.Series

Raises:

ValueError – If one series1 is not numerical. If one series2 is not numerical.

Notes

The function will create an interaction term between the two numerical variables. The interaction term will be named “numerical_variable1 * numerical_variable2”.

Examples

>>> import pandas as pd
>>> import numpy as np
>>> from stepsel.modeling.prep import interaction_numerical_numerical
>>> numerical_series1 = pd.Series(np.random.normal(size=10), name="numerical1")
>>> numerical_series2 = pd.Series(np.random.normal(size=10), name="numerical2")
>>> interaction_numerical_numerical(numerical_series1, numerical_series2)
0    -0.626453
1    -0.720788
2    -0.632650
...
Name: numerical1 * numerical2, dtype: float64
stepsel.modeling.prep.prepare_model_matrix(formula: str, data: pandas.DataFrame, intercept: bool = True, drop_first: bool = True, omit_left_side_variables: bool = False)[source]

Prepare a model matrix based on a formula and a data set. TODO: If intercept = False, keep all the levels of the first categorical variable.

Parameters:
  • formula (str) – The formula for the model.

  • data (pandas.DataFrame) – The data set.

  • intercept (bool, optional) – Whether to include an intercept in the model matrix. Default is True.

  • drop_first (bool, optional) – Whether to drop the first level of each categorical variable. Default is True.

  • omit_left_side_variables (bool, optional) – Whether to omit the left side variables from the output. Default is False. If True, the function will return only the model matrix and the feature IDs.

Returns:

  • y (pandas.Series) – The response variable. If omit_left_side_variables is True, the function won’t return y.

  • model_matrix (pandas.DataFrame) – The model matrix.

  • feature_ids (list) – The feature IDs.

Raises:

ValueError – If interaction type is not supported.

Notes

The function will create a model matrix based on the formula and the data set. Categories will be dummy-encoded. Interaction terms will be created and dummy-encoded if necessary. The feature IDs will be a list of strings of the variable names corresponding to the columns of the model matrix.

Examples

>>> import pandas as pd
>>> import numpy as np
>>> from stepsel.modeling.prep import prepare_model_matrix
>>> data = pd.DataFrame({"y": np.random.normal(size=100),
...                      "x1": np.random.normal(size=100),
...                      "x2": np.random.choice(["A", "B", "C"], size=100),
...                      "x3": np.random.choice(["A", "B", "C"], size=100)})
>>> data[["x2", "x3"]] = data[["x2", "x3"]].astype("category")
>>> y, model_matrix, feature_ids = prepare_model_matrix("y ~ x1 + x2 + x3 + x1*x2 + x1*x3", data)