stepsel.modeling.prep ===================== .. py:module:: stepsel.modeling.prep .. autoapi-nested-parse:: The :mod:`stepsel.modeling.prep` module includes functions for data preparation. Submodules ---------- .. toctree:: :maxdepth: 1 /autoapi/stepsel/modeling/prep/helper/index /autoapi/stepsel/modeling/prep/interaction/index /autoapi/stepsel/modeling/prep/model_matrix/index Functions --------- .. autoapisummary:: stepsel.modeling.prep.get_interaction_type stepsel.modeling.prep.relevel_categorical_variable stepsel.modeling.prep.parse_model_formula stepsel.modeling.prep.recognize_variable_types stepsel.modeling.prep.interaction_categorical_numerical stepsel.modeling.prep.interaction_categorical_categorical stepsel.modeling.prep.interaction_numerical_numerical stepsel.modeling.prep.prepare_model_matrix Package Contents ---------------- .. py:function:: get_interaction_type(interaction: str, interaction_numerical_variables: list, interaction_categorical_variables: list) Get the interaction type of an interaction. :param interaction: The interaction to get the type of. :type interaction: str :param interaction_numerical_variables: The numerical variables that are used in the interactions. :type interaction_numerical_variables: list :param interaction_categorical_variables: The categorical variables that are used in the interactions. :type interaction_categorical_variables: list :returns: **interaction_type** -- The interaction type. One of "numerical_numerical", "categorical_categorical", "numerical_categorical", "categorical_numerical". :rtype: str :raises ValueError: If the interaction does not contain exactly one '*' character. If the interaction variables are not in exactly one of the two lists. .. rubric:: Examples >>> get_interaction_type("a * b", ["a"], ["b"]) "numerical_categorical" .. py:function:: relevel_categorical_variable(series: pandas.Series, new_order: list) Relevel a categorical variable. :param series: The categorical variable to relevel. :type series: pd.Series :param new_order: The new order of the categories. :type new_order: list :returns: **series** -- The relevelled categorical variable. :rtype: pd.Series :raises ValueError: If the new order is not a subset of the current categories. If the new order contains duplicates. .. py:function:: parse_model_formula(formula: str) Parse a model formula into its components. :param formula: The model formula to parse. :type formula: str :returns: * **left_side_variables** (*list*) -- The variables on the left side of the formula. * **interaction_variables** (*list*) -- The interaction variables on the right side of the formula. * **non_interaction_variables** (*list*) -- The non-interaction variables on the right side of the formula. :raises ValueError: If the formula does not contain exactly one '~' character. .. rubric:: Examples >>> parse_model_formula("y ~ a + b + a * b") (["y"], ["a * b"], ["a", "b"]) .. py:function:: recognize_variable_types(data: pandas.DataFrame, interaction_variables: list, non_interaction_variables: list) Recognize the types of the variables. :param data: The data to recognize the variable types from. :type data: pd.DataFrame :param interaction_variables: The interaction variables to recognize the types from. :type interaction_variables: list :param non_interaction_variables: The non-interaction variables to recognize the types from. :type non_interaction_variables: list :returns: **dictionary** -- A dictionary containing the variable types. interaction_numerical_variables : list The numerical variables in the interaction variables. interaction_categorical_variables : list The categorical variables in the interaction variables. non_interaction_numerical_variables : list The numerical variables in the non-interaction variables. non_interaction_categorical_variables : list The categorical variables in the non-interaction variables. interaction_variables : list The interaction variables. :rtype: dict :raises ValueError: If the interaction variables are not either numerical or categorical. If the non-interaction variables are not either numerical or categorical. .. rubric:: Examples >>> recognize_variable_types(data, ["a * b"], ["a", "b", "c"]) (["a"], ["b"], [], [], ["a * b"]) {"non_interaction_numerical_variables": ["a"], "non_interaction_categorical_variables": ["b"], "interaction_numerical_variables": ["a"], "interaction_categorical_variables": ["b", "c"], "interaction_variables": ["a * b"]} .. py:function:: interaction_categorical_numerical(series1: pandas.Series, series2: pandas.Series) Create an interaction term between a categorical and a numerical variable. :param series1: The first series. :type series1: pandas.Series :param series2: The second series. :type series2: pandas.Series :returns: **interaction_df** -- A DataFrame with the interaction terms. :rtype: pandas.DataFrame :raises ValueError: If one (and only one) of the series is not categorical. If one (and only one) of the series is not numerical. .. rubric:: Notes The function will create dummy variables for the categorical variable and multiply them by the numerical variable. The dummy variables will be named "categorical_variable: category * numerical_variable". .. rubric:: Examples >>> import pandas as pd >>> import numpy as np >>> from stepsel.modeling.prep import interaction_categorical_numerical >>> categorical_series = pd.Series(np.random.choice(["A", "B", "C"], size=10), name="categorical").astype("category") >>> numerical_series = pd.Series(np.random.normal(size=10), name="numerical") >>> interaction_categorical_numerical(categorical_series, numerical_series) categorical: A * numerical categorical: B * numerical categorical: C * numerical 0 -0.626453 0.417258 0.619825 1 0.183643 -0.720788 -0.720788 2 0.835979 -0.632650 -0.632650 ... .. py:function:: interaction_categorical_categorical(series1: pandas.Series, series2: pandas.Series) Create an interaction term between two categorical variables. :param series1: The first series. :type series1: pandas.Series :param series2: The second series. :type series2: pandas.Series :returns: **interaction** -- A Series with the interaction terms. :rtype: pandas.Series :raises ValueError: If one series1 is not categorical. If one series2 is not categorical. .. rubric:: Notes The function will create an interaction term between the two categorical variables. The interaction term will be named "categorical_variable1 * categorical_variable2". The interactions will be in form of "category1 * category2". .. rubric:: Examples >>> import pandas as pd >>> import numpy as np >>> from stepsel.modeling.prep import interaction_categorical_categorical >>> categorical_series1 = pd.Series(["A", "B", "C"], name="categorical1").astype("category") >>> categorical_series2 = pd.Series(["X", "Y", "Z"], name="categorical2").astype("category") >>> interaction_categorical_categorical(categorical_series1, categorical_series2) 0 A * X 1 B * Y 2 C * Z Name: categorical1 * categorical2, dtype: category Categories (3, object): ['A * X', 'B * Y', 'C * Z'] .. py:function:: interaction_numerical_numerical(series1: pandas.Series, series2: pandas.Series) Create an interaction term between two numerical variables. :param series1: The first series. :type series1: pandas.Series :param series2: The second series. :type series2: pandas.Series :returns: **interaction** -- A Series with the interaction terms. :rtype: pandas.Series :raises ValueError: If one series1 is not numerical. If one series2 is not numerical. .. rubric:: Notes The function will create an interaction term between the two numerical variables. The interaction term will be named "numerical_variable1 * numerical_variable2". .. rubric:: Examples >>> import pandas as pd >>> import numpy as np >>> from stepsel.modeling.prep import interaction_numerical_numerical >>> numerical_series1 = pd.Series(np.random.normal(size=10), name="numerical1") >>> numerical_series2 = pd.Series(np.random.normal(size=10), name="numerical2") >>> interaction_numerical_numerical(numerical_series1, numerical_series2) 0 -0.626453 1 -0.720788 2 -0.632650 ... Name: numerical1 * numerical2, dtype: float64 .. py:function:: prepare_model_matrix(formula: str, data: pandas.DataFrame, intercept: bool = True, drop_first: bool = True, omit_left_side_variables: bool = False) Prepare a model matrix based on a formula and a data set. TODO: If intercept = False, keep all the levels of the first categorical variable. :param formula: The formula for the model. :type formula: str :param data: The data set. :type data: pandas.DataFrame :param intercept: Whether to include an intercept in the model matrix. Default is True. :type intercept: bool, optional :param drop_first: Whether to drop the first level of each categorical variable. Default is True. :type drop_first: bool, optional :param omit_left_side_variables: Whether to omit the left side variables from the output. Default is False. If True, the function will return only the model matrix and the feature IDs. :type omit_left_side_variables: bool, optional :returns: * **y** (*pandas.Series*) -- The response variable. If omit_left_side_variables is True, the function won't return y. * **model_matrix** (*pandas.DataFrame*) -- The model matrix. * **feature_ids** (*list*) -- The feature IDs. :raises ValueError: If interaction type is not supported. .. rubric:: Notes The function will create a model matrix based on the formula and the data set. Categories will be dummy-encoded. Interaction terms will be created and dummy-encoded if necessary. The feature IDs will be a list of strings of the variable names corresponding to the columns of the model matrix. .. rubric:: Examples >>> import pandas as pd >>> import numpy as np >>> from stepsel.modeling.prep import prepare_model_matrix >>> data = pd.DataFrame({"y": np.random.normal(size=100), ... "x1": np.random.normal(size=100), ... "x2": np.random.choice(["A", "B", "C"], size=100), ... "x3": np.random.choice(["A", "B", "C"], size=100)}) >>> data[["x2", "x3"]] = data[["x2", "x3"]].astype("category") >>> y, model_matrix, feature_ids = prepare_model_matrix("y ~ x1 + x2 + x3 + x1*x2 + x1*x3", data)