stepsel.binning.optimal

Classes

OptimalBinningUsingDecisionTreeRegressor

Class for optimal binning of one variable.

Module Contents

class stepsel.binning.optimal.OptimalBinningUsingDecisionTreeRegressor(criterion: Literal['squared_error', 'friedman_mse', 'absolute_error', 'poisson'] = 'squared_error', splitter: Literal['best', 'random'] = 'best', max_depth: int | None = None, min_samples_split: float | int = 2, min_samples_leaf: float | int = 1, min_weight_fraction_leaf: float = 0, max_features: float | int | Literal['auto', 'sqrt', 'log2'] | None = None, max_leaf_nodes: int | None = None, min_impurity_decrease: float = 0, n_splits: int = 10, scoring: numpy.typing.ArrayLike | tuple | Mapping | None = 'neg_mean_squared_error', n_jobs: int | None = None, refit: str | bool = 'neg_mean_squared_error', verbose_cv: int = 0, return_train_score: bool = False, n_clusters: int = 100, max_grid_length: int = 100, n_best_clusters_for_extensive_search: int = 5, max_cycles_of_extensive_cluster_search: int = 5, verbose_flow: bool = True)[source]

Class for optimal binning of one variable.

Steps:

All cp values are found using DecisionTreeRegressor.
If the number of cp values is higher than max_grid_length, KMeans is used to reduce the number of cp values.
- CV is performed for n_clusters of cp values.
- The best n_best_clusters_for_extensive_search are selected for further CV of cp values in that cluster.
- If the number of cp values is still higher than max_grid_length, the process is repeated.
CV is performed for all cp values.
The best cp value is selected.

fit(X, y, w=None)[source]: Perform optimal binning.

set_feature_names(feature_names)[source]: Set feature names of the input data X. It is used for outputs like plot of the tree and cut points. Feature names must be set after each fit, because the tree is reinitialized.

plot_tree(figsize: tuple=(25,20), feature_names: ArrayLike | None = None, filled: bool=True, rounded: bool=False,: precision: int=3, fontsize: int=14)

Plot the final tree.

predict(X)[source]: Predict regression target for X.

bin_values(X)[source]: Bin X into intervals using fitted optimal binning.

TODO:

-----

- cut_points are collected before setting feature_names, therefore the dictionary does not contain the feature names as keys.

class Logging[source]

SubClass responsible for logging of temporary results.

fit_log_template

fit_cycle_log_template

log_init()[source]: Logs initialization.

log_cycle(cycle, cp_values_cycle, cv_cp_paths_kmeans_clusters, cp_values_cycle_reduced, cv_results)[source]

Log one cycle of cp values reduction with KMeans.

Params:

cycle: cycle number
cp_values_cycle: all the cp values entering the cycle
cv_cp_paths_kmeans_clusters: assigned cluster for cp values
cp_values_cycle_reduced: reduced cp values based on KMeans for which CV is performed
cv_results: CV results (output of GridSearchCV.cv_results_)

log_final(cp_values, cv_fit, in_cycle)[source]

Log final fit state.

Params:

cp_values: all cp values of the experiment
cv_fit: CV fit object (GridSearchCV)
in_cycle [True/False]: indicator whether KMeans reduction was performed

log_cycle_final(cycle, cp_values_cycle, cv_cp_paths_kmeans_clusters, cp_values_cycle_reduced, cv_results, cv_fit, cp_values, in_cycle=True)[source]

Function combining log_cycle() and log_final() for saving final log inside of KMeans loop.

Params:

cycle: cycle number
cp_values_cycle: all the cp values entering the cycle
cv_cp_paths_kmeans_clusters: assigned cluster for cp values
cp_values_cycle_reduced: reduced cp values based on KMeans for which CV is performed
cv_results: CV results (output of GridSearchCV.cv_results_)
cv_fit: CV fit object (GridSearchCV)
cp_values: all cp values of the experiment
in_cycle [True/False]: indicator whether KMeans reduction was performed

criterion

splitter

max_depth

min_samples_split

min_samples_leaf

min_weight_fraction_leaf

max_features

max_leaf_nodes

min_impurity_decrease

n_splits

scoring

refit

n_jobs

return_train_score

verbose_cv

n_clusters

max_grid_length

n_best_clusters_for_extensive_search

max_cycles_of_extensive_cluster_search

verbose_flow

cv_folds

tree_model

feature_names = None

Log

cut_points = None

final_model = None

_check_feature_names(X)[source]

Check if the feature names are present in the input data.

Parameters:: X (pandas.DataFrame) – The input data.
Returns:: Saves the feature names in the feature_names attribute.
Return type:: None

set_feature_names(feature_names)[source]

Set the feature names. This method is useful when the input data X is a numpy array.

Parameters:: feature_names (list) – The feature names.
Returns:: Saves the feature names in the feature_names attribute.
Return type:: None

static _calculate_available_cp_params(tree_model: sklearn.tree.DecisionTreeRegressor, cv_folds: sklearn.model_selection.KFold, X, y, w)[source]

Collects the available cp parameters from the cross-validation folds.

Parameters:

tree_model (DecisionTreeRegressor) – The decision tree regressor model.
cv_folds (KFold) – The cross-validation folds.
X (numpy.ndarray) – The input data.
y (numpy.ndarray) – The target data.
w (numpy.ndarray) – The sample weights.

Returns:

cv_cp_paths – The available cp parameters.

Return type:

numpy.ndarray

static _kmeans_cv_cp_paths_reduction(cv_cp_paths: numpy.typing.ArrayLike, n_clusters: int)[source]

Reduces the number of cp parameters for grid search using KMeans.

Parameters:

cv_cp_paths (numpy.ndarray) – The available cp parameters to reduce.
n_clusters (int) – The number of clusters to use for the KMeans algorithm (maximum number of cp parameters to search at once).

Returns:

cv_cp_paths_reduced (numpy.ndarray) – The reduced cp parameters.
cv_cp_paths_kmeans_clusters (numpy.ndarray) – The cluster labels for each cp parameter.

_x_validation(cv_cp_paths: numpy.typing.ArrayLike, X, y, sample_weight=None)[source]

Performs cross-validation for the given cp parameters.

Parameters:

cv_cp_paths (numpy.ndarray) – The cp parameters to use for cross-validation.
X (numpy.ndarray) – The input data.
y (numpy.ndarray) – The target data.
sample_weight (numpy.ndarray, optional) – The sample weights.

Returns:

gs – The grid search object.

Return type:

GridSearchCV

static _get_cp_params_for_best_xval_clusters(gs, cv_cp_paths, cv_cp_paths_kmeans_clusters, n_best_clusters_for_extensive_search)[source]

Get cp parameters for the best clusters from cross-validation.

Parameters:

gs (GridSearchCV object) – Fitted GridSearchCV object.
cv_cp_paths (numpy.ndarray) – The available cp parameters.
cv_cp_paths_kmeans_clusters (numpy.ndarray) – The cluster labels for each cp parameter.
n_best_clusters_for_extensive_search (int) – The number of best clusters to use for extensive search.

Returns:

best_cp_params – The cp parameters for the best clusters.

Return type:

numpy.ndarray

_collect_results(gs)[source]

Collects the results from the grid search.

Parameters:: gs (GridSearchCV object) – Fitted GridSearchCV object.
Returns:: Saves results as properties of the class.
Return type:: None

plot_tree(figsize: tuple = (25, 20), filled: bool = True, rounded: bool = False, precision: int = 3, fontsize: int = 14)[source]

Plot the decision tree.

Parameters:

figsize (tuple of int, default=(25,20)) – The size of the figure to create in matplotlib.
feature_names (list of str, default=None) – The names of the features.
filled (bool, default=True) – When set to True, paint nodes to indicate majority class for classification, extremity of values for regression, or purity of node for multi-output.
rounded (bool, default=False) – When set to True, draw node boxes with rounded corners and use Helvetica fonts instead of Times-Roman.
precision (int, default=3) – The precision for displaying split thresholds and other float values.
fontsize (int, default=14) – The fontsize for node labels.

Return type:

matplotlib figure

predict(X: numpy.typing.ArrayLike) → numpy.ndarray[source]

Predict regression value for X.

Parameters:: X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The input samples. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csr_matrix.
Returns:: y – The predicted classes, or the predict values.
Return type:: array-like of shape (n_samples,) or (n_samples, n_outputs)

bin_values(data: numpy.typing.ArrayLike) → numpy.ndarray[source]

Bin values of X into intervals.

Parameters:: data (array-like) – The input values to be binned.
Returns:: binned_values – The binned values.
Return type:: array-like

fit(X, y, w=None)[source]

Fit the model and get the optimal binning.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The training input samples. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csc_matrix.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – The target values (real numbers). Use dtype=np.float64 and order='C' for maximum efficiency.
w (array-like of shape (n_samples,), default=None) – Sample weights. If None, then samples are equally weighted. Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node.

Returns:

self – Fitted estimator.

Return type:

OptimalBinningUsingDecisionTreeRegressor