stepsel.binning.optimal
Classes
Class for optimal binning of one variable. |
Module Contents
- class stepsel.binning.optimal.OptimalBinningUsingDecisionTreeRegressor(criterion: Literal['squared_error', 'friedman_mse', 'absolute_error', 'poisson'] = 'squared_error', splitter: Literal['best', 'random'] = 'best', max_depth: int | None = None, min_samples_split: float | int = 2, min_samples_leaf: float | int = 1, min_weight_fraction_leaf: float = 0, max_features: float | int | Literal['auto', 'sqrt', 'log2'] | None = None, max_leaf_nodes: int | None = None, min_impurity_decrease: float = 0, n_splits: int = 10, scoring: numpy.typing.ArrayLike | tuple | Mapping | None = 'neg_mean_squared_error', n_jobs: int | None = None, refit: str | bool = 'neg_mean_squared_error', verbose_cv: int = 0, return_train_score: bool = False, n_clusters: int = 100, max_grid_length: int = 100, n_best_clusters_for_extensive_search: int = 5, max_cycles_of_extensive_cluster_search: int = 5, verbose_flow: bool = True)[source]
Class for optimal binning of one variable.
Steps:
All cp values are found using DecisionTreeRegressor.
- If the number of cp values is higher than max_grid_length, KMeans is used to reduce the number of cp values.
CV is performed for n_clusters of cp values.
The best n_best_clusters_for_extensive_search are selected for further CV of cp values in that cluster.
If the number of cp values is still higher than max_grid_length, the process is repeated.
CV is performed for all cp values.
The best cp value is selected.
- set_feature_names(feature_names)[source]
Set feature names of the input data X. It is used for outputs like plot of the tree and cut points. Feature names must be set after each fit, because the tree is reinitialized.
- plot_tree(figsize: tuple=(25,20), feature_names: ArrayLike | None = None, filled: bool=True, rounded: bool=False,
precision: int=3, fontsize: int=14)
Plot the final tree.
- TODO:
- -----
- - cut_points are collected before setting feature_names, therefore the dictionary does not contain the feature names as keys.
- class Logging[source]
SubClass responsible for logging of temporary results.
- fit_log_template
- fit_cycle_log_template
- log_cycle(cycle, cp_values_cycle, cv_cp_paths_kmeans_clusters, cp_values_cycle_reduced, cv_results)[source]
Log one cycle of cp values reduction with KMeans.
Params:
- cycle
cycle number
- cp_values_cycle
all the cp values entering the cycle
- cv_cp_paths_kmeans_clusters
assigned cluster for cp values
- cp_values_cycle_reduced
reduced cp values based on KMeans for which CV is performed
- cv_results
CV results (output of GridSearchCV.cv_results_)
- log_final(cp_values, cv_fit, in_cycle)[source]
Log final fit state.
Params:
- cp_values
all cp values of the experiment
- cv_fit
CV fit object (GridSearchCV)
- in_cycle [True/False]
indicator whether KMeans reduction was performed
- log_cycle_final(cycle, cp_values_cycle, cv_cp_paths_kmeans_clusters, cp_values_cycle_reduced, cv_results, cv_fit, cp_values, in_cycle=True)[source]
Function combining log_cycle() and log_final() for saving final log inside of KMeans loop.
Params:
- cycle
cycle number
- cp_values_cycle
all the cp values entering the cycle
- cv_cp_paths_kmeans_clusters
assigned cluster for cp values
- cp_values_cycle_reduced
reduced cp values based on KMeans for which CV is performed
- cv_results
CV results (output of GridSearchCV.cv_results_)
- cv_fit
CV fit object (GridSearchCV)
- cp_values
all cp values of the experiment
- in_cycle [True/False]
indicator whether KMeans reduction was performed
- criterion
- splitter
- max_depth
- min_samples_split
- min_samples_leaf
- min_weight_fraction_leaf
- max_features
- max_leaf_nodes
- min_impurity_decrease
- n_splits
- scoring
- refit
- n_jobs
- return_train_score
- verbose_cv
- n_clusters
- max_grid_length
- n_best_clusters_for_extensive_search
- max_cycles_of_extensive_cluster_search
- verbose_flow
- cv_folds
- tree_model
- feature_names = None
- Log
- cut_points = None
- final_model = None
- _check_feature_names(X)[source]
Check if the feature names are present in the input data.
- Parameters:
X (pandas.DataFrame) – The input data.
- Returns:
Saves the feature names in the
feature_namesattribute.- Return type:
None
- set_feature_names(feature_names)[source]
Set the feature names. This method is useful when the input data X is a numpy array.
- Parameters:
feature_names (list) – The feature names.
- Returns:
Saves the feature names in the
feature_namesattribute.- Return type:
None
- static _calculate_available_cp_params(tree_model: sklearn.tree.DecisionTreeRegressor, cv_folds: sklearn.model_selection.KFold, X, y, w)[source]
Collects the available cp parameters from the cross-validation folds.
- Parameters:
tree_model (DecisionTreeRegressor) – The decision tree regressor model.
cv_folds (KFold) – The cross-validation folds.
X (numpy.ndarray) – The input data.
y (numpy.ndarray) – The target data.
w (numpy.ndarray) – The sample weights.
- Returns:
cv_cp_paths – The available cp parameters.
- Return type:
numpy.ndarray
- static _kmeans_cv_cp_paths_reduction(cv_cp_paths: numpy.typing.ArrayLike, n_clusters: int)[source]
Reduces the number of cp parameters for grid search using KMeans.
- Parameters:
cv_cp_paths (numpy.ndarray) – The available cp parameters to reduce.
n_clusters (int) – The number of clusters to use for the KMeans algorithm (maximum number of cp parameters to search at once).
- Returns:
cv_cp_paths_reduced (numpy.ndarray) – The reduced cp parameters.
cv_cp_paths_kmeans_clusters (numpy.ndarray) – The cluster labels for each cp parameter.
- _x_validation(cv_cp_paths: numpy.typing.ArrayLike, X, y, sample_weight=None)[source]
Performs cross-validation for the given cp parameters.
- Parameters:
cv_cp_paths (numpy.ndarray) – The cp parameters to use for cross-validation.
X (numpy.ndarray) – The input data.
y (numpy.ndarray) – The target data.
sample_weight (numpy.ndarray, optional) – The sample weights.
- Returns:
gs – The grid search object.
- Return type:
GridSearchCV
- static _get_cp_params_for_best_xval_clusters(gs, cv_cp_paths, cv_cp_paths_kmeans_clusters, n_best_clusters_for_extensive_search)[source]
Get cp parameters for the best clusters from cross-validation.
- Parameters:
gs (GridSearchCV object) – Fitted GridSearchCV object.
cv_cp_paths (numpy.ndarray) – The available cp parameters.
cv_cp_paths_kmeans_clusters (numpy.ndarray) – The cluster labels for each cp parameter.
n_best_clusters_for_extensive_search (int) – The number of best clusters to use for extensive search.
- Returns:
best_cp_params – The cp parameters for the best clusters.
- Return type:
numpy.ndarray
- _collect_results(gs)[source]
Collects the results from the grid search.
- Parameters:
gs (GridSearchCV object) – Fitted GridSearchCV object.
- Returns:
Saves results as properties of the class.
- Return type:
None
- plot_tree(figsize: tuple = (25, 20), filled: bool = True, rounded: bool = False, precision: int = 3, fontsize: int = 14)[source]
Plot the decision tree.
- Parameters:
figsize (tuple of int, default=(25,20)) – The size of the figure to create in matplotlib.
feature_names (list of str, default=None) – The names of the features.
filled (bool, default=True) – When set to True, paint nodes to indicate majority class for classification, extremity of values for regression, or purity of node for multi-output.
rounded (bool, default=False) – When set to True, draw node boxes with rounded corners and use Helvetica fonts instead of Times-Roman.
precision (int, default=3) – The precision for displaying split thresholds and other float values.
fontsize (int, default=14) – The fontsize for node labels.
- Return type:
matplotlib figure
- predict(X: numpy.typing.ArrayLike) numpy.ndarray[source]
Predict regression value for X.
- Parameters:
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The input samples. Internally, it will be converted to
dtype=np.float32and if a sparse matrix is provided to a sparsecsr_matrix.- Returns:
y – The predicted classes, or the predict values.
- Return type:
array-like of shape (n_samples,) or (n_samples, n_outputs)
- bin_values(data: numpy.typing.ArrayLike) numpy.ndarray[source]
Bin values of X into intervals.
- Parameters:
data (array-like) – The input values to be binned.
- Returns:
binned_values – The binned values.
- Return type:
array-like
- fit(X, y, w=None)[source]
Fit the model and get the optimal binning.
- Parameters:
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The training input samples. Internally, it will be converted to
dtype=np.float32and if a sparse matrix is provided to a sparsecsc_matrix.y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – The target values (real numbers). Use
dtype=np.float64andorder='C'for maximum efficiency.w (array-like of shape (n_samples,), default=None) – Sample weights. If None, then samples are equally weighted. Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node.
- Returns:
self – Fitted estimator.
- Return type: