stepsel.binning.optimal
=======================

.. py:module:: stepsel.binning.optimal


Classes
-------

.. autoapisummary::

   stepsel.binning.optimal.OptimalBinningUsingDecisionTreeRegressor


Module Contents
---------------

.. py:class:: OptimalBinningUsingDecisionTreeRegressor(criterion: Literal['squared_error', 'friedman_mse', 'absolute_error', 'poisson'] = 'squared_error', splitter: Literal['best', 'random'] = 'best', max_depth: int | None = None, min_samples_split: float | int = 2, min_samples_leaf: float | int = 1, min_weight_fraction_leaf: float = 0, max_features: float | int | Literal['auto', 'sqrt', 'log2'] | None = None, max_leaf_nodes: int | None = None, min_impurity_decrease: float = 0, n_splits: int = 10, scoring: numpy.typing.ArrayLike | tuple | Mapping | None = 'neg_mean_squared_error', n_jobs: int | None = None, refit: str | bool = 'neg_mean_squared_error', verbose_cv: int = 0, return_train_score: bool = False, n_clusters: int = 100, max_grid_length: int = 100, n_best_clusters_for_extensive_search: int = 5, max_cycles_of_extensive_cluster_search: int = 5, verbose_flow: bool = True)

   Class for optimal binning of one variable.

   Steps:
   ------
   - All cp values are found using DecisionTreeRegressor.
   - If the number of cp values is higher than max_grid_length, KMeans is used to reduce the number of cp values.
       - CV is performed for n_clusters of cp values.
       - The best n_best_clusters_for_extensive_search are selected for further CV of cp values in that cluster.
       - If the number of cp values is still higher than max_grid_length, the process is repeated.
   - CV is performed for all cp values.
   - The best cp value is selected.

   .. method:: fit(X, y, w = None)

      Perform optimal binning.


   .. method:: set_feature_names(feature_names)

      Set feature names of the input data X. It is used for outputs like plot of the tree and cut points.
      Feature names must be set after each fit, because the tree is reinitialized.


   .. method:: plot_tree(figsize: tuple=(25,20), feature_names: ArrayLike | None = None, filled: bool=True, rounded: bool=False,

            precision: int=3, fontsize: int=14)
      Plot the final tree.


   .. method:: predict(X)

      Predict regression target for X.


   .. method:: bin_values(X)

      Bin X into intervals using fitted optimal binning.


   .. method:: TODO:

   .. method:: -----

   .. method:: - cut_points are collected before setting feature_names, therefore the dictionary does not contain the feature names as keys.


   .. py:class:: Logging

      SubClass responsible for logging of temporary results.


      .. py:attribute:: fit_log_template


      .. py:attribute:: fit_cycle_log_template


      .. py:method:: log_init()

         Logs initialization.


      .. py:method:: log_cycle(cycle, cp_values_cycle, cv_cp_paths_kmeans_clusters, cp_values_cycle_reduced, cv_results)

         Log one cycle of cp values reduction with KMeans.

         Params:
         -------
         cycle
             cycle number
         cp_values_cycle
             all the cp values entering the cycle
         cv_cp_paths_kmeans_clusters
             assigned cluster for cp values
         cp_values_cycle_reduced
             reduced cp values based on KMeans for which CV is performed
         cv_results
             CV results (output of GridSearchCV.cv_results_)


      .. py:method:: log_final(cp_values, cv_fit, in_cycle)

         Log final fit state.

         Params:
         -------
         cp_values
             all cp values of the experiment
         cv_fit
             CV fit object (GridSearchCV)
         in_cycle [True/False]
             indicator whether KMeans reduction was performed


      .. py:method:: log_cycle_final(cycle, cp_values_cycle, cv_cp_paths_kmeans_clusters, cp_values_cycle_reduced, cv_results, cv_fit, cp_values, in_cycle=True)

         Function combining log_cycle() and log_final() for saving final log inside of KMeans loop.

         Params:
         -------
         cycle
             cycle number
         cp_values_cycle
             all the cp values entering the cycle
         cv_cp_paths_kmeans_clusters
             assigned cluster for cp values
         cp_values_cycle_reduced
             reduced cp values based on KMeans for which CV is performed
         cv_results
             CV results (output of GridSearchCV.cv_results_)
         cv_fit
             CV fit object (GridSearchCV)
         cp_values
             all cp values of the experiment
         in_cycle [True/False]
             indicator whether KMeans reduction was performed


   .. py:attribute:: criterion


   .. py:attribute:: splitter


   .. py:attribute:: max_depth


   .. py:attribute:: min_samples_split


   .. py:attribute:: min_samples_leaf


   .. py:attribute:: min_weight_fraction_leaf


   .. py:attribute:: max_features


   .. py:attribute:: max_leaf_nodes


   .. py:attribute:: min_impurity_decrease


   .. py:attribute:: n_splits


   .. py:attribute:: scoring


   .. py:attribute:: refit


   .. py:attribute:: n_jobs


   .. py:attribute:: return_train_score


   .. py:attribute:: verbose_cv


   .. py:attribute:: n_clusters


   .. py:attribute:: max_grid_length


   .. py:attribute:: n_best_clusters_for_extensive_search


   .. py:attribute:: max_cycles_of_extensive_cluster_search


   .. py:attribute:: verbose_flow


   .. py:attribute:: cv_folds


   .. py:attribute:: tree_model


   .. py:attribute:: feature_names
      :value: None


   .. py:attribute:: Log


   .. py:attribute:: cut_points
      :value: None


   .. py:attribute:: final_model
      :value: None


   .. py:method:: _check_feature_names(X)

      Check if the feature names are present in the input data.

      :param X: The input data.
      :type X: pandas.DataFrame

      :returns: Saves the feature names in the ``feature_names`` attribute.
      :rtype: None


   .. py:method:: set_feature_names(feature_names)

      Set the feature names. This method is useful when the input data X is a numpy array.

      :param feature_names: The feature names.
      :type feature_names: list

      :returns: Saves the feature names in the ``feature_names`` attribute.
      :rtype: None


   .. py:method:: _calculate_available_cp_params(tree_model: sklearn.tree.DecisionTreeRegressor, cv_folds: sklearn.model_selection.KFold, X, y, w)
      :staticmethod:


      Collects the available cp parameters from the cross-validation folds.

      :param tree_model: The decision tree regressor model.
      :type tree_model: DecisionTreeRegressor
      :param cv_folds: The cross-validation folds.
      :type cv_folds: KFold
      :param X: The input data.
      :type X: numpy.ndarray
      :param y: The target data.
      :type y: numpy.ndarray
      :param w: The sample weights.
      :type w: numpy.ndarray

      :returns: **cv_cp_paths** -- The available cp parameters.
      :rtype: numpy.ndarray


   .. py:method:: _kmeans_cv_cp_paths_reduction(cv_cp_paths: numpy.typing.ArrayLike, n_clusters: int)
      :staticmethod:


      Reduces the number of cp parameters for grid search using KMeans.

      :param cv_cp_paths: The available cp parameters to reduce.
      :type cv_cp_paths: numpy.ndarray
      :param n_clusters: The number of clusters to use for the KMeans algorithm (maximum number of cp parameters to search at once).
      :type n_clusters: int

      :returns: * **cv_cp_paths_reduced** (*numpy.ndarray*) -- The reduced cp parameters.
                * **cv_cp_paths_kmeans_clusters** (*numpy.ndarray*) -- The cluster labels for each cp parameter.


   .. py:method:: _x_validation(cv_cp_paths: numpy.typing.ArrayLike, X, y, sample_weight=None)

      Performs cross-validation for the given cp parameters.

      :param cv_cp_paths: The cp parameters to use for cross-validation.
      :type cv_cp_paths: numpy.ndarray
      :param X: The input data.
      :type X: numpy.ndarray
      :param y: The target data.
      :type y: numpy.ndarray
      :param sample_weight: The sample weights.
      :type sample_weight: numpy.ndarray, optional

      :returns: **gs** -- The grid search object.
      :rtype: GridSearchCV


   .. py:method:: _get_cp_params_for_best_xval_clusters(gs, cv_cp_paths, cv_cp_paths_kmeans_clusters, n_best_clusters_for_extensive_search)
      :staticmethod:


      Get cp parameters for the best clusters from cross-validation.

      :param gs: Fitted GridSearchCV object.
      :type gs: GridSearchCV object
      :param cv_cp_paths: The available cp parameters.
      :type cv_cp_paths: numpy.ndarray
      :param cv_cp_paths_kmeans_clusters: The cluster labels for each cp parameter.
      :type cv_cp_paths_kmeans_clusters: numpy.ndarray
      :param n_best_clusters_for_extensive_search: The number of best clusters to use for extensive search.
      :type n_best_clusters_for_extensive_search: int

      :returns: **best_cp_params** -- The cp parameters for the best clusters.
      :rtype: numpy.ndarray


   .. py:method:: _collect_results(gs)

      Collects the results from the grid search.

      :param gs: Fitted GridSearchCV object.
      :type gs: GridSearchCV object

      :returns: Saves results as properties of the class.
      :rtype: None


   .. py:method:: plot_tree(figsize: tuple = (25, 20), filled: bool = True, rounded: bool = False, precision: int = 3, fontsize: int = 14)

      Plot the decision tree.

      :param figsize: The size of the figure to create in matplotlib.
      :type figsize: tuple of int, default=(25,20)
      :param feature_names: The names of the features.
      :type feature_names: list of str, default=None
      :param filled: When set to True, paint nodes to indicate majority class for classification, extremity of values for regression, or purity of node for multi-output.
      :type filled: bool, default=True
      :param rounded: When set to True, draw node boxes with rounded corners and use Helvetica fonts instead of Times-Roman.
      :type rounded: bool, default=False
      :param precision: The precision for displaying split thresholds and other float values.
      :type precision: int, default=3
      :param fontsize: The fontsize for node labels.
      :type fontsize: int, default=14

      :rtype: matplotlib figure


   .. py:method:: predict(X: numpy.typing.ArrayLike) -> numpy.ndarray

      Predict regression value for X.

      :param X: The input samples. Internally, it will be converted to
                ``dtype=np.float32`` and if a sparse matrix is provided
                to a sparse ``csr_matrix``.
      :type X: {array-like, sparse matrix} of shape (n_samples, n_features)

      :returns: **y** -- The predicted classes, or the predict values.
      :rtype: array-like of shape (n_samples,) or (n_samples, n_outputs)


   .. py:method:: bin_values(data: numpy.typing.ArrayLike) -> numpy.ndarray

      Bin values of X into intervals.

      :param data: The input values to be binned.
      :type data: array-like

      :returns: **binned_values** -- The binned values.
      :rtype: array-like


   .. py:method:: fit(X, y, w=None)

      Fit the model and get the optimal binning.

      :param X: The training input samples. Internally, it will be converted to
                ``dtype=np.float32`` and if a sparse matrix is provided
                to a sparse ``csc_matrix``.
      :type X: {array-like, sparse matrix} of shape (n_samples, n_features)
      :param y: The target values (real numbers). Use ``dtype=np.float64`` and
                ``order='C'`` for maximum efficiency.
      :type y: array-like of shape (n_samples,) or (n_samples, n_outputs)
      :param w: Sample weights. If None, then samples are equally weighted. Splits
                that would create child nodes with net zero or negative weight are
                ignored while searching for a split in each node.
      :type w: array-like of shape (n_samples,), default=None

      :returns: **self** -- Fitted estimator.
      :rtype: OptimalBinningUsingDecisionTreeRegressor