Skip to content

Utilities

Data preprocessing, splitting, and utility functions for DRN workflows.

Data Preprocessing

Split and Preprocess

Main function for data splitting and preprocessing with support for both numerical and categorical features.

drn.utils.split_and_preprocess(features, target, num_features, cat_features, seed=42, num_standard=True)

Data Splitting

drn.utils.split_data(features, target, seed=42, train_size=0.6, val_size=0.2)

Split features and target into train, validation, and test sets based on fractions of the entire dataset.

Args: features: DataFrame of predictors. target: Series of labels. seed: Random seed for reproducibility. train_size: Fraction of data for training. val_size: Fraction of data for validation. (test_size is computed as 1 - train_size - val_size) Returns: x_train_raw, x_val_raw, x_test_raw, y_train, y_val, y_test

Data Preprocessing

drn.utils.preprocess_data(x_train_raw, x_val_raw, x_test_raw, num_features, cat_features, num_standard=True)

Fit a ColumnTransformer on x_train_raw and transform raw splits. - Numeric features are optionally standardized. - Categorical features are one-hot encoded, using full categories detected from splits.

Returns: x_train, x_val, x_test, fitted ColumnTransformer, all_categories mapping

Categorical Handling

Replace Rare Categories

drn.utils.replace_rare_categories(df, threshold=10, placeholder='OTHER', cat_features=None)

Replace rare categories in specified categorical columns with a placeholder category.

Parameters: - df: The input DataFrame. - threshold: Minimum number of occurrences for a category to be kept. - placeholder: Name to assign to rare categories. - cat_features: If specified, only apply to these columns.

Raises: - ValueError: If the placeholder value already exists in any of the target columns.

Returns: - pd.DataFrame: A new DataFrame with rare categories replaced.

Mathematical Utilities

Binary Search for Inverse CDF

drn.utils.binary_search_icdf(distribution, p, l=None, u=None, max_iter=1000, tolerance=1e-07)

Generic binary search implementation for inverse CDF (quantiles).

This function can be used by any distribution that has a cdf method but doesn't have its own icdf implementation.

Args: distribution: Distribution object with a cdf method p: cumulative probability value at which to evaluate icdf l: lower bound for the quantile search u: upper bound for the quantile search max_iter: maximum number of iterations tolerance: stopping criteria for convergence

Returns: A tensor of shape (1, batch_shape) containing the inverse CDF values.

Helper Functions

Convert to NumPy

drn.utils._to_numpy(data)

Convert input data to numpy array with float32 precision.