Utilities¶
Data preprocessing, splitting, and utility functions for DRN workflows.
Data Preprocessing¶
Split and Preprocess¶
Main function for data splitting and preprocessing with support for both numerical and categorical features.
drn.utils.split_and_preprocess(features, target, num_features, cat_features, seed=42, num_standard=True)
¶
Data Splitting¶
drn.utils.split_data(features, target, seed=42, train_size=0.6, val_size=0.2)
¶
Split features and target into train, validation, and test sets based on fractions of the entire dataset.
Args: features: DataFrame of predictors. target: Series of labels. seed: Random seed for reproducibility. train_size: Fraction of data for training. val_size: Fraction of data for validation. (test_size is computed as 1 - train_size - val_size) Returns: x_train_raw, x_val_raw, x_test_raw, y_train, y_val, y_test
Data Preprocessing¶
drn.utils.preprocess_data(x_train_raw, x_val_raw, x_test_raw, num_features, cat_features, num_standard=True)
¶
Fit a ColumnTransformer on x_train_raw and transform raw splits. - Numeric features are optionally standardized. - Categorical features are one-hot encoded, using full categories detected from splits.
Returns: x_train, x_val, x_test, fitted ColumnTransformer, all_categories mapping
Categorical Handling¶
Replace Rare Categories¶
drn.utils.replace_rare_categories(df, threshold=10, placeholder='OTHER', cat_features=None)
¶
Replace rare categories in specified categorical columns with a placeholder category.
Parameters: - df: The input DataFrame. - threshold: Minimum number of occurrences for a category to be kept. - placeholder: Name to assign to rare categories. - cat_features: If specified, only apply to these columns.
Raises: - ValueError: If the placeholder value already exists in any of the target columns.
Returns: - pd.DataFrame: A new DataFrame with rare categories replaced.
Mathematical Utilities¶
Binary Search for Inverse CDF¶
drn.utils.binary_search_icdf(distribution, p, l=None, u=None, max_iter=1000, tolerance=1e-07)
¶
Generic binary search implementation for inverse CDF (quantiles).
This function can be used by any distribution that has a cdf
method
but doesn't have its own icdf
implementation.
Args:
distribution: Distribution object with a cdf
method
p: cumulative probability value at which to evaluate icdf
l: lower bound for the quantile search
u: upper bound for the quantile search
max_iter: maximum number of iterations
tolerance: stopping criteria for convergence
Returns: A tensor of shape (1, batch_shape) containing the inverse CDF values.
Helper Functions¶
Convert to NumPy¶
drn.utils._to_numpy(data)
¶
Convert input data to numpy array with float32 precision.