The PyePAL API reference¶
The PAL package¶
Core functions¶
Core functions for PAL
Base class¶
Base class for PAL
- class pyepal.pal.pal_base.PALBase(X_design, models, ndim, epsilon=0.01, delta=0.05, beta_scale=0.1111111111111111, goals=None, coef_var_threshold=3, ranges=None)[source]¶
Bases:
object
PAL base class
- __init__(X_design, models, ndim, epsilon=0.01, delta=0.05, beta_scale=0.1111111111111111, goals=None, coef_var_threshold=3, ranges=None)[source]¶
Initialize the PAL instance
- Parameters:
X_design (np.array) – Design space (feature matrix)
models (list) – Machine learning models
ndim (int) – Number of objectives
epsilon (Union[list, float], optional) – Epsilon hyperparameter. Defaults to 0.01.
delta (float, optional) – Delta hyperparameter. Defaults to 0.05.
beta_scale (float, optional) – Scaling parameter for beta. If not equal to 1, the theoretical guarantees do not necessarily hold. Also note that the parametrization depends on the kernel type. Defaults to 1/9.
goals (List[str], optional) – If a list, provide “min” for every objective that shall be minimized and “max” for every objective that shall be maximized. Defaults to None, which _means that the code maximizes all objectives.
coef_var_threshold (float, optional) – Use only points with a coefficient of variation below this threshold in the classification step. Defaults to 3.
ranges (np.ndarray, optional) – Numpy array of length ndmin, where each element contains the value range of given objective. If this is provided, we will use \(\epsilon \cdot ranges\) to computer the uncertainties of the hyperrectangles instead of the default behavior \(\epsilon \cdot |\mu|\)
- __weakref__¶
list of weak references to the object (if defined)
- augment_design_space(X_design, classify=False, clean_classify=True)[source]¶
Add new design points to PAL instance
- Parameters:
X_design (np.ndarrary) – Design matrix. Two-dimensional array containing measurements in the rows and the features as the columns.
classify (bool) – Reclassifies the new design space, using the old model. This is, it runs inference, calculates the hyperrectangles, and runs the classification. Does not increase the iteration count. Note though that points that already have been classified as Pareto-optimal will not be re-classified, e.g., discarded—even if the new design points dominate the existing “Pareto optimal” points. Defaults to False.
clean_classify (bool) – Reclassifies the new design space, using the old model. This is, it runs inference, calculates the hyperrectangles, and runs the classification. Does not increase the iteration count. But, in contrast to classify it erases all previous classifications, before running the new classification. Hence, if some new design point dominates a previously Pareto efficient point, the previous Pareto optimal point will no longer be classified as Pareto efficient. This flag is incompatible with classify. If you choose clean_classify, PyePAL will erase all previous classifications, independent of what you choose for classify. Defaults to True.
- Return type:
None
- property discarded_indices¶
Return the indices of the discarded points
- property discarded_points¶
Return the discarded points
- property hyperrectangle_sizes¶
Return the sizes of the hyperrectangles
- property means¶
Return the means predicted by the model
- property number_design_points¶
Return the number of points in the design space
- property number_discarded_points¶
Return the number of discarded points
- property number_pareto_optimal_points¶
Return the number of Pareto optimal points
- property number_sampled_points¶
Return the number of sampled points
- property number_unclassified_points¶
Return the number of unclassified points
- property pareto_optimal_indices¶
Return the indices of the Pareto optimal points
- property pareto_optimal_points¶
Return the pareto optimal points
- run_one_step(batch_size=1, pooling_method='fro', sample_discarded=False, use_coef_var=True, replace_mean=True, replace_std=True)[source]¶
[summary]
- Parameters:
batch_size (int, optional) – Number of indices that will be returned. Defaults to 1.
pooling_method (str) – Method that is used to aggregate the uncertainty in different objectives into one scalar. Available options are: “fro” (Frobenius/Euclidean norm), “mean”, “median”. Defaults to “fro”.
sample_discarded (bool) – if true, it will sample from all points and not only from the unclassified and Pareto optimal ones
use_coef_var (bool) – If True, uses the coefficient of variation instead of the unscaled rectangle sizes
replace_mean (bool) – If true uses the measured _means for the sampled points
replace_std (bool) – If true uses the measured standard deviation for the sampled points
- Raises:
ValueError – In case the PAL instance was not initialized with measurements.
- Returns:
- Returns array of indices if there are
unclassified points that can be sample left.
- Return type:
Union[np.array, None]
- sample(exclude_idx=None, pooling_method='fro', sample_discarded=False, use_coef_var=True)[source]¶
Runs the sampling step based on the size of the hyperrectangle. I.e., favoring exploration.
- Parameters:
exclude_idx (Union[np.array, None], optional) – Points in design space to exclude from sampling. Defaults to None.
pooling_method (str) – Method that is used to aggregate the uncertainty in different objectives into one scalar. Available options are: “fro” (Frobenius/Euclidean norm), “mean”, “median”. Defaults to “fro”.
sample_discarded (bool) – if true, it will sample from all points and not only from the unclassified and Pareto optimal ones
use_coef_var (bool) – If True, uses the coefficient of variation instead of the unscaled rectangle sizes
- Raises:
ValueError – In case there are no uncertainty rectangles, i.e., when the _predict has not been successfully called.
- Returns:
Index of next point to evaluate in design space
- Return type:
int
- property sampled_indices¶
Return the indices of the sampled points
- property sampled_mask¶
Create a mask for the sampled points We count a point as sampled if at least one objective has been measured, i.e., self.sampled is a N * number objectives array in which some columns can be false if a measurement has not been performed
- property sampled_points¶
Return the sampled points
- property unclassified_indices¶
Return the indices of the unclassified points
- property unclassified_points¶
Return the discarded points
- update_train_set(indices, measurements, measurement_uncertainty=None)[source]¶
Update training set following a measurement
- Parameters:
indices (np.ndarray) – Indices of design space at which the measurements were taken
measurements (np.ndarray) – Measured values, 2D array. the length must equal the length of the indices array. the second direction must equal the number of objectives. If an objective is missing, provide np.nan. For example, np.array([1, 1, np.nan])
measurement_uncertainty (np.ndarray) – uncertainty in the measuremens, if not provided (None) will be zero. If it is not None, it must be an array with the same shape as the measurements If an objective is missing, provide np.nan. For example, np.array([1, 1, np.nan])
- property uses_fixed_epsilon¶
True if it uses the fixed epsilon \(\epsilon \cdot ranges\)
For GPy models¶
PAL using GPy GPR models
- class pyepal.pal.pal_gpy.PALGPy(*args, **kwargs)[source]¶
Bases:
PALBase
PAL class for a list of GPy GPR models, with one model per objective
- __init__(*args, **kwargs)[source]¶
Contruct the PALGPy instance
- Parameters:
X_design (np.array) – Design space (feature matrix)
models (list) – Machine learning models
ndim (int) – Number of objectives
epsilon (Union[list, float], optional) – Epsilon hyperparameter. Defaults to 0.01.
delta (float, optional) – Delta hyperparameter. Defaults to 0.05.
beta_scale (float, optional) – Scaling parameter for beta. If not equal to 1, the theoretical guarantees do not necessarily hold. Also note that the parametrization depends on the kernel type. Defaults to 1/9.
goals (List[str], optional) – If a list, provide “min” for every objective that shall be minimized and “max” for every objective that shall be maximized. Defaults to None, which means that the code maximizes all objectives.
coef_var_threshold (float, optional) – Use only points with a coefficient of variation below this threshold in the classification step. Defaults to 3.
restarts (int) – Number of random restarts that are used for hyperparameter optimization. Defaults to 20.
n_jobs (int) – Number of parallel processes that are used to fit the GPR models. Defaults to 1.
power_transformer (bool) – If True, use Yeo-Johnson transform on the inputs. Defaults to False.
For coregionalized GPy models¶
PAL for coregionalized GPR models
- class pyepal.pal.pal_coregionalized.PALCoregionalized(*args, **kwargs)[source]¶
Bases:
PALBase
PAL class for a coregionalized GPR model
- __init__(*args, **kwargs)[source]¶
Construct the PALCoregionalized instance
- Parameters:
X_design (np.array) – Design space (feature matrix)
models (list) – Machine learning models
ndim (int) – Number of objectives
epsilon (Union[list, float], optional) – Epsilon hyperparameter. Defaults to 0.01.
delta (float, optional) – Delta hyperparameter. Defaults to 0.05.
beta_scale (float, optional) – Scaling parameter for beta. If not equal to 1, the theoretical guarantees do not necessarily hold. Also note that the parametrization depends on the kernel type. Defaults to 1/9.
goals (List[str], optional) – If a list, provide “min” for every objective that shall be minimized and “max” for every objective that shall be maximized. Defaults to None, which means that the code maximizes all objectives.
coef_var_threshold (float, optional) – Use only points with a coefficient of variation below this threshold in the classification step. Defaults to 3.
restarts (int) – Number of random restarts that are used for hyperparameter optimization. Defaults to 20.
parallel (bool) – If true, model hyperparameters are optimized in parallel, using the GPy implementation. Defaults to False.
power_transformer (bool) – If True, use Yeo-Johnson transform on the inputs. Defaults to False.
For sklearn GPR models¶
PAL using Sklearn GPR models
- class pyepal.pal.pal_sklearn.PALSklearn(*args, **kwargs)[source]¶
Bases:
PALBase
PAL class for a list of Sklearn (GPR) models, with one model per objective
- __init__(*args, **kwargs)[source]¶
Construct the PALSklearn instance
- Parameters:
X_design (np.array) – Design space (feature matrix)
models (list) – Machine learning models. You can provide a list of GaussianProcessRegressor instances or a list of fitted RandomizedSearchCV/GridSearchCV instances with GaussianProcessRegressor models
ndim (int) – Number of objectives
epsilon (Union[list, float], optional) – Epsilon hyperparameter. Defaults to 0.01.
delta (float, optional) – Delta hyperparameter. Defaults to 0.05.
beta_scale (float, optional) – Scaling parameter for beta. If not equal to 1, the theoretical guarantees do not necessarily hold. Also note that the parametrization depends on the kernel type. Defaults to 1/9.
goals (List[str], optional) – If a list, provide “min” for every objective that shall be minimized and “max” for every objective that shall be maximized. Defaults to None, which means that the code maximizes all objectives.
coef_var_threshold (float, optional) – Use only points with a coefficient of variation below this threshold in the classification step. Defaults to 3.
n_jobs (int) – Number of parallel processes that are used to fit the GPR models. Defaults to 1.
For quantile regression with LightGBM¶
Implements a PAL class for GBDT models which can predict uncertainity intervals when used with quantile loss. For an example of GBDT with quantile loss see Jablonka, Kevin Maik; Moosavi, Seyed Mohamad; Asgari, Mehrdad; Ireland, Christopher; Patiny, Luc; Smit, Berend (2020): A Data-Driven Perspective on the Colours of Metal-Organic Frameworks. ChemRxiv. Preprint. https://doi.org/10.26434/chemrxiv.13033217.v1
For general information about quantile regression see https://en.wikipedia.org/wiki/Quantile_regression
Note that the scaling of the hyperrectangles has been derived for GPR models (with RBF kernels).
- class pyepal.pal.pal_gbdt.PALGBDT(*args, **kwargs)[source]¶
Bases:
PALBase
PAL class for a list of LightGBM GBDT models
- __init__(*args, **kwargs)[source]¶
Construct the PALGBDT instance
- Parameters:
X_design (np.array) – Design space (feature matrix)
(List[Iterable[LGBMRegressor (models) – Machine learning models. You need to provide a list of iterables. One iterable per objective and every iterable contains three LGBMRegressors. The first one for the lower uncertainty limits, the middle one for the median and the last one for the upper limit. To create appropriate models, you need to use the quantile loss. If you want to parallelize training, we recommend that you use the LightGBM parallelization and fit the models for the different objectives in serial fashion.s
LGBMRegressor – Machine learning models. You need to provide a list of iterables. One iterable per objective and every iterable contains three LGBMRegressors. The first one for the lower uncertainty limits, the middle one for the median and the last one for the upper limit. To create appropriate models, you need to use the quantile loss. If you want to parallelize training, we recommend that you use the LightGBM parallelization and fit the models for the different objectives in serial fashion.s
LGBMRegressor]] – Machine learning models. You need to provide a list of iterables. One iterable per objective and every iterable contains three LGBMRegressors. The first one for the lower uncertainty limits, the middle one for the median and the last one for the upper limit. To create appropriate models, you need to use the quantile loss. If you want to parallelize training, we recommend that you use the LightGBM parallelization and fit the models for the different objectives in serial fashion.s
ndim (int) – Number of objectives
epsilon (Union[list, float], optional) – Epsilon hyperparameter. Defaults to 0.01.
delta (float, optional) – Delta hyperparameter. Defaults to 0.05.
beta_scale (float, optional) – Scaling parameter for beta. If not equal to 1, the theoretical guarantees do not necessarily hold. Also note that the parametrization depends on the kernel type. Defaults to 1/9.
goals (List[str], optional) – If a list, provide “min” for every objective that shall be minimized and “max” for every objective that shall be maximized. Defaults to None, which means that the code maximizes all objectives.
coef_var_threshold (float, optional) – Use only points with a coefficient of variation below this threshold in the classification step. Defaults to 3.
interquartile_scaler (float, optional) – Used to convert the difference between the upper and lower quantile into a standard deviation. This, is std = (up-low)/interquartile_scaler. Defaults to 1.35, following Wan, X., Wang, W., Liu, J. et al. Estimating the sample mean and standard deviation from the sample size, median, range and/or interquartile range. BMC Med Res Methodol 14, 135 (2014). https://doi.org/10.1186/1471-2288-14-135
For GPR with GPFlow¶
PAL using GPy GPR models
- class pyepal.pal.pal_gpflowgpr.PALGPflowGPR(*args, **kwargs)[source]¶
Bases:
PALBase
PAL class for a list of GPFlow GPR models, with one model per objective. Please consider that there are specific multioutput models (https://gpflow.readthedocs.io/en/master/notebooks/advanced/multioutput.html) for which the train and prediction function would need to be adjusted. You might also consider using streaming GPRs (https://github.com/thangbui/streaming_sparse_gp). In future releases we might support this case automatically (i.e., handle the case in which only one model is provided).
- __init__(*args, **kwargs)[source]¶
Contruct the PALGPflowGPR instance
- Parameters:
X_design (np.array) – Design space (feature matrix)
models (list) – Machine learning models
ndim (int) – Number of objectives
epsilon (Union[list, float], optional) – Epsilon hyperparameter. Defaults to 0.01.
delta (float, optional) – Delta hyperparameter. Defaults to 0.05.
beta_scale (float, optional) – Scaling parameter for beta. If not equal to 1, the theoretical guarantees do not necessarily hold. Also note that the parametrization depends on the kernel type. Defaults to 1/9.
goals (List[str], optional) – If a list, provide “min” for every objective that shall be minimized and “max” for every objective that shall be maximized. Defaults to None, which means that the code maximizes all objectives.
coef_var_threshold (float, optional) – Use only points with a coefficient of variation below this threshold in the classification step. Defaults to 3.
opt (function, optional) – Optimizer function for the GPR parameters. If None (default), then we will use ` gpflow.optimizers.Scipy()`
opt_kwargs (dict, optional) – Keyword arguments passed to the optimizer. If None, PyePAL will pass {“maxiter”: 100}
n_jobs (int) – Number of parallel threads that are used to fit the GPR models. Defaults to 1.
For GPR with BoTorch¶
- class pyepal.pal.pal_botorch.PALBoTorch(*args, **kwargs)[source]¶
Bases:
PALBase
PAL class for a list of BoTorch (GPR) models, with one model per objective
- __init__(*args, **kwargs)[source]¶
Contruct the PALBoTorch instance
- Parameters:
X_design (np.array) – Design space (feature matrix)
model_functions (list) – Functions that when called with x, y, and optionally old_state_dict return a model and a likelihood. We need to this due to problems with re-training warm-started models in BOtorch (https://github.com/pytorch/botorch/issues/533).
ndim (int) – Number of objectives
epsilon (Union[list, float], optional) – Epsilon hyperparameter. Defaults to 0.01.
delta (float, optional) – Delta hyperparameter. Defaults to 0.05.
beta_scale (float, optional) – Scaling parameter for beta. If not equal to 1, the theoretical guarantees do not necessarily hold. Also note that the parametrization depends on the kernel type. Defaults to 1/9.
goals (List[str], optional) – If a list, provide “min” for every objective that shall be minimized and “max” for every objective that shall be maximized. Defaults to None, which means that the code maximizes all objectives.
coef_var_threshold (float, optional) – Use only points with a coefficient of variation below this threshold in the classification step. Defaults to 3.
restarts (int) – Number of random restarts that are used for hyperparameter optimization. Defaults to 20.
n_jobs (int) – Number of parallel processes that are used to fit the GPR models. Defaults to 1.
power_transformer (bool) – If True, use Yeo-Johnson transform on the inputs. Defaults to True.
add_observation_noise (bool) – If True, add observation noise to predicted uncertainties. Defaults to False
- class pyepal.pal.pal_botorch.PALMultiTaskBoTorch(*args, **kwargs)[source]¶
Bases:
PALBase
PAL class for a multioutput BoTorch model
- __init__(*args, **kwargs)[source]¶
Contruct the PALBoTorch instance
- Parameters:
X_design (np.array) – Design space (feature matrix)
model_functions (list) – Function that when called with x, y, and optionally old_state_dict returns a model and a likelihood. We need to this due to problems with re-training warm-started models in BOtorch (https://github.com/pytorch/botorch/issues/533).
ndim (int) – Number of objectives
epsilon (Union[list, float], optional) – Epsilon hyperparameter. Defaults to 0.01.
delta (float, optional) – Delta hyperparameter. Defaults to 0.05.
beta_scale (float, optional) – Scaling parameter for beta. If not equal to 1, the theoretical guarantees do not necessarily hold. Also note that the parametrization depends on the kernel type. Defaults to 1/9.
goals (List[str], optional) – If a list, provide “min” for every objective that shall be minimized and “max” for every objective that shall be maximized. Defaults to None, which means that the code maximizes all objectives.
coef_var_threshold (float, optional) – Use only points with a coefficient of variation below this threshold in the classification step. Defaults to 3.
restarts (int) – Number of random restarts that are used for hyperparameter optimization. Defaults to 20.
n_jobs (int) – Number of parallel processes that are used to fit the GPR models. Defaults to 1.
power_transformer (bool) – If True, use Yeo-Johnson transform on the inputs. Defaults to True.
For GBDT with CatBoost¶
Implements a PAL class for GBDT models with virtual ensembles for the uncertainty estimates.
Note that the scaling of the hyperrectangles has been derived for GPR models (with RBF kernels).
- class pyepal.pal.pal_catboost.PALCatBoost(*args, **kwargs)[source]¶
Bases:
PALBase
PAL class for a list of LightGBM GBDT models
- __init__(*args, **kwargs)[source]¶
Construct the PALCatBoost instance
- Parameters:
X_design (np.array) – Design space (feature matrix)
(List[CatBoostRegressor] (models) – Machine learning models. You need to provide a list of CatBoost regressors. The regressors need to use the RMSEWithUncertainty loss.
ndim (int) – Number of objectives
epsilon (Union[list, float], optional) – Epsilon hyperparameter. Defaults to 0.01.
delta (float, optional) – Delta hyperparameter. Defaults to 0.05.
beta_scale (float, optional) – Scaling parameter for beta. If not equal to 1, the theoretical guarantees do not necessarily hold. Also note that the parametrization depends on the kernel type. Defaults to 1/9.
goals (List[str], optional) – If a list, provide “min” for every objective that shall be minimized and “max” for every objective that shall be maximized. Defaults to None, which means that the code maximizes all objectives.
coef_var_threshold (float, optional) – Use only points with a coefficient of variation below this threshold in the classification step. Defaults to 3.
virtual_ensembles_count (int, optional) – Number of virtual ensemble models. Defaults to 10.
Schedules for hyperparameter optimization¶
Provides some scheduling functions that can be used to implement the _should_optimize_hyperparameters function
- pyepal.pal.schedules.exp_decay(iteration, base=10)[source]¶
Optimize hyperparameters at logartihmically spaced intervals
- Parameters:
iteration (int) – current iteration
base (int, optional) – Base of the logarithm. Defaults to 10.
- Returns:
True if iteration is on the log scaled grid
- Return type:
bool
- pyepal.pal.schedules.linear(iteration, frequency=10)[source]¶
Optimize hyperparameters at equally spaced intervals
- Parameters:
iteration (int) – current iteration
frequency (int, optional) – Spacing between the True outputs. Defaults to 10.
- Returns:
True if iteration can be divided by frequency without remainder
- Return type:
bool
Utilities for multiobjective optimization¶
Utilities for dealing with Pareto fronts in general
- pyepal.pal.utils.dominance_check(point1, point2)[source]¶
One point dominates another if it is not worse in all objectives and strictly better in at least one. This here assumes we want to maximize
- Return type:
bool
- pyepal.pal.utils.dominance_check_jitted(point, array)[source]¶
Check if point dominates any point in array
- Return type:
bool
- pyepal.pal.utils.dominance_check_jitted_2(array, point)[source]¶
Check if any point in array dominates point
- Return type:
bool
- pyepal.pal.utils.dominance_check_jitted_3(array, point, ignore_me)[source]¶
Check if any point in array dominates point. ignore_me since numba does not understand masked arrays
- Return type:
bool
- pyepal.pal.utils.exhaust_loop(palinstance, y, batch_size=1)[source]¶
Helper function that takes an initialized PAL instance and loops the sampling until there is no unclassified point left. This is useful if all measurements are already taken and one wants to test the algorithm with different hyperparameters.
- Parameters:
palinstance (PALBase) – A initialized instance of a class that inherited from PALBase and implemented the ._train() and ._predict() functions
y (np.array) – Measurements. The number of measurements must equal the number of points in the design space.
batch_size (int, optional) – Number of indices that will be returned. Defaults to 10.
- Returns:
None. The PAL instance is updated in place
- pyepal.pal.utils.get_hypervolume(pareto_front, reference_vector, prefactor=-1)[source]¶
Compute the hypervolume indicator of a Pareto front I multiply it with minus one as we assume that we want to maximize all objective and then we calculate the area
f1 | |----| | -| | -| ———— f2
But the code we use for the hv indicator assumes that the reference vector is larger than all the points in the Pareto front. For this reason, we then flip all the signs using prefactor
This indicator is not needed for the epsilon-PAL algorithm itself but only to allow tracking a metric that might help the user to see if the algorithm converges.
- Return type:
float
- pyepal.pal.utils.get_kmeans_samples(X, n_samples, **kwargs)[source]¶
Get the samples that are closest to the k=n_samples centroids
- Parameters:
X (np.array) – Feature array, on which the KMeans clustering is run
n_samples (int) – number of samples are should be selected
KMeans (**kwargs passed to the) –
- Returns:
selected_indices
- Return type:
np.array
- pyepal.pal.utils.get_maxmin_samples(X, n_samples, metric='euclidean', init='mean', seed=None, **kwargs)[source]¶
Greedy maxmin sampling, also known as Kennard-Stone sampling (1). Note that a greedy sampling is not guaranteed to give the ideal solution and the output will depend on the random initialization (if this is chosen).
If you need a good solution, you can restart this algorithm multiple times with random initialization and different random seeds and use a coverage metric to quantify how well the space is covered. Some metrics are described in (2). In contrast to the code provided with (2) and (3) we do not consider the feature importance for the selection as this is typically not known beforehand.
You might want to standardize your data before applying this sampling function.
Some more sampling options are provided in our structure_comp (4) Python package. Also, this implementation here is quite memory hungry.
References: (1) Kennard, R. W.; Stone, L. A. Computer Aided Design of Experiments. Technometrics 1969, 11 (1), 137–148. https://doi.org/10.1080/00401706.1969.10490666. (2) Moosavi, S. M.; Nandy, A.; Jablonka, K. M.; Ongari, D.; Janet, J. P.; Boyd, P. G.; Lee, Y.; Smit, B.; Kulik, H. J. Understanding the Diversity of the Metal-Organic Framework Ecosystem. Nature Communications 2020, 11 (1), 4068. https://doi.org/10.1038/s41467-020-17755-8. (3) Moosavi, S. M.; Chidambaram, A.; Talirz, L.; Haranczyk, M.; Stylianou, K. C.; Smit, B. Capturing Chemical Intuition in Synthesis of Metal-Organic Frameworks. Nat Commun 2019, 10 (1), 539. https://doi.org/10.1038/s41467-019-08483-9. (4) https://github.com/kjappelbaum/structure_comp
- Parameters:
X (np.array) – Feature array, this is the array that is used to perform the sampling
n_samples (int) – number of points that will be selected, needs to be lower than the length of X
metric (str, optional) – Distance metric to use for the maxmin calculation. Must be a valid option of scipy.spatial.distance.cdist (‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’, ‘cosine’, ‘dice’, ‘euclidean’, ‘hamming’, ‘jaccard’, ‘jensenshannon’, ‘kulsinski’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘wminkowski’, ‘yule’). Defaults to ‘euclidean’
init (str, optional) – either ‘mean’, ‘median’, or ‘random’. Determines how the initial point is chosen. Defaults to ‘center’
seed (int, optional) – seed for the random number generator. Defaults to None.
cdist (**kwargs passed to the) –
- Returns:
selected_indices
- Return type:
np.array
- pyepal.pal.utils.is_pareto_efficient(costs, return_mask=True)[source]¶
Find the Pareto efficient points Based on https://stackoverflow.com/questions/ 32791911/fast-calculation-of-pareto-front-in-python
- Parameters:
costs (np.array) – An (n_points, n_costs) array
return_mask (bool, optional) – True to return a mask, Otherwise it will be a (n_efficient_points, ) integer array of indices. Defaults to True.
- Returns:
[description]
- Return type:
np.array
Utilities for plotting¶
Plotting utilities
- pyepal.plotting.plot_bar_iterations(pareto_optimal, non_pareto_points, unclassified_points, ax=None)[source]¶
Plot stacked barplots for every step of the iteration.
- Parameters:
pareto_optimal (np.ndarray) – Number of pareto optimal points for every iteration.
non_pareto_points (np.ndarray) – Number of discarded points for every iteration
unclassified_points (np.ndarray) – Number of unclassified points for every iteration
- Returns:
- matplotlib axis (the same that was provided as input
or one from a new figure if no axis was provided)
- Return type:
axis
- pyepal.plotting.plot_histogram(y, palinstance, ax=None)[source]¶
Plot histograms, with maxima scaled to one and different categories indicated in color for one objective
- Parameters:
y (np.ndarray) – objective (measurement)
palinstance (PALBase) – instance of a PAL class
ax (ax) – Matplotlib figure axis
- Returns:
- matplotlib axis (the same that was provided as input
or one from a new figure if no axis was provided)ƒ
- Return type:
ax
- pyepal.plotting.plot_jointplot(y, palinstance, labels=None, figsize=(8.0, 6.0))[source]¶
Plot a jointplot of the objective space with histograms on the diagonal and 2D-Pareto plots on the off-diagonal.
- Parameters:
y (np.array) – Two-dimensional array with the objectives (measurements)
palinstance (PALBase) – “trained” PAL instance
labels (Union[List[str], None], optional) – Labels for each objective. Defaults to “objective [index]”.
figsize (tuple, optional) – Figure size for joint plot. Defaults to (8.0, 6.0).
- Returns:
matplotlib Figure object.
- Return type:
fig
- pyepal.plotting.plot_pareto_front_2d(y_0, y_1, std_0, std_1, palinstance, ax=None)[source]¶
Plot a 2D pareto front, with the different categories indicated in color.
- Parameters:
y_0 (np.ndarray) – objective 0
y_1 (np.ndarray) – objective 1
std_0 (np.ndarray) – standard deviation objective 0
std_1 (np.ndarray) – standard deviation objective 0
palinstance (PALBase) – PAL instance
ax (axix, optional) – Matplotlib figure axis. Defaults to None.
- Returns:
- matplotlib axis (the same that was provided as input
or one from a new figure if no axis was provided)
- Return type:
ax
- pyepal.plotting.plot_residuals(y, palinstance, labels=None, figsize=(6.0, 4.0))[source]¶
Plot signed residual (on y axis) vs fitted (on x axis) plot of sampled points. Will create suplots for y.ndim > 1.
- Parameters:
y (np.array) – Two-dimensional array with the objectives (measurements)
palinstance (PALBase) – “trained” PAL instance
labels (Union[List[str], None], optional) – Labels for each objective. Defaults to “objective [index]”.
figsize (tuple, optional) – Figure size for each individual residual vs fitted objective plot. Defaults to (6.0, 4.0).
- Returns:
matplotlib Figure object
- Return type:
fig
Input validation¶
Methods to validate inputs for the PAL classes
- pyepal.pal.validate_inputs.base_validate_models(models)[source]¶
Currently no validation as the predict and train function are implemented independet of the base class
- Return type:
list
- pyepal.pal.validate_inputs.validate_beta_scale(beta_scale)[source]¶
- Parameters:
beta_scale (Any) – scaling factor for beta
- Raises:
ValueError – If beta is smaller than 0
- Returns:
scaling factor for beta
- Return type:
float
- pyepal.pal.validate_inputs.validate_catboost_models(models, ndim)[source]¶
Make sure that the number of models is equal to the number of objectives. Also make sure that the models are CatBoostRegressor instances with RSMEWithUncertainty loss function
- Return type:
List
[Iterable
]
- pyepal.pal.validate_inputs.validate_coef_var(coef_var)[source]¶
Make sure that the coef_var makes sense
- pyepal.pal.validate_inputs.validate_coregionalized_gpy(models)[source]¶
Make sure that model is a coregionalized GPR model
- pyepal.pal.validate_inputs.validate_delta(delta)[source]¶
Make sure that delta is in a reasonable range
- Parameters:
delta (Any) – Delta hyperparameter
- Raises:
ValueError – Delta must be in [0,1].
- Returns:
delta
- Return type:
float
- pyepal.pal.validate_inputs.validate_epsilon(epsilon, ndim)[source]¶
Validate epsilon and return a np.array
- Parameters:
epsilon (Any) – Epsilon hyperparameter
ndim (int) – Number of dimensions/objectives
- Raises:
ValueError – If epsilon is a list there must be one float per dimension
ValueError – Epsilon must be in [0,1]
ValueError – If epsilon is an array there must be one float per dimension
- Returns:
Array of one epsilon per objective
- Return type:
np.ndarray
- pyepal.pal.validate_inputs.validate_gbdt_models(models, ndim)[source]¶
Make sure that the number of iterables is equal to the number of objectives and that every iterable contains three LGBMRegressors. Also, we check that at least the first and last models use quantile loss
- Return type:
List
[Iterable
]
- pyepal.pal.validate_inputs.validate_goals(goals, ndim)[source]¶
- Create a valid array of goals. 1 for maximization, -1
for objectives that are to be minimized.
- Parameters:
goals (Any) – List of goals, typically provideded as strings ‘max’ for maximization and ‘min’ for minimization
ndim (int) – number of dimensions
- Raises:
ValueError – If goals is a list and the length is not equal to ndim
ValueError – If goals is a list and the elements are not strings ‘min’, ‘max’ or -1 and 1
- Returns:
Array of -1 and 1
- Return type:
np.ndarray
- pyepal.pal.validate_inputs.validate_gpy_model(models)[source]¶
Make sure that all elements of the list a GPRegression models
- pyepal.pal.validate_inputs.validate_interquartile_scaler(interquartile_scaler)[source]¶
Make sure that the interquartile_scaler makes sense
- Return type:
float
- pyepal.pal.validate_inputs.validate_ndim(ndim)[source]¶
Make sure that the number of dimensions makes sense
- Parameters:
ndim (Any) – number of dimensions
- Raises:
ValueError – If the number of dimensions is not an integer
ValueError – If the number of dimensions is not greater than 0
- Returns:
the number of dimensions
- Return type:
int
- pyepal.pal.validate_inputs.validate_njobs(njobs)[source]¶
Make sure that njobs is an int > 1
- Return type:
int
- pyepal.pal.validate_inputs.validate_nt_models(models, ndim)[source]¶
Make sure that we can work with a sequence of
pyepal.pal.models.nt.NTModel()
- Return type:
Sequence
- pyepal.pal.validate_inputs.validate_number_models(models, ndim)[source]¶
Make sure that there are as many models as objectives
- Parameters:
models (Any) – List of models
ndim (int) – Number of objectives
- Raises:
ValueError – If the number of models does not equal the number of objectives
- pyepal.pal.validate_inputs.validate_optimizers(optimizers, ndim)[source]¶
Make sure that we can work with a Sequence if JaxOptimizer
- Return type:
Sequence
The models package¶
Helper functions for GPR with GPy¶
Wrappers for Gaussian Process Regression models.
We typically use the GPy package as it offers most flexibility for Gaussian processes in Python. Typically, we use automatic relevance determination (ARD), where one lengthscale parameter per input dimension is used.
If your task requires training on larger training sets, you might consider replacing the models with their sparse version but for the epsilon-PAL algorithm this typically shouldn’t be needed.
For kernel selection, you can have a look at https://www.cs.toronto.edu/~duvenaud/cookbook/ Matérn, RBF and RationalQuadrat are good quick and dirty solutions but have their caveats
- pyepal.models.gpr.build_coregionalized_model(X_train, y_train, kernel=None, w_rank=1, **kwargs)[source]¶
Wrapper for building a coregionalized GPR, it will have as many outputs as y_train.shape[1]. Each output will have its own noise term
- Return type:
GPCoregionalizedRegression
- pyepal.models.gpr.build_model(X_train, y_train, index=0, kernel=None, **kwargs)[source]¶
Build a single-output GPR model
- Return type:
GPRegression
- pyepal.models.gpr.get_matern_32_kernel(NFEAT, ARD=True, **kwargs)[source]¶
Matern-3/2 kernel without ARD
- Return type:
Matern32
- pyepal.models.gpr.get_matern_52_kernel(NFEAT, ARD=True, **kwargs)[source]¶
Matern-5/2 kernel without ARD
- Return type:
Matern52
- pyepal.models.gpr.get_ratquad_kernel(NFEAT, ARD=True, **kwargs)[source]¶
Rational quadratic kernel without ARD
- Return type:
RatQuad
- pyepal.models.gpr.predict(model, X)[source]¶
Wrapper function for the prediction method of a GPy regression model. It return the standard deviation instead of the variance
- Return type:
Tuple
[array
,array
]
Helper functions for GPR with BoTorch¶
- pyepal.models.botorch_gp.build_model(X, y, warped=True, input_scaled=True, wrap_indices=None, scaling_indices=None, covar_module=None)[source]¶
Build a BoTorch model for a single output.
- Parameters:
X (np.ndarray) – features
y (np.ndarray) – targets
warped (bool, optional) – If true, apply Kumaraswamy warping. Defaults to True.
input_scaled (bool, optional) – If true, scale features to unit cube. Defaults to True.
wrap_indices (Optional[Tuple[int]], optional) – Indices to which warping is applied. Defaults to None.
scaling_indices (Optional[Tuple[int]], optional) – Indices to which scaling is applied. Defaults to None.
covar_module (Optional[Module], optional) – Coregionalization model. Defaults to None.
- Returns:
Function that return model and likelihood when provided with x and y
- Return type:
Callable
- pyepal.models.botorch_gp.build_multioutput_model(X, y, warped=True, input_scaled=True, wrap_indices=None, scaling_indices=None, covar_module=None)[source]¶
Build a BoTorch model for multiple outputs.
- Parameters:
X (np.ndarray) – features
y (np.ndarray) – targets
warped (bool, optional) – If true, apply Kumaraswamy warping. Defaults to True.
input_scaled (bool, optional) – If true, scale features to unit cube. Defaults to True.
wrap_indices (Optional[Tuple[int]], optional) – Indices to which warping is applied. Defaults to None.
scaling_indices (Optional[Tuple[int]], optional) – Indices to which scaling is applied. Defaults to None.
covar_module (Optional[Module], optional) – Coregionalization model. Defaults to None.
- Returns:
Function that return model and likelihood when provided with x and y
- Return type:
Callable