Contents

The PyePAL API reference

The PAL package

Core functions

Core functions for PAL

Base class

Base class for PAL

class pyepal.pal.pal_base.PALBase(X_design, models, ndim, epsilon=0.01, delta=0.05, beta_scale=0.1111111111111111, goals=None, coef_var_threshold=3, ranges=None)[source]

Bases: object

PAL base class

__init__(X_design, models, ndim, epsilon=0.01, delta=0.05, beta_scale=0.1111111111111111, goals=None, coef_var_threshold=3, ranges=None)[source]

Initialize the PAL instance

Parameters:
  • X_design (np.array) – Design space (feature matrix)

  • models (list) – Machine learning models

  • ndim (int) – Number of objectives

  • epsilon (Union[list, float], optional) – Epsilon hyperparameter. Defaults to 0.01.

  • delta (float, optional) – Delta hyperparameter. Defaults to 0.05.

  • beta_scale (float, optional) – Scaling parameter for beta. If not equal to 1, the theoretical guarantees do not necessarily hold. Also note that the parametrization depends on the kernel type. Defaults to 1/9.

  • goals (List[str], optional) – If a list, provide “min” for every objective that shall be minimized and “max” for every objective that shall be maximized. Defaults to None, which _means that the code maximizes all objectives.

  • coef_var_threshold (float, optional) – Use only points with a coefficient of variation below this threshold in the classification step. Defaults to 3.

  • ranges (np.ndarray, optional) – Numpy array of length ndmin, where each element contains the value range of given objective. If this is provided, we will use \(\epsilon \cdot ranges\) to computer the uncertainties of the hyperrectangles instead of the default behavior \(\epsilon \cdot |\mu|\)

__repr__()[source]

Return repr(self).

__weakref__

list of weak references to the object (if defined)

augment_design_space(X_design, classify=False, clean_classify=True)[source]

Add new design points to PAL instance

Parameters:
  • X_design (np.ndarrary) – Design matrix. Two-dimensional array containing measurements in the rows and the features as the columns.

  • classify (bool) – Reclassifies the new design space, using the old model. This is, it runs inference, calculates the hyperrectangles, and runs the classification. Does not increase the iteration count. Note though that points that already have been classified as Pareto-optimal will not be re-classified, e.g., discarded—even if the new design points dominate the existing “Pareto optimal” points. Defaults to False.

  • clean_classify (bool) – Reclassifies the new design space, using the old model. This is, it runs inference, calculates the hyperrectangles, and runs the classification. Does not increase the iteration count. But, in contrast to classify it erases all previous classifications, before running the new classification. Hence, if some new design point dominates a previously Pareto efficient point, the previous Pareto optimal point will no longer be classified as Pareto efficient. This flag is incompatible with classify. If you choose clean_classify, PyePAL will erase all previous classifications, independent of what you choose for classify. Defaults to True.

Return type:

None

property discarded_indices

Return the indices of the discarded points

property discarded_points

Return the discarded points

property hyperrectangle_sizes

Return the sizes of the hyperrectangles

property means

Return the means predicted by the model

property number_design_points

Return the number of points in the design space

property number_discarded_points

Return the number of discarded points

property number_pareto_optimal_points

Return the number of Pareto optimal points

property number_sampled_points

Return the number of sampled points

property number_unclassified_points

Return the number of unclassified points

property pareto_optimal_indices

Return the indices of the Pareto optimal points

property pareto_optimal_points

Return the pareto optimal points

run_one_step(batch_size=1, pooling_method='fro', sample_discarded=False, use_coef_var=True, replace_mean=True, replace_std=True)[source]

[summary]

Parameters:
  • batch_size (int, optional) – Number of indices that will be returned. Defaults to 1.

  • pooling_method (str) – Method that is used to aggregate the uncertainty in different objectives into one scalar. Available options are: “fro” (Frobenius/Euclidean norm), “mean”, “median”. Defaults to “fro”.

  • sample_discarded (bool) – if true, it will sample from all points and not only from the unclassified and Pareto optimal ones

  • use_coef_var (bool) – If True, uses the coefficient of variation instead of the unscaled rectangle sizes

  • replace_mean (bool) – If true uses the measured _means for the sampled points

  • replace_std (bool) – If true uses the measured standard deviation for the sampled points

Raises:

ValueError – In case the PAL instance was not initialized with measurements.

Returns:

Returns array of indices if there are

unclassified points that can be sample left.

Return type:

Union[np.array, None]

sample(exclude_idx=None, pooling_method='fro', sample_discarded=False, use_coef_var=True)[source]

Runs the sampling step based on the size of the hyperrectangle. I.e., favoring exploration.

Parameters:
  • exclude_idx (Union[np.array, None], optional) – Points in design space to exclude from sampling. Defaults to None.

  • pooling_method (str) – Method that is used to aggregate the uncertainty in different objectives into one scalar. Available options are: “fro” (Frobenius/Euclidean norm), “mean”, “median”. Defaults to “fro”.

  • sample_discarded (bool) – if true, it will sample from all points and not only from the unclassified and Pareto optimal ones

  • use_coef_var (bool) – If True, uses the coefficient of variation instead of the unscaled rectangle sizes

Raises:

ValueError – In case there are no uncertainty rectangles, i.e., when the _predict has not been successfully called.

Returns:

Index of next point to evaluate in design space

Return type:

int

property sampled_indices

Return the indices of the sampled points

property sampled_mask

Create a mask for the sampled points We count a point as sampled if at least one objective has been measured, i.e., self.sampled is a N * number objectives array in which some columns can be false if a measurement has not been performed

property sampled_points

Return the sampled points

should_cross_validate()[source]

Override for more complex cross validation schedules

property unclassified_indices

Return the indices of the unclassified points

property unclassified_points

Return the discarded points

update_train_set(indices, measurements, measurement_uncertainty=None)[source]

Update training set following a measurement

Parameters:
  • indices (np.ndarray) – Indices of design space at which the measurements were taken

  • measurements (np.ndarray) – Measured values, 2D array. the length must equal the length of the indices array. the second direction must equal the number of objectives. If an objective is missing, provide np.nan. For example, np.array([1, 1, np.nan])

  • measurement_uncertainty (np.ndarray) – uncertainty in the measuremens, if not provided (None) will be zero. If it is not None, it must be an array with the same shape as the measurements If an objective is missing, provide np.nan. For example, np.array([1, 1, np.nan])

property uses_fixed_epsilon

True if it uses the fixed epsilon \(\epsilon \cdot ranges\)

For GPy models

PAL using GPy GPR models

class pyepal.pal.pal_gpy.PALGPy(*args, **kwargs)[source]

Bases: PALBase

PAL class for a list of GPy GPR models, with one model per objective

__init__(*args, **kwargs)[source]

Contruct the PALGPy instance

Parameters:
  • X_design (np.array) – Design space (feature matrix)

  • models (list) – Machine learning models

  • ndim (int) – Number of objectives

  • epsilon (Union[list, float], optional) – Epsilon hyperparameter. Defaults to 0.01.

  • delta (float, optional) – Delta hyperparameter. Defaults to 0.05.

  • beta_scale (float, optional) – Scaling parameter for beta. If not equal to 1, the theoretical guarantees do not necessarily hold. Also note that the parametrization depends on the kernel type. Defaults to 1/9.

  • goals (List[str], optional) – If a list, provide “min” for every objective that shall be minimized and “max” for every objective that shall be maximized. Defaults to None, which means that the code maximizes all objectives.

  • coef_var_threshold (float, optional) – Use only points with a coefficient of variation below this threshold in the classification step. Defaults to 3.

  • restarts (int) – Number of random restarts that are used for hyperparameter optimization. Defaults to 20.

  • n_jobs (int) – Number of parallel processes that are used to fit the GPR models. Defaults to 1.

  • power_transformer (bool) – If True, use Yeo-Johnson transform on the inputs. Defaults to False.

For coregionalized GPy models

PAL for coregionalized GPR models

class pyepal.pal.pal_coregionalized.PALCoregionalized(*args, **kwargs)[source]

Bases: PALBase

PAL class for a coregionalized GPR model

__init__(*args, **kwargs)[source]

Construct the PALCoregionalized instance

Parameters:
  • X_design (np.array) – Design space (feature matrix)

  • models (list) – Machine learning models

  • ndim (int) – Number of objectives

  • epsilon (Union[list, float], optional) – Epsilon hyperparameter. Defaults to 0.01.

  • delta (float, optional) – Delta hyperparameter. Defaults to 0.05.

  • beta_scale (float, optional) – Scaling parameter for beta. If not equal to 1, the theoretical guarantees do not necessarily hold. Also note that the parametrization depends on the kernel type. Defaults to 1/9.

  • goals (List[str], optional) – If a list, provide “min” for every objective that shall be minimized and “max” for every objective that shall be maximized. Defaults to None, which means that the code maximizes all objectives.

  • coef_var_threshold (float, optional) – Use only points with a coefficient of variation below this threshold in the classification step. Defaults to 3.

  • restarts (int) – Number of random restarts that are used for hyperparameter optimization. Defaults to 20.

  • parallel (bool) – If true, model hyperparameters are optimized in parallel, using the GPy implementation. Defaults to False.

  • power_transformer (bool) – If True, use Yeo-Johnson transform on the inputs. Defaults to False.

For sklearn GPR models

PAL using Sklearn GPR models

class pyepal.pal.pal_sklearn.PALSklearn(*args, **kwargs)[source]

Bases: PALBase

PAL class for a list of Sklearn (GPR) models, with one model per objective

__init__(*args, **kwargs)[source]

Construct the PALSklearn instance

Parameters:
  • X_design (np.array) – Design space (feature matrix)

  • models (list) – Machine learning models. You can provide a list of GaussianProcessRegressor instances or a list of fitted RandomizedSearchCV/GridSearchCV instances with GaussianProcessRegressor models

  • ndim (int) – Number of objectives

  • epsilon (Union[list, float], optional) – Epsilon hyperparameter. Defaults to 0.01.

  • delta (float, optional) – Delta hyperparameter. Defaults to 0.05.

  • beta_scale (float, optional) – Scaling parameter for beta. If not equal to 1, the theoretical guarantees do not necessarily hold. Also note that the parametrization depends on the kernel type. Defaults to 1/9.

  • goals (List[str], optional) – If a list, provide “min” for every objective that shall be minimized and “max” for every objective that shall be maximized. Defaults to None, which means that the code maximizes all objectives.

  • coef_var_threshold (float, optional) – Use only points with a coefficient of variation below this threshold in the classification step. Defaults to 3.

  • n_jobs (int) – Number of parallel processes that are used to fit the GPR models. Defaults to 1.

For quantile regression with LightGBM

Implements a PAL class for GBDT models which can predict uncertainity intervals when used with quantile loss. For an example of GBDT with quantile loss see Jablonka, Kevin Maik; Moosavi, Seyed Mohamad; Asgari, Mehrdad; Ireland, Christopher; Patiny, Luc; Smit, Berend (2020): A Data-Driven Perspective on the Colours of Metal-Organic Frameworks. ChemRxiv. Preprint. https://doi.org/10.26434/chemrxiv.13033217.v1

For general information about quantile regression see https://en.wikipedia.org/wiki/Quantile_regression

Note that the scaling of the hyperrectangles has been derived for GPR models (with RBF kernels).

class pyepal.pal.pal_gbdt.PALGBDT(*args, **kwargs)[source]

Bases: PALBase

PAL class for a list of LightGBM GBDT models

__init__(*args, **kwargs)[source]

Construct the PALGBDT instance

Parameters:
  • X_design (np.array) – Design space (feature matrix)

  • (List[Iterable[LGBMRegressor (models) – Machine learning models. You need to provide a list of iterables. One iterable per objective and every iterable contains three LGBMRegressors. The first one for the lower uncertainty limits, the middle one for the median and the last one for the upper limit. To create appropriate models, you need to use the quantile loss. If you want to parallelize training, we recommend that you use the LightGBM parallelization and fit the models for the different objectives in serial fashion.s

  • LGBMRegressor – Machine learning models. You need to provide a list of iterables. One iterable per objective and every iterable contains three LGBMRegressors. The first one for the lower uncertainty limits, the middle one for the median and the last one for the upper limit. To create appropriate models, you need to use the quantile loss. If you want to parallelize training, we recommend that you use the LightGBM parallelization and fit the models for the different objectives in serial fashion.s

  • LGBMRegressor]] – Machine learning models. You need to provide a list of iterables. One iterable per objective and every iterable contains three LGBMRegressors. The first one for the lower uncertainty limits, the middle one for the median and the last one for the upper limit. To create appropriate models, you need to use the quantile loss. If you want to parallelize training, we recommend that you use the LightGBM parallelization and fit the models for the different objectives in serial fashion.s

  • ndim (int) – Number of objectives

  • epsilon (Union[list, float], optional) – Epsilon hyperparameter. Defaults to 0.01.

  • delta (float, optional) – Delta hyperparameter. Defaults to 0.05.

  • beta_scale (float, optional) – Scaling parameter for beta. If not equal to 1, the theoretical guarantees do not necessarily hold. Also note that the parametrization depends on the kernel type. Defaults to 1/9.

  • goals (List[str], optional) – If a list, provide “min” for every objective that shall be minimized and “max” for every objective that shall be maximized. Defaults to None, which means that the code maximizes all objectives.

  • coef_var_threshold (float, optional) – Use only points with a coefficient of variation below this threshold in the classification step. Defaults to 3.

  • interquartile_scaler (float, optional) – Used to convert the difference between the upper and lower quantile into a standard deviation. This, is std = (up-low)/interquartile_scaler. Defaults to 1.35, following Wan, X., Wang, W., Liu, J. et al. Estimating the sample mean and standard deviation from the sample size, median, range and/or interquartile range. BMC Med Res Methodol 14, 135 (2014). https://doi.org/10.1186/1471-2288-14-135

For GPR with GPFlow

PAL using GPy GPR models

class pyepal.pal.pal_gpflowgpr.PALGPflowGPR(*args, **kwargs)[source]

Bases: PALBase

PAL class for a list of GPFlow GPR models, with one model per objective. Please consider that there are specific multioutput models (https://gpflow.readthedocs.io/en/master/notebooks/advanced/multioutput.html) for which the train and prediction function would need to be adjusted. You might also consider using streaming GPRs (https://github.com/thangbui/streaming_sparse_gp). In future releases we might support this case automatically (i.e., handle the case in which only one model is provided).

__init__(*args, **kwargs)[source]

Contruct the PALGPflowGPR instance

Parameters:
  • X_design (np.array) – Design space (feature matrix)

  • models (list) – Machine learning models

  • ndim (int) – Number of objectives

  • epsilon (Union[list, float], optional) – Epsilon hyperparameter. Defaults to 0.01.

  • delta (float, optional) – Delta hyperparameter. Defaults to 0.05.

  • beta_scale (float, optional) – Scaling parameter for beta. If not equal to 1, the theoretical guarantees do not necessarily hold. Also note that the parametrization depends on the kernel type. Defaults to 1/9.

  • goals (List[str], optional) – If a list, provide “min” for every objective that shall be minimized and “max” for every objective that shall be maximized. Defaults to None, which means that the code maximizes all objectives.

  • coef_var_threshold (float, optional) – Use only points with a coefficient of variation below this threshold in the classification step. Defaults to 3.

  • opt (function, optional) – Optimizer function for the GPR parameters. If None (default), then we will use ` gpflow.optimizers.Scipy()`

  • opt_kwargs (dict, optional) – Keyword arguments passed to the optimizer. If None, PyePAL will pass {“maxiter”: 100}

  • n_jobs (int) – Number of parallel threads that are used to fit the GPR models. Defaults to 1.

For GPR with BoTorch

class pyepal.pal.pal_botorch.PALBoTorch(*args, **kwargs)[source]

Bases: PALBase

PAL class for a list of BoTorch (GPR) models, with one model per objective

__init__(*args, **kwargs)[source]

Contruct the PALBoTorch instance

Parameters:
  • X_design (np.array) – Design space (feature matrix)

  • model_functions (list) – Functions that when called with x, y, and optionally old_state_dict return a model and a likelihood. We need to this due to problems with re-training warm-started models in BOtorch (https://github.com/pytorch/botorch/issues/533).

  • ndim (int) – Number of objectives

  • epsilon (Union[list, float], optional) – Epsilon hyperparameter. Defaults to 0.01.

  • delta (float, optional) – Delta hyperparameter. Defaults to 0.05.

  • beta_scale (float, optional) – Scaling parameter for beta. If not equal to 1, the theoretical guarantees do not necessarily hold. Also note that the parametrization depends on the kernel type. Defaults to 1/9.

  • goals (List[str], optional) – If a list, provide “min” for every objective that shall be minimized and “max” for every objective that shall be maximized. Defaults to None, which means that the code maximizes all objectives.

  • coef_var_threshold (float, optional) – Use only points with a coefficient of variation below this threshold in the classification step. Defaults to 3.

  • restarts (int) – Number of random restarts that are used for hyperparameter optimization. Defaults to 20.

  • n_jobs (int) – Number of parallel processes that are used to fit the GPR models. Defaults to 1.

  • power_transformer (bool) – If True, use Yeo-Johnson transform on the inputs. Defaults to True.

  • add_observation_noise (bool) – If True, add observation noise to predicted uncertainties. Defaults to False

class pyepal.pal.pal_botorch.PALMultiTaskBoTorch(*args, **kwargs)[source]

Bases: PALBase

PAL class for a multioutput BoTorch model

__init__(*args, **kwargs)[source]

Contruct the PALBoTorch instance

Parameters:
  • X_design (np.array) – Design space (feature matrix)

  • model_functions (list) – Function that when called with x, y, and optionally old_state_dict returns a model and a likelihood. We need to this due to problems with re-training warm-started models in BOtorch (https://github.com/pytorch/botorch/issues/533).

  • ndim (int) – Number of objectives

  • epsilon (Union[list, float], optional) – Epsilon hyperparameter. Defaults to 0.01.

  • delta (float, optional) – Delta hyperparameter. Defaults to 0.05.

  • beta_scale (float, optional) – Scaling parameter for beta. If not equal to 1, the theoretical guarantees do not necessarily hold. Also note that the parametrization depends on the kernel type. Defaults to 1/9.

  • goals (List[str], optional) – If a list, provide “min” for every objective that shall be minimized and “max” for every objective that shall be maximized. Defaults to None, which means that the code maximizes all objectives.

  • coef_var_threshold (float, optional) – Use only points with a coefficient of variation below this threshold in the classification step. Defaults to 3.

  • restarts (int) – Number of random restarts that are used for hyperparameter optimization. Defaults to 20.

  • n_jobs (int) – Number of parallel processes that are used to fit the GPR models. Defaults to 1.

  • power_transformer (bool) – If True, use Yeo-Johnson transform on the inputs. Defaults to True.

For GBDT with CatBoost

Implements a PAL class for GBDT models with virtual ensembles for the uncertainty estimates.

Note that the scaling of the hyperrectangles has been derived for GPR models (with RBF kernels).

class pyepal.pal.pal_catboost.PALCatBoost(*args, **kwargs)[source]

Bases: PALBase

PAL class for a list of LightGBM GBDT models

__init__(*args, **kwargs)[source]

Construct the PALCatBoost instance

Parameters:
  • X_design (np.array) – Design space (feature matrix)

  • (List[CatBoostRegressor] (models) – Machine learning models. You need to provide a list of CatBoost regressors. The regressors need to use the RMSEWithUncertainty loss.

  • ndim (int) – Number of objectives

  • epsilon (Union[list, float], optional) – Epsilon hyperparameter. Defaults to 0.01.

  • delta (float, optional) – Delta hyperparameter. Defaults to 0.05.

  • beta_scale (float, optional) – Scaling parameter for beta. If not equal to 1, the theoretical guarantees do not necessarily hold. Also note that the parametrization depends on the kernel type. Defaults to 1/9.

  • goals (List[str], optional) – If a list, provide “min” for every objective that shall be minimized and “max” for every objective that shall be maximized. Defaults to None, which means that the code maximizes all objectives.

  • coef_var_threshold (float, optional) – Use only points with a coefficient of variation below this threshold in the classification step. Defaults to 3.

  • virtual_ensembles_count (int, optional) – Number of virtual ensemble models. Defaults to 10.

Schedules for hyperparameter optimization

Provides some scheduling functions that can be used to implement the _should_optimize_hyperparameters function

pyepal.pal.schedules.exp_decay(iteration, base=10)[source]

Optimize hyperparameters at logartihmically spaced intervals

Parameters:
  • iteration (int) – current iteration

  • base (int, optional) – Base of the logarithm. Defaults to 10.

Returns:

True if iteration is on the log scaled grid

Return type:

bool

pyepal.pal.schedules.linear(iteration, frequency=10)[source]

Optimize hyperparameters at equally spaced intervals

Parameters:
  • iteration (int) – current iteration

  • frequency (int, optional) – Spacing between the True outputs. Defaults to 10.

Returns:

True if iteration can be divided by frequency without remainder

Return type:

bool

Utilities for multiobjective optimization

Utilities for dealing with Pareto fronts in general

pyepal.pal.utils.dominance_check(point1, point2)[source]

One point dominates another if it is not worse in all objectives and strictly better in at least one. This here assumes we want to maximize

Return type:

bool

pyepal.pal.utils.dominance_check_jitted(point, array)[source]

Check if point dominates any point in array

Return type:

bool

pyepal.pal.utils.dominance_check_jitted_2(array, point)[source]

Check if any point in array dominates point

Return type:

bool

pyepal.pal.utils.dominance_check_jitted_3(array, point, ignore_me)[source]

Check if any point in array dominates point. ignore_me since numba does not understand masked arrays

Return type:

bool

pyepal.pal.utils.exhaust_loop(palinstance, y, batch_size=1)[source]

Helper function that takes an initialized PAL instance and loops the sampling until there is no unclassified point left. This is useful if all measurements are already taken and one wants to test the algorithm with different hyperparameters.

Parameters:
  • palinstance (PALBase) – A initialized instance of a class that inherited from PALBase and implemented the ._train() and ._predict() functions

  • y (np.array) – Measurements. The number of measurements must equal the number of points in the design space.

  • batch_size (int, optional) – Number of indices that will be returned. Defaults to 10.

Returns:

None. The PAL instance is updated in place

pyepal.pal.utils.get_hypervolume(pareto_front, reference_vector, prefactor=-1)[source]

Compute the hypervolume indicator of a Pareto front I multiply it with minus one as we assume that we want to maximize all objective and then we calculate the area

f1 | |----| | -| | -| ———— f2

But the code we use for the hv indicator assumes that the reference vector is larger than all the points in the Pareto front. For this reason, we then flip all the signs using prefactor

This indicator is not needed for the epsilon-PAL algorithm itself but only to allow tracking a metric that might help the user to see if the algorithm converges.

Return type:

float

pyepal.pal.utils.get_kmeans_samples(X, n_samples, **kwargs)[source]

Get the samples that are closest to the k=n_samples centroids

Parameters:
  • X (np.array) – Feature array, on which the KMeans clustering is run

  • n_samples (int) – number of samples are should be selected

  • KMeans (**kwargs passed to the) –

Returns:

selected_indices

Return type:

np.array

pyepal.pal.utils.get_maxmin_samples(X, n_samples, metric='euclidean', init='mean', seed=None, **kwargs)[source]

Greedy maxmin sampling, also known as Kennard-Stone sampling (1). Note that a greedy sampling is not guaranteed to give the ideal solution and the output will depend on the random initialization (if this is chosen).

If you need a good solution, you can restart this algorithm multiple times with random initialization and different random seeds and use a coverage metric to quantify how well the space is covered. Some metrics are described in (2). In contrast to the code provided with (2) and (3) we do not consider the feature importance for the selection as this is typically not known beforehand.

You might want to standardize your data before applying this sampling function.

Some more sampling options are provided in our structure_comp (4) Python package. Also, this implementation here is quite memory hungry.

References: (1) Kennard, R. W.; Stone, L. A. Computer Aided Design of Experiments. Technometrics 1969, 11 (1), 137–148. https://doi.org/10.1080/00401706.1969.10490666. (2) Moosavi, S. M.; Nandy, A.; Jablonka, K. M.; Ongari, D.; Janet, J. P.; Boyd, P. G.; Lee, Y.; Smit, B.; Kulik, H. J. Understanding the Diversity of the Metal-Organic Framework Ecosystem. Nature Communications 2020, 11 (1), 4068. https://doi.org/10.1038/s41467-020-17755-8. (3) Moosavi, S. M.; Chidambaram, A.; Talirz, L.; Haranczyk, M.; Stylianou, K. C.; Smit, B. Capturing Chemical Intuition in Synthesis of Metal-Organic Frameworks. Nat Commun 2019, 10 (1), 539. https://doi.org/10.1038/s41467-019-08483-9. (4) https://github.com/kjappelbaum/structure_comp

Parameters:
  • X (np.array) – Feature array, this is the array that is used to perform the sampling

  • n_samples (int) – number of points that will be selected, needs to be lower than the length of X

  • metric (str, optional) – Distance metric to use for the maxmin calculation. Must be a valid option of scipy.spatial.distance.cdist (‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’, ‘cosine’, ‘dice’, ‘euclidean’, ‘hamming’, ‘jaccard’, ‘jensenshannon’, ‘kulsinski’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘wminkowski’, ‘yule’). Defaults to ‘euclidean’

  • init (str, optional) – either ‘mean’, ‘median’, or ‘random’. Determines how the initial point is chosen. Defaults to ‘center’

  • seed (int, optional) – seed for the random number generator. Defaults to None.

  • cdist (**kwargs passed to the) –

Returns:

selected_indices

Return type:

np.array

pyepal.pal.utils.is_pareto_efficient(costs, return_mask=True)[source]

Find the Pareto efficient points Based on https://stackoverflow.com/questions/ 32791911/fast-calculation-of-pareto-front-in-python

Parameters:
  • costs (np.array) – An (n_points, n_costs) array

  • return_mask (bool, optional) – True to return a mask, Otherwise it will be a (n_efficient_points, ) integer array of indices. Defaults to True.

Returns:

[description]

Return type:

np.array

Utilities for plotting

Plotting utilities

pyepal.plotting.plot_bar_iterations(pareto_optimal, non_pareto_points, unclassified_points, ax=None)[source]

Plot stacked barplots for every step of the iteration.

Parameters:
  • pareto_optimal (np.ndarray) – Number of pareto optimal points for every iteration.

  • non_pareto_points (np.ndarray) – Number of discarded points for every iteration

  • unclassified_points (np.ndarray) – Number of unclassified points for every iteration

Returns:

matplotlib axis (the same that was provided as input

or one from a new figure if no axis was provided)

Return type:

axis

pyepal.plotting.plot_histogram(y, palinstance, ax=None)[source]

Plot histograms, with maxima scaled to one and different categories indicated in color for one objective

Parameters:
  • y (np.ndarray) – objective (measurement)

  • palinstance (PALBase) – instance of a PAL class

  • ax (ax) – Matplotlib figure axis

Returns:

matplotlib axis (the same that was provided as input

or one from a new figure if no axis was provided)ƒ

Return type:

ax

pyepal.plotting.plot_jointplot(y, palinstance, labels=None, figsize=(8.0, 6.0))[source]

Plot a jointplot of the objective space with histograms on the diagonal and 2D-Pareto plots on the off-diagonal.

Parameters:
  • y (np.array) – Two-dimensional array with the objectives (measurements)

  • palinstance (PALBase) – “trained” PAL instance

  • labels (Union[List[str], None], optional) – Labels for each objective. Defaults to “objective [index]”.

  • figsize (tuple, optional) – Figure size for joint plot. Defaults to (8.0, 6.0).

Returns:

matplotlib Figure object.

Return type:

fig

pyepal.plotting.plot_pareto_front_2d(y_0, y_1, std_0, std_1, palinstance, ax=None)[source]

Plot a 2D pareto front, with the different categories indicated in color.

Parameters:
  • y_0 (np.ndarray) – objective 0

  • y_1 (np.ndarray) – objective 1

  • std_0 (np.ndarray) – standard deviation objective 0

  • std_1 (np.ndarray) – standard deviation objective 0

  • palinstance (PALBase) – PAL instance

  • ax (axix, optional) – Matplotlib figure axis. Defaults to None.

Returns:

matplotlib axis (the same that was provided as input

or one from a new figure if no axis was provided)

Return type:

ax

pyepal.plotting.plot_residuals(y, palinstance, labels=None, figsize=(6.0, 4.0))[source]

Plot signed residual (on y axis) vs fitted (on x axis) plot of sampled points. Will create suplots for y.ndim > 1.

Parameters:
  • y (np.array) – Two-dimensional array with the objectives (measurements)

  • palinstance (PALBase) – “trained” PAL instance

  • labels (Union[List[str], None], optional) – Labels for each objective. Defaults to “objective [index]”.

  • figsize (tuple, optional) – Figure size for each individual residual vs fitted objective plot. Defaults to (6.0, 4.0).

Returns:

matplotlib Figure object

Return type:

fig

Input validation

Methods to validate inputs for the PAL classes

pyepal.pal.validate_inputs.base_validate_models(models)[source]

Currently no validation as the predict and train function are implemented independet of the base class

Return type:

list

pyepal.pal.validate_inputs.validate_beta_scale(beta_scale)[source]
Parameters:

beta_scale (Any) – scaling factor for beta

Raises:

ValueError – If beta is smaller than 0

Returns:

scaling factor for beta

Return type:

float

pyepal.pal.validate_inputs.validate_catboost_models(models, ndim)[source]

Make sure that the number of models is equal to the number of objectives. Also make sure that the models are CatBoostRegressor instances with RSMEWithUncertainty loss function

Return type:

List[Iterable]

pyepal.pal.validate_inputs.validate_coef_var(coef_var)[source]

Make sure that the coef_var makes sense

pyepal.pal.validate_inputs.validate_coregionalized_gpy(models)[source]

Make sure that model is a coregionalized GPR model

pyepal.pal.validate_inputs.validate_delta(delta)[source]

Make sure that delta is in a reasonable range

Parameters:

delta (Any) – Delta hyperparameter

Raises:

ValueError – Delta must be in [0,1].

Returns:

delta

Return type:

float

pyepal.pal.validate_inputs.validate_epsilon(epsilon, ndim)[source]

Validate epsilon and return a np.array

Parameters:
  • epsilon (Any) – Epsilon hyperparameter

  • ndim (int) – Number of dimensions/objectives

Raises:
  • ValueError – If epsilon is a list there must be one float per dimension

  • ValueError – Epsilon must be in [0,1]

  • ValueError – If epsilon is an array there must be one float per dimension

Returns:

Array of one epsilon per objective

Return type:

np.ndarray

pyepal.pal.validate_inputs.validate_gbdt_models(models, ndim)[source]

Make sure that the number of iterables is equal to the number of objectives and that every iterable contains three LGBMRegressors. Also, we check that at least the first and last models use quantile loss

Return type:

List[Iterable]

pyepal.pal.validate_inputs.validate_goals(goals, ndim)[source]
Create a valid array of goals. 1 for maximization, -1

for objectives that are to be minimized.

Parameters:
  • goals (Any) – List of goals, typically provideded as strings ‘max’ for maximization and ‘min’ for minimization

  • ndim (int) – number of dimensions

Raises:
  • ValueError – If goals is a list and the length is not equal to ndim

  • ValueError – If goals is a list and the elements are not strings ‘min’, ‘max’ or -1 and 1

Returns:

Array of -1 and 1

Return type:

np.ndarray

pyepal.pal.validate_inputs.validate_gpy_model(models)[source]

Make sure that all elements of the list a GPRegression models

pyepal.pal.validate_inputs.validate_interquartile_scaler(interquartile_scaler)[source]

Make sure that the interquartile_scaler makes sense

Return type:

float

pyepal.pal.validate_inputs.validate_ndim(ndim)[source]

Make sure that the number of dimensions makes sense

Parameters:

ndim (Any) – number of dimensions

Raises:
  • ValueError – If the number of dimensions is not an integer

  • ValueError – If the number of dimensions is not greater than 0

Returns:

the number of dimensions

Return type:

int

pyepal.pal.validate_inputs.validate_njobs(njobs)[source]

Make sure that njobs is an int > 1

Return type:

int

pyepal.pal.validate_inputs.validate_nt_models(models, ndim)[source]

Make sure that we can work with a sequence of pyepal.pal.models.nt.NTModel()

Return type:

Sequence

pyepal.pal.validate_inputs.validate_number_models(models, ndim)[source]

Make sure that there are as many models as objectives

Parameters:
  • models (Any) – List of models

  • ndim (int) – Number of objectives

Raises:

ValueError – If the number of models does not equal the number of objectives

pyepal.pal.validate_inputs.validate_optimizers(optimizers, ndim)[source]

Make sure that we can work with a Sequence if JaxOptimizer

Return type:

Sequence

pyepal.pal.validate_inputs.validate_positive_integer_list(seq, ndim, parameter_name='Parameter')[source]

Can be used, e.g., to validate and standardize the ensemble size and epochs input

Return type:

Sequence[int]

pyepal.pal.validate_inputs.validate_sklearn_gpr_models(models, ndim)[source]

Make sure that there is a list of GPR models, one model per objective

Return type:

List[GaussianProcessRegressor]

The models package

Helper functions for GPR with GPy

Wrappers for Gaussian Process Regression models.

We typically use the GPy package as it offers most flexibility for Gaussian processes in Python. Typically, we use automatic relevance determination (ARD), where one lengthscale parameter per input dimension is used.

If your task requires training on larger training sets, you might consider replacing the models with their sparse version but for the epsilon-PAL algorithm this typically shouldn’t be needed.

For kernel selection, you can have a look at https://www.cs.toronto.edu/~duvenaud/cookbook/ Matérn, RBF and RationalQuadrat are good quick and dirty solutions but have their caveats

pyepal.models.gpr.build_coregionalized_model(X_train, y_train, kernel=None, w_rank=1, **kwargs)[source]

Wrapper for building a coregionalized GPR, it will have as many outputs as y_train.shape[1]. Each output will have its own noise term

Return type:

GPCoregionalizedRegression

pyepal.models.gpr.build_model(X_train, y_train, index=0, kernel=None, **kwargs)[source]

Build a single-output GPR model

Return type:

GPRegression

pyepal.models.gpr.get_matern_32_kernel(NFEAT, ARD=True, **kwargs)[source]

Matern-3/2 kernel without ARD

Return type:

Matern32

pyepal.models.gpr.get_matern_52_kernel(NFEAT, ARD=True, **kwargs)[source]

Matern-5/2 kernel without ARD

Return type:

Matern52

pyepal.models.gpr.get_ratquad_kernel(NFEAT, ARD=True, **kwargs)[source]

Rational quadratic kernel without ARD

Return type:

RatQuad

pyepal.models.gpr.predict(model, X)[source]

Wrapper function for the prediction method of a GPy regression model. It return the standard deviation instead of the variance

Return type:

Tuple[array, array]

pyepal.models.gpr.predict_coregionalized(model, X, index=0)[source]

Wrapper function for the prediction method of a coregionalized GPy regression model. It return the standard deviation instead of the variance

Return type:

Tuple[array, array]

pyepal.models.gpr.set_xy_coregionalized(model, X, y, mask=None)[source]

Wrapper to update a coregionalized model with new data

Helper functions for GPR with BoTorch

pyepal.models.botorch_gp.build_model(X, y, warped=True, input_scaled=True, wrap_indices=None, scaling_indices=None, covar_module=None)[source]

Build a BoTorch model for a single output.

Parameters:
  • X (np.ndarray) – features

  • y (np.ndarray) – targets

  • warped (bool, optional) – If true, apply Kumaraswamy warping. Defaults to True.

  • input_scaled (bool, optional) – If true, scale features to unit cube. Defaults to True.

  • wrap_indices (Optional[Tuple[int]], optional) – Indices to which warping is applied. Defaults to None.

  • scaling_indices (Optional[Tuple[int]], optional) – Indices to which scaling is applied. Defaults to None.

  • covar_module (Optional[Module], optional) – Coregionalization model. Defaults to None.

Returns:

Function that return model and likelihood when provided with x and y

Return type:

Callable

pyepal.models.botorch_gp.build_multioutput_model(X, y, warped=True, input_scaled=True, wrap_indices=None, scaling_indices=None, covar_module=None)[source]

Build a BoTorch model for multiple outputs.

Parameters:
  • X (np.ndarray) – features

  • y (np.ndarray) – targets

  • warped (bool, optional) – If true, apply Kumaraswamy warping. Defaults to True.

  • input_scaled (bool, optional) – If true, scale features to unit cube. Defaults to True.

  • wrap_indices (Optional[Tuple[int]], optional) – Indices to which warping is applied. Defaults to None.

  • scaling_indices (Optional[Tuple[int]], optional) – Indices to which scaling is applied. Defaults to None.

  • covar_module (Optional[Module], optional) – Coregionalization model. Defaults to None.

Returns:

Function that return model and likelihood when provided with x and y

Return type:

Callable