BayesOpt
|
Many problems in engineering, computer science, economics, etc., require to find the extremum of a real valued function. These functions are typically continuous and sometimes smooth (e.g.: Lipschitz continuous). However, those functions do not have a closed-form expression or might be multimodal, where some of the local extrema might have a bad outcome compared to the global extremum. The evaluation of those functions might be costly.
Global optimization is a special case of non-convex optimization where we want to find the global extremum of a real valued function, that is, the target function. The search is done by some pointwise evaluation of the target function.
The objective of a global optimization algorithm is to find the sequence of points
which converges to the point , that is, the extremum of the target function, when is large. The algorithm should be able to find that sequence at least for all functions from a given family.
As explained in [Mockus94], this search procedure is a sequential decision making problem where point at step is based on decision which considers all previous data:
where . For simplicity, many works assume , that is, function evaluations are deterministic. However, we can easily extend the description to include stochastic functions (e.g.: homoscedastic noise ).
The search method is the sequence of decisions , which leads to the final decision . In most applications, the objective is to optimize the response of the final decisions. Then, the criteria relies on the optimality error or optimality gap, which can be expressed as:
In other applications, the objective may require to converge to in the input space. Then, we can use for example the Euclidean distance error:
The previous equations can also be interpreted as variants of the loss function for the decision at each step. Thus, the optimal decision is defined as the function that minimizes the loss function:
This requires full knowledge of function , which is unavailable. Instead, let assume that the target function belongs to a family of functions , e.g.: continuous functions in . Let also assume that the function can be represented as sample from a probability distribution over functions . Then, the best response case analysis for the search process is defined as the decision that optimizes the expectation of the loss function:
where is a prior distribution over functions.
However, we can improve the equation considering that, at decision we have already observed the actual response of the function at points, . Thus, the prior information of the function can be updated with the observations and the Bayes rule:
In fact, we can actually rewrite the equation to represent the updates sequentially:
Thus, the previous equation can be rewritten as:
This equation is the root of Bayesian optimization, where the Bayesian part comes from the fact that we are computing the expectation with respect to the posterior distribution, also called belief, over functions. Therefore, Bayesian optimization is a memory-based optimization algorithm.
As commented before, most of the theory of Bayesian optimization is related to deterministic functions, we consider also stochastic functions, that is, we assume there might be a random error in the function output. In fact, evaluations can produce different outputs if repeated. In that case, the target function is the expected output. Furthermore, in a recent paper by [Gramacy2012] it has been shown that, even for deterministic functions, it is better to assume certain error in the observation. The main reason being that, in practice, there might be some mismodelling errors which can lead to instability of the recursion if neglected.
In order to simplify the description, we are going to use a special case of Bayesian optimization model defined previously which corresponds to the most common application. In subsequent Sections we will introduce some generalizations for different applications.
Without loss of generality, consider the problem of finding the minimum of an unknown real valued function , where is a compact space, . Let be a prior distribution over functions represented as a stochastic process, for example, a Gaussian process , with inputs and an associate kernel or covariance function . Let also assume that the target function is a sample of the stochastic process .
In order to find the minimum, the algorithm has a maximum budget of evaluations of the target function . The purpose of the algorithm is to find optimal decisions that provide a better performance at the end.
One advantage of using Gaussian processes as a prior distributions over functions is that new observations of the target function can be easily used to update the distribution over functions. Furthermore, the posterior distribution is also a Gaussian process . Therefore, the posterior can be used as an informative prior for the next iteration in a recursive algorithm.
In a more general setting, many authors have suggested to modify the standard zero-mean Gaussian process for different variations that include semi-parametric models [Huang06] [Handcock1993] [Jones:1998] [OHagan1992], use of hyperpriors on the parameters [MartinezCantin09AR] [Brochu:2010c] [Hoffman2011], Student t processes [Gramacy_Polson_2009] [Sacks89SS] [Williams_Santner_Notz_2000], etc.
We use a generalized linear model of the form:
where
The term means a non-parametric process, which can make reference to a Gaussian process or a Student's t process . In both cases, is the observation noise variance, sometimes called nugget, and it is problem specific. Many authors decide to fix this value when the function is deterministic, for example, a computer simulation. However, as cleverly pointed out in [Gramacy2012], there might be more reasons to include this term appart from being the observation noise, for example, to consider model inaccuracies.
This model has been presented in different ways depending on the field where it was used:
This library was originally developed for as part of a robotics research project [MartinezCantin09AR] [MartinezCantin07RSS], where a Gaussian process with hyperpriors on the mean and signal covariance parameters. Then, the metamodel was constructed using the Maximum a Posteriory (MAP) of the parameters. By that time, it only supported one kernel function, one mean function and one criterion.
However, the library now has grown to support many more surrogate models, with different distributions (Gaussian processes, Student's-t processes, etc.), with many kernels and mean functions. It also provides different criteria (even some combined criteria) so the library can be used to any problem involving some bounded optimization, stochastic bandits, active learning for regression, etc.
As seen in Section modopt this library implements only one general regression model. However, we can assign a set of priors on the parameters of the model , (the kernel hyperparameter will be discussed in Section learnker). Thus, the options are:
Gaussian processes are a very general model that can achieve good performance with a reasonable computational cost. However, Student's t processes, thanks to the hierarchical structure of priors, provide an even more general setup for a minor extra cost. Furthermore, the Student's t distribution is robust to outliers and heavy tails in the data.
One of the critical components of Gaussian and Student's t processes is the definition of the kernel function, which defines the correlation between points in the input space. As a correlation function, the kernel must satisfy a set of properties (e.g.: being positive definite). All the kernel models available and its combinations satisfy the kernel restrictions.
The functions with "ISO" in their name are isotropic function, that is, they share a single set of parameters for all the dimensions of the input space.
The functions with "ARD" in their name use Automatic Relevance Determination, that is, they use independent parameters for every dimension of the problem. Therefore, they can be use to find the relevance of the each feature in the input space. In the limit, this can be used for feature selection.
This kernels allow to combine some of the previous kernels.
Note that the binary kernels only admits two terms. However, we can combine them for more complex operations. For example if we write:
"kSum(kMaternISO3,kSum(kRQISO,kProd(kPoly4,kConst))"
it represents the expresion: Matern(3) + RationalQuadratic + C*Polynomial^4
In this case, the vector of parameters is splited from left to right: 1 for the Matern function, 2 for the RQ function, 2 for polynomial function and 1 for the constant. If the vector of parameters have more or less than 6 elements, the system complains.
Although the nonparametric process is able to model a large amount of funtions, we can model the expected value of the nonparametric process as a parametric function. This parametric model will help to capture large offsets and global trends.
The usage is analogous to the kernel functions.
As discussed in Introduction to Bayesian Optimization, one of the critical aspects for Bayesian optimization is the decision (loss) function. Unfortunately, the functions described there are unavailable, because they assume knowledge of the optimal value . However, we can define proxies for those functions.
Some criteria, such as the expected improvement and the lower confidence bound admits an annealed version "cXXXa". In that version, the parameter that is used to trade off exploration and exploitation changes over time to priorize exploration at the begining and exploitation at the end.
Many criteria depends on the prediction function, which can be a Gaussian or a Student's t distribution, depending on the surrogate model. However, the library includes all the criteria for both distributions, and the system automatically selected the correct one.
In this case, the combined criteria admits more that two functions. For example:
"cHedge(cSum(cEI,cDistance),cLCB,cPOI,cOptimisticSampling)"
The posterior distribution of the model, which is necessary to compute the criterion function, cannot be computed in closed form if the kernel hyperparameters are unknown. Thus, we need a find to approximate this posterior distribution conditional on the kernel hyperparameters.
First, we need to consider if we are going to use a full Bayesian approach or an empirical Bayesian approach. The first one, computes the full posterior distribution by propagation of the uncertainty of each element and hyperparameter to the posterior. In this case, it can be done by discretization of the hyperparameter space or by using MCMC (not yet implemented). In theory, it is more precise but the computation burden can be orders of magnitude higher. The empirical approach on the other hand computes a point estimate of the hyperparameters based on some score function and use it as a "true" value. Although the uncertainty estimation in this case might not be as accurate as the full Bayesian, the computation overhead is minimal.
For the score function, we need to find the likelihood function of the observed data for the parameters. Depending on the model, this function will be a multivariate Gaussian distribution or multivariate t distribution. In general, we present the likelihood as a log-likelihood function up to a constant factor, that is, we remove the terms independent of from the log likelihood. In practice, whether we use a point estimate (maximum score) or full Bayes MCMC/discrete posterior, the constant factor is not needed.
We are going to consider the following score functions to learn the kernel hyperparameters:
In order to build a suitable surrogate function, we a need a preliminar set of samples. In Bayesian optimization this is typically performed using alternative experimental design criteria. In this first step, usually the main criteria is space filling. Thus, we have implemented the subsequent designs:
Note: Since we do not assume any struture in the set of discrete points during discrete optimization, only uniform sampling of the discrete set is available in that case.