punjabtechnicaluniversity.blogspot.in
Introduction
1.1 SimulationBased Optimization
Computer simulations are used extensively as models of
real systems to evaluate output responses. Applications of simulation are
widely found in many areas including supply chain management, finance,
manufacturing, engineering design and medical treatment [42, 48, 60, 83, 96].
The choice of optimal simulation parameters can lead to improved operation, but
configuring them well remains a challenging problem. Historically, the
parameters are chosen by selecting the best from a set of candidate parameter
settings. Simulationbased
optimization [3, 40, 41, 47, 69, 90] is an emerging field which integrates
optimization techniques into simulation analysis. The corresponding objective
function is an associated measurement of an experimental simulation. Due to the
complexity of the simulation, the objective function may be difficult and
expensive to evaluate. Moreover, the inaccuracy of the objective function often
complicates the optimization process. Deterministic optimization tools may
lead to inaccurate solutions. Indeed, derivative information
Floating sleeve
Outer conductor
^{3}
7
Sleeve position
Teflon catheter
is typically
unavailable, so many derivativedependent methods are not applicable to these
problems.
A particular example is an engineering design problem
for an microwave coaxial antenna in hepatic tumor ablation. (More details can
be found in Section 4.2.) We aim to determine the optimal dimensions of the
antenna, such as the slot size, the dipole tip length, and the thicknesses of
the Teflon coatings (see Figure 1 for the antenna illustration) to yield a
desired treatment performance. Finite element models (MultiPhysics 3.2) are
used to simulate the electromagnetic (EM) field distribution in liver given a
particular design. Since dielectric properties of the tissue are assumed to
vary within ±10% of average properties, it is important to ensure that the
antenna used in ablation is robust, i.e., relatively insensitive to the
variations in physical properties of the tissue.
Although real world problems have many forms, in the
thesis we consider the following bounded stochastic formulation:
min f (x)= E [F(x,£(u))]
, (1.1)
where
Q = {x e R^{n}
: l <
x
< u}.
Here, l and u are the lower and upper bounds for the input parameter
x, respectively. We focus on the case where the parameter x comes from a continuous set Q,
but not necessarily a finite set. £(u) is a random vector defined on a probability space.
The sample response function F : R^{n}
x R^{d} —► R takes two
inputs, the simulation parameters x e
R^{n} and a random sample of £(u)
in R^{d}. Note that the variables x
correspond to the design parameters (e.g., slot size,
dipole tip length, etc.) in the above example, and the random realizations
correspond to different dielectric properties of individuals. As in the
example, our approaches concentrate on cases where the number of design
parameters is small.
F(x, £(u)) corresponds to a particular evaluation using the finite element
model. Given a random realization £i of £(u), F(x, £i) can be evaluated via a single
simulation run. The underlying objective function f(x) is computed by taking an
expectation over the sample response function and has no explicit form.
(Different formulations other than expectation are acceptable; for example,
maximum, minimum, or quantiles of the sample objectives [4, 32]). A basic
assumption requires that the expectation function f (x) is well defined (for any x G R^{n} the
function F(x,
•) is measurable, and either E[F(x, £(u;))_{+}]
or E[F(x,£(u))]
is finite, see page 57 of [91]).
1.1.1 SimulationBased Optimization Methods
There are several excellent review papers on the
subject of simulationbased optimization methods, such as [3, 40, 41]. These
papers also provide a good survey on optimization addons for discreteevent
simulation software, which have been undergoing fast development over the last
decade; for example, OptQuest (www.optquest.com) for Arena (www.arena.com)
and for
SIMUL8 (www.simul8.com).
A detailed
summary of commercial simulation software and optimization addons can be found
in Table 1 of [41]. Contemporary methods for continuous simulationbased
optimization, which are implemented in these addons, are classified in several
categories.
Response surface methodology The idea of Response Surface Methodology [12, 90] (RSM) is to construct
one (or multiple) mathematical model A, which is called a surrogate model, to
approximate the underlying function f, so that it is can be easily
and cheaply evaluated at each parameter point x. Optimization procedures are
executed over the inexpensive model A, instead of the expensive cost
function. Moreover, the approximating model A could have estimated gradient
values that enable the application of more efficient optimization algorithms.
We determine the model A() by minimizing the difference of A() and the
function F(•) over a representative set
of points S:
^{min} J2x_{i&}s ^{0} ^{(F}^{(x}i^{,} & ^{} ^{A(x}i^{))} _{1} _{2)}
(1.2)
s.t. A(^)
G A
Here 0() is a merit function, which is typically
chosen as an /^{2}norm; ^ is one sample realization; and the set A is a class of tunable
functional forms. The most popular choices of A are linear and quadratic
models. Polynomial models are also compact forms and easy to evaluate, but they
often have limited capacity to model complex functions of arbitrary shape.
Splines are alternative tools, which are more flexible in fitting and are
capable of handling large scale data. But in high dimensional approximation,
the (Tensor product) spline requires sample data on a grid mesh for efficient
implementation. Kriging is a type of interpolative tool which was originally
developed in geostatistics. It is flexible enough to be implemented on
highdimensional spatial data and has been used in many engineering problems.
However, the performance is limited by the total number of design points; for
example, increasing the number of points in S can quadratically increase the
Kriging model construction time.
Heuristic methods Heuristic methods have proven
to be practically useful in many realworld applications. We will briefly
introduce the three most popular methods: genetic algorithms, tabu search and
simulated annealing.
Genetic algorithms [13, 94,
95] are inspired from the process of biological evolution. The algorithm is
initialized with a finite set of potential solutions called the population.
Each potential solution, referred to as an individual, may be coded as a binary
string or a realcoded integer or taken from a fixed alphabet of characters.
These solutions are evaluated by a fitness function (normally the objective
function) and the fit individuals are assigned a high probability to
"reproduce" in the next generation of solutions, a sort of survival
of the fittest scheme. In the reproducing process, new offspring inherit traits
from parents (crossover) and are subject to mutation changes. The new offspring
are accepted using various criteria to form a new population of candidate
solutions. The algorithm proceeds to generate more favorable solutions in each
iteration and eventually converges to a population with a distribution of good
solutions.
Tabu search [45, 46] is a metaheuristic based on the local search method, which iteratively moves the
current iterate to a neighbor solution, until certain criteria are satisfied.
The algorithm allows a move to a neighbor solution that has a worse objective
value. Meanwhile, in order to prevent circling to the previous solutions or
infinite loops, a list of tabu or forbidden moves are kept and updated at each
iteration. A shortterm memory tabu list is the most used type and is often
referred to as a tabu tenure.
Simulated annealing [23, 31]
searches local moves randomly from a list of candidate neighbor points. If a
better neighbor point is encountered, it replaces the current iterate with
probability one, or if a worse point is found, it replaces the iterate with a
probability value strictly less than one. The appropriate probability value is
determined by the difference of the objective values. For the algorithm to
converge, the probability of moving towards a worse point should decrease along
the iterations according to the decrement of a certain 'temperature' value,
which changes based on a cooling schedule. All of the three algorithms
are considered as global optimization methods since they are able to move
iterates out of regions where locally optimal solutions reside.
Stochastic approximation Stochastic approximation [67, 104]
falls into the category of gradientbased approaches. Typically, the method
iteratively updates the current solution by
xfc+i
= x_{k} + a_{k} Vf (x_{k}),
where the estimated gradient V f (x_{k}) can be calculated by various
gradient estimation tools and a_{k} is the step length. Since the method is an extension of
the line search method, it obviously is a local optimization method. Under
proper conditions, such as when the error in the gradient approximation and the
step length converges to zero as a certain rate, the stochastic approximation
method can be shown to converge to a local optimum of the underlying function.
The approximate gradient Vf (x_{k}) is estimated using sample responses F.
Popular gradient estimation methods include
perturbation analysis and the likelihood ratio method [43]. Spall's lab
develops simultaneous perturbation (SP) and finite difference (FD) based
stochastic approximation algorithms: SPSA and FDSA. The lab's web site (www.jhuapl.edu/SPSA)
provides a
good source of references on these two algorithms.
Derivativefree optimization methods Derivativefree optimization
methods [80] are a class of methods that do not try to utilize or directly
estimate the gradient value, thus are a good fit for the optimization problem
(1.1). Compared to stochastic approximation algorithms, the derivativefree
methods avoid the gradient estimation step, which is sensitive and crucial to
the convergence of these algorithms. In many practical examples, we find that the
gradient estimation tools often become incorrect and problematic when the
gradient value gets close to zero (i.e., when near a local solution).
There are two categories of derivativefree methods: the socalled
modelbased approach [20, 21] and the pattern or geometrybased approach. The
modelbased approach typically constructs a chain of local models that approximate
the objective function, and the algorithm proceeds based on model predictions;
an alternative to the modelbased approach is the patternbased approach, which
directly uses the functional output at locations specified by geometric
arguments, such as the pattern search method [70].
One of the
optimization methods we apply in the thesis is the UOBYQA algorithm [85]
(Unconstrained Optimization BY Quadratic Approximation), which is a modelbased
method. It is designed for solving nonlinear problems with a moderate number of
dimensions. The method shares similarities to response surface methodology
(RSM): both of them construct a series of models to the simulation response
during the optimization process. However, many features in UOBYQA, such as the
quadratic model update and local region shift have advantages over the
classical RSM, and UOBYQA is more mathematically rigorous in convergence.
Dynamic programming and neurodynamic programming Dynamic Programming (DP) [6]
problems are a special type of simulationbased optimization problems with
internal time stages and state transitions. The objective function is not a
single blackbox output, but typically is a combination of intermediate costs
during state transitions plus a cost measurement of the final state.
Appropriate controls are determined at each time stage, typically in a
sequential favor.
NeuroDynamic Programming (NDP) [9, 47, 106] is a class
of reinforcement learning methods to solve complex DP problems. The central
idea of NDP is to approximate the optimal costtogo function using simple
structured functions, such as neuro networks or simulation evaluations. NDP obtains
suboptimal controls, trading off expensive computational costs. This feature
distinguishes NDP methods from standard DP methods.
1.1.2 White Noise vs. Common Random Numbers
To handle stochastic functional output, we adopt a
simple and effective approach  sample multiple replications per point and
take the average to reduce the uncertainty. Algorithms without such a procedure
can obtain a good solution but not the exact solution, because a
singlereplication observation is inevitably affected by noise. The trickly
question of setting the replication number is actually algorithm dependent. We
have applied tools from Bayesian inference in a novel way to facilitate this
computation process.
We separate our optimization
approaches into two cases, using an important property of the simulation
system, which is based on properties of the replications.
The white noise case corresponds to the situation when simulation outputs
F(x,£(u))
are independent for different simulation runs, whatever the input x is. The random component £(u)
of the function is not fixable. The simulation outputs are observed to be
biased by white noise. The second case is the implementation of Common Random Numbers (CRN), which assumes the
random factor £(u)
can be fixed; for example, this can be achieved by fixing the random seed in
the computer simulation process. One of the advantageous properties of using
CRN is that given a realized sample the sample response function F(x,^) is a deterministic function.
The introduction of CRN significantly reduces the level of randomness and can
facilitate comparisons between different x.
When CRN is used, the expected
value function f (x)
in (1.1) can be approximated by a deterministic averaged sample response
function
1 ^{N}
f (x) « f^{N}(x):= F(x,&, (1.3)
where is the number of samples. This is the core idea
of the samplepath method [35, 49, 50, 83, 84, 89],
which is a wellrecognized method in simulationbased optimization. The
samplepath method is sometimes called the Monte Carlo sampling approach [99] or the sample average approximation
method [52,
53, 63, 98, 100, 101]. The samplepath method has been applied in many
settings, including buffer allocation, tandem queue servers, network design,
etc. A wealth of deterministic optimization techniques can be employed to solve
the samplepath problem
min f^{N}(x), (1.4)
which serves as a substitute for (1.1). An optimal
solution x*^{N}
to the
problem (1.4) is then treated as an approximation of x*, the solution of
(1.1). Note that the method is not restricted to bounded problems, but in more
general settings it requires appropriate deterministic tools (i.e., constrained
optimization methods) to be used.
^{1}The functions f^{N} epiconverges to a function f if the epigraphs of the
function f^{N }converges to the epigraph of the function f [91].

Convergence proofs of the samplepath method are given in [89, 97]. Suppose
there is a unique solution x^{*} to the problem (1.1), then under the assumption that
the sequence of functions {f^{N}} epiconverges^{1} to
the function f, the optimal solution sequence {x*^{N}} converges to x* almost surely for all sample
paths. See Figure 2 for the illustration of the samplepath optimization
method. Starting from x_{0}, for a given N, a deterministic algorithm is
applied to solve the samplepath problem. Invoking the above analysis then
allows us to use the solution x*^{N} as an approximation to the true solution x^{*}.
1.2 Bayesian Analysis in Simulation
We have extensively used Bayesian analysis in our
optimization methods. Our analysis is inspired by its application in selecting the best system, which is one of the
fundamental problems in discrete simulationbased optimization.
In Bayesian analysis, a posterior distribution for the
simulation output, including both mean and variance information, is derived.
Since the posterior distribution has a parametric form, it can be readily
assembled and used in optimization methods.
Another important usage of Bayesian analysis is Monte Carlo validation. When Bayesian posterior
distributions cannot be used directly in the algorithms, the Monte Carlo
validation step serves as an alternative. This idea is motivated by using Monte
Carlo simulation of random variables to construct statistical estimators for
unknown quantities. For example, in the computation of the expectation of a
random variable, a simple way to accomplish this is to generate a large number
of realizations from the random variable and take the arithmetic mean (or
compute the sample mean). This provides a so called Monte Carlo estimator for the unknown expectation.
Note that one crucial part of the Monte Carlo method necessitates an efficient
way to generate random samples from a given distribution.
We adopt the same idea in Monte Carlo validation.
Assume that Bayesian posterior distributions can capture the real uncertainty
of the unknown random quantity well from a practical standpoint. In order to
valid a given variance criterion, we can extract a vast amount of samples from
the derived posterior distributions and plug them into the criterion to
validate it. For example, we may observe that 1 — a of all the samples satisfy the
criterion, therefore, we can assert that such a criterion can be satisfied with
an accuracy of 1 — a. The Monte Carlo validation step avoids using the forms
of the Bayesian distributions directly.
We will introduce the problem of selecting the best
system in Section 1.2.1 and the role which Bayesian analysis plays. In Section
1.2.2, we describe the construction of Bayesian posterior estimations and in
Section 1.2.3, we illustrate how the analysis is applied to a simple selecting
the best system problem.
1.2.1 Selecting the Best System
The goal of selecting the best system is to identify
the best system from a collection of candidate systems. The type of problem is
encountered many times (in various forms) in our algorithmic designs, as will
be shown later. Selecting the best system is equivalent to solving a discrete
optimization problem:
argmin E (Yi), (1.5)
i
where Y_{i} is a random variable representing the output of system i, i = 1, 2, • • • ,K. (Without loss of generality,
we assume that a desired system has the smallest expected mean.) Let fi_{i} be the unknown mean of Y_{i} and > H[_{2}\ > • •
• > H[k\
be the
ordered sequence of the means but such a sequence is unknown. We prefer to
determine the index [k] and there is a
probability of correct selection (PCS) value that quantifies the
correctness: PCS =
Pr(select Y_{[K}]\> fi[K] ,j = 1, 2, ■■■ ,K  1). (1.6)
The difficulty of selecting the best system is to
design a statistically accurate procedure that identifies the best system with
the condition
PCS > 1 
a,
where a is a threshold value, while using the least number of
realized samples of Yi.
The standard approach to selecting the best system
includes IndifferenceZone
Ranking and Selection (IZR&S). An indifferencezone parameter 8 should first be prescribed.
When the means of two systems are within a tolerance 8, it is indifferent to
choose either system. One of the most influential IZR&S approaches is a
twostage sampling strategy, described by Rinott [88] in 1978. Firstly, r_{0} replications of each system
are simulated to estimate the preliminary sample variance. Secondly, it
evaluates additional replications for each system, whereby numbers of
replications are determined by the corresponding variance information. The best
system is then chosen as the one having the smallest sample mean. Kim and
Nelson [61] include a screen step that is designed to eliminate a set of
inferior systems from consideration at an early stage. A good survey on the
IZR&S procedures can be found in [62].
The Bayesian method is an
alternative to the standard IZR&S [16, 17, 19]. The idea is to construct
posterior distributions for the underlying mean j_{i} using observations and
acquired knowledge. The selection accuracy is accordingly estimated via joint
posterior distributions. There are many practically efficient procedures
including Chen's OCBA [16] (Optimal Computer Budget Allocation) and Chick's
01(B) [17, 19]. The report [56] offers an empirical comparison of these
methods, as well as Rinott's twostage IZR&S method.
The Bayesian approach has advantages over the
traditional IZR&S approach. First, it is not necessary to set the
indifferencezone parameter 5. In fact, it deals with the ranking and selection
problem (1.5) directly. Moreover, it utilizes both the mean and variance
information of the data, while IZR&S only uses the variance information
directly.
1.2.2 Constructing Bayesian Posterior Estimations
The Bayesian approach described in this subsection is
standard and close to that presented in Chick's papers [17, 19].
Given a set of points [x^{1},x^{2},... ,x^{L}}, we assume the simulation output at these points
F = (F(x^{1}, £(u)), F(x^{2}, £(u)),..., F(x^{L}, £(u)))
is a multivariate normal
variable, with mean j = (^(x^{1}),..., fi(x^{L})) and
covariance matrix E:
F ~ N(p, E).
(1.7)
In the white noise case, since simulation outputs are
independent, E should be a diagonal matrix; while for the CRN case, it is
typically not. Suppose we evaluate N replications of simulation runs at each point, the
existing data X can
be accumulated as an N x L matrix, with
Let p and E denote the sample mean and sample covariance
matrix of the data. For simplicity, we introduce the notation s_{i} = (F(x^{1},^),..., F(x^{L}, ^)), i 1,..., N, so that
r
s1
X
S2
^{s}N
The sample mean and sample
covariance matrix are calculated as
1
N
N
i=1
(f^{N} (x^{1}),...,^ (x^{L})),
(1.8)
and
1
N1
N
i=1
(1.9)
Since we do not have any prior
assumption for the distributions of j and E, we assign noninformative prior distributions
for them. In doing this, the joint posterior distributions of j and E are derived as
E\X ~ WishartL(E,N + L  2),
j\E,X ~ N(j, E/N). (1.10)
Here the Wishart distribution Wishart_{p}(v,m) has covariance matrix v and m degrees of freedom [25]. The
Wishart distribution is a multivariate generalization of the x^{2} distribution.
The distribution of the mean value is of most interest
to us. When the sample size is large, we can replace the covariance matrix E in
(1.10) with the sample covariance matrix E, and asymptotically derive the
posterior distribution of \ X as
j\X ~ N(j,E/N). (1.11)
It should be noted that, with an exact computation, the
marginal distribution of \ X inferred by (1.10)
(eliminating E) is,
j\X ~ StL(/,NE^{1},N  1), (1.12)
where a random variable with Student's tdistribution St_{L}(j, K, m) has mean j, precision k, and m degrees of freedom. The normal
formulation (1.11) is more convenient to manipulate than the tversion (1.12).
We have applied both forms in our algorithms when appropriate, but typically
prefer the simpler normal form.
Particularly, in the white noise case, the distribution
in (1.12) is separable. We have the component posterior distribution
j(x^{i})\X ~ N(j(x^{i}),a^{2}(x^{i})/N),
where a^{2}(x^{i}) is the sample mean for the
point x^{%}
and is the
ith component in the diagonal of the matrix E.
While the multivariate normal assumption (1.7) is not
always valid, several relevant points indicate that it is likely to be
satisfied in practice [18].
•
The form (1.7) is only used to derive the (normal)
posterior distribution j\X.
•
Other types of distribution assumptions may be
appropriate in different circumstances. For example, when a simulation output
follows a Bernoulli 01 distribution, then it would be easier to perform parameter
analysis using beta prior and posterior distributions. The normal assumption
(1.7) is the more relevant to continuous simulation output with unknown mean
and variance.
•
The normal assumption is asymptotically valid for many
applications. Many regular distributions, such as distributions from the
exponential family, are normallike distributions. The analysis using normal
distributions is asymptotically correct.
1.2.3 The Bayesian Method for Selecting the Best
System
As an introductory application, we consider a
preliminary case in selecting the best system when there are only two systems
involved, K
= 2. A
simplified analysis on how to compute the PCS is presented as follows. Interested
readers can seek details (e.g., multiple comparisons and procedures) in [17,
19].
Let X = {yij,i = 1, 2,j = 1, 2, • • • ,r_{i}} denote two sequences of output
of both systems. The sample mean ji_{i} and sample variance a^{2}
are defined as Ui =
E^{r}_{3}Li ^{y}i,j/n ^{and}
A decision is drawn by directly comparing the two
sample means. The system evidenced with a smaller average output is selected:
{

1, if Ui < U2; 2, otherwise.
Without loss of generality, we assume that we observe
the order of the sample means U_{i} < U_{2} and select the first system as
the winner.
In the Bayesian framework, the
posterior distribution of fi_{i}\X can be asymptotically
estimated as
Ui\X ~ N(fii,a^{2}/Ti). (1.13)
Following our earlier assumption, the PCS corresponds to the probability
of event {j1 < j^},
PCS ~ Pr(j1 < j2\ X) = Pr(jn\X  j2\X < 0). (1.14)
The difference of two normal random variables j_{1} \X and ji_{2}\X remains a normal random
variable. It is straightforward to show:
Therefore, the PCS defined in (1.14) is a tail probability of a normal
distribution:
PCS ~ Pr(N jj2, a1/n + a\/r_{2}) < 0). (1.15)
The computation corresponds to a single cdf evaluation
of a onedimensional normal distribution. The value can be computed either by
looking up in a standard cdf table or using routines in mathematics software,
i.e., Matlab's function normcdf.
In general, for formulation (1.15), one may expect the PCS value to become higher as the
numbers of replications r_{1} and r_{2} increase.
1.3 The TwoPhase Optimization
Framework
The thesis focuses on a twophase optimization
framework for solving (1.1), which is motivated by response surface
methodology. Recalling the model construction step (1.2), the overall error in
approximation comes from two sources: random error and model error. Random
error is the type of error directly induced by the uncertainty £(u) in the sample response; model
error is due to an inappropriate choice of functional forms to fit the data.
For example, fitting a cubic curve with a quadratic function will result in a
large discrepancy, when the domain of interest is not appropriately restricted.
This type of error exists independent of the random term £(<j). In the global view, the model
error dominates the random error; therefore, constructing a surrogate function
aims at reducing the model error to the maximum extent. However, in the local
view, reducing the random error becomes the primary consideration. For this
reason, we design a twophase framework for simulationbased optimization which
addresses distinct goals:
1.
Phase I is a global exploration and rough search step.
The algorithm explores the entire domain and proceeds to determine potentially
good subregions for future investigation. We assume the observed randomness in
the subregion detection process is insignificant and can be ignored.
Appropriate criteria to determine when a good subregion has been identified are
required.
2.
Phase II is a local exploitation step. Local
optimization algorithms are applied in each subregion to determine the exact
optimum. The algorithms are required to deal with the randomness explicitly,
since in local subregions, random error is considered as the dominating
feature.
Phase I typically produces one or multiple good points
representing centers of good local subregions. These points are used as
starting points for the Phase II local search algorithms. Alternatively, Phase
I may generate geometric information for the location and shape of the
subregions, which are more interesting for analyzing the significance,
interactions and patterns of individual dimensions.
1.3.1 The WISOPT Structure
We design an optimization package WISOPT (which stands
for Wisconsin Simulation OPTimization) incorporating different optimization
methodologies, based on the twophase framework. See Figure 3 for a flow
chart.
Phase I is a global exploration step over the entire
domain. Algorithms in Phase I should be able to generate (and evaluate) densely
distributed samples in promising subregions and sparsely distributed samples in
inferior subregions. The entire set of samples will be passed to a phase
transition procedure between Phase I and Phase II, which implements a
nonparametric statistical method to determine starting points and surrounding
subregions for multistart Phase II optimizations.
One of our Phase I methods
employs classification tools to facilitate the global search process. By
learning a surrogate from existing data the approach identifies promising
subregions and generates dense samples in the subregions. Additional features
of the method are: (a) more reliable predictions obtained using a voting
scheme combining the options of multiple classifiers, (b) a data preprocessing
step that copes with imbalanced training data. Another Phase I method is the
Noisy DIRECT (DIviding RECTangles) algorithm, which is an extension of the
DIRECT optimization algorithm to the noisy case. As with the
classificationbased global search, the method returns a collection of
promising points together with surrounding rectangles.
Phase II performs local
derivativefree optimization based on the UOBYQA (Unconstrained Optimization BY
Quadratic Approximation) algorithm, in each of the subregions identified. We do
consider whether we can implement CRN in the simulator, corresponding to the
VNSPUOBYQA (VariableNumber SamplePath UOBYQA) algorithm for the CRN case and
the Noisy UOBYQA algorithm for the white noise case. Both algorithms apply
Bayesian techniques to guide appropriate sampling strategies while
simultaneously enhancing algorithmic efficiency to obtain solutions of a
desired accuracy. The statistically accurate scheme determines the number of
simulation runs and guarantees the global convergence of the algorithm.
A key component in the extended algorithms is to
incorporate distribution information provided by Bayesian estimations.
Bayesian inference is an effective tool in input modeling and uncertainty
analysis. In particular, in order to prevent unnecessarily exploring
hyperrectangles in inferior regions,
DIRECT tests the accuracy of selecting hyperrectangles
for future partitioning using a Monte Carlo validation process. Trial samples
are extracted from the Bayesian posterior distributions to validate a
predefined variance criterion. In UOBYQA algorithms, we derive the posterior
volatility for the local model construction step, and therefore control the
uncertainty of solutions in the trust region subproblems.
Certainly, the variety of methods in both phases is not
restricted to what we present. Additional methods that satisfy the purposes of
both phases may work as new modules and can be plugged into the twophase
framework.
1.3.2 Introduction to the Methodologies
We give a brief introduction to the methodologies we
use in WISOPT (refer to Figure 3). Details will be discussed in later chapters.
Classificationbased global optimization A good surrogate model for the
entire space (often noted as a metamodel) may require a large amount of
simulations and can be very expensive to compute, but it may be capable of
approximating global behavior of the underlying function f. Since Phase I only attempts
to determine promising subregions of the search space, the model can be
constructed in a coarser manner.
Classificationbased global
optimization
or
Noisy DIRECT
Phase II

Phase transition
VNSPUOBYQA
Noisy UOBYQA
Figure 3: The twophase WISOPT
structure.
We are indeed interested in the behavior of a simple indicator function:
(1.16)
where promising subregions in the method correspond to
level sets. This function gives sufficient information to determine where a
subregion is located. Approximating the indicator function I is simpler than approximating
the underlying function f, especially in a high dimension case. Normally, we
sample dense designs in the subregion and sparse designs out of the subregion.
In the classificationbased global search, we utilize
the predicting power of a classifier: the classifier works as a surrogate
function for the indicator function I, which reveals the location of promising subregions. A
classifier is a cheap mechanism to predict whether new samples are in promising
subregions or not, thus we can generate a dense collection of points in these
subregions. Figure 4 illustrates the local regions (the dotted circles and the
level sets) and the samples (the '+'s). The method fits well to the theme of the
Phase I optimization, which is to identify local regions. In fact, the method
is relatively insensitive to noise, because the simplification step (1.16)
smoothes out the occurrence of noise. That is reason we normally do not use
replicated samples in training the classifiers. Setting the algorithm may require
specific parameters; for example, we have to potentially understand the proper
number of samples to use in training the classifiers.
Classification forms a major
branch in in machine learning/data mining [51, 77]. In the past couple of years, there has been
increasing interest on techniques interfacing optimization and machine
learning. For example, Kim and Ding [60] have implemented data mining tools in
an engineering design optimization problem. Mangasarian [74] has enhanced
optimization robustness in support vector machines.
10, , , , , , , , , , 1
10 8
6 4 2
0 2 4
6 8 10
Figure 4: Predicated local subregions of the Griewank
function. The function has local optimums in each subregion (circles) and the
global optimum at [0, 0].
The Noisy DIRECT
algorithm The DIRECT optimization method
[37, 38, 58, 59] is a deterministic global optimization algorithm for
boundconstrained problems. The algorithm, first motivated by Lipschitz
optimization [59], has proven to be effective in a wide range of application
domains. The algorithm centers around a spacepartitioning scheme that divides
large hyperrectangles into small ones. Promising hyperrectangles are subject to
further division. Figure 5 provides an illustration of the algorithm on the
Goldstein Price function. The algorithm therefore proceeds to gradually explore
promising subregions.
Figure 5: The DIRECT optimization algorithm on the
Goldstein Price function. The contour plot of the Goldstein Price function is
shown in Figure 15.
When the objective function is
subjected to uncertainty, some crucial operational steps of the DIRECT
algorithm are affected. For example, the choice of potentially optimal hyperrectangles
becomes incorrect because of the noisy function values, possibly misleading the
algorithm to search in inferior regions. We modify the original DIRECT
algorithm using a simple approach  multiple replications are sampled to reduce
output uncertainty. We expect the algorithm to proceed correctly as in the
deterministic case. However, we must face the issue of handling the tradeoff
between two design goals: efficiency of the algorithm versus total computational effort. Since the objective function
is often computationally expensive to evaluate, we must be very cautious in
using function evaluations. On the other hand, we need to maintain a certain
precision in the functions for correctness of the algorithm. In our
modification, we apply Bayesian techniques to derive a posterior distribution
for the function output at each point, and incorporate the distribution
information into the algorithm to determine an appropriate number of
replications to be used.
The parameters for this algorithm are easy to set. Since deterministic
DIRECT is designed for both global and local optimization, one should note that
it is desirable to terminate the algorithm at a reasonable time, at which
sufficient information of local regions can be identified. This can avoid unnecessary
function evaluations for handling comparisons that are really dominated by
noise.
The phase transition Using the evaluated samples in
Phase I, the phase transition procedure consists of a nonparametric local
quadratic regression method to determine the appropriate subregion size. Being
different from regular regression methods, which use the entire set of samples
in the domain to construct one model, local regression makes a prediction (at a
point) using a local model based on samples within a 'window size', thus the
approach values the local behavior of a function more. 'Nonparametric' means
the regression model is not from a single parametric family. It is presumed
that the samples outside the local region have a slight relevance to the current
prediction. In our procedure, we treat the resulting 'window size' as our
subregion radius.
A sequence of good starting
points is generated, satisfying the criteria: (a) each starting point is the
center of a subregion, (b) the subregions are mutually separated. The sequence
of starting points and the subregion sizes are passed to Phase II for local
processing, possibly in a parallel setting.
Extended UOBYQA algorithms In Phase II, the deterministic
UOBYQA algorithm is applied as the base local search method and is extended for
noisy function optimization. The method is an iterative algorithm in a trust
region framework [80], but it differs from a classical trust region method in
that it creates quadratic models by interpolating a set of sample points instead
of using the gradient and Hessian values of the objective function (thus making
it a derivativefree tool). Besides UOBYQA, other modelbased software include
WEDGE [75] and NEWUOA [86]. We choose UOBYQA, because it
We developed variants of
the original UOBYQA, called the VNSPUOBYQA and the Noisy UOBYQA, that are
adapted for noisy optimization problems. The extension idea is similar to that
of the Noisy DIRECT algorithm. We sample multiple replications per point to
reduce variance and apply Bayesian techniques to guide appropriate sampling
strategies to estimate the objective function. The two algorithms employ
different mechanisms in the sampling process. The VNSPUOBYQA determines
appropriate replication numbers by whether sufficient reduction is identified
in the trustregion subproblem, while the Noisy UOBYQA determines the number by
whether the quadratic models can be shown to be stable or not. Generally
speaking, when CRN is implemented, the noise is relatively easy to handle
because it is correlated among sites. Therefore, it is shown that the
VNSPUOBYQA has better convergence properties.
1.4 Outline of the Thesis
The general theme of thesis centers around the WISOPT
package, which is designed based on the twophase optimization framework. Phase
I techniques are described in Chapter 2 and Phase II techniques are described
in Chapter 3.
More specifically, the first section of Chapter 2
presents the classificationbased global search. The detailed procedures of
applying several types of classifiers are included, together with two special
features: a voting scheme to assemble multiple classifiers and imbalanced data
handling. The second section introduces and explains the development of the
Noisy DIRECT algorithm. We will first describe the deterministic DIRECT
algorithm, then analyze how to enhance the algorithm under uncertain
conditions. Several numerical examples and a simulation problem are presented.
The third section is on the procedure of the phase transition.
Chapter 3 contains the details about the Phase II noisy versions of the
UOBYQA algorithm. The ideas of the extensions of both the VNSPUOBYQA and the
Noisy UOBYQA are similar: the Bayesian posterior distributions for the
parameters of the quadratic model are derived and further we can estimate the
stability of the algorithms. Since the VNSPUOBYQA is in the CRN setting, the
corresponding section is written in a more rigorous manner, with the
convergence proof of the algorithm provided.
We show in Chapter 4 two
realworld simulation optimization examples. We aim to fine tune the simulation
parameters in the Wisconsin Breast Cancer Epidemiology model, which is from a
collaborative project with researchers in the Medical School at the University
of Wisconsin. In the second example, we optimize the shape of a type of
microwave coaxial antenna for hepatic tumor ablation, which is a current
ongoing project with the Biomedical Engineering Department at the University
of Wisconsin.
In Chapter 5 we demonstrate a special simulationbased problem in dynamic
programming. This type of simulation problem has an internal time structure 
the objective function is not a pure blackbox stochastic function, but is
constituted of a sequence of costs along the time stages. We modify the rollout
algorithm from neurodynamic programming, using a similar Bayesian analysis to
that outline above to improve the simulation efficiency in training and in
deriving optimal controls at each stage. We illustrate the effectiveness of
our new algorithm using an example in fractionated radiotherapy treatment. This
approach to using Bayesian analysis in neurodynamic programming effectively
generalizes the methods of this thesis to time domain problems, and leads to
further possible applications in other optimization contexts.
0 comments:
Post a Comment