As a follow-up to my previous post on a unified framework to stochastic optimisation, here is a 2008 TutORial in Operations Research by Warren Powell and Peter Frazier on Optimal Learning, which seeks to solve the problem of optimally collecting information with which to make decisions.

There are two types of learning: online and offline. Online learning describes a setting where we incur the cost of traveling a link at the same moment as we learn about the cost. ("Measurements occur as we operate the system.") Offline learning refers to a case where we first have a budget to learn the information and then we operate the system with the best design we have found. The authors describe many examples of learning, especially the learning shortest path problem, where the costs of traveling the arcs can be revealed in three different ways: (1) only after traversing the link, (2) when the traveler arrives at the origin node of the arc he is about to travel, (3) at the beginning, before the traveler starts his trip. The more challenging problem is posed by Type-(2) networks.

The authors point out that optimal learning is not the same as stochastic (data-driven) optimization. Optimal learning is about using observations "to improve our distribution of belief about the stochastic process driving the problem". The authors add: "A central feature of optimal learning problems is the ability to make decisions that reveal information, which is then used to reduce the overall level of uncertainty about the system." They describe the elements of a learning problem, including the states of the system, the types of decision, exogenous information. They also discuss transition functions for the frequentist view (data-driven view, where we begin with no information about the state of the system) and the Bayesian view (where we begin with a prior distribution of belief about the unknown parameters, take measurements and update our distribution into a posterior distribution). (When we end up with the same class of distribution that we started with, we say we have a *conjugate* family. Alternatively, we may impose conjugacy as an approximation.)

Then Powell and Frazier discuss objective functions: how to evaluate whether we are making better decisions with our measurements and how we are managing the economics of measurement and the evaluation of the solution. They state: "There are two broad settings in which optimal learning arises: (1) conducting a series of measurements to design a process or a system, where we are restricted by some sort of a measurement budget; (2) controlling a system where we collect information as we are managing the system." The first case is called *offline learning* and the second case is called *online learning*.

The objectives themselves that we might use in optimal learning can belong to a vast range. A possible choice would be the expected value, maximizing the expected value of choice after the measurements. Another one would be the probability of correct selection. Yet another one would be "to maximize the likelihood that we make a choice that is almost as good as the best," which is known as indifference zone selection and is related to some current research ideas of mine. A fourth and fifth types of objective are least square error and entropy minimization. ("Entropy is a measure of uncertainty that can be used for numeric and nonnumeric data... The entropy is largest when the density is closest to the uniform distribution. If the entropy is zero, then we know [the parameter] perfectly. Thus we can try to take measurements that reduce the entropy of the distribution that describes our knowledge about a parameter.")

The authors then discuss measurement policies, i.e., "rules that tell us which action we should take next in order to serve something new." They distinguish between deterministic policies (where the manager decides what he is going to measure before he begins making any measurements) and sequential policies, "where the decision of what to measure next may depend on past measurements." They present Bellman's equations characterizing the optimal action but point out that those equations are virtually impossible to solve even for very small problems. They describe heuristic policies for discrete selection problems:

- Pure exploration (to "estimat[e] the value of each choice, as opposed to making a good economic decision"
- Pure exploitation (making the best decision given our current set of estimates every time)
- Mixed exploration and exploitation: a strategy where we explore with a given probability (that doesn't change with the number of observations) and exploit with the complementary probability
- Epsilon-greedy exploration: this strategy takes into account the number of iterations by noting that "in the beginning, it is better to explore. As we build confidence in our estimates, we would prefer to exploit more."
- Boltzmann exploration (sampling each measurement with a probability that maximises the soft max)
- Interval estimation (my favorite, which maximises the upper limit of a confidence interval of the value of a measurement. This was pioneered by Kaelbling, L.P.,
*Learning in Embedded Systems*, MIT Press, Cambridge, MA, 1993.)

Stopping problems are, in the authors' words, "a simple but important class of learning problems." A well-known classic problem is known as the secretary problem, involves the challenge of interviewing a sequence of candidates for a secretarial position. "The problem involves the tradeoff between observing candidates (which allows us to collect information) and making a decision (exploiting the information)". The authors remind the readers, based on Puterman's work, that if we have N candidates to interview, we should interview m before offering anyone the job. "The first m candidates, then, are used to learn something about the quality of the candidates. After interviewing m candidates, then the best strategy is to take the first candidate who is the best candidate interviewed so far."

Multiarmed bandit problems are one of the best known problems in information collection. (It is motivated by playing slot machines.) This leads to the basic theory of Gittins indices. The authors' Section 8 is devoted to this problem. They then describe ranking and selection problems and provide an overview of methods, going in detail over the knowledge-gradient method, first for independent measurements and then correlated measurements.

Independent measurements arise when measuring one choice does not tell us anything about measurements about another choice. For instance, if two new technologies are radically different, measuring the efficiency of one tells us zero about the efficiency of another. The idea behind the knowledge gradient is to consider a myopic policy where we choose an action now to maximize the expected value of the next state, rather than picking the best option right now. We can think of it as maximising the incremental value (or gradient with respect to the measurement) given the current state. This was first introduced by Gupta and Miescke in 1996 and studied further by Frazier, Powell and Dayanik in 2008 (they are the ones who coined the term knowledge gradient). Powell and Frazier in their tutorial also explain how to compute the normalised influence of a certain decision and the knowledge gradient index. It is possible to show that the knowledge gradient policy is asymptotically optimal.

Correlated measurements refer to the case where making one measurement (for instance demand for an item at a certain price) tells us something about what we will observe if we make other measurements (observing demand at other prices). This setting was investigated by Frazier, Powell and Dayanik in 2009, assuming knowledge of the covariance matrix. They discuss how to write the updated mean from making a measurement using the vector version of the Bayesian updating formulas, the Sherman-Woodbury equations and an analytical form. They provide the equation to compute the knowledge gradient policy for correlated measurements, but the expected value in the formula is a lot harder to compute due to correlation. While an approximation using Monte Carlo methods is possible, the authors describe a procedure by computing a sequence of lines with increasing slopes and finding the range for which a particular choice dominates. This allows for a simple analytical expression to compute the expected value as a sum of well-chosen terms. The authors also discuss how to compute the knowledge gradient in the case of online learning.

**Some references:**

Gittins, JC. Bandit processes and dynamic allocation indices. *Journal of the Royal Statistical Society, Series B*, 14:148-177, 1979.

Frazier, P, W. Powell and S Dayanik. A Knowledge-Gradient Policy for Sequential Information Collection. *SIAM Journal on Control and Optimization.* 47(5):2410-2439, 2008.

Frazier, P, W. Powell and S Dayanik. The Knowledge-Gradient Policy for Correlated Normal Beliefs. *INFORMS Journal on Computing.* 21(4):599-613, 2009. Available here.

## Comments