d
d
d
Data Driven Decisions @ The Ohio State University
Industrial and Systems Engineering
Multi-stage Stochastic Programming,
Stochastic Decomposition, and
Connections to Dynamic Programming:
An Algorithmic Perspective
Suvrajeet Sen
Data Driven Decisions Lab, ISE Department
Ohio State University
Columbus, OH
d
d
d
The Plan – Part I (70 minutes)
∎ The Mathematical Models
❑ 2-stage Stochastic Programs with Recourse
❑ T-stage Stochastic Programs with Recourse
❑ Non-anticipativity and Measurability of Decisions
∎ Energy Infrastructure Applications of SP
∎ Foundations for 2-stage Programs
❑ Subgradients of the Expected Recourse Function
❑ Subgradient and Stochastic Quasi-gradient Methods
❑ Deterministic Decomposition (Kelley/Benders/L-
shaped)
❑ Stochastic Decomposition
d
d
d
The Plan - Part II
∎ Algorithms for T-stage Programs
❑ Scenario Decomposition
∎ Progressive Hedging Algorithm
❑ Tree-traversal (time-staged) methods
∎ Nested Benders�� Decomposition & Variants
❑ Discrete-time Dynamic Systems (not covered)
❑ Simulating Optimization (not covered)
∎ Multi-stage Extensions Stochastic Decomposition
∎ Comparative Remarks and Conclusions
The Plan – Part II (70 minutes)
d
d
d
The Mathematical Models
d
d
d
2-stage Stochastic Program with Recourse
Deterministic Data
Avoids having to deal
with Zero Prob. Events
Stochastic Data
SP as a
Large-Scale
Linear Program
d
d
d
T-stage Stochastic Program with Recourse
a.s.
Nested Expectation Formulation �� a la DP
Deterministic Data
Stochastic Data
Non-anticipative
d
d
d
What do we mean by
Non-anticipativity?
Non-anticipative
Path 1
Path 2
Since paths 1 and 2 share the
same history until t=2
They must also share the
same decisions until t=2
Scenario
Tree
x
11
x
12
x
13
x
23
x
22
x
33
x
43
x
32
x
53
x
63
x
42
x
21
x
73
x
83
d
d
d
Observations of the Decision Process
(x
0
, x
11
, x
12
, x
13
)
(x
0
, x
11
, x
12
, x
23
)
(x
0
, x
11
, x
22
, x
33
)
(x
0
, x
11
, x
22
, x
43
)
(x
0
, x
21
, x
32
, x
53
)
(x
0
, x
21
, x
32
, x
63
)
(x
0
, x
21
, x
42
, x
73
)
(x
0
, x
12
, x
42
, x
83
)
Scenario
Tree
x
11
x
12
x
13
x
23
x
22
x
33
x
43
x
32
x
53
x
63
x
42
x
21
x
73
x
83
d
d
d
Probabilistic Language:
Measurability
is measureable wrt to ��-algebra ℱ
t
If we look at as a
stochastic process
then one can assign
probability measures
(on decisions) that
are consistent with
the stochastic
process embedded in
the scenario tree.
d
d
d
What does this mean for algorithms?
Non-anticipative
For Multi-stage SP Models, it is necessary to track decisions
at each node of the scenario tree. So, for
Multi-stage SD, we will track decisions by node number.
Scenario
Tree
x
11
x
12
x
13
x
23
x
22
x
33
x
43
x
32
x
53
x
63
x
42
x
21
x
73
x
83
d
d
d
Matching mathematical models with algorithms
∎ Stochastic LPs as Linear Programs
❑ Specialized Factorization Methods for Simplex
and Interior Point Methods (few scenarios)
∎ Scenario Decomposition
❑ Progressive Hedging Algorithm
∎ Tree-traversal (time-staged) methods
❑ Nested Benders�� Decomposition & Variants
∎ Simulating Optimization (time permitting)
❑ Stochastic Decomposition
d
d
d
Energy Infrastructure
Applications of SP
d
d
d
Why Stochastic Programming in Power
Systems? Long History
∎ Reliability Metrics in wide-spread use
❑
Loss of Load Probability (LOLP)
❑
Loss of Load Expectation (LOLE)
❑
Expected Loss of Load Duration (ELOLD)
∎ Some SP References Prior to 2000
❑
Murphy, Sen and Soyster (1981) �� Generation Planning
❑
Louveaux and Smeers (1981) �� Generation Planning
❑
Sherali, Sen and Soyster (1984) �� Electricity Prices
❑
Prekopa and Boros (1989) �� System Reliability
❑
Hobbs and Maheshwari (1990) �� System Planning
❑
Frauendorfer, Glavitsch, and Bacher (1992) �� System Operations
❑
Takriti, Birge and Long (1996) �� Stochastic UC �� SMIP
❑
Takriti , Kassenbrink and Wu (2000) �� Electricity Contracts
∎ Conference of Probabilistic Methods Applied to Power Systems
(since the early 1990��s)
d
d
d
Economic Dispatch Problem
❑ Next generation power grids highly dynamic
▪ Distributed Storage
▪ Cogeneration
▪ Large Scale Renewable Generation
▪ Real-time Pricing
❑ Mandates more proactive and fast operational systems like
Economic Dispatch.
❑ The ED system updates the output levels of the committed
generators to match the load demands in a cost-optimal manner.
▪ Present ED systems uses forecast of the order of 2 hours
(myopic).
▪ Steep trends like wind ramping deteriorate the ED system.
▪ Increasing the foresight and resolution of the ED problem comes
at the expense of additional computational complexity.
d
d
d
Economic Dispatch Problem
Loads
Wind
Generator
Thermal
Generators
d
d
d
ED Problem Formulation Features
(V. Zavala, Argonne National Labs)
∎ 2-stage model of a T-period application
❑ Decisions for first stage �� played out over the next T-1 periods
as the second stage
❑ Randomness in second-stage is wind
∎ Each stage has Cost of Generation and following
constraints
❑ Generation capacity constraints
❑ Power Flow
❑ Flow Balance
❑ Power Flow Bounds
❑ Bus-angle Bounds
❑ Wind ramp constraints (randomness)
❑ Generation ramp constraints
∎ First stage only has generation ramp constraints (bounds)
d
d
d
Foundations for 2-stage
Programs: A Review of Basics
d
d
d
∎ Let be a random variable defined on a
probability space
∎ Then ��static�� formulation of a stochastic
program is given by
∎ Why call it ��static��?
❑
(the outcome) is revealed once, and the rest
of the decision-model becomes deterministic
❑ This process may be repeated many times.
The commonly stated 2-stage SLP
(will be stated again, as needed)
d
d
d
The Recourse Function and its Expectation
∎ Usually, the matrix
C
1
is NOT fixed �� For our
presentation, we will make this assumption. This is
called the ��Fixed-Recourse�� assumption.
∎ Assuming that
h(x
0;��) is finite, LP duality implies
∎ Also LP theory =>
h(��;��) is piecewise linear, convex
and moreover,
d
d
d
Subgradient Method
(Shor/Polyak/Nesterov/Nemirovsky��)
∎ At iteration
k let be given
∎ Let
∎ Then,
where denotes
the projection operator on the set of the
decisions and,
∎ Note that is very difficult to compute! ��.
Enter SQG! Use an unbiased estimate of
∎ How? Use a sample size of
N
k
:
Interchange of Expectation
and Subdifferentiation is
required here
d
d
d
Stochastic Quasi-gradient Method (SQG) (Ermoliev/
Gaivoronski/��)
∎ At iteration
k let be given
∎ Replace of the previous slide with its
unbiased estimate
∎ Then,
with
and, in addition:
d
d
d
∎ Strengths
❑ Easy to Program, no master problem, and easily
parallelizable
∎ Weaknesses
❑ Non-adaptive step-sizes (e.g.
Constant/k)
∎
Needs a lot of fine-tuning to determining step-size (e.g.
Constant)
❑ Convergence
∎
Method makes good progress early on, but like other
steepest-descent type methods, there is zig-zagging behavior
❑ Need ways to stop the algorithm
∎
Difficult because upper and lower bounds on objective values
are difficult to obtain
Strengths and Weaknesses of Subgradient Methods
d
d
d
∎ Let be a random variable defined on a
probability space
∎ Then ��static�� formulation of a stochastic
program is given by
Kelley��s Cutting Plane/Benders��/L-shaped
Decomposition for 2-stage SLP (Recall Problem)
d
d
d
KBL Decomposition (J. Benders/Van Slyke/Wets)
∎ At iteration
k let , and
be given. Recall
∎ Then define
∎ Let
∎ Then,
Constant Term of the
Supporting Hyperplane
��Normal�� of the Supporting Hyperplane
(same as the subgradient method)
d
d
d
KBL Graphical Illustration
Expected Recourse
Function
Approximation: f
k-1
Approximation: f
k
d
d
d
Comparing Subgradient Method and KBL Decomposition
∎ Both evaluate subgradients
∎ Expensive Operation (requires solving as many second-stage
LPs as there are scenarios)
∎ Step size in KBL is implicit (user need not worry)
∎ Master program grows without bound and looks unstable in
the early rounds
∎ Stopping rule is automatic (Upper Bound – Lower Bound �� ��)
∎ KBL��s use of master can be a bottleneck for parallelization
d
d
d
Regularization of the Master Problem
(Ruszczynski/Kiwiel/Lemarechal ��)
∎ Addresses the following issue:
❑ Master program grows without bound and looks unstable
in the early rounds
∎ Include an incumbent and a proximity
measure from the incumbent, using �� >0 as a
weight:
∎ Particularly useful in case of Stochastic
Decomposition.
d
d
d
Where do we stand at this point in the Lecture?
Feature\Method
Subgradient
Method
SQG
Algorithm
KBL
Decomposition
Subgradient or
Estimation
Accurate
Estimation
Accurate
Step Length
Choice Required
Yes
Yes
No
Stopping Rules
Unknown
Unknown
Known
Parallel
Computations
Good
Good
Not so good-
Good
Continuous
Random Variables
No
Yes
No
First-stage Integer
Variables
No
No
Yes
Second-stage
Integer Variables
No
No
No
Of course for small instances, we can always try deterministic equivalents!
d
d
d
Stochastic Decomposition (Sequential
Sampling) Higle/Sen
∎ Allow arbitrarily many outcomes (scenarios)
including continuous random variables
∎ Requirement: can provide a simulator
∎ Assume: cost coefficients are deterministic
(although random costs will be allowed soon)
d
d
d
Central Question: Scalability of each
iteration
∎ If the number of scenarios is large, can we
afford to solve
all second-stage LPs to obtain
accurate subgradient estimates?
❑ No!
∎ Put another way: What is the smallest
number of LPs we can solve in each iteration,
and yet guarantee asymptotic convergence?
❑ The SD answer: 1!
∎ How?
d
d
d
Approximating the recourse function
in SD
∎ At the start of iteration
k, sample one more
outcome �� say
��k independently of
∎ Given solve the following LP
∎ Define
and calculate for
∎ Notice the mapping of outcomes
to finitely
many dual vertices.
d
d
d
Forming the approximation of the
Expected Recourse Function.
∎ The estimated ��cut�� in SD is given by
∎ To calculate this ��cut�� requires one LP
corresponding to the most recent outcome
and the ��argmax�� operations at the bottom of
the previous slide
∎ In addition all previous cuts need to be
updated �� to make old cuts consistent with
the changing sample size over iterations.
d
d
d
Many more SD details (are skipped) �� see
Higle and Sen (1999)
∎ Updating previously generated subgradients
∎ Defining incumbents
∎ Using regularized approximations
∎ Dropping subgradients (finite master)
∎ Stopping rules �� three phases
❑ Set of dual vertices stops changing
❑ Incumbent objective stabilizes
❑ Bootstrapped estimate of distribution of duality
gap is acceptably small
d
d
d
Further comparisons including 2-stage SD
Feature\Method
SQG Algorithm
KBL
Decomposition
Stochastic
Decomposition
Subgradient or
Estimation
Estimation
Accurate
Estimation
Step Length Choice
Required
Yes
No Needed
Not Needed
Stopping Rules
Unknown
Well Studied
Partially Solved
Parallel
Computations
Good
Not so good-
Good
Not known
Continuous
Random Variables
Yes
No
Yes
First-stage Integer
Variables
No
Yes
Yes
Second-stage
Integer Variables
No
No
Available in a
Dissertation
Of course for small instances, we can always try deterministic equivalents!
d
d
d
Comparisons between SD and SAA
(Higle/Zhao accepted for publication)
Prob
Approx. Value
Seconds
20Term
(40 rvs)
Reg. SD 254,581
(79)
259.30
(31.85)
20Term
SAA 254,512
(55)
approx.
10,000
Fleet20_3
(200 rvs)
Reg. SD 141,749
(18)
293.57
(2.45)
Fleet20_3
SAA 141,654
(6.5)
approx.
12,000
SSN
(80 rvs)
Reg. SD 10.26
(0.14)
7491.81
(3728.81)
SSN
SAA
10.57
(0.28)
approx.
100,000
d
d
d
Why the difference in computational times
between SD and SAA?
❑ If there are 1000 outcomes in the SAA
approximations, it requires the subproblem LP
to be solved for 1000 outcomes in every
iteration.
❑ Unlike SAA, the subproblem in SD is solved
for only one outcome, while approximations
are used for other outcomes (from previous
iterations). This explains the difference in
computational times.
d
d
d
Comparisons between SQG and SD
∎ Inventory coordination instance (Herer et al 2006)
∎ SQG method to solve a small inventory
transshipment problem
❑ Find order quantities to minimize total cost of
inventory and transshipment
❑ Example has 7 outlets which can ship goods
among themselves, if necessary
❑ Demand is normally distributed
❑ Herer et al ran the SQG method for K=3000
iterations, using subgradient estimates with
N=1000 simulated outcomes in each iteration
d
d
d
SQG trajectory of order quantities
d
d
d
SD Trajectory of Order Quantities
d
d
d
Optimal Values from 20 SD Runs
d
d
d
Iterations may vary depending on the seed
d
d
d
Running times for SD and SQG
d
d
d
Multi-stage Stochastic
Programming Algorithms
d
d
d
Non-anticipative Solutions by Scenario
(Rockafellar/Wets)
(x
0
, x
11
, x
12
, x
13
)
(x
0
, x
11
, x
12
, x
23
)
(x
0
, x
11
, x
22
, x
33
)
(x
0
, x
11
, x
22
, x
43
)
(x
0
, x
21
, x
32
, x
53
)
(x
0
, x
21
, x
32
, x
63
)
(x
0
, x
21
, x
42
, x
73
)
(x
0
, x
21
, x
42
, x
83
)
Scenario
Tree
x
33
x
23
x
13
x
43
x
53
x
63
x
73
x
83
x
11
x
21
x
12
x
22
,
x
32
x
21
d
d
d
0
1
2
3
4
5
6
7
9
8
10
11
12
13
14
Scenario
Tree
Non-anticipative Solutions by Node
Nodal Variables: X
n
d
d
d
Scenario
Tree
Relaxing Nonanticipativity Creates Clairvoyant
Decisions
(x
0
, x
1
, x
2
, x
3
)1
(x
0
, x
1
, x
2
, x
3
)2
(x
0
, x
1
, x
2
, x
3
)3
(x
0
, x
1
, x
2
, x
3
)4
(x
0
, x
1
, x
2
, x
3
)5
(x
0
, x
1
, x
2
, x
3
)6
(x
0
, x
1
, x
2
, x
3
)7
(x
0
, x
1
, x
2
, x
3
)8
d
d
d
Progressive Hedging Algorithm:
Coordination of Clairvoyant Decisions
∎ The constraints may be considered a graph
X
n
– (x
t
)
��
= 0
X
0
X
1
X
2
X
3
X
4
X
5
X
6
No Non-anticipativity
Constraints for the
Terminal Stage
d
d
d
Dualizing for Progressive Hedging
X
0
X
n
– (x
t
)
��
=0
Primal Constraints
X
n
Free
= 0
Dual Constraints
Time of node
n
d
d
d
Dualizing for Progressive Hedging
X
1
X
2
Time of node
n
= 0
Primal Problem
Lagrangian Dual Problem
d
d
d
Regularized of the Lagrangian Dual
Regularized Lagrangian Dual Problem
Where and are given at the start of any iteration.
Also assume that satisfies dual feasibility
The Progressive Hedging Strategy is Really Simple: Fix
Two of the Three Categories of Variables, and Optimize
the Third in the following order: Primal , followed by
the Primal X (the conditional mean) and finally solve for
. . Now Repeat this Procedure until the change in
estimated solution is within acceptable range.
d
d
d
Summary of the PHA Process
∎ Let
∎ Note that this minimization only involves data
for the outcome
∎ Next ��minimizing�� with respect to X, gives a
new estimate for the conditional
expectations. This is simply the conditional
expectation of the new vectors
∎ Finally, update the dual multipliers:
d
d
d
Comments on the PHA Process
∎ Coordination process has no master problem –
making it highly suited for parallelization
∎ The Lagrange Multipliers provide ex-post
estimates of prices or subsides for every
scenario
❑ But very large search spaces because of exponentially
many dual variables.
∎ The method has also been used as a heuristic
for Stochastic Mixed-Integer Programs (see
Watson, Wets and Woodruff, as well as PySP
(part of Coopr at Sandia).
d
d
d
x11
x21
x12
x22
x32
x42
x13
x23
x83
Scenario
Tree
Nested Benders�� Decomposition
(Birge/Gassman/Dempster ��)
Remember: All data and conditional probabilities
of the Multi-stage Stochastic LP are supplied
d
d
d
Nested Benders�� Decomposition
(Birge/Gassman/Dempster ��)
Information Visualization
• Upstream nodes place
��orders�� based on a
local decision (e.g. x
22
)
• Downstream nodes
respond with prices
(i.e. subgradients) and
feasibility facets
x11
x21
x12
x22
x32
x42
x13
x23
x83
Scenario
Tree
d
d
d
Notation:
j is an index for a stage
i is an index of a node in stage j
i- (��i minus��) is an upstream node.
Each Node of the Tree will ��House�� an LP
Prices (i.e. subgradients)
supplied by downstream
nodes
Feasibility facets
supplied by downstream
nodes
��Orders�� from
upstream
d
d
d
��Nested�� Benders�� Method
∎ Traverse the tree solving LPs whenever
feasible. In this case, pass a Subgradient to
the upstream node
∎ If any LP is infeasible, pass a ��Feasibility
Facet�� to the upstream node.
∎ Question?
❑ Can this algorithm be run via asynchrounous
processing, and still converge?
d
d
d
Comments on ��Nested�� Benders��
∎ Has been extended to sampling the tree (but
you still work the same ��probability��). So,
asymptotic convergence does NOT rely on
subgradients that are stochastic (Philpott
and Guan 2008)
∎ Extensions to Stochastic Subgradients are
right around the corner (under revision)