Home > Distributed Optimization via Alternating Direction Method of Multipliers

# Distributed Optimization via Alternating Direction Method of Multipliers

 Page 1
Distributed Optimization via Alternating Direction Method of Multipliers
Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato Stanford University
ITMANET, Stanford, January 2011

 Page 2
Outline
• precursors – dual decomposition – method of multipliers • alternating direction method of multipliers • applications/examples • conclusions/big picture
ITMANET, Stanford, January 2011
1

 Page 3
Dual problem
• convex equality constrained optimization problem
minimize f(x) subject to Ax = b
• Lagrangian: L(x, y) = f(x) + yT (Ax − b) • dual function: g(y) = infx L(x, y) • dual problem: maximize g(y) • recover x⋆ = argmin
x
L(x, y⋆)
ITMANET, Stanford, January 2011
2

 Page 4
Dual ascent
• gradient method for dual problem: yk+1 = yk + ��k∇g(yk) • ∇g(yk) = A˜x − b, where ˜x = argmin
x
L(x, yk)
• dual ascent method is
xk+1 := argmin
x
L(x, yk)
// x-minimization
yk+1 := yk + ��k(Axk+1 − b) // dual update
• works, with lots of strong assumptions
ITMANET, Stanford, January 2011
3

 Page 5
Dual decomposition
• suppose f is separable:
f(x) = f1(x1) + ··· + fN(xN), x = (x1,...,xN)
• then L is separable in x: L(x, y) = L1(x1,y) + ··· + LN(xN,y) − yT b,
Li(xi,y) = fi(xi) + y
T
Aixi
• x-minimization in dual ascent splits into N separate minimizations
x
k+1 i
:= argmin
xi
Li(xi,y
k
) which can be carried out in parallel
ITMANET, Stanford, January 2011
4

 Page 6
Dual decomposition
• dual decomposition (Everett, Dantzig, Wolfe, Benders 1960–65)
x
k+1 i
:= argmin
xi
Li(xi,yk), i = 1,...,N yk+1 := yk + ��k(P
N i=1
Aix
k+1 i
− b)
• scatter yk; update xi in parallel; gather Aix
k+1 i
• solve a large problem – by iteratively solving subproblems (in parallel) – dual variable update provides coordination • works, with lots of assumptions; often slow
ITMANET, Stanford, January 2011
5

 Page 7
Method of multipliers
• a method to robustify dual ascent • use augmented Lagrangian (Hestenes, Powell 1969), �� > 0
L��(x, y) = f(x) + y
T
(Ax − b)+(��/2)kAx − bk2
2
• method of multipliers (Hestenes, Powell; analysis in Bertsekas 1982)
x
k+1
:= argmin
x
L��(x, y
k
) y
k+1
:= y
k
+ ��(Ax
k+1 − b)
(note specific dual update step length ��)
ITMANET, Stanford, January 2011
6

 Page 8
Method of multipliers
• good news: converges under much more relaxed conditions
(f can be nondifferentiable, take on value +��,...)
• bad news: quadratic penalty destroys splitting of the x-update, so can��t
do decomposition
ITMANET, Stanford, January 2011
7

 Page 9
Alternating direction method of multipliers
• a method – with good robustness of method of multipliers – which can support decomposition
��robust dual decomposition�� or ��decomposable method of multipliers��
• proposed by Gabay, Mercier, Glowinski, Marrocco in 1976
ITMANET, Stanford, January 2011
8

 Page 10
Alternating direction method of multipliers
• ADMM problem form (with f, g convex)
minimize f(x) + g(z) subject to Ax + Bz = c
– two sets of variables, with separable objective • L��(x,z,y) = f(x) + g(z) + yT (Ax + Bz − c)+(��/2)kAx + Bz − ck2
2
xk+1 := argmin
x
L��(x, zk,yk)
// x-minimization
zk+1 := argmin
z
L��(xk+1,z,yk)
// z-minimization
yk+1 := yk + ��(Axk+1 + Bzk+1 − c) // dual update
ITMANET, Stanford, January 2011
9

 Page 11
Alternating direction method of multipliers
• if we minimized over x and z jointly, reduces to method of multipliers • instead, we do one pass of a Gauss-Seidel method • we get splitting since we minimize over x with z fixed, and vice versa
ITMANET, Stanford, January 2011
10

 Page 12
Convergence
• assume (very little!) – f, g convex, closed, proper – L0 has a saddle point • then ADMM converges: – iterates approach feasibility: Axk + Bzk − c �� 0 – objective approaches optimal value: f(xk) + g(zk) �� p⋆
ITMANET, Stanford, January 2011
11

 Page 13
Related algorithms
• operator splitting methods
(Douglas, Peaceman, Rachford, Lions, Mercier, . . . 1950s, 1979)
• proximal point algorithm (Rockafellar 1976) • Dykstra��s alternating projections algorithm (1983) • Spingarn��s method of partial inverses (1985) • Rockafellar-Wets progressive hedging (1991) • proximal methods (Rockafellar, many others, 1976–present) • Bregman iterative methods (2008–present) • most of these are special cases of the proximal point algorithm
ITMANET, Stanford, January 2011
12

 Page 14
Consensus optimization
• want to solve problem with N objective terms
minimize P
N i=1
fi(x)
– e.g., fi is the loss function for ith block of training data • ADMM form:
minimize P
N i=1
fi(xi) subject to xi − z = 0
– xi are local variables – z is the global variable – xi − z = 0 are consistency or consensus constraints
ITMANET, Stanford, January 2011
13

 Page 15
• L��(x,z,y) = P
N i=1
fi(xi) + yT
i
(xi − z)+(��/2)kxi − zk2
2
x
k+1 i
:= argmin
xi
fi(xi) + y
kT i
(xi − zk )+(��/2)kxi − zkk2
2
z
k+1
:= 1 N
N
X
i=1
x
k+1 i
+ (1/��)y
k i
y
k+1 i
:= y
k i
+ ��(x
k+1 i
− zk+1 )
ITMANET, Stanford, January 2011
14

 Page 16
• using P
N i=1
yk
i
= 0, algorithm simplifies to x
k+1 i
:= argmin
xi
fi(xi) + y
kT i
(xi − xk )+(��/2)kxi − xkk2
2
y
k+1 i
:= y
k i
+ ��(x
k+1 i
− xk+1 ) where xk = (1/N)P
N i=1
xk
i
• in each iteration – gather xk
i
and average to get xk
– scatter the average xk to processors – update yk
i
locally (in each processor, in parallel)
– update xi locally
ITMANET, Stanford, January 2011
15

 Page 17
Statistical interpretation
• fi is negative log-likelihood for parameter x given ith data block • x
k+1 i
is MAP estimate under prior N(xk + (1/��)yk
i
, ��I)
• prior mean is previous iteration��s consensus shifted by ��price�� of
processor i disagreeing with previous consensus
• processors only need to support a Gaussian MAP method – type or number of data in each block not relevant – consensus protocol yields global maximum-likelihood estimate
ITMANET, Stanford, January 2011
16

 Page 18
Consensus classification
• data (examples) (ai,bi), i = 1,...,N, ai �� R
n
, bi �� {−1,+1}
• linear classifier sign(aT w + v), with weight w, offset v • margin for ith example is bi(aT
i
w + v); want margin to be positive
• loss for ith example is l(bi(aT
i
w + v))
– l is loss function (hinge, logistic, probit, exponential, . . . ) • choose w, v to minimize 1
N P N i=1
l(bi(aT
i
w + v)) + r(w)
– r(w) is regularization term (ℓ2, ℓ1,...) • split data and use ADMM consensus to solve
ITMANET, Stanford, January 2011
17

 Page 19
Consensus SVM example
• hinge loss l(u) = (1 − u)+ with ℓ2 regularization • baby problem with n = 2, N = 400 to illustrate • examples split into 20 groups, in worst possible way:
each group contains only positive or negative examples
ITMANET, Stanford, January 2011
18

 Page 20
Iteration 1
−3 −2 −1 0 1 2 3 −10 −8 −6 −4 −2 0 2 4 6 8 10
ITMANET, Stanford, January 2011
19

 Page 21
Iteration 5
−3 −2 −1 0 1 2 3 −10 −8 −6 −4 −2 0 2 4 6 8 10
ITMANET, Stanford, January 2011
19

 Page 22
Iteration 40
−3 −2 −1 0 1 2 3 −10 −8 −6 −4 −2 0 2 4 6 8 10
ITMANET, Stanford, January 2011
19

 Page 23
Reference
Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers (Boyd, Parikh, Chu, Peleato, Eckstein) available at Boyd web site
ITMANET, Stanford, January 2011
20