Home > The Mathematics of Deep Learning

The Mathematics of Deep Learning

Page 1
The Mathematics of Deep Learning
ICCV Tutorial, Santiago de Chile, December 12, 2015
 

Joan Bruna (Berkeley), Raja Giryes (Duke), Guillermo Sapiro (Duke), Rene Vidal (Johns Hopkins)

Page 2
Motivations and Goals of the Tutorial
Motivation: Deep networks have led to dramatic improvements in performance for many tasks, but the mathematical reasons for this success remain unclear. • Goal: Review very recent work that aims at understanding the mathematical reasons for the success of deep networks. • What we will do: Study theoretical questions such as
– What properties of images are being captured/exploited by DNNs? – Can we ensure that the learned representations are globally optimal? – Can we ensure that the learned representations are stable?
What we will not do: Show X% improvement in performance for a particular application.

Page 3
Tutorial Schedule
• 14:00-14.30: Introduction • 14:30-15.15: Global Optimality in Deep Learning (Ren�� Vidal) • 15:15-16.00: Coffee Break • 16:00-16:45: Scattering Convolutional Networks (Joan Bruna) • 16:45-17:30: Stability of Deep Networks (Raja Giryes) • 17.30-18:00: Questions and Discussion

Page 4
What do we mean by ��Deep Learning�� in this tutorial?
Disclaimer

Page 5
What do we mean by ��Deep Learning�� in this tutorial? •A class of signal representations that are hierarchical: •The optimization procedure by which these representations are learnt from data end-to-end.
Disclaimer
figure from Raja Giryes

Page 6
Early Hierarchical Feature Models for Vision
• Hubel & Wiesel [60s] Simple & Complex cells architecture: • Fukushima��s Neocognitron [70s]
figures from Yann LeCun��s CVPR��15 plenary

Page 7
Early Hierarchical Feature Models for Vision
• Yann LeCun��s Early ConvNets [80s]: – Used for character recognition – Trained with back propagation.
figures from Yann LeCun��s CVPR��15 plenary

Page 8
Deep Learning pre-2012
•Despite its very competitive performance, deep learning architectures were not widespread before 2012.
– State-of-the-art in handwritten pattern recognition [LeCun et al. ��89, Ciresan et al, ��07, etc]
figures from Yann LeCun��s CVPR��15 plenary

Page 9
Deep Learning pre-2012
•Despite its very competitive performance, deep learning architectures were not widespread before 2012.
– Face detection [Vaillant et al��93,��94 ; Osadchy et al, ��03, ��04, ��07]
(Yann��s Family)

Page 10
Deep Learning pre-2012
•Despite its very competitive performance, deep learning architectures were not widespread before 2012.
– Scene Parsing [Farabet et al, ��12,��13]
figures from Yann LeCun��s CVPR��15 plenary

Page 11
Deep Learning pre-2012
•Despite its very competitive performance, deep learning architectures were not widespread before 2012.
– Scene Parsing [Farabet et al, ��12,��13]
figures from Yann LeCun��s CVPR��15 plenary

Page 12
Deep Learning pre-2012
•Despite its very competitive performance, deep learning architectures were not widespread before 2012.
– Too many parameters to learn from few labeled examples. – ��I know my features are better for this task��. – Non-convex optimization? No, thanks. – Black-box model, no interpretability.

Page 13
Deep Learning Golden age in Vision
• 2012-2014 Imagenet results: • 2015 results: MSRA under 3.5% error. (using a CNN with 150 layers!)
CNN
non-CNN
figures from Yann LeCun��s CVPR��15 plenary

Page 14
Puzzling Questions
•What made this result possible?
– Larger training sets (1.2 million, high-resolution training samples, 1000 object categories) – Better Hardware (GPU) – Better Learning Regularization (Dropout)
•Is this just for a particular dataset? •Is this just for a particular task? •Why are these architectures so efficient?

Page 15
Is it just for a particular dataset?
• No. Nowadays CNNs hold the state-of-the-art on virtually any object classification task.
figures from Yann LeCun��s NIPS��15 tutorial

Page 16
Is it just for a particular task?
• No. CNN architectures also obtain state-of-the-art performance on many other tasks: Pose estimation [Thomson et al, CVPR��15] Object Localization [R-CNN, HyperColumns, Overfeat, etc.]
figures from Yann LeCun��s CVPR��15 plenary

Page 17
Is it just for a particular task?
• No. CNN architectures also obtain state-of-the-art performance on other tasks: •Semantic Segmentation [Pinhero, Collobert, Dollar, ICCV��15]
figures from Yann LeCun��s CVPR��15 plenary

Page 18
Is it just for a particular task?
• No. CNN architectures also obtain state-of-the-art performance on other tasks: •Generative Models for Natural Images [Radford, Metz & Chintala,��15]

Page 19
Is it just for a particular task?
• No. CNN architectures also obtain state-of-the-art performance on other tasks: •Generative Models for Natural Images [Radford, Metz & Chintala,��15]

Page 20
Is it just for a particular task?
• No. CNN architectures also obtain state-of-the-art performance on other tasks: •Related work [Kulkarni et al��15, Dosovitsky et al ��14]

Page 21
Is it just for a particular task?
• No. CNN architectures also obtain state-of-the-art performance on other tasks: •Image Captioning [Vinyals et al��14, Karpathy et al ��14, etc] •Optical Flow estimation [Zontar ��15] •Image Super-Resolution [MSR��14]

Page 22
•Convolutional Deep Learning models thus appear to capture high level image properties more efficiently than previous models.
• Highly Expressive Representations capturing complex geometrical and statistical patterns. • Excellent generalization: ��beating�� the curse of dimensionality

Page 23
•Convolutional Deep Learning models thus appear to capture high level image properties more efficiently than previous models. • Which architectural choices might explain this advantage mathematically?
• Role of non-linearities? • Role of convolutions? • Role of depth? • Interplay with geometrical, class-specific invariants?

Page 24
•Convolutional Deep Learning models thus appear to capture high level image properties more efficiently than previous models. • Which architectural choices might explain this advantage mathematically? •Which optimization choices might explain this advantage?
• Presence of local minima or saddle points? • Equivalence of local solutions? • Role of Stochastic optimization?

Page 25
• Deep Networks define a class of ��universal approximators��: Cybenko and Hornik characterization:
Deep Learning Approximation Theory
Theorem [C��89, H��91] Let ��() be a bounded, non-constant continuous func- tion. Let Im denote the m-dimensional hypercube, and C(Im) denote the space of continuous functions on Im. Given any f �� C(Im) and �� > 0, there exists N > 0 and vi,wi,bi, i = 1...,N such that F(x) = ��
iN
vi��(w
T i
x + bi) satisfies sup
x2Im |f(x) - F(x)| < �� .

Page 26
• Deep Networks define a class of ��universal approximators��: Cybenko and Hornik characterization:
• It guarantees that even a single hidden-layer network can represent any classification problem in which the boundary is locally linear (smooth).
• It does not inform us about good/bad architectures. • Or how they relate to the optimization.
Deep Learning Approximation Theory
Theorem [C��89, H��91] Let ��() be a bounded, non-constant continuous func- tion. Let Im denote the m-dimensional hypercube, and C(Im) denote the space of continuous functions on Im. Given any f �� C(Im) and �� > 0, there exists N > 0 and vi,wi,bi, i = 1...,N such that F(x) = ��
iN
vi��(w
T i
x + bi) satisfies sup
x2Im |f(x) - F(x)| < �� .

Page 27
Deep Learning Estimation Theory
Theorem [Barron��92] The mean integrated square error between the esti- mated networkˆF and the target function f is bounded by O ( C2
f
N \+ O ( Nm Klog K) , where K is the number of training points, N is the number of neurons, m is the input dimension, and Cf measures the global smoothness of f.

Page 28
• Combines approximation and estimation error.
• Does not explain why online/stochastic optimization works better than batch normalization. • Does not relate generalization error with choice of architecture.
Deep Learning Estimation Theory
Theorem [Barron��92] The mean integrated square error between the esti- mated networkˆF and the target function f is bounded by O ( C2
f
N \+ O ( Nm Klog K) , where K is the number of training points, N is the number of neurons, m is the input dimension, and Cf measures the global smoothness of f.

Page 29
Non-Convex Optimization
• [Choromaska et al, AISTATS��15] (also [Dauphin et al, ICML��15] ) use tools from Statistical Physics to explain the behavior of stochastic gradient methods when training deep neural networks.

Page 30
Non-Convex Optimization
• [Choromaska et al, AISTATS��15] (also [Dauphin et al, ICML��15] ) use tools from Statistical Physics to explain the behavior of stochastic gradient methods when training deep neural networks.
• Offers a macroscopic explanation of why SGD ��works��. • Gives a characterization of the network depth.
• Strong model simplifications, no convolutional specification.

Page 31
Tutorial Outline
Part I: Global Optimality in Deep Learning (Ren�� Vidal) • Part II: Signal Recovery from Scattering Convolutional Networks (Joan Bruna) • Part III: On the Stability of Deep Networks (Raja Giryes)
ONE LAYER STABLE EMBEDDING
��


( ) �� ℝ



Theorem 1: There exists an algorithm such that
Critical Points of Non-Convex Function Guarantees of Our Framework
(a) (i) (b) (c) (d) (e) (f) (g) (h)
Figure 4.1: Left: Example critical points of a non-convex function (shown in red).
original
single-layer recovery
O(log N)
scattering recovery
O((log N)2)

Page 32
Part I: Global Optimality in Deep Learning
• Key Questions
– How to deal with the non-convexity of the learning problem? – Do learning methods get trapped in local minima? – Why many local solutions seem to give about equally good results? – Why using rectified linear rectified units instead of other nonlinearities?
• Key Results
– Deep learning is a positively homogeneous factorization problem – With proper regularization, local minima are global – If network large enough, global minima can be found by local descent
Critical Points of Non-Convex Function Guarantees of Our Framework
(a) (i) (b) (c) (d) (e) (f) (g) (h)
Figure 4.1: Left: Example critical points of a non-convex function (shown in red).

Page 33
Part II: Scattering Convolutional Networks
• Key Questions
– What is the importance of "deep" and "convolutional" in CNN
architectures?
– What statistical properties of images are being captured/exploited by
deep networks?
• Key Results
– Scattering coefficients are stable encodings of geometry and texture – Layers in a CNN encode complex, class-specific geometry.
original
single-layer recovery
O(logN)
scattering recovery
O((logN)2)

Page 34
Part III: On the Stability of Deep Networks
Key Questions
Stability: Do small perturbations to the input image cause small
perturbations to the output of the network?
– Can I recover the input from the output?
• Key Results
– Gaussian mean width is a good measure of data complexity. – DNN keep important information of the data. – Deep learning can be viewed as metric learning problem.
ONE LAYER STABLE EMBEDDING
��


( ) �� ℝ



Theorem 1: There exists an algorithm such that

Page 35
Tutorial Schedule
• 14:00-14.30: Introduction • 14:30-15.15: Global Optimality in Deep Learning (Ren�� Vidal) • 15:15-16.00: Coffee Break • 16:00-16:45: Scattering Convolutional Networks (Joan Bruna) • 16:45-17:30: Stability of Deep Networks (Raja Giryes) • 17.30-18:00: Questions and Discussion
Search more related documents:The Mathematics of Deep Learning

Recent Documents:

Set Home | Add to Favorites

All Rights Reserved Powered by Free Document Search and Download

Copyright © 2011
This site does not host pdf,doc,ppt,xls,rtf,txt files all document are the property of their respective owners. complaint#nuokui.com
TOP