Topological Features in Locally Connected RBMs_Free Document Search and Download

Page 1

Topological Features in Locally Connected RBMs
Andreas Müller, Hannes Schulz, and Sven Behnke
Abstract—Unsupervised learning algorithms find ways to model latent structure present in the data. These latent struc- tures can then serve as a basis for supervised classification methods. A common choice for unsupervised feature discovery is the Restricted Boltzmann Machine (RBM). Since the RBM is a general purpose learning machine, it is not particularly tailored for image data. Representations found by RBMs are consequently not image-like. Since it is essential to exploit the known topological structure for image analysis, it is desirable not to discard the topology property when learning new repre- sentations. Then, the same learning methods can be applied to the latent representation in a hierarchical manner. In this work, we propose a modification to the learning rule of locally connected RBMs, which ensures that topological image structure is preserved in the latent representation. To this end, we use a Gaussian kernel to transfer topological properties of the image space to the feature space. The learned model is then used as an initialization for a neural network trained to classify the images. We evaluate our approach on the MNIST and Caltech 101 datasets and demonstrate that we are able to learn topological feature maps.
I. INTRODUCTION Many classifiers rely on well-designed features for high performance. Since hand-crafting such features is infeasible for complex data sets, a variety of methods were developed to extract features from data in an unsupervised manner. As unlabeled data is easy to acquire, methods which explain the variance in the data without known labels are preferred. One mean to this end is modeling the data distribution by introducing latent variables. Generative graphical models, such as Restricted Boltzmann Machines (RBM, [1]) are a popular choice for this purpose. RBMs model correlations of observed variables by introducing binary latent variables (features) which are assumed to be conditionally independent given the observed variables. This restriction is useful because, in contrast to general Boltzmann Machines, a fast learning algorithm exists (Contrastive Divergence [1]). RBMs are generic learning machines and have been applied to many domains, including text, speech, motion data, and images. In the most commonly used form, however, they do not take advantage of the topology of the input space. Especially when applied to image data, fully connected RBMs model long-range dependencies which are known to be weak in natural images [2]. Typically (e.g. [3]), a sparsity constraint is introduced, resulting in local receptive fields. On the negative side, this approach wastes space and time, as large parts of the parameter space have to be “unlearned”. Another way to deal with the sparsity problem is to remove long-range parameters from the model entirely. The advantage
University of Bonn – Computer Science VI, Autonomous In- telligent Systems Group, Römerstraße 164, 53117 Bonn, Germany. Email: {amueller,schulz,behnke}@ais.uni-bonn.de
of this approach is two-fold: first, local connectivity serves as a prior that matches well to the properties of natural images and, second, the drastically reduced number of parameters makes learning on larger images feasible. While local receptive fields are well-established in dis- criminative learning, their counterpart in the generative case, which we call “impact area”, is not well understood. In [4], we investigated the capabilities of stacked RBMs and RBM- like graphical models with local impact area (LIRBMs) and lateral connections. We showed that with an equal number of parameters, these models are easier to train and give good classification performance. We further demonstrated that with the help of stacking global consistency can be enforced. We now apply this architecture to much larger images than common in the RBM literature. To obtain representations of the data that are invariant to small transformations, like translations, rotations, or scalings, we introduce a pooling operation, similar to the common convolutional approach. To do this on higher levels, we have to ensure that the learned representations retain structural properties of natural images. This is not the case in standard RBM approaches. To this end, we propose a modification to the contrastive divergence learning rule that enforces local coherence in hidden representations. This learning rule results in locally similar conditional probability distributions of latent variables. This in turn causes filters with impact areas that are close with respect to the image topology to be similar. Thereby, we introduce a method that constitutes a trade-off between the possibility to learn invariant filters (a property of the widely used convolutional neural network) and the possibility to learn locally adjusted filters (a property of the LIRBM). We evaluate our approach on the MNIST and Caltech 101 datasets. We are able to learn meaningful features and representations that exhibit local structural properties similar to those in the original images. We further use the learned topological features as an initialization for a locally connected neural network that is trained for classification. We show that this pretraining improves classification performance and shortens training times. II. BACKGROUND ON BOLTZMANN MACHINES (BM) A BM is an undirected graphical model with binary observed variables v ∈ {0, 1}
n
(visible nodes) and latent variables h ∈ {0, 1}
m
(hidden nodes). The energy function of a BM is given by E(v, h,θ) = −vT Wh − vT Iv − hT Lh − bT v − aT h, where θ = (W, I, L, b, a) are the model parameters, namely pairwise visible-hidden, visible-visible and hidden-hidden interaction weights, respectively, and b, a are the biases
Published in Proc. of International Joint Conference on Neural Networks (IJCNN) 2010 in Barcelona, Spain

Page 2

v c p1 h1
Fig. 1. Overview of our setup. v is the visible layer of a LIRBM, h1 is the hidden layer with multiple topographic maps (2 in this example). A sample variable in h1 is shown to be connected to its respective local impact area in v. p1 is a pooling layer, invariant to small translations and feature transformations. After training a stack of n LIRBMs one by one, we connect all units in all maps of the last pooling layer pn to all class-labelled output units in c and perform supervised finetuning in the resulting network.
of visible and hidden activation potentials. The diagonal elements of I and L are always zero. This yields a probability distribution p(v) p(v; θ) = 1 Z(θ) p∗(v; θ) = 1 Z(θ) ∑
h
e−E(v,h,θ), where Z(θ) is the normalizing constant (partition function) and p∗(·) denotes unnormalized probability. In RBMs, I and L are set to zero. Consequently, the conditional distributions p(v|h) and p(h|v) factorize com- pletely. This makes exact inference of the respective posteriors possible. Their expected values are given by 〈v|h〉p = σ(Wh + b) and 〈h|v〉p = σ(Wv + b), (1) where σ denotes element-wise application of the sigmoid function σ(x) = (1 + exp(−x))−1. In practice, Contrastive Divergence (CD, [1]) is used to approximate the true parameter gradient ∂ ln p(v)/∂wi,j = 〈vT h〉+ − 〈vT h〉− by a Markov Chain Monte Carlo algorithm. Here 〈·〉+ and 〈·〉− refer to the expected values with respect to the data distribution and model distribution, respectively. The expected value 〈vT h〉+ can be calculated in closed form using the formula for p(h|v). This calculation is called the “positive phase”. The model distribution 〈vT h〉−, on the other hand, can only be approximated using a Markov chain starting from the data and using the transition operators given by Equation (1). This process is termed the “negative phase”. Recently, Tieleman [5] proposed a faster alternative, called Persistent Contrastive Divergence (PCD), which employs a persistent Markov chain to approximate 〈·〉−. We use PCD throughout this paper.
Fig. 2. Selected Features from first, second and third hidden layer of LIRBM. The impact area of units in higher layers increases and captures more complex aspects of the input. The first row shows actual filters, the second and third row show features of single hidden nodes projected down to the visible layer.
RBMs can be stacked to build hierarchical models. The training of stacked models proceeds layer-wise by training the high-level models using the activations of the hidden nodes of the already trained layer below as input. III. LOCAL IMPACT RESTRICTED BOLTZMANN MACHINES In [4], we introduced a modification to the architectures of Section II called the Local Impact Restricted Boltzmann Machine (LIRBM). LIRBMS have a restricted impact area for each hidden node. To this end, we arranged the hidden nodes in multiple maps, each of which resembles the visible layer in its geometry. As a result, each hidden node hj can be assigned a position pos(hj) ∈ N2 in the input (pixel) coordinate system. This approach is similar to the common approach in convolutional neural networks [6]. In contrast to the convolutional procedure, we do not require the weights to be equal for all hidden units within one map. A common problem in training convolutional RBMs is that the over-complete representation of the visible nodes by the hidden nodes allows the filters to learn a trivial identity ([7], [8]). The LIRBM does not suffer from this problem for two reasons: First, we use PCD for training. This means even if a training example can be perfectly reconstructed, there is still a non-zero learning signal. This signal stems from the dependency of the approximated gradient on the state of the persistent Markov chain. Second, due to not sharing weights, for a trivial solution to occur, all filters have to learn the identity separately. A. Topological Feature Maps To obtain invariant features, it is common to arrange features in a topological map (e.g. [9], [10], [11]), where similar features are close to each other. All filters are then applied to the input image and responses are grouped together in neighborhoods which are local in the topological map. After

Page 3

grouping, the features are invariant to transformations which are local in the topology of the feature map. In our architecture, we now introduce modifications to learn multiple topological maps. Moreover, we endow the feature maps with a topology which matches the input topology. This is in contrast to previous work, where the topology of the features was chosen arbitrarily. Specifically, we arrange the features in the topology that is naturally given by the geometry of each map defined in Section III. The structure of the LIRBM (or RBMs, for that matter), however, is symmetric with respect to hidden units with same impact area. We break this symmetry by indirectly encouraging local filter similarity within maps. To this end, we propose a modification to the contrastive divergence learning rule. During the positive phase, we simulate local similarity using a local distance kernel. The conditional distribution of the hidden units is given by: p(hi|v) = ∑
j∈M(i)
1 √ 2πs e
− (
d(i,j)2 2s2
)
· σ ( ∑
k
Wkjvk ) . (2) Here, s determines the degree of locality in the topological map, d(i, j) = pos(hi) − pos(hj) denotes the Eucledian distance of two latent variables in the map, and M(i) is the set of all hidden units sharing a map with hi. The negative phase is left unchanged. The idea behind the local distance kernel is that during the positive phase, we pretend that close latent variables have similar conditional distributions. Contrastive divergence learning fits the model distribution to the data distribution and consequently creates features with locally similar activation patterns. Note that it is not necessary to constrain the features directly: we rather encourage neighboring hidden units within a map to react to the same input patterns. The similarity of features then emerges by virtue of the local statistics of natural images. For inference, since filters are already ordered topologically, the distance kernel is not necessary anymore and the hidden activations are given by Equation (1) again. B. Higher-Order Topological Feature Maps Due to the local similarity of features within a map, latent variable activities for a given input have an image- like structure within a map: as in natural images, neighboring positions are strongly correlated. Consequently, a coherent grouping of features that are close within a map is possible. Grouping can, for example, be performed by a simple maximum pooling operation. This operation yields another, more compact representation which is again locally coherent and image-like. As these are the properties of the original input, it is possible to train another LIRBM using the pooled representations as its input. The resulting hierarchy of representations is increasingly invariant to local distortions and translations. Higher-order features can thus represent more complex patterns by combining input from a specific location across all maps in the layer below. C. Initializing Neural Networks with Topological Features In many object recognition applications, it was shown that initializing the weight of a neural network with an unsuper- vised feature extraction algorithm improved the classification results. This procedure is commonly known as unsupervised pretraining and supervised finetuning in the deep learning literature. We use a locally connected neural network (LCNN, e.g. [12], [13]) for classification. Using the weights learned in a Local Impact Restricted Boltzmann Machine for initial- ization is straight-forward, since the arrangement of neurons in a LCNN matches the arrangement of hidden nodes in the LIRBM. This makes it possible to directly copy the weights learned by the LIRBM into a LCNN. We then replace the soft-max layers of the LIRBM by simple maximum pooling layers. There are, however, no standard procedures for the back-propagation of error through max-pooling layers when no weight sharing is employed. Here, we use the simplest way of performing gradient descent across the max-pooling layer, by back-propagating the error only to the input node that had the maximum activity in the forward pass. While in principle, the same path could be taken every time the output is calculated, the pre-training largely rules out these pathological cases by providing sensible default filters. Finally, we connect an output layer to the LIRBM by adding random weights between each output (classification) unit and all units in the last pooling layer of the LIRBM, respectively. We can now proceed with the fine-tuning using gradient de- scent (backpropagation of error). For this work, we employed batch learning using RPROP [14] with default parameters to minimize the cross-entropy error. First, the output layer, which was not pretrained, is adjusted while all other weights remain fixed. This prevents problems induced by large errors in the randomly initialized weights of the output layer destroying the pre-trained weights in earlier layers. After 50 epochs the output weights have converged and we continue training network as a whole. D. GPU Acceleration As the 2D structure of the input maps well to the architecture of Graphics Processing Units (GPUs), we imple- mented the proposed architecture using the NVIDIA CUDA programming framework. To this end, we created a matrix library which, apart from common dense-matrix operations can operate on sparse matrices in diagonal format (DIA). This enables us to efficiently process large images, where the corresponding dense matrix W would not even fit in memory. Specifically, we implemented H ← WT V and V ← WH, where H and V are dense matrices containing hidden/visible activations, respectively, and W is the LIRBM weight matrix. Efficiency is further improved by processing 16 images in parallel. The gradient, ∆W ← V HT is calculated by multiplying all blocks of V and H, where the resulting block is cut by one of the diagonals in ∆W. For maximum pooling operations we made use of routines provided by Alex

Page 4

LCNN With Pretraining Without Pretraining maps 1,2,4; no pooling 1.4 1.6 maps 1,2,4; max pooling 1.4 1.9 maps 1,4,8; no pooling 1.3 1.3 maps 1,4,8; max pooling 1.2 1.3 TABLE I MNIST CLASSIFICATION ERROR IN PERCENT
Krizhevsky1. Our implementation, which focuses on ease of use by exporting all functionality to Python and seamlessly integrating with NumPy, gives an overall speedup of about 50 when compared to an optimized single CPU version. The code is available at our website2. IV. RELATED WORK In the context of generative models of images, little work has been done to exploit local structure. A well known supervised learning approach that makes use of the local structure of images is the convolutional neural network by LeCun etal. [6]. LeCun’s ideas where transfered to the field of generative graphical models by Lee etal. [7] and Norouzi et al. [8]. Their models, which employ weight sharing and max-pooling, discard global image statistics. Our model does not suffer from this restriction. When training landscapes, for example, our model would be able to learn, even on the lowest layer, that there is always sky depicted in the upper half of the image. In the neural networks context, architectures which exhibit a similar local structure to ours were investigated by Behnke [12] and have been applied to natural images on a large scale by Uetz [15]. The latter architecture proved to be competitive to state-of-the-art in object recognition. In our, as well as in convolutional architectures, the activities of each feature map are a filtered version of the input. However, in convolutional architectures, the same filter is applied at each position of an image. Our topological features, on the contrary, are adjusted specifically to the local statistics at a given position. Our model also shares some of the ideas with Kavukcuoglu [9]: The authors learned a set of filters, placed on a two dimensional grid, using a sparse coding algorithm. Features are first learned on a topographical map and then grouped according to proximity in the topology. While the topology in [9] is artificially constructed, with no relation to the input, our topology reflects and is adjusted to the topology of the input space. Other methods for topological filters are based on non- negative matrix factorization (NMF, [16], [11]). However, both methods do not learn features specific to the spatial layout of the input. Pretraining deep neural networks with unsupervised meth- ods is covered extensively in the literature. Good surveys of the state-of-the-art can be found in the papers of Bengio [17] and Erhan et. al. [18].
1http://www.cs.utoronto.ca/˜kriz 2http://www.ais.uni-bonn.de/deep_learning
Fig. 5. MNIST samples and hidden activations. The first row shows sample MNIST digits. The second row displays the respective hidden representations. Each digit is represented with eight maps.
V. RESULTS In our implementation, we used maximum pooling in 2×2 non-overlapping windows. The width s of the distance kernel was set such that only direct neighbors influenced each other. We trained our model on the MNIST dataset of handwritten digits [6] and on the Caltech 101 [19] object categories database. Training on both datasets was purely unsupervised. A. MNIST 1) Unsupervised Feature Extraction: The MNIST dataset consists of 60,000 training and 10,000 test images. The images are centered gray scale images of handwritten digits. The size of each image is 28 × 28 pixels. We trained our hierarchical model with three layers of LIRBMs, where the layers had four, eight and 16 maps of topological features, respectively. The size of the local impact area was set to 9 × 9 pixels on all layers. Training the model took approximately 30 minutes on GPU. We first analyzed the learned feature maps. As expected, the features within a map are locally similar and vary smoothly between locations. In regions where the data exhibits a large variance, each map concentrates on a different aspect of the data. We show two learned feature maps of the lowest layer in Figure 3. Higher-layer features exhibit similar continuity properties. Apart from visualizing the weight matrices directly, we can also project down activation patterns corresponding to a single high-level unit turned on, and the others clamped to zero. To calculate the hidden activations of the lower layer, we have to reverse the max-pooling operation. This is done by supersampling and dividing the resulting activations by the size of the pooling window. After projecting down all filters of a given layer, we subtracted the mean from the resulting activities in the image domain for visualization purposes. We show sample features in Figure 2. Features of the first hidden layer represent simple edge filters, while the filters of the second hidden layer represent parts of numbers. The third-layer features have a impact area large enough to cover the whole image. Note, that despite the ad-hoc reversal of max-pooling, filters are still quite pronounced. 2) Classification: In order to assess the fitness of the extracted features for classification, we train a locally con- nected neural network, as described in Section III-C. Since we aim at comparing different learning methods, we do not augment the training set with deformed or translated versions of the training data, which generally guarantees performance improvements (e. g. [6]).

Page 5

Fig. 3. Two topographic feature maps learned on the MNIST dataset in our LIRBM model. The filters are clearly specific to the properties of the dataset in their respective areas. Features vary smoothly with the topology but differ between maps. Fig. 4. Topological learning on Caltech 101. The three leftmost columns display Caltech images (top) and their first-layer representations (bottom). The four variations are the result of the four different filter maps. The rightmost column shows first-layer activations of the same LIRBM trained without topographic feature maps.
We compare our approach with a vanilla fully connected neural network, a locally connected network without pretrain- ing and a pretrained network where the pretraining did not use topological filters. Using a parameter scan we found that the best result for using a vanilla neural network are achieved with one hidden layer containing 1000 neurons. Using this setup, we obtained a test error of 1.7%, which serves as a baseline. We used locally connected networks with two hidden layers and two, four, eight or 16 maps per layer. The size of the local receptive fields (or impact areas in the LIRBM) was kept fixed to 9 × 9. If pooling was applied, the size of the pooling window was set to 2 × 2. Table I summarizes the obtained results. From the table it is clear that locally connected networks have an advantage over fully connected networks. The use of max-pooling layers does not significantly influence the classification rate. Using maximum pooling, however, makes the model more scalable since the upper layers decrease in size and less parameters have to be learned and stored. We find that the pre-trained neural networks perform best. The influence of topological filters was also analysed. We found

Page 6

that networks pretrained without topological filters performed worse (1.32%) than networks which used topological filters (1.18%). This tendency is expected to strengthen for deeper networks as more dissimilar filters are pooled. B. Caltech 101 Due to the local impact areas of our model, it is feasible to train even on large images, which is not possible using fully connected RBMs. We trained two models on the Caltech 101 dataset. We preprocessed the images by converting them to gray scale and fit them into a 128×128 pixel image, keeping the aspect ration constant. The images were padded to square aspect ratio using their respective average gray level. We further faded the image border into the padding to remove side effects caused by the image edges. To train on natural images, it is necessary to use Gaussian visible units (see [20]). For this purpose, we normalized the images such that each pixel has zero mean and unit variance. The conditional distribution of the visible units after this normalization is given by p(v|h) ∝ N(Wh, 1). We trained one standard LIRBM model and one using the modified learning rule for topological feature maps. Both models consisted of three hidden layers with four, eight and eight maps, respectively. The training time for this model was about one hour. Figure 4 shows sample inputs and activations from both models. Only the hidden activations of the model trained with topographic filter maps retain the image structure and partly preserve local texture. Again, different filter maps concentrate on different aspects of the image. VI. CONCLUSION In this work, we addressed the problem of learning locally coherent representations in locally connected Restricted Boltz- mann Machines. Locality is a desirable property of learning algorithms especially on images, since local algorithms are faster and can retain basic structural properties of the data. However, when training a locally connected RBM in a naıve way, the resulting representations do not reflect local correlations present in natural images. This loss of structure prevents stacking of such models. Here, we present a novel modification of the contrastive divergence learning rule. We map the topology of the input space to the hidden units and thereby encourage the learning algorithm to learn similar features at neighboring positions. The resulting activations then exhibit the desired properties. We evaluated our method on the MNIST and Caltech 101 datasets and found that our topological features are useful for classification and generate significantly more image-like hidden representations when compared to models trained without topological feature maps. In future work, we are going to extend our hierarchical learning architecture to (semi-) supervised fine-tuning using deep Boltzmann machines and third order methods. REFERENCES
[1] G Hinton, S Osindero, and W Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18(7):1527–1554, 2006. [2] J Huang and D Mumford. Statistics of natural images and models. In CVPR, 1999. [3] MA Ranzato, L Boureau, and LeCun Y. Sparse feature learning for deep belief networks. Advances in neural information processing systems, 20, 2007. [4] H Schulz, A Müller, and S Behnke. Exploiting local structure in stacked Boltzmann machines. In European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN), 2010. [5] T Tieleman. Training restricted Boltzmann machines using approxima- tions to the likelihood gradient. In ICML, 2008. [6] Y LeCun, L Bottou, Y Bengio, and P Haffner. Gradient-based learning applied to document recognition. Proc. IEEE, 86(11):2278–2324, November 1998. [7] H Lee, R Grosse, R Ranganath, and Y Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In ICML, 2009. [8] M Norouzi, M Ranjbar, and G Mori. Stacks of Convolutional Restricted Boltzmann Machines for Shift-Invariant Feature Learning. In CVPR, 2009. [9] K Kavukcuoglu, MA Ranzato, R Fergus, and Y Le-Cun. Learning invariant features through topographic filter maps. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2009. CVPR Workshops 2009, pages 1605–1612, 2009. [10] Y Karklin and MS Lewicki. Emergence of complex cell properties by learning to generalize in natural scenes. Nature, 457(7225):83–86, 2008. [11] K Hosoda, M Watanabe, H Wersing, E Körner, H Tsujino, H Tamura, and I Fujita. A model for learning topographically organized parts-based representations of objects in visual cortex: Topographic nonnegative matrix factorization. Neural Computation, 21(9):2605–2633, 2009. [12] S Behnke. Hierarchical neural networks for image interpretation. Springer-Verlag, 2003. [13] R Uetz and S Behnke. Large-scale Object Recognition with CUDA- accelerated Hierarchical Neural Networks. In Proc. Intelligent Com- puting and Intelligent Systems, 2009. [14] Martin Riedmiller and Heinrich Braun. A direct adaptive method for faster backpropagation learning: the RPROP algorithm. pages 586–591, 1993. [15] R Uetz and S Behnke. Locally-Connected Hierarchical Neural Networks for GPU-accelerated Object Recognition. In NIPS 2009 Workshop on Large-Scale Machine Learning: Parallelism and Massive Datasets, 2009. [16] T Zhang, B Fang, Y Tang, G He, and J Wen. Topology preserving non- negative matrix factorization for face recognition. IEEE Transactions on Image Processing, 17(4):574, 2008. [17] Y Bengio. Learning deep architectures for AI. Now Publishers Inc, 2009. [18] D Erhan, P Manzagol, Y Bengio, S Bengio, and P Vincent. The difficulty of training deep architectures and the effect of unsupervised pre-training. In Proc. Artificial Intelligence and Statistics (AISTATS’09), pages 153–160, 2009. [19] L Fei-Fei, R Fergus, and P Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Computer Vision and Image Understanding, 106(1):59–70, 2007. [20] G Hinton and R Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504, 2006.

Topological Features in Locally Connected RBMs

Recent Documents:

Recent Search: