Our method was presented in the following paper:
Gil Levi and Tal Hassner, Emotion Recognition in the Wild via Convolutional Neural Networks and Mapped Binary Patterns, Proc. ACM International Conference on Multimodal Interaction (ICMI), Seattle, Nov. 2015
For code, models and examples, please see our project page.
The presented work was developed and co-authored with Prof. Tal Hassner.
I would also like to thank David Beniaguiev for his useful advice regarding this and other research projects.
Automatic facial expression had been subject to much research in recent years. However, performance of current algorithms is still far worse than human performance and much progress still need to be made in order to meet human level. This is especially noteworthy since recent works in the related task of face-recognition reported super-human performance [1,2,3].
Motivated by the success of deep learning method in various classification problems (e.g object classification [4] , human pose estimation [5], face parsing [6], facial keypoint detection [7], action classification [8], age and gender classification [9] and many more) we propose to use Deep Convolutional Neural Networks [10] for facial emotion recognition. Moreover, we present a novel mapping from image intensities to an illumination invariant 3D space based on the notion of Local Binary Patterns [11,12,13]. Unlike regular LBP codes, our mapping produces values in a metric space which can be used to finetune existing RGB convolutional neural networks (see the figure above).
We apply our LBP mapping with various LBP parameters to the CASIA WebFace collection [14] and use the mapped images (along with the original RGB images) to train an ensemble of CNN models with different architecture. This allows our models to learn general face representations. We then use the pretrained models to finetune on a much smaller set of emotion labels face images. We demonstrate our method on the Emotion Recognition in the Wild Challenge (EmotiW 15), Static Facial Expression Recognition sub-challenge (SFEW) [15]. Our method achieves an accuracy of 54.5%, which is a 15.3% improvement over the baseline provided by the challenge authors (40% gain in performance).
First, we will give an overview of our method. We assume the input images are converted to grayscale and cropped to the region of the face. Further, the images were aligned using the Zhu and Ramanan facial feature detector [16].
Each of those steps is defined in detail below.
Local Binary Patterns [11,12,13] capture local texture by considering local image similarity. To produce an LBP code at a certain pixel location, the intensity values of the 8 adjacent pixels are thresholded by the the center pixel intensity. This produces an 8 bit binary vector where each bit is 1 if the corresponding adjacent pixel’s intensity was larger than the center pixel’s intensity and 0 otherwise. This process is depicted in the following illustration:
In the common LBP pipeline, LBP codes are extracted in each pixel location and transformed to decimal values. The image is then divided into regions and for each region a histogram of LBP codes is created. Finally, the different histograms are concatenated to produce a single global descriptor for the image. Here instead we will simply calculate an 8 bit binary vector for each image which we will afterward map into a 3 dimensional space.
After extracting LBP codes in each pixel location, we would like to map the 8 channel binary image to a regular 3 channel image. That would allow us to later use pre-trained network and finetune them to the specific problem at hand. One method for dimensionality reduction is Multi-Dimensional Scaling (MDS) [17,18]. Given a high-dimensional set of data and a similarity matrix describing the similarity distance between each pair of high-dimensional data points, MDS produces a mapping which can transform the high-dimensional dataset to the desired low dimensional space while trying to preserve the distances between each pair of data points. Now, the question that arises is how to choose a suitable similarity metric between the 8 bit LBP codes in a way that appropriately captures their visual similarity?
The most straightforward approach would be use the standard euclidean norm, as done in regular images. That would mean to take the decimal value of each LBP code and simply take the absolute values of their difference. However, that would introduce large differences between codes that are actually very similar. We will demonstrate this using a toy example: consider the following three binary LBP codes: a = (1,0,0,0,0,0,0,0,0), b=(0,0,0,0,0,0,0,0), c=(0,0,0,0,0,0,0,1). Clearly, the visual difference between ‘a’ and ‘b’ is the same as the visual difference between ‘b’ and ‘c’ – they only differ in one bit in both cases. Now, let’s compute their euclidean distance: |a-b| = |128-0| = 128, |b-c| = |0 -1| = 1. We get a large difference in euclidean distance for pairs of LBP codes with the actually the same ‘visual’ difference.
A different approach would be to take the difference between the binary 8-bit LBP code which is usually calculated as their Hamming distance – the number of different bits between the two binary vectors which can be computed as the sum of the result of the XOR operation of the two vectors. This metric, however, ignores completely the locations of the bits which are different between the two vectors. This seemingly negligible detail can actually introduce large differences between LBP vectors produced from very similar intensity patterns. To demonstrate this, consider the following three binary 8-bit LBP vectors: a = (1,0,0,0,0,0,0,0,0), b=(0,1,0,0,0,0,0,0), c=(0,0,0,0,0,0,0,1):
The hamming distance between pattern a and patten b is equal to the hamming distance between the pattern a and the pattern c (both equal 1). However, pattern b is different from pattern a by a slight single pixel rotation around the central pixel whereas pattern c is a mirror of pattern a, thus the local patches that produces patterns a and b are more similar than the local patches that produced patterns a and c. Clearly, the spatial location of the different bits also needs to be taken into consideration.
To this end, we find the Earth Mover’s Distance [19] to be suitable to our scenario. Originally, the Earth Mover’s distance was designed to compare histograms in a way that takes into account the spatial location of the bins. The Earth Mover’s Distance can be illustrated by the amount of work needed to align two piles of dirt (representing the two histograms) against each other – the energy of moving dirt takes in consideration both the differences in the piles height in each location, but also how far the dirt needs to be moved.
Due to the huge number of model parameters, deep CNN are prone to overfitting when they are not trained with a sufficiently large training set. The EmotiW challenge contains only 891 training samples, making it dramatically smaller than other image classification datasets commonly used for training deep networks (e.g, the Imagenet dataset [20]). To alleviate this problem, we train our models in two steps: First, we finetune pre-trained object classification networks on a large face recognition dataset, namely the CASIA WebFace dataset [21]. This allows the network to learn general features relevant for face classification related problems. Then, we finetune the resulting networks for the problem of emotion recognition using the smaller training set given in the challenge. Furthermore, we apply data-augmentation during training by feeding the network with 5 different crops of each image and their mirrors (over-sampling). This is also done in test time and in practice improves the accuracy of the models.
We experimented with 4 different network architectures and 5 different image transformations, giving us a total of 20 deep models which we later ensemble by a weighted average of each model’s predictions. The network’s architecture that we used are VGG_S, VGG_M_2048, VGG_M_4096 [22] and GoogleNet[23]. We used mapped LBP transformations with radii of 1, 5 and 10 pixels as well as the original RGB values. Once the networks were trained, we used the validation set to learn a weighted average of their predictions.
The table below lists our results for all the various combinations of image transformation and network architecture:
The table seems a bit overloaded with numbers, but one important thing we can notice is that the networks trained on the original RGB images did not give the best results — the best accuracy was obtained by the networks trained on the mapped LBP images.
Below is a histogram of the importance each model got in the final ensemble:
Note here that the models trained on the mapped LBP images had the most importance in final ensemble.
Also see below some of the predictions made by our system. In some cases, the faces that the system misclassified are heavily blurred, not correctly cropped or in challenging head pose.
We showed that by applying a certain image transformation, mapped LBP in this case, we can enforce the model to learn different features. This becomes useful when ensembling such models – since each model learned different information, the models complement each other and the accuracy we get when ensembling them is dramatically higher than each model alone. This is evident from our results as our best model achieved an accuracy of 44.73% while our ensemble achieved 51.75% on the validation set. Moreover, we achieved an accuracy of 54.56% which is a 15.36% improvement over the baseline provided by the challenge authors.
[1] Schroff, Florian, Dmitry Kalenichenko, and James Philbin. “Facenet: A unified embedding for face recognition and clustering.” arXiv preprint arXiv:1503.03832 (2015).
[2] Sun, Yi, et al. “Deepid3: Face recognition with very deep neural networks.” arXiv preprint arXiv:1502.00873 (2015).
[3] Sun, Yi, Xiaogang Wang, and Xiaoou Tang. “Deeply learned face representations are sparse, selective, and robust.” arXiv preprint arXiv:1412.1265 (2014).
[4] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks.” Advances in neural information processing systems. 2012.
[5] Toshev, Alexander, and Christian Szegedy. “Deeppose: Human pose estimation via deep neural networks.” Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on. IEEE, 2014.
[6] Luo, Ping, Xiaogang Wang, and Xiaoou Tang. “Hierarchical face parsing via deep learning.” Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012.
[7] Sun, Yi, Xiaogang Wang, and Xiaoou Tang. “Deep convolutional network cascade for facial point detection.” Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on. IEEE, 2013.
[8] Sun, Yi, Xiaogang Wang, and Xiaoou Tang. “Deep convolutional network cascade for facial point detection.” Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on. IEEE, 2013.
[9] Gil Levi and Tal Hassner, Age and Gender Classification using Convolutional Neural Networks, IEEE Workshop on Analysis and Modeling of Faces and Gestures (AMFG), at the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Boston, June 2015.
[10] LeCun, Yann, et al. “Backpropagation applied to handwritten zip code recognition.” Neural computation 1.4 (1989): 541-551.
[11] Ojala, Timo, Matti Pietikäinen, and David Harwood. “A comparative study of texture measures with classification based on featured distributions.” Pattern recognition 29.1 (1996): 51-59.
[12] Ojala, Timo, Matti Pietikäinen, and Topi Mäenpää. “A generalized Local Binary Pattern operator for multiresolution gray scale and rotation invariant texture classification.” ICAPR. Vol. 1. 2001.
[13] Ojala, Timo, Matti Pietikäinen, and Topi Mäenpää. “Multiresolution gray-scale and rotation invariant texture classification with local binary patterns.” Pattern Analysis and Machine Intelligence, IEEE Transactions on 24.7 (2002): 971-987.
[14] Yi, Dong, et al. “Learning face representation from scratch.” arXiv preprint arXiv:1411.7923 (2014).
[15] Dhall, Abhinav, et al. “Video and image based emotion recognition challenges in the wild: Emotiw 2015.” Proceedings of the 2015 ACM on International Conference on Multimodal Interaction. ACM, 2015.
[16] Zhu, Xiangxin, and Deva Ramanan. “Face detection, pose estimation, and landmark localization in the wild.” Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012.
[17] Borg, Ingwer, and Patrick JF Groenen. Modern multidimensional scaling: Theory and applications. Springer Science & Business Media, 2005.
[18] Seber, George AF. Multivariate observations. Vol. 252. John Wiley & Sons, 2009.
[19] Rubner, Yossi, Carlo Tomasi, and Leonidas J. Guibas. “A metric for distributions with applications to image databases.” Computer Vision, 1998. Sixth International Conference on. IEEE, 1998.
[20] Russakovsky, Olga, et al. “Imagenet large scale visual recognition challenge.” International Journal of Computer Vision 115.3 (2015): 211-252.
[21] Yi, Dong, et al. “Learning face representation from scratch.” arXiv preprint arXiv:1411.7923 (2014).
[22] Chatfield, Ken, et al. “Return of the devil in the details: Delving deep into convolutional nets.” arXiv preprint arXiv:1405.3531 (2014).
[23] Szegedy, Christian, et al. “Going deeper with convolutions.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.
The notebook used in the demo is available here and the various deep networks and definition files used to run the demo are available here.
Our method was presented in the following paper:
Gil Levi and Tal Hassner, Age and Gender Classification using Convolutional Neural Networks, IEEE Workshop on Analysis and Modeling of Faces and Gestures (AMFG), at the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Boston, June 2015.
For code, models and examples, please see our project page.
New! Tensor-Flow implementation of our method .
The presented work was developed and co-authored with my thesis supervisor, Prof. Tal Hassner.
Though age and gender classification plays a key role in social interactions, performance of automatic facial age and gender classification systems is far from satisfactory. This is in contrast to the super-human performance in the related task of face recognition reported in recent works [3,4].
Previous approaches for age and gender classification were based on measuring differences and relations between facial dimensions [5] or on hand-crafted facial descriptors[6,7,8]. Most have designed classification schemes tailored specifically for age or gender estimation, for example [9] and others. Few of the past methods have considered challenging in-the-wild images [6] and most did not leverage the recent rise in availability and scale of image datasets in order to improve classification performance.
Motivated by the tremendous progress made in face recognition research by the use of deep learning techniques[10] , we propose a similar approach for age and gender classification. To this end, we train deep convolutional neural networks[11] with a rather simple architecture due to the limited amount of training data available for those tasks.
We test our method on the challenging recently proposed AdienceFaces benchmark[6] and show it to outperform previous methods by a substantial margin. The AdienceFaces benchmarks depicts in-the-wild setting. Example images from this collection are presented in the figure above.
Currently, databases of in-the-wild face images which contain age and gender labels are relatively small in size compared to other popular image classification datasets (for example, the Imagenet dataset[12] and the CASIA WebFace dataset [13]). Overfitting is a common problem when training complex learning models on a limited dataset, therefore we take special care in preventing overfitting in our method. This is done by choosing a relatively “modest” architecture, incorporating two drop-out layers and augmenting the images with random crops and flips in the training phase.
The same network architecture is used for both age and gender classification. The proposed network comprises of only three convolutional layers and two fully-connected layers with a small number of neurons. This architecture is relatively shallow, compared to the much larger architectures applied, for example, in [14] and [15]. A schematic illustration of the network is below:
The network contains three convolutional layers, each followed by a ReLU operation and a pooling layer. The first two layers also follow an LRN layer [14]. The first Convolutional Layer contains 96 filters of 7×7 pixels, the second Convolutional Layer contains 256 filters of 5×5 pixels, The third and final Convolutional Layer contains 384 filters of 3 × 3 pixels. Finally, two fully-connected layers are added, each containing 512 neurons and each followed by a ReLU operation and a dropout layer.
We tested our method on the recently proposed AdienceFaces [6] benchmark for age and gender classification. The AdienceFaces benchmark contains automatically uploaded Flickr images. As the images were automatically uploaded without prior filtering, they depict challenging in-the-wild settings and vary in facial expression, head pose, occlusions, lighting conditions, image quality etc. Moreover, some of the images are of very low quality or contain extreme motion blur. The figure above (first figure in the post) illustrates example images from the AdienceFaces collection. Below is a breakdown of the dataset into the different age and gender classes.
0-2 | 4-6 | 8-13 | 15-20 | 25-32 | 38-43 | 48-53 | 60+ | Total | |
Male | 745 | 928 | 934 | 734 | 2308 | 1294 | 392 | 442 | 8192 |
Female | 682 | 1234 | 1360 | 919 | 2589 | 1056 | 433 | 427 | 9411 |
Both | 1427 | 2162 | 2294 | 1653 | 4897 | 2350 | 825 | 869 | 19487 |
We experimented with two methods of classification:
The tables below summarizes our results compared to previously proposed methods. We measure mean accuracy + standard variation, 1-off in age classification means the age prediction was either correct or 1-off from the correct age class:
Gender:
Method | Accuracy |
Best from [6] | 77.8 ± 1.3 |
Best from [16] | 79.3 ± 0.0 |
Proposed using single crop | 85.9 ± 1.4 |
Proposed using over-sampling | 86.8 ± 1.4 |
Age:
Method | Exact | 1-off |
Best from [6] | 45.1 ± 2.6 | 79.5 ±1.4 |
Proposed using single crop | 49.5 ± 4.4 | 84.6 ± 1.7 |
Proposed using over-sampling | 50.7 ± 5.1 | 84.7 ± 2.2 |
Evidently, the proposed network, though it’s simplicity, outperforms previous methods by a substantial margin. We further present misclassification results for our method, both for age and gender classification.
Gender misclassifications: Top row: Female subjects mistakenly classified as males. Bottom row: Male subjects mistakenly classified as females:
Age misclassifications: Top row: Older subjects mistakenly classified as younger. Bottom row: Younger subjects mistakenly classified as older.
As can be seen from the misclassification examples, most mistakes are due to blur, low image resolution or occlusions. Furthermore, in gender, most of the misclassifications are in babies or in young children where facial gender attributes are not clearly visible.
A few months ago, there was a bit hype about Microsoft’s new how-old.net webpage that allow users to upload their images and then it tries to automatically determined their age and gender.
We thought it would be interesting to try and compare MS’s methods with ours and measure their accuracy. To this end, we automatically uploaded all of the AdienceFaces images to the how-old.net page and listed the results. We only got their age estimation result and only in case where MS’s page managed to detect a face in the image (if it the image was too hard for face detection, it would probably fail completely on the much more challenging task of age classification).
MS’s how-old.net site reached an accuracy of about 40%. As listed in the tables above, our network reached 50.7% with over-sampling and 49.5% using single-crop. Below are some examples of images which the MS tool misclassified, but our method classified correctly.
We have presented a novel method for age and gender classification in the wild based on deep convolutional neural networks. Taking into account the relatively small amount of training data, we devised a relatively shallow network and took special care to avoid over-fitting (using data augmentation and dropout layers).
We measured our performance on the AdienceFaces benchmark[6] and showed that the proposed approach outperforms previous methods by a large margin. Moreover, we compared our method against Microsoft’s how-old.net webpage.
For paper, code and more details, please see our project page.
[1] Gil Levi and Tal Hassner, LATCH: Learned Arrangements of Three Patch Codes, IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, USA, March, 2016
[2] Gil Levi and Tal Hassner, Age and Gender Classification using Convolutional Neural Networks, IEEE Workshop on Analysis and Modeling of Faces and Gestures (AMFG), at the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Boston, June 2015.
[3] Sun, Yi, Xiaogang Wang, and Xiaoou Tang. “Deep learning face representation from predicting 10,000 classes.” Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on. IEEE, 2014.
[4] Schroff, Florian, Dmitry Kalenichenko, and James Philbin. “Facenet: A unified embedding for face recognition and clustering.” arXiv preprint arXiv:1503.03832 (2015).
[5] Kwon, Young Ho, and Niels Da Vitoria Lobo. “Age classification from facial images.” Computer Vision and Pattern Recognition, 1994. Proceedings CVPR’94., 1994 IEEE Computer Society Conference on. IEEE, 1994.
[6] Eidinger, Eran, Roee Enbar, and Tal Hassner. “Age and gender estimation of unfiltered faces.” Information Forensics and Security, IEEE Transactions on 9.12 (2014): 2170-2179.
[7] Gao, Feng, and Haizhou Ai. “Face age classification on consumer images with gabor feature and fuzzy lda method.” Advances in biometrics. Springer Berlin Heidelberg, 2009. 132-141.
[8] Liu, Chengjun, and Harry Wechsler. “Gabor feature based classification using the enhanced fisher linear discriminant model for face recognition.” Image processing, IEEE Transactions on 11.4 (2002): 467-476.
[9] Chao, Wei-Lun, Jun-Zuo Liu, and Jian-Jiun Ding. “Facial age estimation based on label-sensitive learning and age-oriented regression.” Pattern Recognition 46.3 (2013): 628-641.
[10] Taigman, Yaniv, et al. “Deepface: Closing the gap to human-level performance in face verification.” Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on. IEEE, 2014.
[11] LeCun, Yann, et al. “Backpropagation applied to handwritten zip code recognition.” Neural computation 1.4 (1989): 541-551.
[12] Russakovsky, Olga, et al. “Imagenet large scale visual recognition challenge.” International Journal of Computer Vision (2014): 1-42.
[13] Yi, Dong, et al. “Learning face representation from scratch.” arXiv preprint arXiv:1411.7923 (2014).
[14] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks.” Advances in neural information processing systems. 2012.
[15] Chatfield, Ken, et al. “Return of the devil in the details: Delving deep into convolutional nets.” arXiv preprint arXiv:1405.3531 (2014).
[16] Hassner, Tal, et al. “Effective face frontalization in unconstrained images.” arXiv preprint arXiv:1411.7964 (2014).
Our proposed LATCH descriptor was presented in the following paper:
Gil Levi and Tal Hassner, LATCH: Learned Arrangements of Three Patch Codes, IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, USA, March, 2016
Here is a short video of me presenting LATCH at the WACV 16 conference (I apologize for the technical problems in the video).
Our LATCH descriptor has already been officially integrated into OpenCV3.0 and has even won the CVPR 2015, OpenCV State of the Art Vision Challenge, in the Image Registration category !
Also, see CUDA (GPU) implementation of the LATCH descriptor and a cool visual odometry demo, both by Christopher Parker.
For more information, please see LATCH project page.
The LATCH descriptor was developed along with my master thesis supervisor, Dr. Tal Hassner. I would like to take this opportunity and thank him again for his support, insight and endless patience.
I would also like to thank David Beniaguiev for his useful advice regarding this and other research projects.
Histogram based descriptors (SIFT and friends)
One of the key problems in Computer Vision, that is a common underline to the vast majority of Computer Vision application (for example, 3D reconstruction, Face Recognition, Dense Correspondence, etc.), is that of representing local image patches. Numerous attempts have been made at obtaining a discriminative local patch representation, which is both invariant to various image transformations as well as efficient to extract and match.
Until recently [1], the common approach for representing local patches was through the use of Histograms of Oriented Gradients. One of the early works in this approach is the Scale Invariant Feature Transform (SIFT) [11] which computes an 8 bins histogram of gradients orientation in each cell of 4×4 grid. Once histograms are computed, they are normalized and concatenated into a single 128-dimensional descriptor that is commonly compared using the L2 metric.
Since the use of gradients is time-consuming, the Speeded Up Robust Feature (SURF) [12] descriptor make use of integral images in order to speed up feature extraction. It is important to note that although SURF speeds up descriptors extraction, the matching phase which is an order of O(n^2) in the number of image descriptors, still uses the expensive and time consuming L2 metric.
Other notable examples in this line of work include the Gradient Location and Orientation Histogram (GLOH) [13] descriptor which replaces SIFT’s 4×4 rectangular grid by a log-polar grid to improve rotation invariance as well as PCA-SIFT [14] which builds upon SIFT, but uses the PCA dimensionality reduction technique to improve its distinctiveness and to reduce it’s size.
Binary Descriptors
Though much work have been devoted in obtaining strong representations [11, 12, 13, 14], it is was only until recently that a major improvement in both running times and memory footprint has been obtained with the emergence of the binary descriptors [1 ,2, 3, 4]. Instead of expensive gradients operations, Binary Descriptors make use of simple pixel comparison which result in binary strings of typically short length (commonly 256 bits). The resulting binary representation hold some very useful properties:
Those characteristics are of particular importance for real-world real-time applications that commonly require fast processing of an increasing amount of data, often on mobile devices with limited storage and computational power.
One of the early works in this line of work was the Binary Robust Independent Elementary Features (BRIEF) descriptor [1]. BRIEF operates by comparing the same set of smoothed pixel pairs for each local patch that it describes. For each pair, if the first smoothed pixel intensity is larger than that of the second, BRIEF writes 1 in the final descriptor string and 0 otherwise. The sampling pairs are chosen randomly, initialized only once and used for each image and local patch.
The Oriented fast and Rotated BRIEF (ORB) descriptor [2] builds upon BRIEF by adding a rotation invariance mechanism that is based on measuring the patch orientation using first order moments. ORB also uses unsupervised learning to learn the sampling pairs.
Instead of using arbitrary sampling pairs the Binary Robust Invariant Scalable Keypoints (BRISK) descriptor [3] uses a hand-crafted sampling pattern that is composed of set of concentric rings. BRISK uses the long-distance sampling pairs to estimate the patch orientation and the short distance sampling pairs to construct the descriptor itself through pixel intensity comparisons.
The Fast REtina Keypoint (FREAK) descriptor [4] is similar to BRISK by having a hand crafted sampling pattern. However, it’s sampling pattern is motivated by the retina structure, having exponentially more sampling points toward the center, and similar to ORB, FREAK also uses unsupervised learning to learn a set of sampling pairs.
Similar to BRIEF, the Local Difference Binary (LDB) descriptor was proposed in [15] where instead of comparing smoothed intensities, mean intensities in grids of 2 × 2, 3 × 3 or 4 × 4 pixels were compared. Also, in addition to the mean intensity values, LDB also compares the mean values of horizontal and vertical derivatives, amounting to 3 bits per comparison.
Building upon LDB, the Accelerated KAZE (A-KAZE) descriptor[10] uses the A-KAZE detector estimation of orientation for rotating the LDB grid to achieve rotation invariance. In addition, A-KAZE also uses the A-KAZE detector’s estimation of patch scale to sub-sample the grid in steps that are a function of the patch scale.
On a different line of work are recent approaches for extracting binary descriptors, which instead of using pixel comparison, use linear/non-linear projections followed by threshold operations to achieve binary representations. One of the first methods in this line of work is the LDA-Hash [15] descriptor which first extract SIFT descriptors in the image, then projects the descriptors to a more discriminant space and finally thresholds the projected descriptors to obtain a binary representation.
Instead the intermediate step of extracting floating point descriptors, the DBRIEF [7] descriptor projects the patches directly, where the projections are computed as a linear combination of a small number of simple filters.
Finally, the BinBoost [8,9] descriptor projects the image patch to a binary vector by applying a set of hash functions learned through boosting. The hash functions are a signed operation on a linear combination of a set of gradient-based filters.
Though some works in that line claim improved performance, their drawback is an increase in descriptor extraction time, which might render them unsuitable for real-time applications.
We propose a novel binary descriptors that belongs to the earlier family of binary descriptors, i.e. binary descriptors that use fast pixel comparisons. However, we propose to use small patch comparisons, rather than pixel comparisons, and use triplets rather than pairs of sampling points. We also present a novel method for learning an optimal set of sampling triplets. The proposed descriptor is appropriately dubbed LATCH – Learned Arrangements of Three patCHes codes.
Again, the full details of our approach including evaluation and comparison with other binary descriptors can be found in the following paper:
Levi, Gil, and Tal Hassner. “LATCH: Learned Arrangements of Three Patch Codes.” arXiv preprint arXiv:1501.03719 (2015). (link to project page)
First, we give a reminder of binary descriptors design (the reader may also refer to my earlier post on introduction to binary descriptors). The goal of binary descriptors is to represent local patches by a binary vectors that can be quickly matched using Hamming distance (the sum of the XOR operation). To this end, a binary descriptors uses a set of sampling pairs S = (s1 , s1 , . . . , sn), where each si is a pair of two pixel locations defining where to sample in a local patch. For each sampling pair, the descriptor compares the intensity in the location of the first pixel to that of second pixel – if the intensity is greater then it assigns 1 in the final descriptor and 0 otherwise. The comparisons result in a binary string which can be compared very efficiently using Hamming distance.
Even though some variants of binary descriptors smooth the mini-patch around each sampling point, we note that comparing single pixels might cause the representation to be sensitive to noise and slight image distortions. We propose instead to compare the mini-patches themselves and in order to increase the spatial support of the binary tests, we use triplets of mini-patches, rather than pairs. Moreover, we devise a novel method for learning an optimal set of triplets using supervised learning. This is in contrast to the learning methods of ORB [1] and FREAK [4] which are not supervised.
From pairs to triplets
As described above, the LATCH descriptor uses triplets of mini-patches. To create a descriptor for 512 bits, LATCH uses 512 triplets, where each triplet defines the location of three mini-patches of size K x K, we assume 7×7. In each triplet, one of the mini-patches is denoted an ‘anchor’ and the other two mini-patches are denoted as ‘companions’. We will denote the anchor as ‘a’ and the companions as ‘c1’ and ‘c2’.
For each of the 512 triplets, we test if the companion patch ‘c1’ is more similar to the anchor ‘a’ than the second companion ‘c2’. For measuring similarity between patches, we use the sum-of-squared differences (denoted SSD).
To summarize, the LATCH descriptor uses a predefined set of 512 triplets, defining where to sample in a local image patch. Given a local patch, the LATCH descriptor goes over all 512 triplets. For each triplet, if the SSD between the anchor ‘a’ and the companion ‘c1’ is smaller than the SSD between ‘a’ and the second companion ‘c2’, then we write ‘1’ in the resulting bit. If not, we write ‘0’. The distance between LATCH descriptors is computed as their hamming distance, as with the other binary descriptors.
Learning patch triplet arrangements
Assume we would like to represent a local patch of size 48 x 48. Considering all the possible pixel triplets in that region will amount to an enormous number of triplet possibilities. Since binary descriptors typically require no more than 256 bits, a method for selecting an optimal set of triplets is required.
To this end, we leverage the labeled data of the the same/not-same patch dataset of [16]. The dataset consists of corresponding patches sampled from 3D reconstructions of three popular tourist sites. For each of the three sets, predefined sets of 250K pairs of same and not-same patches were gathered. A ‘same’ patch pair are simply two patches that are visually different (by zoom, illumination or viewpoint), but both correspond to the same real-world 3D point in space. A ‘Not-Same’ patch pair are two patches that correspond to entirely different 3D locations.
Below are examples of same and not-same pair patches – the left figure presents examples of ‘same’ patch pairs and the right figure presents examples of ‘not-same’ patch pairs. Note that the images in each ‘same’ patch pair indeed belong to the same ‘real-world’ patch, but are different in illumination, blur, view-point, etc.
We randomly draw a set of 56,000 candidate arrangements. Each arrangements defines where to sample the anchor mini-patch and its two companion mini-patches. We then go over all the 56K candidate arrangements and apply each one on the set of 500K pairs of same and not-same patch pairs. The quality score of an arrangement is then defined by the number of times it yielded the same binary bit for the two ‘same’ patch pairs and the number of times it yielded different binary bits for the two ‘not-same’ patch pairs.
We would want to choose the bits with the highest quality score, but that could lead to choosing highly correlated arrangements. Two arrangements can have a very high quality score, but being highly correlated, the second arrangement will contribute very little additional information on the patch given we already computed the first arrangement. To avoid choosing highly correlated arrangements, we follow the method used by ORB [2] and FREAK [4] and add arrangements incrementally, avoiding from adding a new arrangement if it’s correlation with at least one of the previously added arrangements is larger than a predefined threshold.
We have tested our proposed LATCH descriptor on two standard publicly available benchmarks and compared it to the following binary image descriptors: BRIEF [1], ORB [2], BRISK [2], FREAK [4], A-KAZE [10], LDA-HASH/DIF [15], DBRIEF [7] and BinBoost [8,9]. To also compare against floating point descriptors we have included SIFT [11] and SURF [12] in our evaluation.
If not mentioned otherwise, we extracted LATCH from 48×48 patches using a 32-byte (256 bits) representation with mini-patches of 7×7 pixels. For all the descriptors we used the efficient C++ code available from OpenCV or from the various authors with parameters left unchanged.
Running times
We start by a run time analysis of the various descriptors included in our evaluation. We measured the time (in milliseconds) required to extract a single descriptor (averaged on a large set of patches).
The results are listed in the table below. Evidently, there is a large difference in running times between the ‘pure’ comparison-based descriptors and the projection based binary descriptors.
Descriptor | Running time (ms) |
SIFT [11] | 3.29 |
SURF [12] | 2.11 |
LDA-HASH [15] | 5.03 |
LDA-DIF [15] | 4.74 |
DBRIEF [7] | 8.75 |
BinBoost [8,9] | 3.29 |
BRIEF [1] | 0.234 |
ORB [2] | 0.486 |
BRISK [3] | 0.059 |
FREAK [4] | 0.072 |
A-Kaze [10] | 0.069 |
LATCH [5] | 0.616 |
The Mikolajczyk benchmark
The Mikolajczyk benchmark was introduced in [13] and has since become a standard benchmark for evaluating local image descriptors (I also used it in my post on adding rotation invariance to the BRIEF descriptor)
The Mikolajczyk benchmark is composed of 8 image sets, each containing 6 images that depict an increasing degree of a certain image transformation:
The figure below illustrates example images from each set. In the upper row is the first image from each set, in the bottom row is another image from the set. Note the increasing degree of transformation in each set.
The protocol for the benchmark is the following: in each set, we detect keypoints and extract descriptors from each of the images, then compare the first image to each of the remaining five images and check for correspondences. The benchmark includes known ground truth transformations (homographies) between the images, so we can compute the percent of the correct matches and display the performance of each descriptor using recall vs. 1-precision curves.
Below is a table summarizing the area under the recall vs. 1-precision curve for each of the sets, averaged over the five image pairs – higher values means the descriptor performs better. Recall vs. 1-precision graphs for the sets ‘Bikes’ and ‘Leuven’ are below.
Descriptor | Bark | Bikes | Boat | Graffiti | Leuven |
SIFT [11] | 0.077 | 0.322 | 0.080 | 0.127 | 0.130 |
SURF [12] | 0.071 | 0.413 | 0.088 | 0.133 | 0.300 |
LDA-HASH [15] | 0.199 | 0.466 | 0.269 | 0.155 | 0.303 |
LDA-DIF [15] | 0.197 | 0.472 | 0.278 | 0.170 | 0.435 |
DBRIEF [7] | 0.000 | 0.025 | 0.001 | 0.008 | 0.010 |
BinBoost [8,9] | 0.055 | 0.344 | 0.083 | 0.132 | 0.338 |
BRIEF [1] | 0.055 | 0.353 | 0.050 | 0.102 | 0.227 |
ORB [2] | 0.032 | 0.208 | 0.048 | 0.062 | 0.118 |
BRISK [3] | 0.015 | 0.138 | 0.026 | 0.071 | 0.161 |
FREAK [4] | 0.019 | 0.145 | 0.034 | 0.101 | 0.194 |
A-Kaze [10] | 0.022 | 0.326 | 0.005 | 0.048 | 0.138 |
LATCH [5] | 0.065 | 0.415 | 0.057 | 0.119 | 0.374 |
Descriptor | Trees | UBC | Wall | Average |
SIFT [11] | 0.047 | 0.130 | 0.138 | 0.131 |
SURF [12] | 0.046 | 0.268 | 0.121 | 0.180 |
LDA-HASH [15] | 0.110 | 0.393 | 0.268 | 0.270 |
LDA-DIF [15] | 0.101 | 0.396 | 0.260 | 0.289 |
DBRIEF [7] | 0.001 | 0.031 | 0.002 | 0.010 |
BinBoost [8,9] | 0.037 | 0.217 | 0.119 | 0.166 |
BRIEF [1] | 0.060 | 0.178 | 0.141 | 0.146 |
ORB [2] | 0.027 | 0.121 | 0.050 | 0.083 |
BRISK [3] | 0.018 | 0.131 | 0.038 | 0.075 |
FREAK [4] | 0.026 | 0.147 | 0.041 | 0.089 |
A-Kaze [10] | 0.027 | 0.144 | 0.048 | 0.095 |
LATCH [5] | 0.082 | 0.215 | 0.175 | 0.188 |
Note that LATCH performs much better than most of the alternatives, even better than the floating point descriptors SIFT and SURF. LATCH does perform worse than LDA-HASH/DIF, but their run time is an order of a magnitude slower.
Learning Local Image Descriptors benchmark
Next, we describe our experiments using the data-set from [16], which we also used in learning triplets arrangements. The protocol for this data-set is the following: given a pair of same/not-same patches, extract a single descriptor from each patch and then compute the descriptor distance between the two descriptors. By thresholding over the patch’s distance we can determine if the two patches are considered ‘same’ (distance smaller than the threshold) or ‘not-same’ (distance larger than the threshold). Using the same/not-same labels, we can draw precision vs. recall graphs.
In our experiments, we used the Yosemite set for learning optimal arrangements as well as learning optimal thresholds. Testing was performed on the sets Liberty and Notre-Dame.
The table below presents the results, where we measure accuracy (ACC), area under the ROC curve (AUC) and 95% error-rate (the percent of incorrect matches obtained when 95% of the true matches are found – 95% Err.).
Notre-Dame | Liberty | |||||
Descriptor | AUC | ACC | 95% Err. | AUC | ACC | 95% Err. |
SIFT [11] | 0.934 | 0.817 | 39.7 | .928 | .764 | 40.1 |
SURF [12] | 0.935 | 0.866 | 41.1 | .911 | .833 | 55.0 |
LDA-HASH [15] | 0.916 | .830 | 46.7 | .910 | .798 | 48.1 |
LDA-DIF [15] | 0.934 | .857 | 38.5 | .921 | .836 | 43.1 |
DBRIEF [7] | 0.900 | .830 | 55.1 | .868 | .794 | 61.5 |
BinBoost [8,9] | 0.963 | .907 | 21.6 | .949 | .884 | 29.3 |
BRIEF [1] | 0.889 | .823 | 63.2 | .868 | .798 | 66.7 |
ORB [2] | 0.894 | .835 | 66.2 | .882 | .822 | 69.2 |
BRISK [3] | 0.915 | .857 | 57.7 | .897 | .834 | 62.6 |
FREAK [4] | 0.899 | .835 | 61.5 | .887 | .824 | 65.0 |
A-Kaze [10] | 0.885 | .806 | 56.7 | .860 | .782 | 63.4 |
LATCH [5] | 0.919 | .855 | 52.0 | .906 | .838 | 56.7 |
The results above show the clear advantage of LATCH over the other binary alternatives. Although LATCH performance is worse than BinBoost and LDA-HASH/DIF, as noted above, their improved performance does come at the price of slower running times.
3D reconstruction
One of the common uses of local image descriptors is in structure-from-motion (SfM) applications where multiple images of a scene are taken from many viewpoints, local descriptors are then extracted from each image and a 3D model is generated by triangulating corresponding images location. This task requires descriptors to be both discriminative (so that enough correct matches could be made) and fast to match (so that building the model won’t be time consuming). We have chose to apply LATCH to this applications to demonstrate its advantage in both of this aspects. To this end, we have incorporated LATCH into into the OpenMVG library [17] using their incremental structure from motion chain method. We have compared between the 3D models obtained using SIFT and the 3D models obtained using LATCH in terms of running times and visual quality.
Below are reconstruction results obtained using standard photogrammetry images sets:
And here are the measured running times for the reconstruction of each model in seconds (descriptors matching time only) :
Sequence | SIFT | LATCH |
Sceaux Castle | 381.63 | 39.05 |
Bouteville Castle | 4766.22 | 488.70 |
Mirebeau Church | 3166.35 | 325.31 |
Saint Jacques | 1651.12 | 169.19 |
Notice that although the visual quality of the 3D models obtained with LATCH and SIFT is similar, the matching time of LATCH is an order of a magnitude faster !
By comparing learned triplets of mini-patches rather than pairs of pixels, our LATCH descriptor achieves improved accuracy in similar running times compared to other binary alternatives.
This was demonstrated both quantitatively on two standard image benchmarks and qualitatively through the real-world application of 3D reconstruction from images sets.
For more details, code and paper, please see LATCH project page.
[1] Calonder, Michael, et al. “Brief: Binary robust independent elementary features.” Computer Vision–ECCV 2010. Springer Berlin Heidelberg, 2010. 778-792.
[2] Rublee, Ethan, et al. “ORB: an efficient alternative to SIFT or SURF.” Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011.
[3] Leutenegger, Stefan, Margarita Chli, and Roland Y. Siegwart. “BRISK: Binary robust invariant scalable keypoints.” Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011.
[4] Alahi, Alexandre, Raphael Ortiz, and Pierre Vandergheynst. “Freak: Fast retina keypoint.” Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012.
[5] Gil Levi and Tal Hassner, LATCH: Learned Arrangements of Three Patch Codes, IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, USA, March, 2016
[6] Strecha, Christoph, et al. “LDAHash: Improved matching with smaller descriptors.” Pattern Analysis and Machine Intelligence, IEEE Transactions on34.1 (2012): 66-78.
[7] Trzcinski, Tomasz, and Vincent Lepetit. “Efficient discriminative projections for compact binary descriptors.” Computer Vision–ECCV 2012. Springer Berlin Heidelberg, 2012. 228-242.
[8] Trzcinski, Tomasz, et al. “Boosting binary keypoint descriptors.” Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on. IEEE, 2013.
[9] Trzcinski, Tomasz, Mario Christoudias, and Vincent Lepetit. “Learning Image Descriptors with Boosting.” (2014).
[10] P. F. Alcantarilla, J. Nuevo, and A. Bartoli. Fast explicit diffusion for accelerated features in nonlinear scale spaces. In British Machine Vision Conf. (BMVC), 2013
[11] Lowe, David G. “Distinctive image features from scale-invariant keypoints.”International journal of computer vision 60.2 (2004): 91-110.
[12] Bay, Herbert, Tinne Tuytelaars, and Luc Van Gool. “Surf: Speeded up robust features.” Computer vision–ECCV 2006. Springer Berlin Heidelberg, 2006. 404-417.
[13] Mikolajczyk, Krystian, and Cordelia Schmid. “A performance evaluation of local descriptors.” Pattern Analysis and Machine Intelligence, IEEE Transactions on27.10 (2005): 1615-1630.
[14] Ke, Yan, and Rahul Sukthankar. “PCA-SIFT: A more distinctive representation for local image descriptors.” Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on. Vol. 2. IEEE, 2004.
[15] Yang, Xin, and Kwang-Ting Cheng. “LDB: An ultra-fast feature for scalable augmented reality on mobile devices.” Mixed and Augmented Reality (ISMAR), 2012 IEEE International Symposium on. IEEE, 2012.
[16] Brown, Matthew, Gang Hua, and Simon Winder. “Discriminative learning of local image descriptors.” Pattern Analysis and Machine Intelligence, IEEE Transactions on 33.1 (2011): 43-57.
[17] Moulon, Pierre, Pascal Monasse, and Renaud Marlet. “Adaptive structure from motion with a contrario model estimation.” Computer Vision–ACCV 2012. Springer Berlin Heidelberg, 2013. 257-270.
Just as a reminder, we had a general post on local image descriptors, an introductory post to binary descriptors and a post presenting the BRIEF descriptor. We also had posts on other binary descriptors: ORB[2], BRISK[3] and FREAK[4].
We’ll start by a visual example, displaying the correct matches between a pair of images of the same scene, taken from different angles – once with the original version of BRIEF (first image pair) and one with the proposed rotation invariant version of BRIEF (second image pair):
It can be seen that there are much more correct matches when using the proposed rotation invariant of the BRIEF descriptor.
The BRIEF descriptor is one the simplest of the binary descriptors and also the first published. BRIEF operates by comparing the same set of smoothed pixel pairs for each local patch that it describes. For each pair, if the first smoothed pixel’s intensity is larger than that of the second BRIEF writes 1 in the final descriptor’s string, and 0 otherwise. The sampling pairs are chosen randomly, initialized only once and used for each image and local patch. As usual, the distance between two binary descriptors is computed as the number of different bits, and can be formally written as sum(XOR(descriptor1, descriptor2)).
Our method for adding rotation invariance is straightforward and uses the detector coupled with the descriptor. Many keypoint detectors can estimate the patch’s orientation (e.g. SIFT[5] and SURF[6]) and we can make use of that estimate to properly align the sampling pairs. For each patch, given the angle of the patch, we can rotate the sampling pairs according to the patch’s orientation and thus extract rotation invariant descriptors. The same principle is applied in the original implementation of the other rotation invariant binary descriptors (ORB, BRISK and FREAK), but as opposed to them we just take the orientation of the patch from the keypoint detector instead of devising some orientation measurement mechanism.
Now for the fun part – comparing rotation invariance BRIEF with BRIEF’s original version. I’ll also compare to SIFT to see how binary descriptors compete with some of the floating point descriptors.
For the evaluation, I’ll use the Mikolajczyk benchmark [8] which is a publicly available and standard benchmark for evaluating local descriptors. The benchmark consists of 8 image sets, each containing 6 images that depict an increasing degree of a certain image transformation. Each set depicts a different transformation:
Below are the images of each set in the benchmark. In each set, the images are ordered from left to right and top to bottom (the first row contains images 1-3, the second row contains images 4-6).
Bark (zoom and rotation changes):
Bikes (blur): you can notice that image 6 is far more blurred than image 1.
Boat (zoom and rotation changes):
Graffiti (view point changes):
Leuven (illumination changes):
Trees (blur):
UBC (JPEG compression):
Wall (viewpoint changes):
The protocol for the benchmark is the following: in each set, we detect keypoints and extract descriptors from each of the images, compare the first image to each of the remaining five images and check for correspondences. The benchmark includes known ground truth transformations (homographies) between the images, thus we can compute the percent of the correct matches and display the performance of each descriptor using recall vs. 1-precision curves.
I used the public OpenCV implementation[9] for our experiments. SIFT is used as a keypoint detector and I used the 512 bits version of BRIEF and rotation invariant BRIEF.
Below are tables summarizing the area under the recall vs. precision curve for each of the sets, averaged over the five image pairs – higher values means the descriptor performs better. For clarity, I also specified the type of image transformation introduced by each set.
Descriptor | Bark (zoom + rotation) | Bikes (blur) | Boat (zoom + rotation) | Graffiti (view point changes) |
BRIEF | 0.007 | 0.677 | 0.048 | 0.097 |
Rotation Invariant BRIEF | 0.055 | 0.353 | 0.05 | 0.103 |
SIFT | 0.077 | 0.322 | 0.08 | 0.128 |
Descriptor | Leuven (illumination) | Trees (blur) | UBC (JPEG compression) | Wall (view point changes) |
BRIEF | 0.457 | 0.258 | 0.421 | 0.285 |
Rotation Invariant BRIEF | 0.228 | 0.061 | 0.178 | 0.146 |
SIFT | 0.131 | 0.048 | 0.13 | 0.132 |
Notice that for sets that depict orientation changes (Bark and Boat), the rotation invariant version of BRIEF performs much better than the original (not invariant) version. However, in sets that depict photometric changes (blur, illumination and JPEG compression) and do not depict orientation changes, the original version of BRIEF performs better than the rotation invariant one. It seems that when orientation changes are not present, trying to compensate for them introduces noise and reduces performance. Notice also that since the set Graffiti introduces some orientation changes (as can be seen from the images above), the rotation invariant version of BRIEF has an advantage over the original version of BRIEF. One can also see that although the “Wall” set exhibit view point changes, the images in the set have very much the same orientation, thus the rotation invariant version of BRIEF performs worse than the original one. On a side note, it is also very interesting to see that in some of the sets, BRIEF and Rotation Invariant BRIEF even outperform the SIFT descriptor (keep in mind that BRIEF is a lot faster to extract and match and also take much less storage space).
To further illustrate the difference in performance between the original and the rotation invariant version of BRIEF, below are recall vs. 1-precision curves for the sets Bikes, Graffiti and Boat, respectively.
Notice again that BRIEF outperforms it’s rotation invariant version in the Bikes sets, which depicts photometric changes (specifically, blur) while the rotation invariant version of BRIEF outperforms the original version in the sets Graffiti and Boat which depict rotation changes.
I’m in the process of contributing an implementation of the rotation invariant version of BRIEF to OpenCV. I’ve forked the GitHub OpenCV3.0 repository and implemented the changes under my forked repository.
The code has been further cleaned and is now available under the following pull request: https://github.com/Itseez/opencv_contrib/pull/207
I have presented a rotation invariant version of BRIEF that makes use of the detector’s estimation of the keypoint orientation in order to align the sampling point of the BRIEF descriptor, thus making it rotation invariant. I’ve demonstrated the advantage of the rotation invariant version of BRIEF in scenarios where orientation changes are present and also it’s disadvantage in dealing with photometric changes (blur, lightning and JPEG compression). Finally, I’ve published a C++ implementation of the proposed descriptor integrating it into OpenCV3.
[1] Calonder, Michael, et al. “Brief: Binary robust independent elementary features.” Computer Vision–ECCV 2010. Springer Berlin Heidelberg, 2010. 778-792.
[2] Rublee, Ethan, et al. “ORB: an efficient alternative to SIFT or SURF.” Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011.
[3] Leutenegger, Stefan, Margarita Chli, and Roland Yves Siegwart. “BRISK: Binary robust invariant scalable keypoints.” Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011.
[4] Alahi, Alexandre, Raphael Ortiz, and Pierre Vandergheynst. “Freak: Fast retina keypoint.” Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. Ieee, 2012.
[5] Lowe, David G. “Distinctive image features from scale-invariant keypoints.”International journal of computer vision 60.2 (2004): 91-110.
[6] Bay, Herbert, Tinne Tuytelaars, and Luc Van Gool. “Surf: Speeded up robust features.” Computer Vision–ECCV 2006. Springer Berlin Heidelberg, 2006. 404-417.
[7] Rosten, Edward, and Tom Drummond. “Machine learning for high-speed corner detection.” Computer Vision–ECCV 2006. Springer Berlin Heidelberg, 2006. 430-443.
[8] Mikolajczyk, Krystian, and Cordelia Schmid. “A performance evaluation of local descriptors.” Pattern Analysis and Machine Intelligence, IEEE Transactions on27.10 (2005): 1615-1630.
[9]http://docs.opencv.org/trunk/modules/features2d/doc/common_interfaces_of_descriptor_extractors.html
I’ve searched for tutorials explaining how to install and configure OpenCV 2.4.9 with Cmake, using Visual Studio 2013, but I haven’t found any good ones. As a result, I’ve decided to create my own tutorial, where I explain how to build the OpenCV solution using Cmake and how to create applications in Visual Studio 2013 that use OpenCV. Note that my laptop is running Windwos 8.1.
Here is the tutorial:
The tutorial summarizes the following steps:
As promised, I’ve uploaded the sample code that I used as an example application:
#include “stdafx.h”
#include “opencv2/highgui/highgui.hpp”
#include <iostream>using namespace cv;
using namespace std;int main(int argc, char* argv[])
{
VideoCapture cap(0); // open the video camera no. 0if (!cap.isOpened()) // if not success, exit program
{
cout << “Cannot open the video file” << endl;
return -1;
}double dWidth = cap.get(CV_CAP_PROP_FRAME_WIDTH); //get the width of frames of the video
double dHeight = cap.get(CV_CAP_PROP_FRAME_HEIGHT); //get the height of frames of the videocout << “Frame size : ” << dWidth << ” x ” << dHeight << endl;
namedWindow(“MyVideo”, CV_WINDOW_AUTOSIZE); //create a window called “MyVideo”
Mat frame;while (1)
{
bool bSuccess = cap.read(frame); // read a new frame from videoif (!bSuccess) //if not success, break loop
{
cout << “Cannot read a frame from video file” << endl;
break;
}imshow(“MyVideo”, frame); //show the frame in “MyVideo” window
if (waitKey(30) == 27) //wait for ‘esc’ key press for 30ms. If ‘esc’ key is pressed, break loop
{
cout << “esc key is pressed by user” << endl;
break;
}
}
return 0;
}
If you have any question, please comment on this post and I’ll be glad to help.
Gil.
Here you can see an original image vs. a screenshot of the 3D model:
The package that we’ll use is named OSM-bundler. The package implemented 3 algorithms:
Those algorithms were in fact used in two very impressive projects that ran 3D reconstruction on a very large scale. The first, is “Building Rome in a Day” aimed at (sparse) 3D reconstruction of touristic sites from images found on the web. The following video demonstrates the 3D reconstruction of the old city of Dubrovnic:
The reconstruction was done from about 4600 images. In the second work – “Towards Internet scale Multi-View Stereo” – , the CMVS algorithm was presented and demonstrated how multi-view stereo can be done on a large scale. The following video demonstrates various touristic sites that were reconstructed using images from the web:
In the next section I will explain how to run those packages to produce 3D models.
First, download OSM-Bundler from here . Extract it to whichever location you’d like. In my installation, I extracted it directly to C drive, so I now have the following directory from which I’ll run Bundler and PVMS: “C:\osm-bundler-pmvs2-cmvs-full-32-64\osm-bundler”.
Next, download and install python 2.7.6. You should also download and install Python Imaging Library (download the version for python 2.7). To view the 3D models, you can use Meshlab, which you can download here.
Congratulations, you’re now done with the installation.
The procedure for building a 3D model is the following:
>>cd C:\<Installation Path>\osm-bundler\osm-bundlerWin32/64.
>>python RunBundler.py –photos=”<full path where your images are located>”
>>python RunPMVS.py –bundlerOutputPath=”<BundlerOutputPath>”
If the geometry of the model seems broken, the most common reason is that Bundler camera database doesn’t contain the specific camera you used. To solve this, first, right click on one of the images and click on properties. Search for the “Camera model” field and write down its contents (the camera model). For example, for images taken with the Galaxy S4 camera, the “Camera model” is GT-I9505.
The next steps are a bit tricky. You’ll need to find out the CCD width of your camera. To do that, search for the camera’s sensor size (for example Google “<camera name> sensor size”). For example, if you’d google “Galaxy S4 sensor size” and search a bit, you’ll find out that the sensor size is 1/2.33 which according to this page is 6.16 x 4.62. The CCD width is the first value of the two – 6.16.
Now, we need to insert the camera model and the CCD width we found to Bundler’s camera’s database. To do that, run the following command:
>> python RunBundler.py –photos=<full path where your images are located> –checkCameraDatabase
Now, Bundler will prompt you for the camera’s CCD width. Enter it and press ok.
If the geometry still seems broken, try first to run Bundler and PMVS on the example sets provided in the package. Try taking more images of the object and remember Bundler and PMVS do not work so well for object textureless objects, such as smooth surfaces. If you continue to encounter difficulties, mail me at Gil.levi100@gmail.com and I’ll try to help.
[1] Agarwal, Sameer, et al. “Building rome in a day.” Communications of the ACM54.10 (2011): 105-112.
[2] Furukawa, Yasutaka, et al. “Towards internet-scale multi-view stereo.”Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010.
This is our fifth post in the series about binary descriptors and here we will talk about the FREAK[4] descriptor. This is the last descriptor that we’ll talk about as the next and final post in the series will give a performance evaluation of the different binary descriptors. Just a remainder – we had an introduction to binary descriptors and posts about BRIEF[5], ORB[3] and BRISK[2].
As you may recall from the previous posts, a binary descriptor is composed out of three parts:
Recall that to build the binary string representing a region around a keypoint we need to go over all the pairs and for each pair (p1, p2) – if the intensity at point p1 is greater than the intensity at point p2, we write 1 in the binary string and 0 otherwise.
In a nutshell, FREAK is similar to BRISK by having a handcrafter sampling pattern and also similar to ORB by using machine learning techniques to learn the optimal set of sampling pairs. FREAK also has an orientation mechanism that is similar to that of BRISK.
Many sampling patterns are possible to compare pixel intensities. As we’ve seen in the previous posts, BRIEF uses random pairs, ORB uses learned pairs and BRISK uses a circular pattern where points are equally spaced on circles concentric, similar to DAISY[1].
FREAK suggests to use the retinal sampling grid which is also circular with the difference of having higher density of points near the center. The density of points drops exponentially as can be seen in the following figure:
Each sampling point is smoothed with a Gaussian kernel where the radius of the circle illustrates the size of the standard deviation of the kernel.
As can be seen in the following figure, the suggested sampling grid corresponds with the distribution of receptive fields over the retina:
With few dozen sampling points, thousands of sampling pairs can be considered. However, many of the pairs might not be useful efficiently describe a patch. A possible strategy can be to follow BRISK’s approach[2] and select pairs according to their spatial distance. However, the selected pairs can be highly correlated and not discriminant. Consequently, FREAKS follows ORB’s approach[3] and tries to learn the pairs by maximizing variance of the pairs and taking pairs that are not correlated. ORB’s approach was explained in length in our post about ORB, so we won’t explain it again here.
Interestingly, there is a structure in the resulting pairs – a coarse-to-fine approach which matches our understanding of the model of the human retina. The first pairs that are selected mainly compare sampling points in the outer rings of the pattern where the last pairs compare mainly points in the inner rings of the pattern. This is similar to the way the human eye operates, as it first use the perifoveal receptive fields to estimate the location of an object of interest. Then, the validation is performed with the more densely distributed receptive fields in the fovea area.
The sampling pairs are illustrated in the following figure, where each figure contains 128 pairs (from left to right, top to bottom):
FREAKS takes advantage of this coarse-to-fine structure to further speed up the matching using a cascade approach: when matching two descriptors, we first compare only the first 128 bits. If the distance is smaller than a threshold, we further continue the comparison to the next 128 bits. As a result, a cascade of comparisons is performed accelerating even further the matching as more than 90% of the candidates are discarded with the first 128 bits of the descriptor. The following figure illustrates the cascade approach:
To somewhat compensate for rotation changes, FREAK measures the orientation of the keypoint and rotates the sampling pairs my measure angle. FREAK’s mechanism for measuring the orientation is similar to that of BRISK[2] only that instead of using long distance pairs, FREAK uses a predefined set of 45 symmetric sampling pairs:
For more details on orientation assignment, you can refer to the post on BRISK[2].
Though we’ll have a whole post that will compare the performance of the binary descriptors, we should say a few words now.
The next post will be our last in the series about binary descriptors, which will give a performance evaluation of binary descriptors.
[1] Tola, Engin, Vincent Lepetit, and Pascal Fua. “A fast local descriptor for dense matching.” Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 2008.
[2] Leutenegger, Stefan, Margarita Chli, and Roland Y. Siegwart. “BRISK: Binary robust invariant scalable keypoints.” Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011.
[3] Rublee, Ethan, et al. “ORB: an efficient alternative to SIFT or SURF.” Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011.
[4] Alahi, Alexandre, Raphael Ortiz, and Pierre Vandergheynst. “Freak: Fast retina keypoint.” Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012.
[5] Calonder, Michael, et al. “Brief: Binary robust independent elementary features.” Computer Vision–ECCV 2010. Springer Berlin Heidelberg, 2010. 778-792.
This fourth post in our series about binary descriptors that will talk about the BRISK descriptor [1]. We had an introduction to patch descriptors, an introduction to binary descriptors and posts about the BRIEF [2] and the ORB [3] descriptors.
We’ll start by showing the following figure that shows an example of using BRISK to match between real world images with viewpoint change. Green lines are valid matches, red circles are detected keypoints.
As you may recall from the previous posts, a binary descriptor is composed out of three parts:
Recall that to build the binary string representing a region around a keypoint we need to go over all the pairs and for each pair (p1, p2) – if the intensity at point p1 is greater than the intensity at point p2, we write 1 in the binary string and 0 otherwise.
The BRISK descriptor is different from the descriptors we talked about earlier, BRIEF and ORB, by having a hand-crafted sampling pattern. BRISK sampling pattern is composed out of concentric rings:
When considering each sampling point, we take a small patch around it and apply Gaussian smoothing. The red circle in the figure above illustrations the size of the standard deviation of the Gaussian filter applied to each sampling point.
When using this sampling pattern, we distinguish between short pairs and long pairs. Short pair are pairs of sampling points that their distance is below a certain threshold d_max and long pairs are pairs of sampling points that their distance is above a certain different threshold d_min, where d_min>d_max, so there are no short pairs that are also long pairs.
Long pairs are used in BRISK to determine orientation and short pairs are used for the intensity comparisons that build the descriptor, as in BRIEF and ORB. The
To illustrate this and help make things clear, here are figures of BRISK’s short pairs – each red line represent one pair. Each figure shows 100 pairs:
BRISK is equipped with a mechanism for orientation compensation; by trying to estimate the orientation of the keypoint and rotation the sampling pattern by that orientation, BRISK becomes somewhat invariant to rotation.
For computing the orientation of the keypoint, BRISK uses local gradients between the sampling pairs which are defined by
Where g(pi,pj) is the local gradient between the sampling pair (pi,pj), I is the smoothed intensity (by a Gaussian) in the corresponding sampling point by the appropriate standard deviation (see the figure above of BRISK sampling pattern).
To compute orientation, we sum up all the local gradients between all the long pairs and take arctan(gy/gx) – the arctangent of the the y component of the gradient divided by the x component of the gradient. This gives up the angle of the keypoint. Now, we only need to rotate the short pairs by that angle to help the descriptor become more invariant to rotation. Note that BRISK only use long pairs for computing orientation based on the assumption that local gradients cancel each other thus not necessary in the global gradient determination.
As with all binary descriptors, building the descriptor is done by performing intensity comparisons. BRISK takes the set of short pairs, rotate the pairs by the orientation computed earlier and makes comparisons of the form:
Meaning that for each short pair it takes the smoothed intensity of the sampling points and checked whether the smoothed intensity of the first point in the pair is larger than that of the second point. If it does, then it writes 1 in the corresponding bit of the descriptor and otherwise 0. Remember that BRISK uses only the short pairs for building the descriptor.
As usual, the distance between two descriptors is defined as the number of different bits of the two descriptors, and can be easily computed as the sum of the XOR operator between them.
You probably ask what about performance. Well we’ll have a detailed post that will talk all about performance of the different binary descriptors, but for now I will say a few words comparing BRISK to the previous descriptors we talked about – BRIEF and ORB:
Stay tuned for the next post in the series that will talk about the FREAK descriptor, the last binary descriptor we will focus on before giving a detailed performance evaluation.
Gil.
References:
[1] Leutenegger, Stefan, Margarita Chli, and Roland Y. Siegwart. “BRISK: Binary robust invariant scalable keypoints.” Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011.
[2] Calonder, Michael, et al. “Brief: Binary robust independent elementary features.” Computer Vision–ECCV 2010. Springer Berlin Heidelberg, 2010. 778-792.
[3] Rublee, Ethan, et al. “ORB: an efficient alternative to SIFT or SURF.” Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011.
This third post in our series about binary descriptors that will talk about the ORB descriptor [1]. We had an introduction to patch descriptors, an introduction to binary descriptors and a post about the BRIEF [2] descriptor.
We’ll start by showing the following figure that shows an example of using ORB to match between real world images with viewpoint change. Green lines are valid matches, red circles indicate unmatched points.
Now, as you may recall from the previous posts, a binary descriptor is composed out of three parts:
Recall that to build the binary string representing a region around a keypoint we need to go over all the pairs and for each pair (p1, p2) – if the intensity at point p1 is greater than the intensity at point p2, we write 1 in the binary string and 0 otherwise.
The ORB descriptor is a bit similar to BRIEF. It doesn’t have an elaborate sampling pattern as BRISK [3] or FREAK [4]. However, there are two main differences between ORB and BRIEF:
ORB uses a simple measure of corner orientation – the intensity centroid [5]. First, the moments of a patch are defined as:
With these moments we can find the centroid, the “center of mass” of the patch as:
We can construct a vector from the corner’s center O, to the centroid -OC. The orientation of the patch is then given by:
Here is an illustration to help explain the method:
Once we’ve calculated the orientation of the patch, we can rotate it to a canonical rotation and then compute the descriptor, thus obtaining some rotation invariance.
There are two properties we would like our sampling pairs to have. One is uncorrelation – we would like that the sampling pairs will be uncorrelated so that each new pair will bring new information to the descriptor, thus maximizing the amount of information the descriptor carries. The other is high variance of the pairs – high variance makes a feature more discriminative, since it responds differently to inputs.
The authors of ORB suggest learning the sampling pairs to ensure they have these two properties. A simple calculation [1] shows that there are about 205,000 possible tests (sampling pairs) to consider. From that vast amount of tests, only 256 tests will be chosen.
The learning is done as follows. First, they set a training set of about 300,000 keypoints drawn from the PASCAL 2006 dataset [6].Next, we apply the following greedy algorithm:
Once this algorithm terminates, we obtain a set of 256 relatively uncorrelated tests with high variance.
To conclude, ORB is binary descriptor that is similar to BRIEF, with the added advantages of rotation invariance and learned sampling pairs. You’re probably asking yourself, how does ORB perform in comparison to BRIEF? Well, in non-geometric transformation (those that are image capture dependent and do not rely on the viewpoint, such as blur, JPEG compression, exposure and illumination) BRIEF actually outperforms ORB. In affine transformation, BRIEF perform poorly under large rotation or scale change as it’s not designed to handle such changes. In perspective transformations, which are the result of view-point change, BRIEF surprisingly slightly outperforms ORB. For further details, refer to [7] or wait for the last post in this tutorial which will give a performance evaluation of the binary descriptors.
The next post will talk about BRISK [3] that was actually presented in the same conference as ORB. It presents some difference from BRIEF and ORB by using a hand-crafted sampling pattern.
Gil.
References:
[1] Rublee, Ethan, et al. “ORB: an efficient alternative to SIFT or SURF.” Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011.
[2] Calonder, Michael, et al. “Brief: Binary robust independent elementary features.” Computer Vision–ECCV 2010. Springer Berlin Heidelberg, 2010. 778-792.
[3] Leutenegger, Stefan, Margarita Chli, and Roland Y. Siegwart. “BRISK: Binary robust invariant scalable keypoints.” Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011.
[4] Alahi, Alexandre, Raphael Ortiz, and Pierre Vandergheynst. “Freak: Fast retina keypoint.” Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012.
[5] Rosin, Paul L. “Measuring corner properties.” Computer Vision and Image Understanding 73.2 (1999): 291-307.
[6] M. Everingham. The PASCAL Visual Object Classes Challenge 2006 (VOC2006) Results. http://pascallin.ecs.soton.ac.uk/challenges/VOC/databases.html.
[7] Heinly, Jared, Enrique Dunn, and Jan-Michael Frahm. “Comparative evaluation of binary features.” Computer Vision–ECCV 2012. Springer Berlin Heidelberg, 2012. 759-773.