Lately, I’ve been reading a lot about BOW (Bag of Words) models  and I thought it would be nice to write a short post on the subject. The post is based on the slides from Li Fei-Fei taken from ICCV 2005 course about object detection:
As the name implies, the concept of BOW is actually taken from text analysis. The idea is to represent a document as a “bag” of important keywords, without ordering of the words (that’s why it’s a called “bag of words”, instead of a list for example).
In computer vision, the idea is similar. We represent an object as a bag of visual words – patches that described by a certain descriptor:
We can use the bag of words model for object categorization by constructing a large vocabulary of many visual words and represent each image as a histogram of the frequency words that are in the image. The following figure illustrates this idea:
How exactly do we construct the model? First, we need to build a visual dictionary. We do that by taking a large set of object images and extracting descriptors from them (for example, SIFT’s) on a grid or from detected keypoints (see the post on descriptors if you’re not sure about their usage).
Next, we cluster the set of descriptors (using k-means for example) to k clusters. The cluster centers act as our dictionary’s visual words.
Given a new image, we represent it using our model in the following manner: first, extract descriptors from the image on a grid or around detected keypoints. Next, for each descriptor extracted compute its nearest neighbor in the dictionary. Finally, build a histogram of length k where the i’th value is the frequency of the i’th dictionary word:
This model can be used in conjunction with Naïve-Bayes classifier or with an SVM for object classification.
There is a nice demonstration in Vlfeat of a SIFT based BOW model and SVM for object classification on the Caltech101 benchmark.
I ran a tiny example of the code using only 10 classes, 15 images for training and 15 images for testing and got the following confusion matrix:
The confusion matrix shows the score that each classes’ training set got when running each of the classes’ classifiers on them – the (i,j) element is the result of applying class j classifier on class i training set. As you can see, the top scores are on the diagonal, meaning the each class top score was obtained where running each class’s classifier on its training data.
OpenCV also has a module for BOW model classification that you can read about here. There are many blog post about BOW OpenCV usage, so I won’t go into detail, but just give a simple pseudo code that explain the general usage(thanks to Mathieu Barnachon from OpenCV Q&A forum!):
Stay tuned for the next posts that will talk about binary descriptors.
 Csurka, Gabriella, et al. “Visual categorization with bags of keypoints.”Workshop on statistical learning in computer vision, ECCV. Vol. 1. 2004.
 Lowe, David G. “Object recognition from local scale-invariant features.”Computer vision, 1999. The proceedings of the seventh IEEE international conference on. Vol. 2. Ieee, 199