Lately, I’ve been reading a lot about BOW (Bag of Words) models  and I thought it would be nice to write a short post on the subject. The post is based on the slides from Li Fei-Fei taken from ICCV 2005 course about object detection:
As the name implies, the concept of BOW is actually taken from text analysis. The idea is to represent a document as a “bag” of important keywords, without ordering of the words (that’s why it’s a called “bag of words”, instead of a list for example).
Illustration of Bag of words model in documents
In computer vision, the idea is similar. We represent an object as a bag of visual words – patches that described by a certain descriptor:
Illustration of Bag of words model in images
Since the next few posts will talk about binary descriptors, I thought it would be a good idea to post a short introduction to the subject of patch descriptors. The following post will talk about the motivation to patch descriptors, the common usage and highlight the Histogram of Oriented Gradients (HOG) based descriptors.
I think the best way to start is to consider one application of patch descriptors and to explain the common pipeline in their usage. Consider, for example, the application of image alignment: we would like to align two images of the same scene taken at slightly different viewpoints. One way of doing so is by applying the following steps:
Compute distinctive keypoints in both images (for example, corners).
Compare the keypoints between the two images to find matches.
Use the matches to find a general mapping between the images (for example, a homography).
Apply the mapping on the first image to align it to the second image.
Using descriptors to compare patches