Scene Recognition with Bag of Words
Class | Instructor | Date | Language | Ta'ed | Code |
---|---|---|---|---|---|
CS 6476 Computer Vision | James Hays | Fall 2015 | MATLAB | No | Code N/A |

Best Performing Tiny Image confusion matrix
The purpose of this project was to implement various image parameterization and classification techniques to assist in the task of scene recognition. From 15 categories of images we build training and testing sets of images to build and validate, respectively, the various classifiers we are implementing.
- Tiny Image representation + K-NN Classification
- Bag of SIFT Words + K-NN Classification
- Bag of SIFT Words + Linear SVM Classification
- Bag of SIFT Words + Non-Linear SVM Classification
- Bag of PHOW Words + K-NN/Linear/Non-Linear SVM Classification
Tiny Image representation + K-NN Classification
In an ideal world, we would train and validate our classifiers on entire images, but this would be intractible to teach any kind of thorough image recognition, due to the sheer size of the data required. To counter this, image representation mechanisms are implemented to represent the important contextual elements of an image in a reduced space to make classification tasks tractible.
One of the simplest of such mechanisms is the tiny image reperesentation. Basically, the image is shrunk to a size that is sufficiently small that it can be used itself as a feature descriptor of the original image. It is quickly apparent that this is not a desireable solution, as all the detail and high-frequency content in the image is lost - in effect, the process of shrinking the image is equivalent to blurring it.
Equally as straight-forward is the concept behind the first classifier that I implemented - K-NN. Basically it says that the a sample is going to probably be classified similarly to its neighbors in feature space (those samples whose features it generally "looks like"). For this calculation I implemented both a hard threshold and a weighted voting mechanism, where the first K neighbors had an influence on the predicted classification of a particular sample that was inversely proportional to their distance. This was intended to minimaize the impact of outliers, who may be very close to a sample but whose classes are sparsely represented in the neighborhood. This resulted in a few percentage points better performance, on average, everywhere I used K-NN.
For the Tiny Image portion of the project, I shrank the images to be 16x16 pixels. I alternated between shrinking only the center square of the image and shrinking the entire image, and using the central square improved the performance by a few %, with the tiny images also being normalized. Compared to random guessing, which would be expected to be correct around 6-7% of the time (1:15), the tiny images did pretty well at ~22-23%, but for a real recognition task this performance is severely lacking.
%The best performance for the tiny image representation using K-NN classifier was :
Avg accuracy : 0.233667 across 10.000000 runs with std 0.005915
Best accuracy : 0.246 with K = 6;
using varying K's : [14, 5, 1, 1, 17, 7, 1, 1, 6, 1]
Bag of SIFT Words + K-NN Classification
Instead of Tiny Images, I next used a Bag of Words implementation derived from SIFT descriptors of the training images. To accomplish this, I used vl_dsift to aquire the various sift descriptors from each image, then I compiled a descriptor "Language" by assigning each descriptor to a cluster using vlfeat's nearest neighbor algorithm, and saving the centroid of this cluster as a feature. I varied vocabulary (# of clusters) from 10 to 800 and step size from 100 to 8, but ended up using a vocabulary size of 400, a step size of 8 and a sample size of 4 as my primary BOW sources. I varied vlfeat's Nearest Neighbor algorithm between the default lloyd and elkan and ANN, and found that ANN performed as well as elkan and lloyd (on average) and was a bit faster. In general the clustering of the vocabulary for the Bag Of Words generation took the longest of all the components for this project to complete, but this was alleviated somewhat by saving files with names relevant to the hyperparams used to generate them.
I then build normalized histograms of each image in the test and training data set, counting the presence of each of the words in the sift feature complement for each image. The performance of this method was over twice as good as the best Tiny Image performance, which stands to reason in that the details of the images are not lost as they are with blurring/shrinking, but rather encoded, albeit without and sense of their location within the image.
%The best performance for the Bag of SIFT words representation using K-NN classifier was :
Without neighbor voting : Avg accuracy : 0.534133 across 10.000000 runs with std 0.010134
Best accuracy with K val : 15.000000 gives accuracy : 0.543333
With Neighbor Voting : Avg accuracy : 0.551400 across 10.000000 runs with std 0.012914
Best K val : 5.000000 gives accuracy : 0.579333
Bag of SIFT Words + Linear SVM Classification
I then replaced the K-NN classifier with a linear SVM classifier, which attempts to partition the feature space into "membership/non-membership" zones using a n-dimensional hyperplane, where n is the dimension of feature space. To implement this I used vlfeat's vl_svmtrain function. I found that a lambda of 0.000240 worked best with my bag of sift words representation, and this classifier added over 10% to the performance of this configuration.
%The best performance for the Bag of SIFT words representation using Linear SVM classifier was :
Avg accuracy : 0.662267 across 10.000000 runs with std 0.004436
Best Accuracy (mean of diagonal of confusion matrix) is 0.684
Bag of SIFT Words + Non-Linear SVM Classification
In an effort to match a more complex feature space topology, I used the Non-linear SVM classifier coded via primal_svm.m, from the website Olivier Chapelle's Primal_svm.m. The implementation of this was similar to using the vl_svmtrain function, but instead of using just the labels and training examples, a distance kernel was built from the training examples, consisting of a gaussian RBF of the training examples, and another gaussian distance kernel of the testing examples was built for evaluation. I used a lambda of .000001 and a gamma of .5 for the gaussian RBF, and I implemented my own kernel function.
%The best performance for the Bag of SIFT words representation using Non-Linear SVM classifier was :
Avg accuracy : 0.690400 across 10.000000 runs with std 0.008857
Best Accuracy (mean of diagonal of confusion matrix) is 0.714
Bag of PHOW Words + KNN/Linear/Non-Linear SVM Classification
In an attempt to improve the recognition performance, I tried using Pyramid Histogram of SIFT Words descriptors, which are basically SIFT descriptors taken at multiple scales. These descriptors were slower to compute (I used the same hyperparams as those for the regular SIFT Bag of Words descriptors) but also performed a little better than the pure SIFT Bag of words. I used the built in vlFeat function vl_phow, with a step size of 8 and the default sample size ramp.
%The best performance for the Bag of PHOW words representation using K-NN classifier was :
Without neighbor voting : Avg accuracy : 0.544200 across 10.000000 runs with std 0.008345
Best K val : 5.000000 gives accuracy : 0.550000
With Neighbor Voting : Avg accuracy : 0.564333 across 10.000000 runs with std 0.015226 with neighbor voting
Best K val : 7.000000 gives accuracy : 0.584667
For the Linear SVM classifier I used a lambda regularization coefficient of 0.00016.
%The best performance for the Bag of PHOW words representation using Linear SVM classifier was :
Avg accuracy : 0.671200 across 10.000000 runs with std 0.011387
Best Accuracy (mean of diagonal of confusion matrix) is 0.691
For the Non-linear SVM classifier, I wasn't able to get the PHOW representation to perform as well as the standard Bag of SIFT words, but this was probably due to the necessity of more parameter tuning.
%The best performance for the Bag of PHOW words representation using Non-Linear SVM classifier was :
Avg accuracy : 0.664000 across 10.000000 runs with std 0.011496
Best Accuracy (mean of diagonal of confusion matrix) is 0.681
Extra Implementations
To sum of my extra implementations, I implemented n-fold cross validation (where I would resample from both the training and testing sets and build new sets, to build new models, and see their performance), multiple Kmeans functionality for clustering the bags of words, distance-based voting for KNN, multiple vocabulary-size result sets. PHOW Bags, nonlinear svm, and my own kernel function.
Results best performing pipeline in my project - Nonlinear SVM on Bag of Sift Features : 71.4 %
Accuracy (mean of diagonal of confusion matrix) is 0.714
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label | ||||
---|---|---|---|---|---|---|---|---|---|
Kitchen | 0.590 | ![]() |
![]() |
![]() |
![]() |
![]() LivingRoom |
![]() LivingRoom |
![]() LivingRoom |
![]() Bedroom |
Store | 0.590 | ![]() |
![]() |
![]() |
![]() |
![]() InsideCity |
![]() LivingRoom |
![]() Industrial |
![]() Coast |
Bedroom | 0.390 | ![]() |
![]() |
![]() |
![]() |
![]() Industrial |
![]() LivingRoom |
![]() Office |
![]() OpenCountry |
LivingRoom | 0.440 | ![]() |
![]() |
![]() |
![]() |
![]() Office |
![]() Bedroom |
![]() Kitchen |
![]() Bedroom |
Office | 0.870 | ![]() |
![]() |
![]() |
![]() |
![]() Kitchen |
![]() Bedroom |
![]() LivingRoom |
![]() Kitchen |
Industrial | 0.520 | ![]() |
![]() |
![]() |
![]() |
![]() LivingRoom |
![]() Bedroom |
![]() Coast |
![]() TallBuilding |
Suburb | 0.950 | ![]() |
![]() |
![]() |
![]() |
![]() Industrial |
![]() Mountain |
![]() Office |
![]() Office |
InsideCity | 0.680 | ![]() |
![]() |
![]() |
![]() |
![]() Industrial |
![]() LivingRoom |
![]() Street |
![]() TallBuilding |
TallBuilding | 0.740 | ![]() |
![]() |
![]() |
![]() |
![]() Industrial |
![]() InsideCity |
![]() Mountain |
![]() Forest |
Street | 0.880 | ![]() |
![]() |
![]() |
![]() |
![]() InsideCity |
![]() Highway |
![]() TallBuilding |
![]() InsideCity |
Highway | 0.820 | ![]() |
![]() |
![]() |
![]() |
![]() InsideCity |
![]() Bedroom |
![]() Mountain |
![]() Store |
OpenCountry | 0.600 | ![]() |
![]() |
![]() |
![]() |
![]() Mountain |
![]() Coast |
![]() Suburb |
![]() Forest |
Coast | 0.760 | ![]() |
![]() |
![]() |
![]() |
![]() Industrial |
![]() OpenCountry |
![]() OpenCountry |
![]() OpenCountry |
Mountain | 0.840 | ![]() |
![]() |
![]() |
![]() |
![]() Industrial |
![]() Store |
![]() Highway |
![]() Forest |
Forest | 0.940 | ![]() |
![]() |
![]() |
![]() |
![]() OpenCountry |
![]() Store |
![]() Mountain |
![]() OpenCountry |
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label |