Experimental Projects

  1. News categorization with Pegasos: Download this portion of the Reuters dataset containing 20K newsitems. The dataset is multiclass and multilabel (more that one label for each data point). Every line of the file has the following format:
    docId labList = feature:val feature:val feature:val ...
    where docId is a unique identifier for the newsitem (drop it before training), labList is a space-separated list of numerical labels, and feature:val are feature-value pairs. Each feature is associated with a word and the value is a function of the frequency of the word in the newsitem. The coding is sparse: only the features with non-zero value are listed. Implement Pegasos from scratch and use it to train four binary classifiers (one-vs-all encoding), each recognizing the presence of one of the four most frequent labels in the dataset (the other labels can be ignored). Study the classification performance for different values of the parameters λ and T in Pegasos. Use external cross-validation to evaluate accuracy.

    Extra (for groups only): Implement the Perceptron from scratch, and compare against Pegasos its classification performance on the same dataset (one-vs-all encoding as before) considering the predictor obtained by averaging all the Perceptron models. Consider different numbers of epochs on the training set.

  2. Forest cover type classification using AdaBoost: Download the Cover Type Dataset. The dataset is in CSV format. The first column is a unique identifier for the dats point (drop it before training). The last column is the class label (from 1 to 7). Use AdaBoost with decision stumps (binary classification rules using single features) as base classifiers to train seven binary classifiers, one for each of the seven classes (one-vs-all encoding) and study the classification performance for different values of the number T of rounds in AdaBoost. Use external cross-validation to evaluate accuracy.

    Extra (for groups only): Implement the Bagging algorithm from scratch also using decision stumps as base classifiers and one-vs-all encoding. Study the classification performance considering different numbers of base classifiers.

  3. Neural network classification with TensorFlow: Go through the TensorFlow tutorial on Convolutional Neural Networks for image classification using the CIFAR-10 dataset. Download the Street View House Numbers dataset and use it instead of CIFAR-10. Try adapting the network architecture to improve predictive performance.

    Extra (for groups only): Apply to the same SVHN dataset a network architecture different from CNN and compare the performance of the two architectures.