Statistical Learning Theory

Algorithmic Luckiness

Over the last few decades a few frameworks to study the generalisation performance of learning algorithms have been emerged. Among the few, the most remarkable are the VC framework (empirical risk minimisation algorithms), compression framework (on-line algorithms and compression schemes) and the luckiness framework (structural risk minimisation algorithms). However, apart from the compression framework none of the frameworks has considered the generalisation error of the single hypothesis learned by a given learning algorithm but resorted to the more stringent requirement of uniform convergence. The algorithmic luckiness framework is an extension of the powerful luckiness framework which studies the generalisation error of particular learning algorithms relative to some prior knowledge about the target concept encoded via a luckiness function.

PAC Bayesian Framework

In the Bayesian framework learning is viewed as an update of prior belief in the target concept in light of the data. The learning algorithms considered in the PAC-Bayesian framework are the Gibbs classifier (or better classification strategy) and the Bayes classifiers. Thus, once a learning algorithm is expressed as an update of a probability distribution such that the Bayes classifier is equivalent to the classifier at hand, the whole (and powerful) machinery of PAC-Bayesian can be applied. We are particularly interested in the study of linear classifiers. A geometrical picture reveals that the margin is only an approximation to the real quantity controlling generalisation error: the volume of consistent classifiers to the whole volume of parameter space. Hence we are able to remove awkward constant as well as permanent complexity terms from known margin bounds. The resulting bound can considered as tight and practically useful for bound based model selection. Further research aims at revealing the limitation of bound based model selection and to extent the analysis to unbounded loss like in regression or unsupervised learning techniques such as clustering, PCA or ICA.

  • R. Herbrich and T. Graepel, “A PAC-Bayesian margin bound for linear classifiers,” Ieee transactions on information theory, vol. 48, iss. 12, p. 3140–3150, 2002.
    [BibTeX] [Abstract] [Download PDF]

    We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training sample. The result is obtained in a PAC-Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound, which was developed in the luckiness framework, and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to non-trivial bound values and — for maximum margins — to a vanishing complexity term. In contrast to previous results, however, the new bound does depend on the dimensionality of feature space. The analysis shows that the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the fraction of hypothesis space consistent with the training sample. The practical relevance of the result lies in the fact that the well-known support vector machine is optimal with respect to the new bound only if the feature vectors in the training sample are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only. Numerical simulations support this recommendation and demonstrate that the new error bound can be used for the purpose of model selection.

    @article{DBLP:journals/tit/HerbrichG02,
    abstract = {We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training sample. The result is obtained in a PAC-Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound, which was developed in the luckiness framework, and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to non-trivial bound values and — for maximum margins — to a vanishing complexity term. In contrast to previous results, however, the new bound does depend on the dimensionality of feature space. The analysis shows that the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the fraction of hypothesis space consistent with the training sample. The practical relevance of the result lies in the fact that the well-known support vector machine is optimal with respect to the new bound only if the feature vectors in the training sample are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only. Numerical simulations support this recommendation and demonstrate that the new error bound can be used for the purpose of model selection.},
    author = {Herbrich, Ralf and Graepel, Thore},
    file = {:Users/rherb/Code/herbrich.me/papers/ieee-pacbayes.pdf:pdf},
    journal = {IEEE Transactions on Information Theory},
    number = {12},
    pages = {3140--3150},
    title = {{A PAC-Bayesian margin bound for linear classifiers}},
    url = {http://www.herbrich.me/papers/ieee-pacbayes.pdf},
    volume = {48},
    year = {2002}
    }

Sparsity and Generalisation

It is generally accepted that inferring a function given only a finite amount of data is only possible if one restricts the model of the data (descriptive approach) or the model of the dependencies (predictive approach) respectively. Over the last years sparse models have become very popular in the field of prediction. Sparse models are additive models f(x)=∑αi k(x,xi) – also referred to as kernel models – where at the solution for a finite amount of data only a few αi are unequal to zero. Surprisingly Bayesian schemes (like Gaussian Processes, Ridge Regression) which do not enforce such a sparsity show good generalization behaviour. We look for an explanation of this fact and finally for the usefulness of sparsity in Machine Learning.

  • T. Graepel, R. Herbrich, and J. Shawe-Taylor, “Generalisation Error Bounds for Sparse Linear Classifiers,” in Proceedings of the thirteenth annual conference on computational learning theory (colt 2000), june 28 – july 1, 2000, palo alto, california, 2000, p. 298–303.
    [BibTeX] [Download PDF]
    @inproceedings{DBLP:conf/colt/GraepelHS00,
    author = {Graepel, Thore and Herbrich, Ralf and Shawe-Taylor, John},
    booktitle = {Proceedings of the Thirteenth Annual Conference on Computational Learning Theory (COLT 2000), June 28 - July 1, 2000, Palo Alto, California},
    file = {:Users/rherb/Code/herbrich.me/papers/colt00\_sparse.pdf:pdf},
    pages = {298--303},
    title = {{Generalisation Error Bounds for Sparse Linear Classifiers}},
    url = {http://www.herbrich.me/papers/colt00\_sparse.pdf},
    year = {2000}
    }

Assessment of Learning Algorithms

In order to rank the performance of machine learning algorithms, many researchers conduct experiments on certain benchmark data sets. Most learning algorithms have domain-specific parameters and it is a popular custom to adjust these parameters with respect to minimal error on a holdout set. The error on the same holdout set of samples is then used to rank the algorithm, which causes an optimistic bias. We quantify this bias and show, why, when, and to which extent this inappropriate experimental setting distorts the results.

  • T. Scheffer and R. Herbrich, “Unbiased Assesment of Learning Algorithms,” in Proceedings of the international joint conference on artificial intelligence, Nagoya, 1997, p. 798–803.
    [BibTeX] [Abstract] [Download PDF]

    In order to rank the performance of machine learning algorithms, many researchs conduct experiments on benchmark datasets. Since most learning algorithms have domain-specific parameters, it is a popular custom to adapt these parameters to obtain a minimal error rate on the test set. The same rate is used to rank the algorithm which causes an optimistic bias. We quantify this bias, showing in particular that an algorithm with more parameters will probably be ranked higher than an equally good algorithm with fewer parameters. We demonstrate this result, showing the number of parameters and trials required in order to pretend to outperform C4.5 or FOIL, respectively, for various benchmark problems. We then describe how unbiased ranking experiments should be conducted.

    @inproceedings{DBLP:conf/ijcai/SchefferH97,
    abstract = {In order to rank the performance of machine learning algorithms, many researchs conduct experiments on benchmark datasets. Since most learning algorithms have domain-specific parameters, it is a popular custom to adapt these parameters to obtain a minimal error rate on the test set. The same rate is used to rank the algorithm which causes an optimistic bias. We quantify this bias, showing in particular that an algorithm with more parameters will probably be ranked higher than an equally good algorithm with fewer parameters. We demonstrate this result, showing the number of parameters and trials required in order to pretend to outperform C4.5 or FOIL, respectively, for various benchmark problems. We then describe how unbiased ranking experiments should be conducted.},
    address = {Nagoya},
    author = {Scheffer, Tobias and Herbrich, Ralf},
    booktitle = {Proceedings of the International Joint Conference on Artificial Intelligence},
    file = {:Users/rherb/Downloads/10.1.1.2.2973.pdf:pdf},
    pages = {798--803},
    title = {{Unbiased Assesment of Learning Algorithms}},
    url = {http://www.herbrich.me/papers/scheffer97.pdf},
    year = {1997}
    }

Large Deviation Bounds for Ranking

We study generalization properties of the area under the ROC curve (AUC), a quantity that has been advocated as an evaluation criterion for the bipartite ranking problem. The AUC is a different term than the error rate used for evaluation in classification problems; consequently, existing generalization bounds for the classification error rate cannot be used to draw conclusions about the AUC. In this paper, we define the expected accuracy of a ranking function (analogous to the expected error rate of a classification function), and derive distribution-free probabilistic bounds on the deviation of the empirical AUC of a ranking function (observed on a finite data sequence) from its expected accuracy. We derive both a large deviation bound, which serves to bound the expected accuracy of a ranking function in terms of its empirical AUC on a test sequence, and a uniform convergence bound, which serves to bound the expected accuracy of a learned ranking function in terms of its empirical AUC on a training sequence. Our uniform convergence bound is expressed in terms of a new set of combinatorial parameters that we term the bipartite rank-shatter coefficients; these play the same role in our result as do the standard VC-dimension related shatter coefficients (also known as the growth function) in uniform convergence results for the classification error rate. A comparison of our result with a recent uniform convergence result derived by Freund et al. (2003) for a quantity closely related to the AUC shows that the bound provided by our result can be considerably tighter.