Machine Learning Theory

Algorithmic Luckiness

Over the last few decades a few frameworks to study the generalisation performance of learning algorithms have been emerged. Among the few, the most remarkable are the Vapnik-Chervonenkis (VC) framework (empirical risk minimisation algorithms), compression framework (on-line algorithms and compression schemes) and the luckiness framework (structural risk minimisation algorithms). However, apart from the compression framework none of the frameworks has considered the generalisation error of the single hypothesis learned by a given learning algorithm but resorted to the more stringent requirement of uniform convergence. The algorithmic luckiness framework is an extension of the powerful luckiness framework which studies the generalisation error of particular learning algorithms relative to some prior knowledge about the target concept encoded via a luckiness function.

1.

Herbrich, Ralf; Williamson, Robert C

Algorithmic Luckiness Journal Article

In: Journal of Machine Learning Research, vol. 3, pp. 175–212, 2002.

Abstract | Links | BibTeX

2.

Herbrich, Ralf; Williamson, Robert C

Algorithmic Luckiness Proceedings Article

In: Advances in Neural Information Processing Systems 14, pp. 391–397, 2001.

Abstract | Links | BibTeX

PAC Bayesian Framework

In the Bayesian framework learning is viewed as an update of prior belief in the target concept that governs the data-generating process in light of the data. Once a learning algorithm is expressed as an update of a probability distribution such that the Bayes classifier is equivalent to the classifier at hand, the whole (and powerful) machinery of PAC-Bayesian can be applied. We are particularly interested in the study of linear classifiers. A geometrical picture reveals that the margin is only an approximation to the real quantity controlling generalisation error: the volume of consistent classifiers to the whole volume of parameter space. Hence we are able to remove awkward constant as well as permanent complexity terms from known margin bounds. The resulting bound can considered as tight and practically useful for bound based model selection.

1.

Graepel, Thore; Herbrich, Ralf; Shawe-Taylor, John

PAC-Bayesian Compression Bounds on the Prediction Error of Learning Algorithms for Classification Journal Article

In: Machine Learning, vol. 59, pp. 55–76, 2005.

Abstract | Links | BibTeX

2.

Herbrich, Ralf; Graepel, Thore; Williamson, Robert C

The Structure of Version Space Book Section

In: Innovations in Machine Learning: Theory and Applications, pp. 257–274, Springer, 2005.

Abstract | Links | BibTeX

3.

Herbrich, Ralf; Graepel, Thore

A PAC-Bayesian Margin Bound for Linear Classifiers Journal Article

In: IEEE Transactions on Information Theory, vol. 48, no. 12, pp. 3140–3150, 2002.

Abstract | Links | BibTeX

4.

Herbrich, Ralf; Graepel, Thore

A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work Proceedings Article

In: Advances in Neural Information Processing Systems 13, pp. 224–230, 2000.

Abstract | Links | BibTeX

Sparsity and Generalisation

It is generally accepted that inferring a function given only a finite amount of data is only possible if one restricts the model of the data (descriptive approach) or the model of the dependencies (predictive approach) respectively. Over the years, sparse models have become very popular in the field of machine learning. Sparse models are additive models f(x)=∑αi k(x,xi) – also referred to as kernel models – where at the solution for a finite amount of data only a few αi are unequal to zero. Surprisingly Bayesian schemes (like Gaussian Processes, Ridge Regression) which do not enforce such a sparsity show good generalization behaviour. We studied an explanation of this fact and, more generally, the usefulness of sparsity in Machine Learning.

1.

Graepel, Thore; Herbrich, Ralf; Shawe-Taylor, John

Generalisation Error Bounds for Sparse Linear Classifiers Proceedings Article

In: Proceedings of the 13th Annual Conference on Computational Learning Theory, pp. 298–303, 2000.

Abstract | Links | BibTeX

2.

Herbrich, Ralf; Graepel, Thore; Shawe-Taylor, John

Sparsity vs. Large Margins for Linear Classifiers Proceedings Article

In: Proceedings of the 13th1 Annual Conference on Computational Learning Theory, pp. 304–308, 2000.

Abstract | Links | BibTeX

Generalized Representer Theorem

Wahba’s classical representer theorem states that the solutions of certain risk minimization problems involving an empirical risk term and a quadratic regularizer can be written as expansions in terms of the training examples. We generalize the theorem to a larger class of regularizers and empirical risk terms, and give a self-contained proof utilizing the feature space associated with a kernel. The result shows that a wide range of problems have optimal solutions that live in the finite dimensional span of the training examples mapped into feature space, thus enabling us to carry out kernel algorithms independent of the (potentially infinite) dimensionality of the feature space.

1.

Schölkopf, Bernhard; Herbrich, Ralf; Smola, Alexander

A Generalized Representer Theorem Proceedings Article

In: Proceedings of the Fourteenth Annual Conference on Computational Learning Theory, pp. 416–426, 2001.

Abstract | Links | BibTeX

Unbiased Assessment of Learning Algorithms

In order to rank the performance of machine learning algorithms, researchers mostly conduct experiments on benchmark data sets. Most learning algorithms have domain-specific parameters and it is a popular custom to adjust these parameters with respect to minimal error on a holdout set. The error on the same holdout set of samples is then used to rank the algorithm, which causes an optimistic bias. We quantify this bias and show, why, when, and to which extent this inappropriate experimental setting distorts the results.

1.

Scheffer, Tobias; Herbrich, Ralf

Unbiased Assesment of Learning Algorithms Proceedings Article

In: Proceedings of the International Joint Conference on Artificial Intelligence, pp. 798–803, 1997.

Abstract | Links | BibTeX

Large Deviation Bounds for Ranking

We study generalization properties of the area under the ROC curve (AUC), a quantity that has been advocated as an evaluation criterion for the bipartite ranking problem. The AUC is a different term than the error rate used for evaluation in classification problems; consequently, existing generalization bounds for the classification error rate cannot be used to draw conclusions about the AUC. We define the expected accuracy of a ranking function (analogous to the expected error rate of a classification function), and derive distribution-free probabilistic bounds on the deviation of the empirical AUC of a ranking function (observed on a finite data sequence) from its expected accuracy. We derive both a large deviation bound, which serves to bound the expected accuracy of a ranking function in terms of its empirical AUC on a test sequence, and a uniform convergence bound, which serves to bound the expected accuracy of a learned ranking function in terms of its empirical AUC on a training sequence. The uniform convergence bound is expressed in terms of a new set of combinatorial parameters that we term the bipartite rank-shatter coefficients; these play the same role in our result as do the standard VC-dimension related shatter coefficients (also known as the growth function) in uniform convergence results for the classification error rate. A comparison of our result with a uniform convergence result derived by Freund et al. (2003) for a quantity closely related to the AUC shows that the bound provided by our result can be considerably tighter.

1.

Agarwal, Shivani; Graepel, Thore; Herbrich, Ralf; Har-Peled, Sariel; Roth, Dan

Generalization Bounds for the Area Under the ROC Curve Journal Article

In: Journal of Machine Learning Research, vol. 6, pp. 393–425, 2005.

Abstract | Links | BibTeX

2.

Agarwal, Shivani; Graepel, Thore; Herbrich, Ralf; Roth, Dan

A Large Deviation Bound for the Area Under the ROC Curve Proceedings Article

In: Advances in Neural Information Processing Systems 17, pp. 9–16, 2004.

Abstract | Links | BibTeX

3.

Hill, Simon; Zaragoza, Hugo; Herbrich, Ralf; Rayner, Peter

Average Precision and the Problem of Generalisation Proceedings Article

In: Proceedings of the ACM SIGIR Workshop on Mathematical and Formal Methods in Information Retrieval, 2002.

Abstract | Links | BibTeX

4.

Herbrich, Ralf; Graepel, Thore; Obermayer, Klause

Large Margin Rank Boundaries for Ordinal Regression Book Section

In: Advances in Large Margin Classifiers, pp. 115–132, The MIT Press, 1999.

Abstract | Links | BibTeX

5.

Herbrich, Ralf; Graepel, Thore; Bollmann-Sdorra, Peter; Obermayer, Klaus

Learning Preference Relations for Information Retrieval Proceedings Article

In: Proceedings of the International Conference on Machine Learning Workshop: Text Categorization and Machine learning, pp. 80–84, 1998.

Abstract | Links | BibTeX