Algorithmic Luckiness
Over the last few decades a few frameworks to study the generalisation performance of learning algorithms have been emerged. Among the few, the most remarkable are the Vapnik-Chervonenkis (VC) framework (empirical risk minimisation algorithms), compression framework (on-line algorithms and compression schemes) and the luckiness framework (structural risk minimisation algorithms). However, apart from the compression framework none of the frameworks has considered the generalisation error of the single hypothesis learned by a given learning algorithm but resorted to the more stringent requirement of uniform convergence. The algorithmic luckiness framework is an extension of the powerful luckiness framework which studies the generalisation error of particular learning algorithms relative to some prior knowledge about the target concept encoded via a luckiness function.
1.
Herbrich, Ralf; Williamson, Robert C
Algorithmic Luckiness Journal Article
In: Journal of Machine Learning Research, vol. 3, pp. 175–212, 2002.
@article{herbrich2002algorithmicluckiness,
title = {Algorithmic Luckiness},
author = {Ralf Herbrich and Robert C Williamson},
url = {https://www.herbrich.me/papers/algoluck.pdf},
year = {2002},
date = {2002-01-01},
journal = {Journal of Machine Learning Research},
volume = {3},
pages = {175--212},
abstract = {Classical statistical learning theory studies the generalisation performance of machine learning algorithms rather indirectly. One of the main detours is that algorithms are studied in terms of the hypothesis class that they draw their hypotheses from. In this paper, motivated by the luckiness framework of Shawe-Taylor et al. (1998), we study learning algorithms more directly and in a way that allows us to exploit the serendipity of the training sample. The main difference to previous approaches lies in the complexity measure; rather than covering all hypotheses in a given hypothesis space it is only necessary to cover the functions which could have been learned using the fixed learning algorithm. We show how the resulting framework relates to the VC, luckiness and compression frameworks. Finally, we present an application of this framework to the maximum margin algorithm for linear classifiers which results in a bound that exploits the margin, the sparsity of the resultant weight vector, and the degree of clustering of the training data in feature space.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Classical statistical learning theory studies the generalisation performance of machine learning algorithms rather indirectly. One of the main detours is that algorithms are studied in terms of the hypothesis class that they draw their hypotheses from. In this paper, motivated by the luckiness framework of Shawe-Taylor et al. (1998), we study learning algorithms more directly and in a way that allows us to exploit the serendipity of the training sample. The main difference to previous approaches lies in the complexity measure; rather than covering all hypotheses in a given hypothesis space it is only necessary to cover the functions which could have been learned using the fixed learning algorithm. We show how the resulting framework relates to the VC, luckiness and compression frameworks. Finally, we present an application of this framework to the maximum margin algorithm for linear classifiers which results in a bound that exploits the margin, the sparsity of the resultant weight vector, and the degree of clustering of the training data in feature space.
2.
Herbrich, Ralf; Williamson, Robert C
Algorithmic Luckiness Inproceedings
In: Advances in Neural Information Processing Systems 14, pp. 391–397, 2001.
@inproceedings{herbrich2001algoluck,
title = {Algorithmic Luckiness},
author = {Ralf Herbrich and Robert C Williamson},
url = {https://www.herbrich.me/papers/nips01_algoluck.pdf},
year = {2001},
date = {2001-01-01},
booktitle = {Advances in Neural Information Processing Systems 14},
pages = {391--397},
abstract = {In contrast to standard statistical learning theory which studies uniform bounds on the expected error we present a framework that exploits the specific learning algorithm used. Motivated by the luckiness framework [Taylor et al., 1998] we are also able to exploit the serendipity of the training sample. The main difference to previous approaches lies in the complexity measure; rather than covering all hypotheses in a given hypothesis space it is only necessary to cover the functions which could have been learned using the fixed learning algorithm. We show how the resulting framework relates to the VC, luckiness and compression frameworks. Finally, we present an application of this framework to the maximum margin algorithm for linear classifiers which results in a bound that exploits both the margin and the distribution of the data in feature space.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
In contrast to standard statistical learning theory which studies uniform bounds on the expected error we present a framework that exploits the specific learning algorithm used. Motivated by the luckiness framework [Taylor et al., 1998] we are also able to exploit the serendipity of the training sample. The main difference to previous approaches lies in the complexity measure; rather than covering all hypotheses in a given hypothesis space it is only necessary to cover the functions which could have been learned using the fixed learning algorithm. We show how the resulting framework relates to the VC, luckiness and compression frameworks. Finally, we present an application of this framework to the maximum margin algorithm for linear classifiers which results in a bound that exploits both the margin and the distribution of the data in feature space.
PAC Bayesian Framework
In the Bayesian framework learning is viewed as an update of prior belief in the target concept that governs the data-generating process in light of the data. Once a learning algorithm is expressed as an update of a probability distribution such that the Bayes classifier is equivalent to the classifier at hand, the whole (and powerful) machinery of PAC-Bayesian can be applied. We are particularly interested in the study of linear classifiers. A geometrical picture reveals that the margin is only an approximation to the real quantity controlling generalisation error: the volume of consistent classifiers to the whole volume of parameter space. Hence we are able to remove awkward constant as well as permanent complexity terms from known margin bounds. The resulting bound can considered as tight and practically useful for bound based model selection.
1.
Graepel, Thore; Herbrich, Ralf; Shawe-Taylor, John
PAC-Bayesian Compression Bounds on the Prediction Error of Learning Algorithms for Classification Journal Article
In: Machine Learning, vol. 59, pp. 55–76, 2005.
@article{graepel2005pacbayesian,
title = {PAC-Bayesian Compression Bounds on the Prediction Error of Learning Algorithms for Classification},
author = {Thore Graepel and Ralf Herbrich and John Shawe-Taylor},
url = {https://www.herbrich.me/papers/graehertay05.pdf},
year = {2005},
date = {2005-01-01},
journal = {Machine Learning},
volume = {59},
pages = {55--76},
abstract = {We consider bounds on the prediction error of classification algorithms based on sample compression. We refine the notion of a compression scheme to distinguish permutation and repetition invariant and non-permutation and repetition invariant compression schemes leading to different prediction error bounds. Also, we extend known results on compression to the case of non-zero empirical risk. We provide bounds on the prediction error of classifiers returned by mistakedriven online learning algorithms by interpreting mistake bounds as bounds on the size of the respective compression scheme of the algorithm. This leads to a bound on the prediction error of perceptron solutions that depends on the margin a support vector machine would achieve on the same training sample. Furthermore, using the property of compression we derive bounds on the average prediction error of kernel classifiers in the PAC-Bayesian framework. These bounds assume a prior measure over the expansion coefficients in the data-dependent kernel expansion and bound the average prediction error uniformly over subsets of the space of expansion coefficients.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
We consider bounds on the prediction error of classification algorithms based on sample compression. We refine the notion of a compression scheme to distinguish permutation and repetition invariant and non-permutation and repetition invariant compression schemes leading to different prediction error bounds. Also, we extend known results on compression to the case of non-zero empirical risk. We provide bounds on the prediction error of classifiers returned by mistakedriven online learning algorithms by interpreting mistake bounds as bounds on the size of the respective compression scheme of the algorithm. This leads to a bound on the prediction error of perceptron solutions that depends on the margin a support vector machine would achieve on the same training sample. Furthermore, using the property of compression we derive bounds on the average prediction error of kernel classifiers in the PAC-Bayesian framework. These bounds assume a prior measure over the expansion coefficients in the data-dependent kernel expansion and bound the average prediction error uniformly over subsets of the space of expansion coefficients.
2.
Herbrich, Ralf; Graepel, Thore; Williamson, Robert C
The Structure of Version Space Incollection
In: Innovations in Machine Learning: Theory and Applications, pp. 257–274, Springer, 2005.
@incollection{herbrich2005versionspace,
title = {The Structure of Version Space},
author = {Ralf Herbrich and Thore Graepel and Robert C Williamson},
url = {https://www.herbrich.me/papers/holmes_mod2.pdf},
year = {2005},
date = {2005-01-01},
booktitle = {Innovations in Machine Learning: Theory and Applications},
pages = {257--274},
publisher = {Springer},
chapter = {9},
abstract = {We investigate the generalisation performance of consistent classifiers, i.e. classifiers that are contained in the so-called version space, both from a theoretical and experimental angle. In contrast to classical VC analysis - where no single classifier within version space is singled out on grounds of a generalisation error bound - the data dependent structural risk minimisation framework suggests that there exists one particular classifier that is to be preferred because it minimises the generalisation error bound. This is usually taken to provide a theoretical justification for learning algorithms such as the well known support vector machine. A reinterpretation of a recent PAC-Bayesian result, however, reveals that given a suitably chosen hypothesis space there exists a large fraction of classifiers with small generalisation error although we cannot readily identify them for a specific learning task. In the particular case of linear classifiers we show that classifiers found by the classical perceptron algorithm have guarantees bounded by the size of version space. These results are complemented with an empirical study for kernel classifiers on the task of handwritten digit recognition which demonstrates that even classifiers with a small margin may exhibit excellent generalisation. In order to perform this analysis we introduce the kernel Gibbs sampler - an algorithm which can be used to sample consistent kernel classifiers.},
keywords = {},
pubstate = {published},
tppubtype = {incollection}
}
We investigate the generalisation performance of consistent classifiers, i.e. classifiers that are contained in the so-called version space, both from a theoretical and experimental angle. In contrast to classical VC analysis - where no single classifier within version space is singled out on grounds of a generalisation error bound - the data dependent structural risk minimisation framework suggests that there exists one particular classifier that is to be preferred because it minimises the generalisation error bound. This is usually taken to provide a theoretical justification for learning algorithms such as the well known support vector machine. A reinterpretation of a recent PAC-Bayesian result, however, reveals that given a suitably chosen hypothesis space there exists a large fraction of classifiers with small generalisation error although we cannot readily identify them for a specific learning task. In the particular case of linear classifiers we show that classifiers found by the classical perceptron algorithm have guarantees bounded by the size of version space. These results are complemented with an empirical study for kernel classifiers on the task of handwritten digit recognition which demonstrates that even classifiers with a small margin may exhibit excellent generalisation. In order to perform this analysis we introduce the kernel Gibbs sampler - an algorithm which can be used to sample consistent kernel classifiers.
3.
Herbrich, Ralf; Graepel, Thore
A PAC-Bayesian Margin Bound for Linear Classifiers Journal Article
In: IEEE Transactions on Information Theory, vol. 48, no. 12, pp. 3140–3150, 2002.
@article{herbrich2002marginbound,
title = {A PAC-Bayesian Margin Bound for Linear Classifiers},
author = {Ralf Herbrich and Thore Graepel},
url = {https://www.herbrich.me/papers/ieee-pacbayes.pdf},
year = {2002},
date = {2002-01-01},
journal = {IEEE Transactions on Information Theory},
volume = {48},
number = {12},
pages = {3140--3150},
abstract = {We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training sample. The result is obtained in a PAC-Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound, which was developed in the luckiness framework, and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to non-trivial bound values and ^a€'' for maximum margins ^a€'' to a vanishing complexity term. In contrast to previous results, however, the new bound does depend on the dimensionality of feature space. The analysis shows that the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the fraction of hypothesis space consistent with the training sample. The practical relevance of the result lies in the fact that the well-known support vector machine is optimal with respect to the new bound only if the feature vectors in the training sample are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only. Numerical simulations support this recommendation and demonstrate that the new error bound can be used for the purpose of model selection.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training sample. The result is obtained in a PAC-Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound, which was developed in the luckiness framework, and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to non-trivial bound values and ^a€'' for maximum margins ^a€'' to a vanishing complexity term. In contrast to previous results, however, the new bound does depend on the dimensionality of feature space. The analysis shows that the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the fraction of hypothesis space consistent with the training sample. The practical relevance of the result lies in the fact that the well-known support vector machine is optimal with respect to the new bound only if the feature vectors in the training sample are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only. Numerical simulations support this recommendation and demonstrate that the new error bound can be used for the purpose of model selection.
4.
Herbrich, Ralf; Graepel, Thore
A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work Inproceedings
In: Advances in Neural Information Processing Systems 13, pp. 224–230, 2000.
@inproceedings{herbrich2000pacbayesiansvm,
title = {A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work},
author = {Ralf Herbrich and Thore Graepel},
url = {https://www.herbrich.me/papers/pacbayes.pdf},
year = {2000},
date = {2000-01-01},
booktitle = {Advances in Neural Information Processing Systems 13},
pages = {224--230},
abstract = {We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training set. The result is obtained in a PAC-Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound by Shawe-Taylor et al. and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to non-trivial bound values and - for maximum margins - to a vanishing complexity term. Furthermore, the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the volume ratio between the whole hypothesis space and the subset of consistent hypotheses. The practical relevance of the result lies in the fact that the well-known support vector machine is optimal w.r.t. the new bound only if the feature vectors are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only - a recommendation that is well supported by our numerical experiments on two benchmark data sets.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training set. The result is obtained in a PAC-Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound by Shawe-Taylor et al. and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to non-trivial bound values and - for maximum margins - to a vanishing complexity term. Furthermore, the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the volume ratio between the whole hypothesis space and the subset of consistent hypotheses. The practical relevance of the result lies in the fact that the well-known support vector machine is optimal w.r.t. the new bound only if the feature vectors are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only - a recommendation that is well supported by our numerical experiments on two benchmark data sets.
Sparsity and Generalisation
It is generally accepted that inferring a function given only a finite amount of data is only possible if one restricts the model of the data (descriptive approach) or the model of the dependencies (predictive approach) respectively. Over the years, sparse models have become very popular in the field of machine learning. Sparse models are additive models f(x)=∑αi k(x,xi) – also referred to as kernel models – where at the solution for a finite amount of data only a few αi are unequal to zero. Surprisingly Bayesian schemes (like Gaussian Processes, Ridge Regression) which do not enforce such a sparsity show good generalization behaviour. We studied an explanation of this fact and, more generally, the usefulness of sparsity in Machine Learning.
1.
Graepel, Thore; Herbrich, Ralf; Shawe-Taylor, John
Generalisation Error Bounds for Sparse Linear Classifiers Inproceedings
In: Proceedings of the 13th Annual Conference on Computational Learning Theory, pp. 298–303, 2000.
@inproceedings{graepel2000sparse,
title = {Generalisation Error Bounds for Sparse Linear Classifiers},
author = {Thore Graepel and Ralf Herbrich and John Shawe-Taylor},
url = {https://www.herbrich.me/papers/colt00_sparse.pdf},
year = {2000},
date = {2000-01-01},
booktitle = {Proceedings of the 13th Annual Conference on Computational Learning Theory},
pages = {298--303},
abstract = {We provide small sample size bounds on the generalisation error of linear classiers that are sparse in their dual representation given by the expansion coecients of the weight vector in terms of the training data. These results theoretically justify algorithms like the Support Vector Machine, the Relevance Vector Machine and K-nearest-neighbour. The
bounds are a-posteriori bounds to be evaluated after learning when the attained level of sparsity is known. In a PAC-Bayesian style prior knowledge about the expected sparsity is incorporated into the bounds. The proofs avoid the use of double sample arguments by taking into account the sparsity that leaves unused training points for the evaluation of classiers. We furthermore give a PAC-Bayesian bound on the average generalisation error over subsets of parameter space that may pave the way combining sparsity in the expansion coefficients and margin in a single bound. Finally, reinterpreting a mistake bound for the classical perceptron algorithm due to Noviko we demonstrate that our new results put classifiers found by this algorithm on a firm theoretical basis.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
We provide small sample size bounds on the generalisation error of linear classiers that are sparse in their dual representation given by the expansion coecients of the weight vector in terms of the training data. These results theoretically justify algorithms like the Support Vector Machine, the Relevance Vector Machine and K-nearest-neighbour. The
bounds are a-posteriori bounds to be evaluated after learning when the attained level of sparsity is known. In a PAC-Bayesian style prior knowledge about the expected sparsity is incorporated into the bounds. The proofs avoid the use of double sample arguments by taking into account the sparsity that leaves unused training points for the evaluation of classiers. We furthermore give a PAC-Bayesian bound on the average generalisation error over subsets of parameter space that may pave the way combining sparsity in the expansion coefficients and margin in a single bound. Finally, reinterpreting a mistake bound for the classical perceptron algorithm due to Noviko we demonstrate that our new results put classifiers found by this algorithm on a firm theoretical basis.
2.
Herbrich, Ralf; Graepel, Thore; Shawe-Taylor, John
Sparsity vs. Large Margins for Linear Classifiers Inproceedings
In: Proceedings of the 13th1 Annual Conference on Computational Learning Theory, pp. 304–308, 2000.
@inproceedings{herbrich2000sparsitymarginlinear,
title = {Sparsity vs. Large Margins for Linear Classifiers},
author = {Ralf Herbrich and Thore Graepel and John Shawe-Taylor},
url = {https://www.herbrich.me/papers/colt00_sparsemargin.pdf},
year = {2000},
date = {2000-01-01},
booktitle = {Proceedings of the 13th1 Annual Conference on Computational Learning Theory},
pages = {304--308},
abstract = {We provide small sample size bounds on the generalisation error of linear classifiers that take advantage of large observed margins on the training set and sparsity in the data dependent expansion coefficients. It is already known from results in the luckiness framework that both criteria independently have a large impact on the generalisation error. Our new results show that they can be combined which theoretically justifies learning algorithms like the Support Vector Machine or the Relevance Vector Machine. In contrast to previous studies we avoid using the classical technique of symmetrisation by a ghost sample but directly using the sparsity for the estimation of the generalisation error. We demonstrate that our result leads to practical useful results even in case of small sample size if the training set witnesses our prior belief in sparsity and large margins.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
We provide small sample size bounds on the generalisation error of linear classifiers that take advantage of large observed margins on the training set and sparsity in the data dependent expansion coefficients. It is already known from results in the luckiness framework that both criteria independently have a large impact on the generalisation error. Our new results show that they can be combined which theoretically justifies learning algorithms like the Support Vector Machine or the Relevance Vector Machine. In contrast to previous studies we avoid using the classical technique of symmetrisation by a ghost sample but directly using the sparsity for the estimation of the generalisation error. We demonstrate that our result leads to practical useful results even in case of small sample size if the training set witnesses our prior belief in sparsity and large margins.
Generalized Representer Theorem
Wahba’s classical representer theorem states that the solutions of certain risk minimization problems involving an empirical risk term and a quadratic regularizer can be written as expansions in terms of the training examples. We generalize the theorem to a larger class of regularizers and empirical risk terms, and give a self-contained proof utilizing the feature space associated with a kernel. The result shows that a wide range of problems have optimal solutions that live in the finite dimensional span of the training examples mapped into feature space, thus enabling us to carry out kernel algorithms independent of the (potentially infinite) dimensionality of the feature space.
1.
Schölkopf, Bernhard; Herbrich, Ralf; Smola, Alexander
A Generalized Representer Theorem Inproceedings
In: Proceedings of the Fourteenth Annual Conference on Computational Learning Theory, pp. 416–426, 2001.
@inproceedings{scholkopf2001representer,
title = {A Generalized Representer Theorem},
author = {Bernhard Schölkopf and Ralf Herbrich and Alexander Smola},
url = {https://www.herbrich.me/papers/colt2001.pdf},
year = {2001},
date = {2001-01-01},
booktitle = {Proceedings of the Fourteenth Annual Conference on Computational Learning Theory},
pages = {416--426},
abstract = {Wahba's classical representer theorem states that the solutions of certain risk minimization problems involving an empirical risk term and a quadratic regularizer can be written as expansions in terms of the training examples. We generalize the theorem to a larger class of regularizers and empirical risk terms, and give a self-contained proof utilizing the feature space associated with a kernel. The result shows that a wide range of problems have optimal solutions that live in the finite dimensional span of the training examples mapped into feature space, thus enabling us to carry out kernel algorithms independent of the (potentially infinite) dimensionality of the feature space.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Wahba's classical representer theorem states that the solutions of certain risk minimization problems involving an empirical risk term and a quadratic regularizer can be written as expansions in terms of the training examples. We generalize the theorem to a larger class of regularizers and empirical risk terms, and give a self-contained proof utilizing the feature space associated with a kernel. The result shows that a wide range of problems have optimal solutions that live in the finite dimensional span of the training examples mapped into feature space, thus enabling us to carry out kernel algorithms independent of the (potentially infinite) dimensionality of the feature space.
Unbiased Assessment of Learning Algorithms
In order to rank the performance of machine learning algorithms, researchers mostly conduct experiments on benchmark data sets. Most learning algorithms have domain-specific parameters and it is a popular custom to adjust these parameters with respect to minimal error on a holdout set. The error on the same holdout set of samples is then used to rank the algorithm, which causes an optimistic bias. We quantify this bias and show, why, when, and to which extent this inappropriate experimental setting distorts the results.
1.
Scheffer, Tobias; Herbrich, Ralf
Unbiased Assesment of Learning Algorithms Inproceedings
In: Proceedings of the International Joint Conference on Artificial Intelligence, pp. 798–803, 1997.
@inproceedings{scheffer1997assessment,
title = {Unbiased Assesment of Learning Algorithms},
author = {Tobias Scheffer and Ralf Herbrich},
url = {https://www.herbrich.me/papers/scheffer97.pdf},
year = {1997},
date = {1997-01-01},
booktitle = {Proceedings of the International Joint Conference on Artificial Intelligence},
pages = {798--803},
abstract = {In order to rank the performance of machine learning algorithms, many researchs conduct experiments on benchmark datasets. Since most learning algorithms have domain-specific parameters, it is a popular custom to adapt these parameters to obtain a minimal error rate on the test set. The same rate is used to rank the algorithm which causes an optimistic bias. We quantify this bias, showing in particular that an algorithm with more parameters will probably be ranked higher than an equally good algorithm with fewer parameters. We demonstrate this result, showing the number of parameters and trials required in order to pretend to outperform C4.5 or FOIL, respectively, for various benchmark problems. We then describe how unbiased ranking experiments should be conducted.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
In order to rank the performance of machine learning algorithms, many researchs conduct experiments on benchmark datasets. Since most learning algorithms have domain-specific parameters, it is a popular custom to adapt these parameters to obtain a minimal error rate on the test set. The same rate is used to rank the algorithm which causes an optimistic bias. We quantify this bias, showing in particular that an algorithm with more parameters will probably be ranked higher than an equally good algorithm with fewer parameters. We demonstrate this result, showing the number of parameters and trials required in order to pretend to outperform C4.5 or FOIL, respectively, for various benchmark problems. We then describe how unbiased ranking experiments should be conducted.
Large Deviation Bounds for Ranking
We study generalization properties of the area under the ROC curve (AUC), a quantity that has been advocated as an evaluation criterion for the bipartite ranking problem. The AUC is a different term than the error rate used for evaluation in classification problems; consequently, existing generalization bounds for the classification error rate cannot be used to draw conclusions about the AUC. We define the expected accuracy of a ranking function (analogous to the expected error rate of a classification function), and derive distribution-free probabilistic bounds on the deviation of the empirical AUC of a ranking function (observed on a finite data sequence) from its expected accuracy. We derive both a large deviation bound, which serves to bound the expected accuracy of a ranking function in terms of its empirical AUC on a test sequence, and a uniform convergence bound, which serves to bound the expected accuracy of a learned ranking function in terms of its empirical AUC on a training sequence. The uniform convergence bound is expressed in terms of a new set of combinatorial parameters that we term the bipartite rank-shatter coefficients; these play the same role in our result as do the standard VC-dimension related shatter coefficients (also known as the growth function) in uniform convergence results for the classification error rate. A comparison of our result with a uniform convergence result derived by Freund et al. (2003) for a quantity closely related to the AUC shows that the bound provided by our result can be considerably tighter.
1.
Agarwal, Shivani; Graepel, Thore; Herbrich, Ralf; Har-Peled, Sariel; Roth, Dan
Generalization Bounds for the Area Under the ROC Curve Journal Article
In: Journal of Machine Learning Research, vol. 6, pp. 393–425, 2005.
@article{agarwal2005roc,
title = {Generalization Bounds for the Area Under the ROC Curve},
author = {Shivani Agarwal and Thore Graepel and Ralf Herbrich and Sariel Har-Peled and Dan Roth},
url = {https://www.herbrich.me/papers/auc.pdf},
year = {2005},
date = {2005-01-01},
journal = {Journal of Machine Learning Research},
volume = {6},
pages = {393--425},
abstract = {We study generalization properties of the area under the ROC curve (AUC), a quantity that has been advocated as an evaluation criterion for the bipartite ranking problem. The AUC is a different term than the error rate used for evaluation in classification problems; consequently, existing generalization bounds for the classification error rate cannot be used to draw conclusions about the AUC. In this paper, we define the expected accuracy of a ranking function (analogous to the expected error rate of a classification function), and derive distribution-free probabilistic bounds on the deviation of the empirical AUC of a ranking function (observed on a finite data sequence) from its expected accuracy. We derive both a large deviation bound, which serves to bound the expected accuracy of a ranking function in terms of its empirical AUC on a test sequence, and a uniform convergence bound, which serves to bound the expected accuracy of a learned ranking function in terms of its empirical AUC on a training sequence. Our uniform convergence bound is expressed in terms of a new set of combinatorial parameters that we term the bipartite rank-shatter coefficients; these play the same role in our result as do the standard VC-dimension related shatter coefficients (also known as the growth function) in uniform convergence results for the classification error rate. A comparison of our result with a recent uniform convergence result derived by Freund et al. (2003) for a quantity closely related to the AUC shows that the bound provided by our result can be considerably tighter.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
We study generalization properties of the area under the ROC curve (AUC), a quantity that has been advocated as an evaluation criterion for the bipartite ranking problem. The AUC is a different term than the error rate used for evaluation in classification problems; consequently, existing generalization bounds for the classification error rate cannot be used to draw conclusions about the AUC. In this paper, we define the expected accuracy of a ranking function (analogous to the expected error rate of a classification function), and derive distribution-free probabilistic bounds on the deviation of the empirical AUC of a ranking function (observed on a finite data sequence) from its expected accuracy. We derive both a large deviation bound, which serves to bound the expected accuracy of a ranking function in terms of its empirical AUC on a test sequence, and a uniform convergence bound, which serves to bound the expected accuracy of a learned ranking function in terms of its empirical AUC on a training sequence. Our uniform convergence bound is expressed in terms of a new set of combinatorial parameters that we term the bipartite rank-shatter coefficients; these play the same role in our result as do the standard VC-dimension related shatter coefficients (also known as the growth function) in uniform convergence results for the classification error rate. A comparison of our result with a recent uniform convergence result derived by Freund et al. (2003) for a quantity closely related to the AUC shows that the bound provided by our result can be considerably tighter.
2.
Agarwal, Shivani; Graepel, Thore; Herbrich, Ralf; Roth, Dan
A Large Deviation Bound for the Area Under the ROC Curve Inproceedings
In: Advances in Neural Information Processing Systems 17, pp. 9–16, 2004.
@inproceedings{agarwal2004roc,
title = {A Large Deviation Bound for the Area Under the ROC Curve},
author = {Shivani Agarwal and Thore Graepel and Ralf Herbrich and Dan Roth},
url = {https://www.herbrich.me/papers/nips04-auc.pdf},
year = {2004},
date = {2004-01-01},
booktitle = {Advances in Neural Information Processing Systems 17},
pages = {9--16},
abstract = {The area under the ROC curve (AUC) has been advocated as an evaluation criterion for the bipartite ranking problem. We study large deviation properties of the AUC; in particular, we derive a distribution-free large deviation bound for the AUC which serves to bound the expected accuracy of a ranking function in terms of its empirical AUC on an independent test sequence. A comparison of our result with a corresponding large deviation result for the classification error rate suggests that the test sample size required to obtain an epsilon-accurate estimate of the expected accuracy of a ranking function with delta-confidence is larger than that required to obtain an epsilon-accurate estimate of the expected error rate of a classification function with the same confidence. A simple application of the union bound allows the large deviation bound to be extended to learned ranking functions chosen from finite function classes.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
The area under the ROC curve (AUC) has been advocated as an evaluation criterion for the bipartite ranking problem. We study large deviation properties of the AUC; in particular, we derive a distribution-free large deviation bound for the AUC which serves to bound the expected accuracy of a ranking function in terms of its empirical AUC on an independent test sequence. A comparison of our result with a corresponding large deviation result for the classification error rate suggests that the test sample size required to obtain an epsilon-accurate estimate of the expected accuracy of a ranking function with delta-confidence is larger than that required to obtain an epsilon-accurate estimate of the expected error rate of a classification function with the same confidence. A simple application of the union bound allows the large deviation bound to be extended to learned ranking functions chosen from finite function classes.
3.
Hill, Simon; Zaragoza, Hugo; Herbrich, Ralf; Rayner, Peter
Average Precision and the Problem of Generalisation Inproceedings
In: Proceedings of the ACM SIGIR Workshop on Mathematical and Formal Methods in Information Retrieval, 2002.
@inproceedings{hill2002ap,
title = {Average Precision and the Problem of Generalisation},
author = {Simon Hill and Hugo Zaragoza and Ralf Herbrich and Peter Rayner},
url = {https://www.herbrich.me/papers/hill.pdf},
year = {2002},
date = {2002-01-01},
booktitle = {Proceedings of the ACM SIGIR Workshop on Mathematical and Formal Methods in Information Retrieval},
abstract = {In this paper we study the problem of generalisation in information retrieval. In particular we study precision-recall curves and the average precision value. We provide two types of bounds: large-deviation bounds of the average precision and maximum deviation bounds with respect to a given point of the precision recall curve. The first type of bounds are useful to answer the question: how far can true average precision be from the value observed on a test collection? The second is useful for obtaining bounds on average precision when tight bounds on a particular point of the curve can be established, as is the case when training SVMs or Perceptrons for document categorisation.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
In this paper we study the problem of generalisation in information retrieval. In particular we study precision-recall curves and the average precision value. We provide two types of bounds: large-deviation bounds of the average precision and maximum deviation bounds with respect to a given point of the precision recall curve. The first type of bounds are useful to answer the question: how far can true average precision be from the value observed on a test collection? The second is useful for obtaining bounds on average precision when tight bounds on a particular point of the curve can be established, as is the case when training SVMs or Perceptrons for document categorisation.
4.
Herbrich, Ralf; Graepel, Thore; Obermayer, Klause
Large Margin Rank Boundaries for Ordinal Regression Incollection
In: Advances in Large Margin Classifiers, pp. 115–132, The MIT Press, 1999.
@incollection{herbrich1999largemarginrank,
title = {Large Margin Rank Boundaries for Ordinal Regression},
author = {Ralf Herbrich and Thore Graepel and Klause Obermayer},
url = {https://www.herbrich.me/papers/nips98_ordinal.pdf},
year = {1999},
date = {1999-01-01},
booktitle = {Advances in Large Margin Classifiers},
pages = {115--132},
publisher = {The MIT Press},
chapter = {7},
abstract = {In contrast to the standard machine learning tasks of classification and metric regression we investigate the problem of predicting variables of ordinal scale, a setting referred to as ordinal regression. This problem arises frequently in the social sciences and in information retrieval where human preferences play a major role. Whilst approaches proposed in statistics rely on a probability model of a latent (unobserved) variable we present a distribution independent risk formulation of ordinal regression which allows us to derive a uniform convergence bound. Applying this bound we present a large margin algorithm that is based on a mapping from objects to scalar utility values thus classifying pairs of objects. We give experimental results for an information retrieval task which show that our algorithm outperforms more naive approaches to ordinal regression such as Support Vector Classification and Support Vector Regression in the case of more than two ranks.},
keywords = {},
pubstate = {published},
tppubtype = {incollection}
}
In contrast to the standard machine learning tasks of classification and metric regression we investigate the problem of predicting variables of ordinal scale, a setting referred to as ordinal regression. This problem arises frequently in the social sciences and in information retrieval where human preferences play a major role. Whilst approaches proposed in statistics rely on a probability model of a latent (unobserved) variable we present a distribution independent risk formulation of ordinal regression which allows us to derive a uniform convergence bound. Applying this bound we present a large margin algorithm that is based on a mapping from objects to scalar utility values thus classifying pairs of objects. We give experimental results for an information retrieval task which show that our algorithm outperforms more naive approaches to ordinal regression such as Support Vector Classification and Support Vector Regression in the case of more than two ranks.
5.
Herbrich, Ralf; Graepel, Thore; Bollmann-Sdorra, Peter; Obermayer, Klaus
Learning Preference Relations for Information Retrieval Inproceedings
In: Proceedings of the International Conference on Machine Learning Workshop: Text Categorization and Machine learning, pp. 80–84, 1998.
@inproceedings{herbrich1998preference,
title = {Learning Preference Relations for Information Retrieval},
author = {Ralf Herbrich and Thore Graepel and Peter Bollmann-Sdorra and Klaus Obermayer},
url = {https://www.herbrich.me/papers/hergraebollober98.pdf},
year = {1998},
date = {1998-01-01},
booktitle = {Proceedings of the International Conference on Machine Learning Workshop: Text Categorization and Machine learning},
pages = {80--84},
abstract = {In this paper we investigate the problem of learning a preference relation from a given set of ranked documents. We show that the Bayes's optimal decision function, when applied to learning a preference relation, may violate transitivity. This is undesirable for information retrieval, because it is in conflict with a document ranking based on the user's preferences. To overcome this problem we present a vector space based method that performs a linear mapping from documents to scalar utility values and thus guarantees transitivity. The learning of the relation between documents is formulated as a classification problem on pairs of documents and is solved using the principle of structural risk minimization for good generalization. The approach is extended to polynomial utility functions by using the potential function method (the so called "kernel trick"), which allows to incorporate higher order correlations of features into the utility function at minimal computational costs. The resulting algorithm is tested on an example with artificial data. The algorithm successfully learns the utility function underlying the training examples and shows good classification performance.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
In this paper we investigate the problem of learning a preference relation from a given set of ranked documents. We show that the Bayes's optimal decision function, when applied to learning a preference relation, may violate transitivity. This is undesirable for information retrieval, because it is in conflict with a document ranking based on the user's preferences. To overcome this problem we present a vector space based method that performs a linear mapping from documents to scalar utility values and thus guarantees transitivity. The learning of the relation between documents is formulated as a classification problem on pairs of documents and is solved using the principle of structural risk minimization for good generalization. The approach is extended to polynomial utility functions by using the potential function method (the so called "kernel trick"), which allows to incorporate higher order correlations of features into the utility function at minimal computational costs. The resulting algorithm is tested on an example with artificial data. The algorithm successfully learns the utility function underlying the training examples and shows good classification performance.