Kernel Methods

Semidefinite Programming for Classification Learning

Knowledge about local invariances with respect to given pattern transformations can greatly improve the accuracy of classification. Approaches are either based on regularisation or on the generation of virtual (transformed) examples. We developed a new framework for learning linear classifiers under known transformations based on semi-definite programming where we are able to find a maximum margin hyper-plane when the training examples are polynomial trajectories instead of single points. The solution is found to be sparse in dual variables and allows to identify those points on the trajectory with minimal real-valued output as virtual support vectors.

We also developed a modified version of the perceptron learning algorithm which solves such semidefinite programs in polynomial time. The algorithm is based on the following three observations: (i) Semidefinite programs are linear programs with infinitely many (linear) constraints (i.e., all the virtually transformed examples); (ii) every linear program can be solved by a sequence of constraint satisfaction problems with linear constraints; and (iii) in general, the perceptron learning algorithm solves a constraint satisfaction problem with linear constraints in finitely many updates.

1.

Graepel, Thore; Herbrich, Ralf

Invariant Pattern Recognition by Semidefinite Programming Machines Proceedings Article

In: Advances in Neural Information Processing Systems 16, pp. 33–40, 2003.

Abstract | Links | BibTeX

2.

Graepel, Thore; Herbrich, Ralf; Kharechko, Andriy; Shawe-Taylor, John

Semi-Definite Programming by Perceptron Learning Proceedings Article

In: Advances in Neural Information Processing Systems 16, pp. 457–464, The MIT Press, 2003.

Abstract | Links | BibTeX

Adaptive Margin Machines

Former approaches for learning kernel classifiers like Quadratic Programming Machines (SVMs), and Linear Programming Machines were based on minimization of a regularized margin loss where the margin was treated equivalently for each training pattern. We propose a reformulation of the minimization problem such that adaptive margins (AMM) for each training pattern are utilized. Furthermore, we give bounds on the generalization error of AMMs which justify their robustness against outliers. We show experimentally that the generalization error of AMMs is comparable to QP- and LP-Machines on benchmark datasets from the UCI repository.

1.

Herbrich, Ralf; Weston, Jason

Adaptive Margin Support Vector Machines for Classification Proceedings Article

In: Proceedings of the 9th International Conference on Artificial Neural Networks, pp. 97–102, 1999.

Abstract | Links | BibTeX

2.

Weston, Jason; Herbrich, Ralf

Adaptive Margin Support Vector Machines Book Section

In: Advances in Large Margin Classifiers, pp. 281–296, The MIT Press, 1999.

Abstract | Links | BibTeX

Ordinal Regression

In contrast to the standard machine learning tasks of classification and metric regression we investigate the problem of predicting variables of ordinal scale, a setting referred to as ordinal regression. This problem arises frequently in the social sciences and in information retrieval where human preferences play a major role. Whilst approaches proposed in Statistics rely on a probability model of a latent (unobserved) variable we present a large margin algorithm that is based on a mapping from objects to scalar utility values though classifying pairs of objects.

1.

Herbrich, Ralf; Graepel, Thore; Obermayer, Klaus

Support Vector Learning for Ordinal Regression Proceedings Article

In: Proceedings of the 9th International Conference on Artificial Neural Networks, pp. 97–102, 1999.

Abstract | Links | BibTeX

2.

Herbrich, Ralf; Graepel, Thore; Obermayer, Klause

Large Margin Rank Boundaries for Ordinal Regression Book Section

In: Advances in Large Margin Classifiers, pp. 115–132, The MIT Press, 1999.

Abstract | Links | BibTeX

Kernel Methods for Measuring Independence

We introduce two new functionals, the constrained covariance and the kernel mutual information, to measure the degree of independence of random variables. These quantities are both based on the covariance between functions of the random variables in reproducing kernel Hilbert spaces (RKHSs). We prove that when the RKHSs are universal, both functionals are zero if and only if the random variables are pairwise independent. We also show that the kernel mutual information is an upper bound near independence on the Parzen window estimate of the mutual information. Analogous results apply for two correlation-based dependence functionals introduced earlier: we show the kernel canonical correlation and the kernel generalised variance to be independence measures for universal kernels, and prove the latter to be an upper bound on the mutual information near independence. The performance of the kernel dependence functionals in measuring independence is verified in the context of independent component analysis.

1.

Gretton, Arthur; Herbrich, Ralf; Smola, Alexander J; Bousquet, Olivier; Schölkopf, Bernhard

Kernel Methods for Measuring Independence Journal Article

In: Journal of Machine Learning Research, vol. 6, pp. 2075–2129, 2005.

Abstract | Links | BibTeX

2.

Gretton, Arthur; Smola, Alexander; Bousquet, Olivier; Herbrich, Ralf; Belitski, Andrei; Augath, Mark; Murayama, Yusuke; Pauls, Jon; Schölkopf, Bernhard; Logothetis, Nikos

Kernel Constrained Covariance for Dependence Measurement Proceedings Article

In: Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics (AISTATS), pp. 112–119, 2005.

Abstract | Links | BibTeX

Kernel Gibbs Sampler

We present an algorithm that samples the hypothesis space of kernel classifiers. Given a uniform prior over normalised weight vectors and a likelihood based on a model of label noise leads to a piecewise constant posterior that can be sampled by the kernel Gibbs sampler (KGS). The KGS is a Markov Chain Monte Carlo method that chooses a random direction in parameter space and samples from the resulting piecewise constant density along the line chosen. The KGS can be used as an analytical tool for the exploration of Bayesian transduction, Bayes point machines, active learning, and evidence-based model selection on small data sets that are contaminated with label noise. For a simple toy example we demonstrate experimentally how a Bayes point machine based on the KGS outperforms an SVM that is incapable of taking into account label noise.

1.

Graepel, Thore; Herbrich, Ralf

The Kernel Gibbs Sampler Proceedings Article

In: Advances in Neural Information Processing Systems 13, pp. 514–520, The MIT Press, 2000.

Abstract | Links | BibTeX

Classification Learning on Proximity Data

We investigate the problem of learning a classification task on data represented in terms of their pairwise proximities. This representation does not refer to an explicit feature representation of the data items and is thus more general than the standard approach of using Euclidean feature vectors, from which pair wise proximities can always be calculated. Our approaches are based on a linear threshold model in the proximity values themselves. We show that prior knowledge about the problem can be incorporated by the choice of distance measures and examine different metrics w.r.t. their generalization.

1.

Graepel, Thore; Herbrich, Ralf; Schölkopf, Bernhard; Smola, Alex; Bartlett, Peter; Müller, Klaus Robert; Obermayer, Klaus; Williamson, Robert C

Classification on Proximity Data with LP-Machines Proceedings Article

In: Proceedings of the 9th International Conference on Artificial Neural Networks, pp. 304–309, 1999.

Abstract | Links | BibTeX

2.

Graepel, Thore; Herbrich, Ralf; Bollmann-Sdorra, Peter; Obermayer, Klaus

Classification on Pairwise Proximity Data Proceedings Article

In: Advances in Neural Information Processing Systems 11, pp. 438–444, The MIT Press, 1998.

Abstract | Links | BibTeX