Kernel Methods

Semidefinite Programming for Classification Learning

Knowledge about local invariances with respect to given pattern transformations can greatly improve the accuracy of classification. Approaches are either based on regularisation or on the generation of virtual (transformed) examples. We developed a new framework for learning linear classifiers under known transformations based on semi-definite programming where we are able to find a maximum margin hyper-plane when the training examples are polynomial trajectories instead of single points. The solution is found to be sparse in dual variables and allows to identify those points on the trajectory with minimal real-valued output as virtual support vectors.

We also developed a modified version of the perceptron learning algorithm which solves such semidefinite programs in polynomial time. The algorithm is based on the following three observations: (i) Semidefinite programs are linear programs with infinitely many (linear) constraints (i.e., all the virtually transformed examples); (ii) every linear program can be solved by a sequence of constraint satisfaction problems with linear constraints; and (iii) in general, the perceptron learning algorithm solves a constraint satisfaction problem with linear constraints in finitely many updates.

  • T. Graepel and R. Herbrich, „Invariant Pattern Recognition by Semidefinite Programming Machines,“ in Advances in Neural Information Processing Systems 16, 2003, p. 33–40.
    [BibTeX] [Abstract] [Download PDF]

    Knowledge about local invariances with respect to given pattern transformations can greatly improve the accuracy of classification. Previous approaches are either based on regularisation or on the generation of virtual (transformed) examples. We develop a new framework for learning linear classifiers under known transformations based on semidefinite programming. We present a new learning algorithm – the Semidefinite Programming Machine (SDPM) – which is able to find a maximum margin hyperplane when the training examples are polynomial trajectories instead of single points. The solution is found to be sparse in dual variables and allows to identify those points on the trajectory with minimal real-valued output as virtual support vectors. Extensions to segments of trajectories, to more than one transformation parameter, and to learning with kernels are discussed. In experiments we use a Taylor expansion to locally approximate rotational invariance in pixel images from USPS and find improvements over known methods.

    @inproceedings{graepel2003sdm,
    abstract = {Knowledge about local invariances with respect to given pattern transformations can greatly improve the accuracy of classification. Previous approaches are either based on regularisation or on the generation of virtual (transformed) examples. We develop a new framework for learning linear classifiers under known transformations based on semidefinite programming. We present a new learning algorithm - the Semidefinite Programming Machine (SDPM) - which is able to find a maximum margin hyperplane when the training examples are polynomial trajectories instead of single points. The solution is found to be sparse in dual variables and allows to identify those points on the trajectory with minimal real-valued output as virtual support vectors. Extensions to segments of trajectories, to more than one transformation parameter, and to learning with kernels are discussed. In experiments we use a Taylor expansion to locally approximate rotational invariance in pixel images from USPS and find improvements over known methods.},
    author = {Graepel, Thore and Herbrich, Ralf},
    booktitle = {Advances in Neural Information Processing Systems 16},
    pages = {33--40},
    title = {Invariant Pattern Recognition by Semidefinite Programming Machines},
    url = {https://www.herbrich.me/papers/sdpm.pdf},
    year = {2003}
    }

  • T. Graepel, R. Herbrich, A. Kharechko, and J. Shawe-Taylor, „Semi-Definite Programming by Perceptron Learning,“ in Advances in Neural Information Processing Systems 16, 2003, p. 457–464.
    [BibTeX] [Abstract] [Download PDF]

    We present a modified version of the perceptron learning algorithm (PLA) which solves semidefinite programs (SDPs) in polynomial time. The algorithm is based on the following three observations: (i) Semidefinite programs are linear programs with infinitely many (linear) constraints; (ii) every linear program can be solved by a sequence of constraint satisfaction problems with linear constraints; (iii) in general, the perceptron learning algorithm solves a constraint satisfaction problem with linear constraints in finitely many updates. Combining the PLA with a probabilistic rescaling algorithm (which, on average, increases the size of the feasable region) results in a probabilistic algorithm for solving SDPs that runs in polynomial time. We present preliminary results which demonstrate that the algorithm works, but is not competitive with state-of-the-art interior point methods.

    @inproceedings{graepel2003sdpwithperceptron,
    abstract = {We present a modified version of the perceptron learning algorithm (PLA) which solves semidefinite programs (SDPs) in polynomial time. The algorithm is based on the following three observations: (i) Semidefinite programs are linear programs with infinitely many (linear) constraints; (ii) every linear program can be solved by a sequence of constraint satisfaction problems with linear constraints; (iii) in general, the perceptron learning algorithm solves a constraint satisfaction problem with linear constraints in finitely many updates. Combining the PLA with a probabilistic rescaling algorithm (which, on average, increases the size of the feasable region) results in a probabilistic algorithm for solving SDPs that runs in polynomial time. We present preliminary results which demonstrate that the algorithm works, but is not competitive with state-of-the-art interior point methods.},
    author = {Graepel, Thore and Herbrich, Ralf and Kharechko, Andriy and Shawe-Taylor, John},
    booktitle = {Advances in Neural Information Processing Systems 16},
    pages = {457--464},
    publisher = {The MIT Press},
    title = {Semi-Definite Programming by Perceptron Learning},
    url = {https://www.herbrich.me/papers/sdpm-pla.pdf},
    year = {2003}
    }

Adaptive Margin Machines

Former approaches for learning kernel classifiers like Quadratic Programming Machines (SVMs), and Linear Programming Machines were based on minimization of a regularized margin loss where the margin was treated equivalently for each training pattern. We propose a reformulation of the minimization problem such that adaptive margins (AMM) for each training pattern are utilized. Furthermore, we give bounds on the generalization error of AMMs which justify their robustness against outliers. We show experimentally that the generalization error of AMMs is comparable to QP- and LP-Machines on benchmark datasets from the UCI repository.

  • R. Herbrich and J. Weston, „Adaptive Margin Support Vector Machines for Classification,“ in Proceedings of the 9th International Conference on Artificial Neural Networks, 1999, p. 97–102.
    [BibTeX] [Abstract] [Download PDF]

    In this paper we propose a new learning algorithm for classification learning based on the Support Vector Machine (SVM) approach. Existing approaches for constructing SVMs are based on minimization of a regularized margin loss where the margin is treated equivalently for each train- ing pattern. We propose a reformulation of the minimization problem such that adaptive margins for each training pattern are utilized, which we call the Adaptive Margin (AM)-SVM. We give bounds on the generalization error of AM-SVMs which justify their robustness against outliers, and show experimentally that the generalization error of AM-SVMs is comparable to classical SVMs on benchmark datasets from the UCI repository.

    @inproceedings{herbrich1999ann,
    abstract = {In this paper we propose a new learning algorithm for classification learning based on the Support Vector Machine (SVM) approach. Existing approaches for constructing SVMs are based on minimization of a regularized margin loss where the margin is treated equivalently for each train- ing pattern. We propose a reformulation of the minimization problem such that adaptive margins for each training pattern are utilized, which we call the Adaptive Margin (AM)-SVM. We give bounds on the generalization error of AM-SVMs which justify their robustness against outliers, and show experimentally that the generalization error of AM-SVMs is comparable to classical SVMs on benchmark datasets from the UCI repository.},
    author = {Herbrich, Ralf and Weston, Jason},
    booktitle = {Proceedings of the 9th International Conference on Artificial Neural Networks},
    pages = {97--102},
    title = {Adaptive Margin Support Vector Machines for Classification},
    url = {https://www.herbrich.me/papers/icann99_ann.pdf},
    year = {1999}
    }

  • J. Weston and R. Herbrich, „Adaptive Margin Support Vector Machines,“ in Advances in Large Margin Classifiers, The MIT Press, 1999, p. 281–296.
    [BibTeX] [Abstract] [Download PDF]

    In this chapter we present a new learning algorithm, Leave-One-Out (LOO -) SVMs and its generalization Adaptive Margin (AM-) SVMs, inspired by a recent upper bound on the leave-one-out error proved for kernel classifiers by Jaakkola and Haussler. The new approach minimizes the expression given by the bound in an attempt to minimize the leave-one-out error. This gives a convex optimization problem which constructs a sparse linear classifier in feature space using the kernel technique. As such the algorithm possesses many of the same properties as SVMs and Linear Programming (LP-) SVMs. These former techniques are based on the minimization of a regularized margin loss, where the margin is treated equivalently for each training pattern. We propose a minimization problem such that adaptive margins for each training pattern are utilized. Furthermore, we give bounds on the generalization error of the approach which justifies its robustness against outliers. We show experimentally that the generalization error of AM-SVMs is comparable to SVMs and LP-SVMs on benchmark datasets from the UCI repository.

    @incollection{weston1999ann,
    abstract = {In this chapter we present a new learning algorithm, Leave-One-Out (LOO -) SVMs and its generalization Adaptive Margin (AM-) SVMs, inspired by a recent upper bound on the leave-one-out error proved for kernel classifiers by Jaakkola and Haussler. The new approach minimizes the expression given by the bound in an attempt to minimize the leave-one-out error. This gives a convex optimization problem which constructs a sparse linear classifier in feature space using the kernel technique. As such the algorithm possesses many of the same properties as SVMs and Linear Programming (LP-) SVMs. These former techniques are based on the minimization of a regularized margin loss, where the margin is treated equivalently for each training pattern. We propose a minimization problem such that adaptive margins for each training pattern are utilized. Furthermore, we give bounds on the generalization error of the approach which justifies its robustness against outliers. We show experimentally that the generalization error of AM-SVMs is comparable to SVMs and LP-SVMs on benchmark datasets from the UCI repository.},
    author = {Weston, Jason and Herbrich, Ralf},
    booktitle = {Advances in Large Margin Classifiers},
    chapter = {15},
    pages = {281--296},
    publisher = {The MIT Press},
    title = {Adaptive Margin Support Vector Machines},
    url = {https://www.herbrich.me/papers/nips98_ann.pdf},
    year = {1999}
    }

Ordinal Regression

In contrast to the standard machine learning tasks of classification and metric regression we investigate the problem of predicting variables of ordinal scale, a setting referred to as ordinal regression. This problem arises frequently in the social sciences and in information retrieval where human preferences play a major role. Whilst approaches proposed in Statistics rely on a probability model of a latent (unobserved) variable we present a large margin algorithm that is based on a mapping from objects to scalar utility values though classifying pairs of objects.

  • R. Herbrich, T. Graepel, and K. Obermayer, „Support Vector Learning for Ordinal Regression,“ in Proceedings of the 9th International Conference on Artificial Neural Networks, 1999, p. 97–102.
    [BibTeX] [Abstract] [Download PDF]

    We investigate the problem of predicting variables of ordinal scale. This task is referred to as ordinal regression and is complementary to the standard machine learning tasks of classification and metric regression. In contrast to statistical models we present a distribution independent formulation of the problem together with uniform bounds of the risk functional. The approach presented is based on a mapping from objects to scalar utility values. Similar to Support Vector methods we derive a new learning algorithm for the task of ordinal regression based on large margin rank boundaries. We give experimental results for an information retrieval task: learning the order of documents w.r.t. an initial query. Experimental results indicate that the presented algorithm outperforms more naive approaches to ordinal regression such as Support Vector classification and Support Vector regression in the case of more than two ranks.

    @inproceedings{herbrich1999ordinalregression,
    abstract = {We investigate the problem of predicting variables of ordinal scale. This task is referred to as ordinal regression and is complementary to the standard machine learning tasks of classification and metric regression. In contrast to statistical models we present a distribution independent formulation of the problem together with uniform bounds of the risk functional. The approach presented is based on a mapping from objects to scalar utility values. Similar to Support Vector methods we derive a new learning algorithm for the task of ordinal regression based on large margin rank boundaries. We give experimental results for an information retrieval task: learning the order of documents w.r.t. an initial query. Experimental results indicate that the presented algorithm outperforms more naive approaches to ordinal regression such as Support Vector classification and Support Vector regression in the case of more than two ranks.},
    author = {Herbrich, Ralf and Graepel, Thore and Obermayer, Klaus},
    booktitle = {Proceedings of the 9th International Conference on Artificial Neural Networks},
    pages = {97--102},
    title = {Support Vector Learning for Ordinal Regression},
    url = {https://www.herbrich.me/papers/icann99_ordinal.pdf},
    year = {1999}
    }

  • R. Herbrich, T. Graepel, and K. Obermayer, „Large Margin Rank Boundaries for Ordinal Regression,“ in Advances in Large Margin Classifiers, The MIT Press, 1999, p. 115–132.
    [BibTeX] [Abstract] [Download PDF]

    In contrast to the standard machine learning tasks of classification and metric regression we investigate the problem of predicting variables of ordinal scale, a setting referred to as ordinal regression. This problem arises frequently in the social sciences and in information retrieval where human preferences play a major role. Whilst approaches proposed in statistics rely on a probability model of a latent (unobserved) variable we present a distribution independent risk formulation of ordinal regression which allows us to derive a uniform convergence bound. Applying this bound we present a large margin algorithm that is based on a mapping from objects to scalar utility values thus classifying pairs of objects. We give experimental results for an information retrieval task which show that our algorithm outperforms more naive approaches to ordinal regression such as Support Vector Classification and Support Vector Regression in the case of more than two ranks.

    @incollection{herbrich1999largemarginrank,
    abstract = {In contrast to the standard machine learning tasks of classification and metric regression we investigate the problem of predicting variables of ordinal scale, a setting referred to as ordinal regression. This problem arises frequently in the social sciences and in information retrieval where human preferences play a major role. Whilst approaches proposed in statistics rely on a probability model of a latent (unobserved) variable we present a distribution independent risk formulation of ordinal regression which allows us to derive a uniform convergence bound. Applying this bound we present a large margin algorithm that is based on a mapping from objects to scalar utility values thus classifying pairs of objects. We give experimental results for an information retrieval task which show that our algorithm outperforms more naive approaches to ordinal regression such as Support Vector Classification and Support Vector Regression in the case of more than two ranks.},
    author = {Herbrich, Ralf and Graepel, Thore and Obermayer, Klause},
    booktitle = {Advances in Large Margin Classifiers},
    chapter = {7},
    pages = {115--132},
    publisher = {The MIT Press},
    title = {Large Margin Rank Boundaries for Ordinal Regression},
    url = {https://www.herbrich.me/papers/nips98_ordinal.pdf},
    year = {1999}
    }

Kernel Methods for Measuring Independence

We introduce two new functionals, the constrained covariance and the kernel mutual information, to measure the degree of independence of random variables. These quantities are both based on the covariance between functions of the random variables in reproducing kernel Hilbert spaces (RKHSs). We prove that when the RKHSs are universal, both functionals are zero if and only if the random variables are pairwise independent. We also show that the kernel mutual information is an upper bound near independence on the Parzen window estimate of the mutual information. Analogous results apply for two correlation-based dependence functionals introduced earlier: we show the kernel canonical correlation and the kernel generalised variance to be independence measures for universal kernels, and prove the latter to be an upper bound on the mutual information near independence. The performance of the kernel dependence functionals in measuring independence is verified in the context of independent component analysis.

  • A. Gretton, R. Herbrich, A. J. Smola, O. Bousquet, and B. Schölkopf, „Kernel Methods for Measuring Independence,“ Journal of Machine Learning Research, vol. 6, p. 2075–2129, 2005.
    [BibTeX] [Abstract] [Download PDF]

    We introduce two new functionals, the constrained covariance and the kernel mutual information, to measure the degree of independence of random variables. These quantities are both based on the covariance between functions of the random variables in reproducing kernel Hilbert spaces (RKHSs). We prove that when the RKHSs are universal, both functionals are zero if and only if the random variables are pairwise independent. We also show that the kernel mutual information is an upper bound near independence on the Parzen window estimate of the mutual information. Analogous results apply for two correlation-based dependence functionals introduced earlier: we show the kernel canonical correlation and the kernel generalised variance to be independence measures for universal kernels, and prove the latter to be an upper bound on the mutual information near independence. The performance of the kernel dependence functionals in measuring independence is verified in the context of independent component analysis.

    @article{gretton2005kernelindependence,
    abstract = {We introduce two new functionals, the constrained covariance and the kernel mutual information, to measure the degree of independence of random variables. These quantities are both based on the covariance between functions of the random variables in reproducing kernel Hilbert spaces (RKHSs). We prove that when the RKHSs are universal, both functionals are zero if and only if the random variables are pairwise independent. We also show that the kernel mutual information is an upper bound near independence on the Parzen window estimate of the mutual information. Analogous results apply for two correlation-based dependence functionals introduced earlier: we show the kernel canonical correlation and the kernel generalised variance to be independence measures for universal kernels, and prove the latter to be an upper bound on the mutual information near independence. The performance of the kernel dependence functionals in measuring independence is verified in the context of independent component analysis.},
    author = {Gretton, Arthur and Herbrich, Ralf and Smola, Alexander J and Bousquet, Olivier and Sch\"{o}lkopf, Bernhard},
    journal = {Journal of Machine Learning Research},
    pages = {2075--2129},
    title = {Kernel Methods for Measuring Independence},
    url = {https://www.herbrich.me/papers/gretton05a.pdf},
    volume = {6},
    year = {2005}
    }

  • A. Gretton, A. Smola, O. Bousquet, R. Herbrich, A. Belitski, M. Augath, Y. Murayama, J. Pauls, B. Schölkopf, and N. Logothetis, „Kernel Constrained Covariance for Dependence Measurement,“ in Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics (AISTATS), 2005, p. 112–119.
    [BibTeX] [Abstract] [Download PDF]

    We discuss reproducing kernel Hilbert space (RKHS)-based measures of statistical dependence, with emphasis on constrained covariance (COCO), a novel criterion to test dependence of random variables. We show that COCO is a test for independence if and only if the associated RKHSs are universal. That said, no independence test exists that can distinguish dependent and independent random variables in all circumstances. Dependent random variables can result in a COCO which is arbitrarily close to zero when the source densities are highly non-smooth. All current kernel-based independence tests share this behaviour. We demonstrate exponential convergence between the population and empirical COCO. Finally, we use COCO as a measure of joint neural activity between voxels in MRI recordings of the macaque monkey, and compare the results to the mutual information and the correlation. We also show the effect of removing breathing artefacts from the MRI recording.

    @inproceedings{gretton2005kernelcoco,
    abstract = {We discuss reproducing kernel Hilbert space (RKHS)-based measures of statistical dependence, with emphasis on constrained covariance (COCO), a novel criterion to test dependence of random variables. We show that COCO is a test for independence if and only if the associated RKHSs are universal. That said, no independence test exists that can distinguish dependent and independent random variables in all circumstances. Dependent random variables can result in a COCO which is arbitrarily close to zero when the source densities are highly non-smooth. All current kernel-based independence tests share this behaviour. We demonstrate exponential convergence between the population and empirical COCO. Finally, we use COCO as a measure of joint neural activity between voxels in MRI recordings of the macaque monkey, and compare the results to the mutual information and the correlation. We also show the effect of removing breathing artefacts from the MRI recording.},
    author = {Gretton, Arthur and Smola, Alexander and Bousquet, Olivier and Herbrich, Ralf and Belitski, Andrei and Augath, Mark and Murayama, Yusuke and Pauls, Jon and Sch\"{o}lkopf, Bernhard and Logothetis, Nikos},
    booktitle = {Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics (AISTATS)},
    pages = {112--119},
    title = {Kernel Constrained Covariance for Dependence Measurement},
    url = {https://www.herbrich.me/papers/kcc.pdf},
    year = {2005}
    }

Kernel Gibbs Sampler

We present an algorithm that samples the hypothesis space of kernel classifiers. Given a uniform prior over normalised weight vectors and a likelihood based on a model of label noise leads to a piecewise constant posterior that can be sampled by the kernel Gibbs sampler (KGS). The KGS is a Markov Chain Monte Carlo method that chooses a random direction in parameter space and samples from the resulting piecewise constant density along the line chosen. The KGS can be used as an analytical tool for the exploration of Bayesian transduction, Bayes point machines, active learning, and evidence-based model selection on small data sets that are contaminated with label noise. For a simple toy example we demonstrate experimentally how a Bayes point machine based on the KGS outperforms an SVM that is incapable of taking into account label noise.

  • T. Graepel and R. Herbrich, „The Kernel Gibbs Sampler,“ in Advances in Neural Information Processing Systems 13, 2000, p. 514–520.
    [BibTeX] [Abstract] [Download PDF]

    We present an algorithm that samples the hypothesis space of kernel classifiers. Given a uniform prior over normalised weight vectors and a likelihood based on a model of label noise leads to a piecewise constant posterior that can be sampled by the kernel Gibbs sampler (KGS). The KGS is a Markov Chain Monte Carlo method that chooses a random direction in parameter space and samples from the resulting piecewise constant density along the line chosen. The KGS can be used as an analytical tool for the exploration of Bayesian transduction, Bayes point machines, active learning, and evidence-based model selection on small data sets that are contaminated with label noise. For a simple toy example we demonstrate experimentally how a Bayes point machine based on the KGS outperforms an SVM that is incapable of taking into account label noise.

    @inproceedings{herbrich2000kernelgibbs,
    abstract = {We present an algorithm that samples the hypothesis space of kernel classifiers. Given a uniform prior over normalised weight vectors and a likelihood based on a model of label noise leads to a piecewise constant posterior that can be sampled by the kernel Gibbs sampler (KGS). The KGS is a Markov Chain Monte Carlo method that chooses a random direction in parameter space and samples from the resulting piecewise constant density along the line chosen. The KGS can be used as an analytical tool for the exploration of Bayesian transduction, Bayes point machines, active learning, and evidence-based model selection on small data sets that are contaminated with label noise. For a simple toy example we demonstrate experimentally how a Bayes point machine based on the KGS outperforms an SVM that is incapable of taking into account label noise.},
    author = {Graepel, Thore and Herbrich, Ralf},
    booktitle = {Advances in Neural Information Processing Systems 13},
    pages = {514--520},
    publisher = {The MIT Press},
    title = {The Kernel Gibbs Sampler},
    url = {https://www.herbrich.me/papers/kernel-gibbs-sampler.pdf},
    year = {2000}
    }

Classification Learning on Proximity Data

We investigate the problem of learning a classification task on data represented in terms of their pairwise proximities. This representation does not refer to an explicit feature representation of the data items and is thus more general than the standard approach of using Euclidean feature vectors, from which pair wise proximities can always be calculated. Our approaches are based on a linear threshold model in the proximity values themselves. We show that prior knowledge about the problem can be incorporated by the choice of distance measures and examine different metrics w.r.t. their generalization.

  • T. Graepel, R. Herbrich, B. Schölkopf, A. Smola, P. Bartlett, K. R. Müller, K. Obermayer, and R. C. Williamson, „Classification on Proximity Data with LP-Machines,“ in Proceedings of the 9th International Conference on Artificial Neural Networks, 1999, p. 304–309.
    [BibTeX] [Abstract] [Download PDF]

    We provide a new linear program to deal with classification of data in the case of data given in terms of pairwise proximities. This allows to avoid the problems inherent in using feature spaces with indefinite metric in Support Vector Machines, since the notion of a margin is purely needed in input space where the classification actually occurs. Moreover in our approach we can enforce sparsity in the proximity representation by sacrificing training error. This turns out to be favorable for proximity data. Similar to nu-SV methods, the only parameter needed in the algorithm is the (asymptotical) number of data points being classified with a margin. Finally, the algorithm is successfully compared with nu-SV learning in proximity space and K-nearest-neighbors on real world data from Neuroscience and molecular biology.

    @inproceedings{graepel1999proximity,
    abstract = {We provide a new linear program to deal with classification of data in the case of data given in terms of pairwise proximities. This allows to avoid the problems inherent in using feature spaces with indefinite metric in Support Vector Machines, since the notion of a margin is purely needed in input space where the classification actually occurs. Moreover in our approach we can enforce sparsity in the proximity representation by sacrificing training error. This turns out to be favorable for proximity data. Similar to nu-SV methods, the only parameter needed in the algorithm is the (asymptotical) number of data points being classified with a margin. Finally, the algorithm is successfully compared with nu-SV learning in proximity space and K-nearest-neighbors on real world data from Neuroscience and molecular biology.},
    author = {Graepel, Thore and Herbrich, Ralf and Sch\"{o}lkopf, Bernhard and Smola, Alex and Bartlett, Peter and M\"{u}ller, Klaus Robert and Obermayer, Klaus and Williamson, Robert C},
    booktitle = {Proceedings of the 9th International Conference on Artificial Neural Networks},
    pages = {304--309},
    title = {Classification on Proximity Data with {LP}-Machines},
    url = {https://www.herbrich.me/papers/icann99_proxy.pdf},
    year = {1999}
    }

  • T. Graepel, R. Herbrich, P. Bollmann-Sdorra, and K. Obermayer, „Classification on Pairwise Proximity Data,“ in Advances in Neural Information Processing Systems 11, 1998, p. 438–444.
    [BibTeX] [Abstract] [Download PDF]

    We investigate the problem of learning a classification task on data represented in terms of their pairwise proximities. This representation does not refer to an explicit feature representation of the data items and is thus more general than the standard approach of using Euclidean feature vectors, from which pairwise proximities can always be calculated. Our first approach is based on a combined linear embedding and classification procedure resulting in an extension of the Optimal Hyperplane algorithm to pseudo-Euclidean data. As an alternative we present another approach based on a linear threshold model in the proximity values themselves, which is optimized using Structural Risk Minimization. We show that prior knowledge about the problem can be incorporated by the choice of distance measures and examine different metrics w.r.t. their generalization. Finally, the algorithms are successfully applied to protein structure data and to data from the cat’s cerebral cortex. They show better performance than K-nearest-neighbor classification.

    @inproceedings{graepel1998classificationpairwise,
    abstract = {We investigate the problem of learning a classification task on data represented in terms of their pairwise proximities. This representation does not refer to an explicit feature representation of the data items and is thus more general than the standard approach of using Euclidean feature vectors, from which pairwise proximities can always be calculated. Our first approach is based on a combined linear embedding and classification procedure resulting in an extension of the Optimal Hyperplane algorithm to pseudo-Euclidean data. As an alternative we present another approach based on a linear threshold model in the proximity values themselves, which is optimized using Structural Risk Minimization. We show that prior knowledge about the problem can be incorporated by the choice of distance measures and examine different metrics w.r.t. their generalization. Finally, the algorithms are successfully applied to protein structure data and to data from the cat's cerebral cortex. They show better performance than K-nearest-neighbor classification.},
    author = {Graepel, Thore and Herbrich, Ralf and Bollmann-Sdorra, Peter and Obermayer, Klaus},
    booktitle = {Advances in Neural Information Processing Systems 11},
    pages = {438--444},
    publisher = {The MIT Press},
    title = {Classification on Pairwise Proximity Data},
    url = {https://www.herbrich.me/papers/graeherbollober99.pdf},
    year = {1998}
    }