# Bayesian Learning and Decision Making

## Approximate Bayesian Inference

Bayesian inference provides a powerful mechanism for data analysis and learning. However, in real-world situations it is rarely possible to perform exact inference; in fact, exact Bayesian inference is in general NP hard. One of the most successful approaches to address this problem is to exploit a factorial structure of both the sampling distribution and the prior. Then, there are a variety of methods that exploit the factorisation for efficient approximations in inference. The Expectation-Propagation (EP) algorithm is a powerful tool for inference in such factor graphs. For discrete distributions, the EP algorithm is also known as Belief Propagation.

• R. Herbrich, „Distributed, Real-time Bayesian Learning in Online Services,“ in Proceedings of the 6th ACM Conference on Recommender Systems, 2012, p. 203–204.

The last ten years have seen a tremendous growth in Internet-based online services such as search, advertising, gaming and social networking. Today, it is important to analyze large collections of user interaction data as a first step in building predictive models for these services as well as learn these models in real-time. One of the biggest challenges in this setting is scale: not only does the sheer scale of data necessitate parallel processing but it also necessitates distributed models; with over 900 million active users at Facebook, any user-specific sets of features in a linear or non-linear model yields models of a size bigger than can be stored in a single system. In this talk, I will give a hands-on introduction to one of the most versatile tools for handling large collections of data with distributed probabilistic models: the sum-product algorithm for approximate message passing in factor graphs. I will discuss the application of this algorithm for the specific case of generalized linear models and outline the challenges of both approximate and distributed message passing including an in-depth discussion of expectation propagation and Map-Reduce.

@inproceedings{herbrich2012distributed,
abstract = {The last ten years have seen a tremendous growth in Internet-based online services such as search, advertising, gaming and social networking. Today, it is important to analyze large collections of user interaction data as a first step in building predictive models for these services as well as learn these models in real-time. One of the biggest challenges in this setting is scale: not only does the sheer scale of data necessitate parallel processing but it also necessitates distributed models; with over 900 million active users at Facebook, any user-specific sets of features in a linear or non-linear model yields models of a size bigger than can be stored in a single system. In this talk, I will give a hands-on introduction to one of the most versatile tools for handling large collections of data with distributed probabilistic models: the sum-product algorithm for approximate message passing in factor graphs. I will discuss the application of this algorithm for the specific case of generalized linear models and outline the challenges of both approximate and distributed message passing including an in-depth discussion of expectation propagation and Map-Reduce.},
author = {Herbrich, Ralf},
booktitle = {Proceedings of the 6th ACM Conference on Recommender Systems},
pages = {203--204},
title = {Distributed, Real-time {Bayesian} Learning in Online Services},
url = {https://www.herbrich.me/papers/recsys2012.pdf},
organization = {ACM},
year = {2012}
}

• R. Herbrich, „On Gaussian Expectation Propagation,“ Microsoft Research 2005.

In this short note we will re-derive the Gaussian expectation propagation (Gaussian EP) algorithm as presented in Minka (2001) and demonstrate an application of Gaussian EP to approximate multi-dimensional truncated Gaussians.

@techreport{herbrich2005gaussianep,
abstract = {In this short note we will re-derive the Gaussian expectation propagation (Gaussian EP) algorithm as presented in Minka (2001) and demonstrate an application of Gaussian EP to approximate multi-dimensional truncated Gaussians.},
author = {Herbrich, Ralf},
institution = {Microsoft Research},
title = {On {Gaussian} Expectation Propagation},
url = {https://www.herbrich.me/papers/EP.pdf},
year = {2005}
}

• R. Herbrich, „Minimising the Kullback-Leibler Divergence,“ Microsoft Research 2005.

In this note we show that minimising the Kullback–Leibler divergence over a family in the class of exponential distributions is achieved by matching the expected natural statistic. We will also give an explicit update formula for distributions with only one likelihood term.

@techreport{herbrich2005minimizingkl,
abstract = {In this note we show that minimising the Kullback–Leibler divergence over a family in the class of exponential distributions is achieved by matching the expected natural statistic. We will also give an explicit update formula for distributions with only one likelihood term.},
author = {Herbrich, Ralf},
institution = {Microsoft Research},
title = {{Minimising the Kullback-Leibler Divergence}},
url = {https://www.herbrich.me/papers/KL.pdf},
year = {2005}
}

## Poisson Networks

Modelling structured multivariate point process data has wide ranging applications like understanding neural activity, developing faster file access systems and learning dependencies among servers in large networks. We developed the Poisson network model for representing multivariate structured Poisson processes. In our model each node of the network represents a Poisson process. The novelty of our work is that waiting times of a process are modelled by an exponential distribution with a piecewise constant rate function that depends on the event counts of its parents in the network in a generalised linear way. Our choice of model allows to perform exact sampling from arbitrary structures. We adopt a Bayesian approach for learning the network structure. We also develop fixed point and sampling based approximations for performing inference of rate functions in Poisson networks.

• S. Rajaram, T. Graepel, and R. Herbrich, „Poisson-Networks : A Model for Structured Poisson Processes,“ in Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics, 2004, p. 277–284.

Modelling structured multivariate point process data has wide ranging applications like understanding neural activity, developing faster file access systems and learning dependencies among servers in large networks. In this paper, we develop the Poisson network model for representing multivariate structured Poisson processes. In our model each node of the network represents a Poisson process. The novelty of our work is that waiting times of a process are modelled by an exponential distribution with a piecewise constant rate function that depends on the event counts of its parents in the network in a generalised linear way. Our choice of model allows to perform exact sampling from arbitrary structures. We adopt a Bayesian approach for learning the network structure. Further, we discuss fixed point and sampling based approximations for performing inference of rate functions in Poisson networks.

@inproceedings{rajaram2004poissonnetworks,
abstract = {Modelling structured multivariate point process data has wide ranging applications like understanding neural activity, developing faster file access systems and learning dependencies among servers in large networks. In this paper, we develop the Poisson network model for representing multivariate structured Poisson processes. In our model each node of the network represents a Poisson process. The novelty of our work is that waiting times of a process are modelled by an exponential distribution with a piecewise constant rate function that depends on the event counts of its parents in the network in a generalised linear way. Our choice of model allows to perform exact sampling from arbitrary structures. We adopt a Bayesian approach for learning the network structure. Further, we discuss fixed point and sampling based approximations for performing inference of rate functions in Poisson networks.},
author = {Rajaram, Shyamsundar and Graepel, Thore and Herbrich, Ralf},
booktitle = {Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics},
pages = {277--284},
title = {{Poisson}-Networks : A Model for Structured {Poisson} Processes},
url = {https://www.herbrich.me/papers/spikenets.pdf},
year = {2004}
}

## Informative Vector Machine

We have developed a framework for sparse Gaussian process methods which uses forward selection with criteria based on information-theoretical principles, previously suggested for active learning. In contrast to most previous work on sparse GPs, our goal is not only to learn sparse predictors (which can be evaluated in O(d) rather than O(n), d<n, n the number of training points), but also to perform training under strong restrictions on time and memory requirements. The scaling of our method is at most O(nd2), and in large real-world classification experiments we show that it can match prediction performance of the popular support vector machine (SVM), yet it requires only a fraction of the training time. In contrast to the SVM, our approximation produces estimates of predictive probabilities (‚error bars‘), allows for Bayesian model selection and is less complex in implementation.

• N. Lawrence, M. Seeger, and R. Herbrich, „Fast Sparse Gaussian Process Methods: The Informative Vector Machine,“ in Advances in Neural Information Processing Systems 15, 2002, p. 609–616.

We present a framework for sparse Gaussian process (GP) methods which uses forward selection with criteria based on information-theoretical principles, previously suggested for active learning. In contrast to most previous work on sparse GPs, our goal is not only to learn sparse predictors (which can be evaluated in $O(d)$ rather than $O(n)$, $d łl n$, $n$ the number of training points), but also to perform training under strong restrictions on time and memory requirements. The scaling of our method is at most $O(n \cdot d^2)$, and in large real-world classification experiments we show that it can match prediction performance of the popular support vector machine (SVM), yet it requires only a fraction of the training time. In contrast to the SVM, our approximation produces estimates of predictive probabilities („error bars“), allows for Bayesian model selection and is less complex in implementation.

@inproceedings{lawrence2002ivm,
abstract = {We present a framework for sparse Gaussian process (GP) methods which uses forward selection with criteria based on information-theoretical principles, previously suggested for active learning. In contrast to most previous work on sparse GPs, our goal is not only to learn sparse predictors (which can be evaluated in $O(d)$ rather than $O(n)$, $d \ll n$, $n$ the number of training points), but also to perform training under strong restrictions on time and memory requirements. The scaling of our method is at most $O(n \cdot d^2)$, and in large real-world classification experiments we show that it can match prediction performance of the popular support vector machine (SVM), yet it requires only a fraction of the training time. In contrast to the SVM, our approximation produces estimates of predictive probabilities ("error bars"), allows for Bayesian model selection and is less complex in implementation.},
author = {Lawrence, Neil and Seeger, Matthias and Herbrich, Ralf},
booktitle = {Advances in Neural Information Processing Systems 15},
pages = {609--616},
title = {Fast Sparse {Gaussian} Process Methods: The Informative Vector Machine},
url = {https://www.herbrich.me/papers/ivm.pdf},
year = {2002}
}

## Learning with Social Priors

Slow convergence and poor initial accuracy are two problems that plague efforts to use very large feature sets in online learning. We show how these problems can be mitigated if a graph of relationships between features is known. We study this problem in a fully Bayesian setting, focusing on the problem of using Facebook user-IDs as features, with the social network giving the relationship structure. Our analysis uncovers significant problems with the obvious regularizations, and motivates a two-component mixture-model social prior that is provably better.

• D. Chakrabarti and R. Herbrich, „Speeding Up Large-Scale Learning with a Social Prior,“ in Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2013, p. 650–658.

Slow convergence and poor initial accuracy are two problems that plague efforts to use very large feature sets in online learning. This is especially true when only a few features are active in any training example, and the frequency of activations of different features is skewed. We show how these problems can be mitigated if a graph of relationships between features is known. We study this problem in a fully Bayesian setting, focusing on the problem of using Facebook user-IDs as features, with the social network giving the relationship structure. Our analysis uncovers significant problems with the obvious regularizations, and motivates a two-component mixture-model social prior that is provably better. Empirical results on large-scale click prediction problems show that our algorithm can learn as well as the baseline with 12M fewer training examples, and continuously outperforms it for over 60M examples. On a second problem using binned features, our model outperforms the baseline even after the latter sees 5x as much data.

@inproceedings{chakrabarti2013speeding,
abstract = {Slow convergence and poor initial accuracy are two problems that plague efforts to use very large feature sets in online learning. This is especially true when only a few features are active in any training example, and the frequency of activations of different features is skewed. We show how these problems can be mitigated if a graph of relationships between features is known. We study this problem in a fully Bayesian setting, focusing on the problem of using Facebook user-IDs as features, with the social network giving the relationship structure. Our analysis uncovers significant problems with the obvious regularizations, and motivates a two-component mixture-model social prior that is provably better. Empirical results on large-scale click prediction problems show that our algorithm can learn as well as the baseline with 12M fewer training examples, and continuously outperforms it for over 60M examples. On a second problem using binned features, our model outperforms the baseline even after the latter sees 5x as much data.},
author = {Chakrabarti, Deepayan and Herbrich, Ralf},
booktitle = {Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining},
pages = {650--658},
title = {Speeding Up Large-Scale Learning with a Social Prior},
url = {https://www.herbrich.me/papers/kdd13-socialprior.pdf},
organization = {ACM},
year = {2013}
}

## Kernel Topic Models

Latent Dirichlet Allocation models discrete document data as a mixture of discrete distributions, using Dirichlet beliefs over the mixture weights. We study a variation in which the document’s mixture weight beliefs are replaced with squashed Gaussian distributions. This allows documents to be associated with elements of a Hilbert space, admitting kernel topic models (KTM), modelling temporal, spatial, hierarchical, social and other structure between documents. The main challenge is efficient approximate inference on the latent Gaussian. We present an approximate algorithm cast around a Laplace approximation in a transformed basis. The KTM can also be interpreted as a type of Gaussian process latent variable model, or as a topic model conditional on document features, uncovering links between earlier work in these areas.

• P. Hennig, D. Stern, R. Herbrich, and T. Graepel, „Kernel Topic Models,“ in Proceedings of the 15th International Conference on Artificial Intelligence and Statistics (AISTATS), 2012, p. 511–519.

Latent Dirichlet Allocation models discrete data as a mixture of discrete distributions, using Dirichlet beliefs over the mixture weights. We study a variation of this concept, in which the documents‘ mixture weight beliefs are replaced with squashed Gaussian distributions. This allows documents to be associated with elements of a Hilbert space, admitting kernel topic models (KTM), modelling temporal, spatial, hierarchical, social and other structure between documents. The main challenge is efficient approximate inference on the latent Gaussian. We present an approximate algorithm cast around a Laplace approximation in a transformed basis. The KTM can also be interpreted as a type of Gaussian process latent variable model, or as a topic model conditional on document features, uncovering links between earlier work in these areas.

@inproceedings{henning2012kerneltopic,
abstract = {Latent Dirichlet Allocation models discrete data as a mixture of discrete distributions, using Dirichlet beliefs over the mixture weights. We study a variation of this concept, in which the documents' mixture weight beliefs are replaced with squashed Gaussian distributions. This allows documents to be associated with elements of a Hilbert space, admitting kernel topic models (KTM), modelling temporal, spatial, hierarchical, social and other structure between documents. The main challenge is efficient approximate inference on the latent Gaussian. We present an approximate algorithm cast around a Laplace approximation in a transformed basis. The KTM can also be interpreted as a type of Gaussian process latent variable model, or as a topic model conditional on document features, uncovering links between earlier work in these areas.},
author = {Hennig, Philipp and Stern, David and Herbrich, Ralf and Graepel, Thore},
booktitle = {Proceedings of the 15th International Conference on Artificial Intelligence and Statistics (AISTATS)},
pages = {511--519},
title = {Kernel Topic Models},
url = {https://www.herbrich.me/papers/ktm.pdf},
year = {2012}
}

## Bayesian Online Learning for Multilabel and Multivariate Performance Measures

Many real world applications employ multi-variate performance measures and each example can belong to multiple classes. We propose a Bayesian online multi-label classification framework which learns a probabilistic linear classifier. The likelihood is modeled by a graphical model similar to TrueSkill, and inference is based on Gaussian density filtering with expectation propagation. Using samples from the posterior, we label the testing data by maximizing the expected F1-score. Our experiments on Reuters1-v2 dataset show that the proposed algorithm compares favorably to the state-of-the-art online learners in macro-averaged F1-score and training time.

• X. Zhang, T. Graepel, and R. Herbrich, „Bayesian Online Learning for Multi-label and Multi-variate Performance Measures,“ in Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS), 2010, p. 956–963.

Many real world applications employ multi-variate performance measures and each example can belong to multiple classes. The currently most popular approaches train an SVM for each class, followed by ad-hoc thresholding. Probabilistic models using Bayesian decision theory are also commonly adopted. In this paper, we propose a Bayesian online multi-label classification framework (BOMC) which learns a probabilistic linear classifier. The likelihood is modeled by a graphical model similar to TrueSkill, and inference is based on Gaussian density filtering with expectation propagation. Using samples from the posterior, we label the testing data by maximizing the expected F1-score. Our experiments on Reuters1-v2 dataset show BOMC compares favorably to the state-of-the-art online learners in macro-averaged F1-score and training time.

@inproceedings{zhang2010bayesian,
abstract = {Many real world applications employ multi-variate performance measures and each example can belong to multiple classes. The currently most popular approaches train an SVM for each class, followed by ad-hoc thresholding. Probabilistic models using Bayesian decision theory are also commonly adopted. In this paper, we propose a Bayesian online multi-label classification framework (BOMC) which learns a probabilistic linear classifier. The likelihood is modeled by a graphical model similar to TrueSkill, and inference is based on Gaussian density filtering with expectation propagation. Using samples from the posterior, we label the testing data by maximizing the expected F1-score. Our experiments on Reuters1-v2 dataset show BOMC compares favorably to the state-of-the-art online learners in macro-averaged F1-score and training time.},
author = {Zhang, Xinhua and Graepel, Thore and Herbrich, Ralf},
booktitle = {Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS)},
pages = {956--963},
title = {{Bayesian} Online Learning for Multi-label and Multi-variate Performance Measures},
url = {https://www.herbrich.me/papers/zhang10b.pdf},
year = {2010}
}

## Bayesian Transduction

We consider the case of binary classification by linear discriminant functions. The simplification of the transduction problem results from the fact that the infinite number of linear discriminants is boiled down to a finite number of equivalence classes on the working set. The number of equivalence classes is bounded from above by the growth function. Each equivalence class corresponds to a polyhedron in parameter space. From a Bayesian point of view, we suggest to measure the prior probability of a labelling of the working set as the volume of the corresponding polyhedron w.r.t. the a-priori distribution in parameter space. Then the maximum a-posterior (MAP) scheme recommends to choose the labelling of maximum volume.

• T. Graepel and R. Herbrich, „The Kernel Gibbs Sampler,“ in Advances in Neural Information Processing Systems 13, 2000, p. 514–520.

We present an algorithm that samples the hypothesis space of kernel classifiers. Given a uniform prior over normalised weight vectors and a likelihood based on a model of label noise leads to a piecewise constant posterior that can be sampled by the kernel Gibbs sampler (KGS). The KGS is a Markov Chain Monte Carlo method that chooses a random direction in parameter space and samples from the resulting piecewise constant density along the line chosen. The KGS can be used as an analytical tool for the exploration of Bayesian transduction, Bayes point machines, active learning, and evidence-based model selection on small data sets that are contaminated with label noise. For a simple toy example we demonstrate experimentally how a Bayes point machine based on the KGS outperforms an SVM that is incapable of taking into account label noise.

@inproceedings{herbrich2000kernelgibbs,
abstract = {We present an algorithm that samples the hypothesis space of kernel classifiers. Given a uniform prior over normalised weight vectors and a likelihood based on a model of label noise leads to a piecewise constant posterior that can be sampled by the kernel Gibbs sampler (KGS). The KGS is a Markov Chain Monte Carlo method that chooses a random direction in parameter space and samples from the resulting piecewise constant density along the line chosen. The KGS can be used as an analytical tool for the exploration of Bayesian transduction, Bayes point machines, active learning, and evidence-based model selection on small data sets that are contaminated with label noise. For a simple toy example we demonstrate experimentally how a Bayes point machine based on the KGS outperforms an SVM that is incapable of taking into account label noise.},
author = {Graepel, Thore and Herbrich, Ralf},
booktitle = {Advances in Neural Information Processing Systems 13},
pages = {514--520},
publisher = {The MIT Press},
title = {The Kernel Gibbs Sampler},
url = {https://www.herbrich.me/papers/kernel-gibbs-sampler.pdf},
year = {2000}
}

• T. Graepel, R. Herbrich, and K. Obermayer, „Bayesian Transduction,“ in Advances in Neural Information Processing Systems 12, 1999, p. 456–462.

Transduction is an inference principle that takes a training sample and aims at estimating the values of a function at given points contained in the so-called working sample. Hence, transduction is a less ambitious task than induction which aims at inferring a functional dependency on the whole of input space. As a consequence, however, transduction provides a confidence measure on single predictions rather than classifiers, a feature particularly important for risk-sensitive applications. We consider the case of binary classification by linear discriminant functions (perceptrons) in kernel space. From the transductive point of view, the infinite number of perceptrons is boiled down to a finite number of equivalence classes on the working sample each of which corresponds to a polyhedron in parameter space. In the Bayesian spirit the posteriori probability of a labelling of the working sample is determined as the ratio between the volume of the corresponding polyhedron and the volume of version space. Then the maximum posteriori scheme recommends to choose the labelling of maximum volume. We suggest to sample version space by an ergodic billiard in kernel space. Experimental results on real world data indicate that Bayesian Transduction compares favourably to the well-known Support Vector Machine, in particular if the posteriori probability of labellings is used as a confidence measure to exclude test points of low confidence.

@inproceedings{graepel1999bayesiantransduction,
abstract = {Transduction is an inference principle that takes a training sample and aims at estimating the values of a function at given points contained in the so-called working sample. Hence, transduction is a less ambitious task than induction which aims at inferring a functional dependency on the whole of input space. As a consequence, however, transduction provides a confidence measure on single predictions rather than classifiers, a feature particularly important for risk-sensitive applications. We consider the case of binary classification by linear discriminant functions (perceptrons) in kernel space. From the transductive point of view, the infinite number of perceptrons is boiled down to a finite number of equivalence classes on the working sample each of which corresponds to a polyhedron in parameter space. In the Bayesian spirit the posteriori probability of a labelling of the working sample is determined as the ratio between the volume of the corresponding polyhedron and the volume of version space. Then the maximum posteriori scheme recommends to choose the labelling of maximum volume. We suggest to sample version space by an ergodic billiard in kernel space. Experimental results on real world data indicate that Bayesian Transduction compares favourably to the well-known Support Vector Machine, in particular if the posteriori probability of labellings is used as a confidence measure to exclude test points of low confidence.},
author = {Graepel, Thore and Herbrich, Ralf and Obermayer, Klaus},
booktitle = {Advances in Neural Information Processing Systems 12},
pages = {456--462},
publisher = {The MIT Press},
title = {{Bayesian} Transduction},
url = {https://www.herbrich.me/papers/trans.pdf},
year = {1999}
}

## Bayes Point Machines

From a Bayesian perspective, Support Vector Machines choose the hypothesis corresponding to the largest possible hypersphere that can be inscribed in version space, i.e. in the space of all consistent hypotheses given a training set. Those boundaries of version space which are tangent to the hypersphere define the support vectors. An alternative and potentially better approach is to construct the hypothesis using the whole of version space. This is achieved by using a Bayes Point Machine which finds the midpoint of the region of intersection of all hyperplanes bisecting version space into two halves of equal volume (the Bayes point). It is known that the centre of mass of version space approximates the Bayes point. We investigate estimating the centre of mass by averaging over the trajectory of a billiard ball bouncing in version space. Experimental results indicate that Bayes Point Machines consistently outperform Support Vector Machines.

• E. Harrington, R. Herbrich, J. Kivinen, J. C. Platt, and R. C. Williamson, „Online Bayes Point Machines,“ in Proceedings of Advances in Knowledge Discovery and Data Mining, 2003, p. 241–252.

We present a new and simple algorithm for learning large margin classifiers that works in a truly online manner. The algorithm generates a linear classifier by averaging the weights associated with several perceptron-like algorithms run in parallel in order to approximate the Bayes point. A random subsample of the incoming data stream is used to ensure diversity in the perceptron solutions. We experimentally study the algorithm’s performance on online and batch learning settings. The online experiments showed that our algorithm produces a low prediction error on the training sequence and tracks the presence of concept drift. On the batch problems its performance is comparable to the maximum margin algorithm which explicitly maximises the margin.

@inproceedings{harrington2003onlinebpm,
abstract = {We present a new and simple algorithm for learning large margin classifiers that works in a truly online manner. The algorithm generates a linear classifier by averaging the weights associated with several perceptron-like algorithms run in parallel in order to approximate the Bayes point. A random subsample of the incoming data stream is used to ensure diversity in the perceptron solutions. We experimentally study the algorithm's performance on online and batch learning settings. The online experiments showed that our algorithm produces a low prediction error on the training sequence and tracks the presence of concept drift. On the batch problems its performance is comparable to the maximum margin algorithm which explicitly maximises the margin.},
author = {Harrington, Edward and Herbrich, Ralf and Kivinen, Jyrki and Platt, John C and Williamson, Robert C},
booktitle = {Proceedings of Advances in Knowledge Discovery and Data Mining},
pages = {241--252},
title = {Online {Bayes} Point Machines},
url = {https://www.herbrich.me/papers/OBPM.pdf},
year = {2003}
}

• R. Herbrich, T. Graepel, and C. Campbell, „Bayes Point Machines,“ Journal of Machine Learning Research, vol. 1, p. 245–279, 2001.

Kernel-classifiers comprise a powerful class of non-linear decision functions for binary classification. The support vector machine is an example of a learning algorithm for kernel classifiers that singles out the consistent classifier with the largest margin, i.e. minimal real-valued output on the training sample, within the set of consistent hypotheses, the so-called version space. We suggest the Bayes point machine as a well-founded improvement which approximates the Bayes-optimal decision by the centre of mass of version space. We present two algorithms to stochastically approximate the centre of mass of version space: a billiard sampling algorithm and a sampling algorithm based on the well known perceptron algorithm. It is shown how both algorithms can be extended to allow for soft-boundaries in order to admit training errors. Experimentally, we find that –- for the zero training error case –- Bayes point machines consistently outperform support vector machines on both surrogate data and real-world benchmark data sets. In the soft-boundary/soft-margin case, the improvement over support vector machines is shown to be reduced. Finally, we demonstrate that the real-valued output of single Bayes points on novel test points is a valid confidence measure and leads to a steady decrease in generalisation error when used as a rejection criterion.

@article{herbrich2001bpm,
abstract = {Kernel-classifiers comprise a powerful class of non-linear decision functions for binary classification. The support vector machine is an example of a learning algorithm for kernel classifiers that singles out the consistent classifier with the largest margin, i.e. minimal real-valued output on the training sample, within the set of consistent hypotheses, the so-called version space. We suggest the Bayes point machine as a well-founded improvement which approximates the Bayes-optimal decision by the centre of mass of version space. We present two algorithms to stochastically approximate the centre of mass of version space: a billiard sampling algorithm and a sampling algorithm based on the well known perceptron algorithm. It is shown how both algorithms can be extended to allow for soft-boundaries in order to admit training errors. Experimentally, we find that --- for the zero training error case --- Bayes point machines consistently outperform support vector machines on both surrogate data and real-world benchmark data sets. In the soft-boundary/soft-margin case, the improvement over support vector machines is shown to be reduced. Finally, we demonstrate that the real-valued output of single Bayes points on novel test points is a valid confidence measure and leads to a steady decrease in generalisation error when used as a rejection criterion.},
author = {Herbrich, Ralf and Graepel, Thore and Campbell, Colin},
journal = {Journal of Machine Learning Research},
pages = {245--279},
title = {{Bayes} Point Machines},
url = {https://www.herbrich.me/papers/bpm.pdf},
volume = {1},
year = {2001}
}

• T. Graepel and R. Herbrich, „The Kernel Gibbs Sampler,“ in Advances in Neural Information Processing Systems 13, 2000, p. 514–520.

We present an algorithm that samples the hypothesis space of kernel classifiers. Given a uniform prior over normalised weight vectors and a likelihood based on a model of label noise leads to a piecewise constant posterior that can be sampled by the kernel Gibbs sampler (KGS). The KGS is a Markov Chain Monte Carlo method that chooses a random direction in parameter space and samples from the resulting piecewise constant density along the line chosen. The KGS can be used as an analytical tool for the exploration of Bayesian transduction, Bayes point machines, active learning, and evidence-based model selection on small data sets that are contaminated with label noise. For a simple toy example we demonstrate experimentally how a Bayes point machine based on the KGS outperforms an SVM that is incapable of taking into account label noise.

@inproceedings{herbrich2000kernelgibbs,
abstract = {We present an algorithm that samples the hypothesis space of kernel classifiers. Given a uniform prior over normalised weight vectors and a likelihood based on a model of label noise leads to a piecewise constant posterior that can be sampled by the kernel Gibbs sampler (KGS). The KGS is a Markov Chain Monte Carlo method that chooses a random direction in parameter space and samples from the resulting piecewise constant density along the line chosen. The KGS can be used as an analytical tool for the exploration of Bayesian transduction, Bayes point machines, active learning, and evidence-based model selection on small data sets that are contaminated with label noise. For a simple toy example we demonstrate experimentally how a Bayes point machine based on the KGS outperforms an SVM that is incapable of taking into account label noise.},
author = {Graepel, Thore and Herbrich, Ralf},
booktitle = {Advances in Neural Information Processing Systems 13},
pages = {514--520},
publisher = {The MIT Press},
title = {The Kernel Gibbs Sampler},
url = {https://www.herbrich.me/papers/kernel-gibbs-sampler.pdf},
year = {2000}
}

• R. Herbrich and T. Graepel, „Large Scale Bayes Point Machines,“ in Advances in Neural Information Processing Systems 13, 2000, p. 528–534.

The concept of averaging over classifiers is fundamental to the Bayesian analysis of learning. Based on this viewpoint, it has recently been demonstrated for linear classifiers that the centre of mass of version space (the set of all classifiers consistent with the training set) – also known as the Bayes point – exhibits excellent generalisation abilities. However, the billiard algorithm as presented in [Herbrich et al., 2000] is restricted to small sample size because it requires O(m*m) of memory and O(N*m*m) computational steps where m is the number of training patterns and N is the number of random draws from the posterior distribution. In this paper we present a method based on the simple perceptron learning algorithm which allows to overcome this algorithmic drawback. The method is algorithmically simple and is easily extended to the multi-class case. We present experimental results on the MNIST data set of handwritten digits which show that Bayes Point Machines are competitive with the current world champion, the support vector machine. In addition, the computational complexity of BPMs can be tuned by varying the number of samples from the posterior. Finally, rejecting test points on the basis of their (approximative) posterior probability leads to a rapid decrease in generalisation error, e.g. 0.1\% generalisation error for a given rejection rate of 10\%.

@inproceedings{herbrich2000largebpm,
abstract = {The concept of averaging over classifiers is fundamental to the Bayesian analysis of learning. Based on this viewpoint, it has recently been demonstrated for linear classifiers that the centre of mass of version space (the set of all classifiers consistent with the training set) - also known as the Bayes point - exhibits excellent generalisation abilities. However, the billiard algorithm as presented in [Herbrich et al., 2000] is restricted to small sample size because it requires O(m*m) of memory and O(N*m*m) computational steps where m is the number of training patterns and N is the number of random draws from the posterior distribution. In this paper we present a method based on the simple perceptron learning algorithm which allows to overcome this algorithmic drawback. The method is algorithmically simple and is easily extended to the multi-class case. We present experimental results on the MNIST data set of handwritten digits which show that Bayes Point Machines are competitive with the current world champion, the support vector machine. In addition, the computational complexity of BPMs can be tuned by varying the number of samples from the posterior. Finally, rejecting test points on the basis of their (approximative) posterior probability leads to a rapid decrease in generalisation error, e.g. 0.1\% generalisation error for a given rejection rate of 10\%.},
author = {Herbrich, Ralf and Graepel, Thore},
booktitle = {Advances in Neural Information Processing Systems 13},
pages = {528--534},
publisher = {The MIT Press},
title = {Large Scale Bayes Point Machines},
url = {https://www.herbrich.me/papers/mnist.pdf},
year = {2000}
}

• R. Herbrich, T. Graepel, and C. Campbell, „Robust Bayes Point Machines,“ in Proceedings of European Symposium on Artificial Neural Networks, 2000, p. 49–54.

Support Vector Machines choose the hypothesis corresponding to the centre of the largest hypersphere that can be inscribed in version space. If version space is elongated or irregularly shaped a potentially superior approach is take into account the whole of version space. We propose to construct the Bayes point which is approximated by the centre of mass. Our implementation of a Bayes Point Machine (BPM) uses an ergodic billiard to estimate this point in the kernel space. We show that BPMs outperform hard margin Support Vector Machines (SVMs) on real world datasets. We introduce a technique that allows the BPM to construct hypotheses with non-zero training error similar to soft margin SVMs with quadratic penelisation of the margin slacks. An experimental study reveals that with decreasing penelisation of training error the improvement of BPMs over SVMs decays, a finding that is explained by geometrical considerations.

@inproceedings{herbrich2000robustbpm,
abstract = {Support Vector Machines choose the hypothesis corresponding to the centre of the largest hypersphere that can be inscribed in version space. If version space is elongated or irregularly shaped a potentially superior approach is take into account the whole of version space. We propose to construct the Bayes point which is approximated by the centre of mass. Our implementation of a Bayes Point Machine (BPM) uses an ergodic billiard to estimate this point in the kernel space. We show that BPMs outperform hard margin Support Vector Machines (SVMs) on real world datasets. We introduce a technique that allows the BPM to construct hypotheses with non-zero training error similar to soft margin SVMs with quadratic penelisation of the margin slacks. An experimental study reveals that with decreasing penelisation of training error the improvement of BPMs over SVMs decays, a finding that is explained by geometrical considerations.},
author = {Herbrich, Ralf and Graepel, Thore and Campbell, Colin},
booktitle = {Proceedings of European Symposium on Artificial Neural Networks},
pages = {49--54},
title = {Robust {Bayes} Point Machines},
url = {https://www.herbrich.me/papers/esann00.pdf},
year = {2000}
}

• R. Herbrich, T. Graepel, and C. Campbell, „Bayes Point Machines : Estimating the Bayes Point in Kernel Space,“ in Proceedings of IJCAI Workshop Support Vector Machines, 1999, p. 23–27.

From a Bayesian perspective Support Vector Machines choose the hypothesis corresponding to the largest possible hypersphere that can be inscribed in version space, i.e. in the space of all consistent hypotheses given a training set. Those boundaries of version space which are tangent to the hypersphere define the support vectors. An alternative and potentially better approach is to construct the hypothesis using the whole of version space. This is achieved by using a Bayes Point Machine which finds the midpoint of the region of intersection of all hyperplanes bisecting version space into two halves of equal volume (the Bayes point). It is known that the center of mass of version space approximates the Bayes point. We suggest estimating the center of mass by averaging over the trajectory of a billiard ball bouncing in version space. Experimental results are presented indicating that Bayes Point Machines consistently outperform Support Vector Machines.

@inproceedings{herbrich1999bpm,
abstract = {From a Bayesian perspective Support Vector Machines choose the hypothesis corresponding to the largest possible hypersphere that can be inscribed in version space, i.e. in the space of all consistent hypotheses given a training set. Those boundaries of version space which are tangent to the hypersphere define the support vectors. An alternative and potentially better approach is to construct the hypothesis using the whole of version space. This is achieved by using a Bayes Point Machine which finds the midpoint of the region of intersection of all hyperplanes bisecting version space into two halves of equal volume (the Bayes point). It is known that the center of mass of version space approximates the Bayes point. We suggest estimating the center of mass by averaging over the trajectory of a billiard ball bouncing in version space. Experimental results are presented indicating that Bayes Point Machines consistently outperform Support Vector Machines.},
author = {Herbrich, Ralf and Graepel, Thore and Campbell, Colin},
booktitle = {Proceedings of IJCAI Workshop Support Vector Machines},
pages = {23--27},
title = {{Bayes} Point Machines : Estimating the {Bayes} Point in Kernel Space},
url = {https://www.herbrich.me/papers/ijcai99.pdf},
year = {1999}
}