Bayesian Learning and Decision Making

Approximate Bayesian Inference

Bayesian inference provides a powerful mechanism for data analysis and learning. However, in real-world situations it is rarely possible to perform exact inference; in fact, exact Bayesian inference is in general NP hard. One of the most successful approaches to address this problem is to exploit a factorial structure of both the sampling distribution and the prior. Then, there are a variety of methods that exploit the factorisation for efficient approximations in inference. The Expectation-Propagation (EP) algorithm is a powerful tool for inference in such factor graphs. For discrete distributions, the EP algorithm is also known as Belief Propagation.

1.

Kurle, Richard; Herbrich, Ralf; Januschowski, Tim; Wang, Yuyang; Gasthaus, Jan

On the detrimental effect of invariances in the likelihood for variational inference Proceedings Article

In: Advances in Neural Information Procesing Systems 36, 2022.

2.

Herbrich, Ralf

Distributed, Real-time Bayesian Learning in Online Services Proceedings Article

In: Proceedings of the 6th ACM Conference on Recommender Systems, pp. 203–204, ACM 2012.

Abstract | Links | BibTeX

3.

Herbrich, Ralf

On Gaussian Expectation Propagation Technical Report

Microsoft Research 2005.

Abstract | Links | BibTeX

4.

Herbrich, Ralf

Minimising the Kullback-Leibler Divergence Technical Report

Microsoft Research 2005.

Abstract | Links | BibTeX

Poisson Networks

Modelling structured multivariate point process data has wide ranging applications like understanding neural activity, developing faster file access systems and learning dependencies among servers in large networks. We developed the Poisson network model for representing multivariate structured Poisson processes. In our model each node of the network represents a Poisson process. The novelty of our work is that waiting times of a process are modelled by an exponential distribution with a piecewise constant rate function that depends on the event counts of its parents in the network in a generalised linear way. Our choice of model allows to perform exact sampling from arbitrary structures. We adopt a Bayesian approach for learning the network structure. We also develop fixed point and sampling based approximations for performing inference of rate functions in Poisson networks.

1.

Rajaram, Shyamsundar; Graepel, Thore; Herbrich, Ralf

Poisson-Networks : A Model for Structured Poisson Processes Proceedings Article

In: Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics, pp. 277–284, 2004.

Abstract | Links | BibTeX

Informative Vector Machine

We have developed a framework for sparse Gaussian process methods which uses forward selection with criteria based on information-theoretical principles, previously suggested for active learning. In contrast to most previous work on sparse GPs, our goal is not only to learn sparse predictors (which can be evaluated in O(d) rather than O(n), d<n, n the number of training points), but also to perform training under strong restrictions on time and memory requirements. The scaling of our method is at most O(nd²), and in large real-world classification experiments we show that it can match prediction performance of the popular support vector machine (SVM), yet it requires only a fraction of the training time. In contrast to the SVM, our approximation produces estimates of predictive probabilities (‘error bars’), allows for Bayesian model selection and is less complex in implementation.

1.

Lawrence, Neil; Seeger, Matthias; Herbrich, Ralf

Fast Sparse Gaussian Process Methods: The Informative Vector Machine Proceedings Article

In: Advances in Neural Information Processing Systems 15, pp. 609–616, 2002.

Abstract | Links | BibTeX

Learning with Social Priors

Slow convergence and poor initial accuracy are two problems that plague efforts to use very large feature sets in online learning. We show how these problems can be mitigated if a graph of relationships between features is known. We study this problem in a fully Bayesian setting, focusing on the problem of using Facebook user-IDs as features, with the social network giving the relationship structure. Our analysis uncovers significant problems with the obvious regularizations, and motivates a two-component mixture-model social prior that is provably better.

1.

Chakrabarti, Deepayan; Herbrich, Ralf

Speeding Up Large-Scale Learning with a Social Prior Proceedings Article

In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 650–658, ACM 2013.

Abstract | Links | BibTeX

Kernel Topic Models

Latent Dirichlet Allocation models discrete document data as a mixture of discrete distributions, using Dirichlet beliefs over the mixture weights. We study a variation in which the document’s mixture weight beliefs are replaced with squashed Gaussian distributions. This allows documents to be associated with elements of a Hilbert space, admitting kernel topic models (KTM), modelling temporal, spatial, hierarchical, social and other structure between documents. The main challenge is efficient approximate inference on the latent Gaussian. We present an approximate algorithm cast around a Laplace approximation in a transformed basis. The KTM can also be interpreted as a type of Gaussian process latent variable model, or as a topic model conditional on document features, uncovering links between earlier work in these areas.

1.

Hennig, Philipp; Stern, David; Herbrich, Ralf; Graepel, Thore

Kernel Topic Models Proceedings Article

In: Proceedings of the 15th International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 511–519, 2012.

Abstract | Links | BibTeX

Bayesian Online Learning for Multilabel and Multivariate Performance Measures

Many real world applications employ multi-variate performance measures and each example can belong to multiple classes. We propose a Bayesian online multi-label classification framework which learns a probabilistic linear classifier. The likelihood is modeled by a graphical model similar to TrueSkill, and inference is based on Gaussian density filtering with expectation propagation. Using samples from the posterior, we label the testing data by maximizing the expected F1-score. Our experiments on Reuters1-v2 dataset show that the proposed algorithm compares favorably to the state-of-the-art online learners in macro-averaged F1-score and training time.

1.

Zhang, Xinhua; Graepel, Thore; Herbrich, Ralf

Bayesian Online Learning for Multi-label and Multi-variate Performance Measures Proceedings Article

In: Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 956–963, 2010.

Abstract | Links | BibTeX

Bayesian Transduction

We consider the case of binary classification by linear discriminant functions. The simplification of the transduction problem results from the fact that the infinite number of linear discriminants is boiled down to a finite number of equivalence classes on the working set. The number of equivalence classes is bounded from above by the growth function. Each equivalence class corresponds to a polyhedron in parameter space. From a Bayesian point of view, we suggest to measure the prior probability of a labelling of the working set as the volume of the corresponding polyhedron w.r.t. the a-priori distribution in parameter space. Then the maximum a-posterior (MAP) scheme recommends to choose the labelling of maximum volume.

1.

Graepel, Thore; Herbrich, Ralf

The Kernel Gibbs Sampler Proceedings Article

In: Advances in Neural Information Processing Systems 13, pp. 514–520, The MIT Press, 2000.

Abstract | Links | BibTeX

2.

Graepel, Thore; Herbrich, Ralf; Obermayer, Klaus

Bayesian Transduction Proceedings Article

In: Advances in Neural Information Processing Systems 12, pp. 456–462, The MIT Press, 1999.

Abstract | Links | BibTeX

Bayes Point Machines

From a Bayesian perspective, Support Vector Machines choose the hypothesis corresponding to the largest possible hypersphere that can be inscribed in version space, i.e. in the space of all consistent hypotheses given a training set. Those boundaries of version space which are tangent to the hypersphere define the support vectors. An alternative and potentially better approach is to construct the hypothesis using the whole of version space. This is achieved by using a Bayes Point Machine which finds the midpoint of the region of intersection of all hyperplanes bisecting version space into two halves of equal volume (the Bayes point). It is known that the centre of mass of version space approximates the Bayes point. We investigate estimating the centre of mass by averaging over the trajectory of a billiard ball bouncing in version space. Experimental results indicate that Bayes Point Machines consistently outperform Support Vector Machines.

1.

Harrington, Edward; Herbrich, Ralf; Kivinen, Jyrki; Platt, John C; Williamson, Robert C

Online Bayes Point Machines Proceedings Article

In: Proceedings of Advances in Knowledge Discovery and Data Mining, pp. 241–252, 2003.

Abstract | Links | BibTeX

2.

Herbrich, Ralf; Graepel, Thore; Campbell, Colin

Bayes Point Machines Journal Article

In: Journal of Machine Learning Research, vol. 1, pp. 245–279, 2001.

Abstract | Links | BibTeX

3.

Graepel, Thore; Herbrich, Ralf

The Kernel Gibbs Sampler Proceedings Article

In: Advances in Neural Information Processing Systems 13, pp. 514–520, The MIT Press, 2000.

Abstract | Links | BibTeX

4.

Herbrich, Ralf; Graepel, Thore

Large Scale Bayes Point Machines Proceedings Article

In: Advances in Neural Information Processing Systems 13, pp. 528–534, The MIT Press, 2000.

Abstract | Links | BibTeX

5.

Herbrich, Ralf; Graepel, Thore; Campbell, Colin

Robust Bayes Point Machines Proceedings Article

In: Proceedings of European Symposium on Artificial Neural Networks, pp. 49–54, 2000.

Abstract | Links | BibTeX

6.

Herbrich, Ralf; Graepel, Thore; Campbell, Colin

Bayes Point Machines : Estimating the Bayes Point in Kernel Space Proceedings Article

In: Proceedings of IJCAI Workshop Support Vector Machines, pp. 23–27, 1999.

Abstract | Links | BibTeX