Publications

2014

  • X. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi, A. Atallah, R. Herbrich, S. Bowers, and J. Q. Candela, “Practical lessons from predicting clicks on ads at Facebook,” in Proceedings of 20th acm sigkdd conference on knowledge discovery and data mining, 2014, p. 1–-9.
    [BibTeX] [Abstract] [Download PDF]

    Online advertising allows advertisers to only bid and pay for measurable user responses, such as clicks on ads. As a consequence, click prediction systems are central to most on-line advertising systems. With over 750 million daily active users and over 1 million active advertisers, predicting clicks on Facebook ads is a challenging machine learning task. In this paper we introduce a model which combines decision trees with logistic regression, outperforming either of these methods on its own by over 3%, an improvement with significant impact to the overall system performance. We then explore how a number of fundamental parameters impact the final prediction performance of our system. Not surprisingly, the most important thing is to have the right features: those capturing historical information about the user or ad dominate other types of features. Once we have the right features and the right model (decisions trees plus logistic regression), other factors play small roles (though even small improvements are important at scale). Picking the optimal handling for data freshness, learning rate schema and data sampling improve the model slightly, though much less than adding a high-value feature, or picking the right model to begin with.

    @inproceedings{He2014,
    abstract = {Online advertising allows advertisers to only bid and pay for measurable user responses, such as clicks on ads. As a consequence, click prediction systems are central to most on-line advertising systems. With over 750 million daily active users and over 1 million active advertisers, predicting clicks on Facebook ads is a challenging machine learning task. In this paper we introduce a model which combines decision trees with logistic regression, outperforming either of these methods on its own by over 3%, an improvement with significant impact to the overall system performance. We then explore how a number of fundamental parameters impact the final prediction performance of our system. Not surprisingly, the most important thing is to have the right features: those capturing historical information about the user or ad dominate other types of features. Once we have the right features and the right model (decisions trees plus logistic regression), other factors play small roles (though even small improvements are important at scale). Picking the optimal handling for data freshness, learning rate schema and data sampling improve the model slightly, though much less than adding a high-value feature, or picking the right model to begin with.},
    author = {He, Xinran and Pan, Junfeng and Jin, Ou and Xu, Tianbing and Liu, Bo and Xu, Tao and Shi, Yanxin and Atallah, Antoine and Herbrich, Ralf and Bowers, Stuart and Candela, Joaquin Quiñonero},
    booktitle = {Proceedings of 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining},
    pages = {1–-9},
    title = {{Practical lessons from predicting clicks on ads at Facebook}},
    url = {http://www.herbrich.me/papers/adclicksfacebook.pdf},
    organization={ACM},
    year = {2014}
    }

2013

  • D. Chakrabarti and R. Herbrich, “Speeding up large-scale learning with a social prior,” in Proceedings of the 19th acm sigkdd international conference on knowledge discovery and data mining, 2013, p. 650–-658.
    [BibTeX] [Abstract] [Download PDF]

    Slow convergence and poor initial accuracy are two problems that plague efforts to use very large feature sets in online learning. This is especially true when only a few features are \”active\” in any training example, and the frequency of activations of different features is skewed. We show how these problems can be mitigated if a graph of relationships between features is known. We study this problem in a fully Bayesian setting, focusing on the problem of using Facebook user-IDs as features, with the social network giving the relationship structure. Our analysis uncovers significant problems with the obvious regularizations, and motivates a two-component mixture-model \”social prior\” that is provably better. Empirical results on large-scale click prediction problems show that our algorithm can learn as well as the baseline with 12M fewer training examples, and continuously outperforms it for over 60M examples. On a second problem using binned features, our model outperforms the baseline even after the latter sees 5x as much data.

    @inproceedings{chakrabarti2013speeding,
    abstract = {Slow convergence and poor initial accuracy are two problems that plague efforts to use very large feature sets in online learning. This is especially true when only a few features are \”active\” in any training example, and the frequency of activations of different features is skewed. We show how these problems can be mitigated if a graph of relationships between features is known. We study this problem in a fully Bayesian setting, focusing on the problem of using Facebook user-IDs as features, with the social network giving the relationship structure. Our analysis uncovers significant problems with the obvious regularizations, and motivates a two-component mixture-model \”social prior\” that is provably better. Empirical results on large-scale click prediction problems show that our algorithm can learn as well as the baseline with 12M fewer training examples, and continuously outperforms it for over 60M examples. On a second problem using binned features, our model outperforms the baseline even after the latter sees 5x as much data.},
    author = {Chakrabarti, Deepayan and Herbrich, Ralf},
    booktitle = {Proceedings of the 19th ACM SIGKDD international conference on Knowledge Discovery and Data Mining},
    pages = {650–-658},
    title = {{Speeding up large-scale learning with a social prior}},
    url = {http://www.herbrich.me/papers/kdd13-socialprior.pdf},
    organization={ACM},
    year = {2013}
    }

2012

  • M. Gartrell, U. Paquet, and R. Herbrich, “A Bayesian treatment of social links in recommender systems,” CU Technical Report CU-CS-1092-12 2012.
    [BibTeX] [Abstract] [Download PDF]

    Recommender systems are increasingly driving user experiences on the Internet. This personalization is often achieved through the factorization of a large but sparse observation matrix of user-item feedback signals. In instances where the user’s social network is known, its inclusion can significantly improve recommendations for cold start users. There are numerous ways in which the network can be incorporated into a probabilistic graphical model. We propose and investigate two ways for including a social network, either as a Markov Random Field that describes a user similarity in the prior over user features, or an explicit model that treats social links as observations. State of the art performance is reported on the Flixster online social network dataset.

    @techreport{gartrell2012bayesian,
    abstract = {Recommender systems are increasingly driving user experiences on the Internet. This personalization is often achieved through the factorization of a large but sparse observation matrix of user-item feedback signals. In instances where the user’s social network is known, its inclusion can significantly improve recommendations for cold start users. There are numerous ways in which the network can be incorporated into a probabilistic graphical model. We propose and investigate two ways for including a social network, either as a Markov Random Field that describes a user similarity in the prior over user features, or an explicit model that treats social links as observations. State of the art performance is reported on the Flixster online social network dataset.},
    author = {Gartrell, Mike and Paquet, Ulrich and Herbrich, Ralf},
    institution={CU Technical Report CU-CS-1092-12},
    title = {{A Bayesian treatment of social links in recommender systems}},
    url = {http://www.herbrich.me/papers/ucb511101092internet.pdf},
    year = {2012}
    }

  • L. Dietz, B. Gamari, J. Guiver, E. Snelson, and R. Herbrich, “De-Layering Social Networks by Shared Tastes of Friendships,” in Icwsm, 2012.
    [BibTeX] [Abstract] [Download PDF]

    Traditionally, social network analyses are applied to data from a particular social domain. With the advent of online social networks such as Facebook, we observe an aggregate of various social domains resulting in a layered mix of professional contacts, family ties, and different circles. These aggregates dilute the community structure. We provide a method for de-layering social networks according to shared interests. Instead of relying on changes in the edge density, our shared taste model uses content of users to disambiguate the underlying shared interest of each friendship. We successfully de-layer real world networks from LibraryThing and Boards.ie, obtaining topics that significantly outperform LDA on unsupervised prediction of group membership.

    @inproceedings{dietz2012layering,
    abstract = {Traditionally, social network analyses are applied to data from a particular social domain. With the advent of online social networks such as Facebook, we observe an aggregate of various social domains resulting in a layered mix of professional contacts, family ties, and different circles. These aggregates dilute the community structure. We provide a method for de-layering social networks according to shared interests. Instead of relying on changes in the edge density, our shared taste model uses content of users to disambiguate the underlying shared interest of each friendship. We successfully de-layer real world networks from LibraryThing and Boards.ie, obtaining topics that significantly outperform LDA on unsupervised prediction of group membership.},
    author = {Dietz, Laura and Gamari, Ben and Guiver, John and Snelson, Edward and Herbrich, Ralf},
    booktitle = {ICWSM},
    title = {{De-Layering Social Networks by Shared Tastes of Friendships}},
    url = {http://www.herbrich.me/papers/icwsm2012.pdf},
    year = {2012}
    }

  • K. El-Arini, U. Paquet, R. Herbrich, J. Van Gael, and B. Agüera y Arcas, “Transparent user models for personalization,” in Proceedings of the 18th acm sigkdd international conference on knowledge discovery and data mining, 2012, p. 678–-686.
    [BibTeX] [Abstract] [Download PDF]

    Personalization is a ubiquitous phenomenon in our daily online experience. While such technology is critical for helping us combat the overload of information we face, in many cases, we may not even realize that our results are being tailored to our personal tastes and preferences. Worse yet, when such a system makes a mistake, we have little recourse to correct it.

    @inproceedings{el2012transparent,
    abstract = {Personalization is a ubiquitous phenomenon in our daily online experience. While such technology is critical for helping us combat the overload of information we face, in many cases, we may not even realize that our results are being tailored to our personal tastes and preferences. Worse yet, when such a system makes a mistake, we have little recourse to correct it.},
    author = {El-Arini, Khalid and Paquet, Ulrich and Herbrich, Ralf and Van Gael, Jurgen and Ag{\"u}era y Arcas, Blaise},
    booktitle = {Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining},
    pages = {678–-686},
    title = {{Transparent user models for personalization}},
    url = {http://www.herbrich.me/papers/kdd2012-personalization.pdf},
    organization={ACM},
    year = {2012}
    }

  • R. Herbrich, “Distributed, real-time bayesian learning in online services,” in Proceedings of the sixth acm conference on recommender systems, 2012, p. 203–-204.
    [BibTeX] [Abstract] [Download PDF]

    The last ten years have seen a tremendous growth in Internet-based online services such as search, advertising, gaming and social networking. Today, it is important to analyze large collections of user interaction data as a first step in building predictive models for these services as well as learn these models in real-time. One of the biggest challenges in this setting is scale: not only does the sheer scale of data necessitate parallel processing but it also necessitates distributed models; with over 900 million active users at Facebook, any user-specific sets of features in a linear or non-linear model yields models of a size bigger than can be stored in a single system. In this talk, I will give a hands-on introduction to one of the most versatile tools for handling large collections of data with distributed probabilistic models: the sum-product algorithm for approximate message passing in factor graphs. I will discuss the application of this algorithm for the specific case of generalized linear models and outline the challenges of both approximate and distributed message passing including an in-depth discussion of expectation propagation and Map-Reduce.

    @inproceedings{herbrich2012distributed,
    abstract = {The last ten years have seen a tremendous growth in Internet-based online services such as search, advertising, gaming and social networking. Today, it is important to analyze large collections of user interaction data as a first step in building predictive models for these services as well as learn these models in real-time. One of the biggest challenges in this setting is scale: not only does the sheer scale of data necessitate parallel processing but it also necessitates distributed models; with over 900 million active users at Facebook, any user-specific sets of features in a linear or non-linear model yields models of a size bigger than can be stored in a single system. In this talk, I will give a hands-on introduction to one of the most versatile tools for handling large collections of data with distributed probabilistic models: the sum-product algorithm for approximate message passing in factor graphs. I will discuss the application of this algorithm for the specific case of generalized linear models and outline the challenges of both approximate and distributed message passing including an in-depth discussion of expectation propagation and Map-Reduce.},
    author = {Herbrich, Ralf},
    booktitle = {Proceedings of the sixth ACM conference on Recommender systems},
    pages = {203–-204},
    title = {{Distributed, real-time bayesian learning in online services}},
    url = {http://www.herbrich.me/papers/recsys2012.pdf},
    organization={ACM},
    year = {2012}
    }

2011

  • A. Passos, J. Van Gael, R. Herbrich, and U. Paquet, “A penny for your thoughts? the value of information in recommendation systems,” in Nips workshop on bayesian optimization, experimental design, and bandits, 2011, p. 9–14.
    [BibTeX] [Abstract] [Download PDF]

    Most recommendation systems are trained to predict behavioral data and then used to generate more such data by recommending items and receiving feedback on the quality of these recommendations. This data in then fed back into the training process. This creates a feedback loop: as long as the low-cost way to interact with the service is through the recommender, the recommender will only ever see behavioral data on the items it chooses. This process can lead to hidden biases, as it effectively limits how much information the recommender system will ever see. On the other hand, there is a cost to making exploratory recommendations, as they should, myopically, be worse than the best bets of a recommendation system. In this paper we explore the notion that recommender systems are a special kind of active learning agents, with the peculiarity that the cost of asking for the label of an instance depends on its true label, as the cost of showing a bad recommendation when exploring is higher than the cost of showing a good recommendation.

    @inproceedings{Passos2011,
    abstract = {Most recommendation systems are trained to predict behavioral data and then used to generate more such data by recommending items and receiving feedback on the quality of these recommendations. This data in then fed back into the training process. This creates a feedback loop: as long as the low-cost way to interact with the service is through the recommender, the recommender will only ever see behavioral data on the items it chooses. This process can lead to hidden biases, as it effectively limits how much information the recommender system will ever see. On the other hand, there is a cost to making exploratory recommendations, as they should, myopically, be worse than the best bets of a recommendation system. In this paper we explore the notion that recommender systems are a special kind of active learning agents, with the peculiarity that the cost of asking for the label of an instance depends on its true label, as the cost of showing a bad recommendation when exploring is higher than the cost of showing a good recommendation.},
    author = {Passos, Alexandre and Van Gael, Juergen and Herbrich, Ralf and Paquet, Ulrich},
    booktitle = {NIPS Workshop on Bayesian Optimization, Experimental Design, and Bandits},
    pages = {9--14},
    title = {{A penny for your thoughts? the value of information in recommendation systems}},
    url = {http://www.herbrich.me/papers/passos11penny.pdf},
    year = {2011}
    }

  • W. Cheng, G. Kasneci, T. Graepel, D. H. Stern, and R. Herbrich, “Automated Feature Generation From Structured Knowledge,” in Proceedings of the 20th acm conference on information and knowledge management, Glasgow, 2011, p. 1395–1404.
    [BibTeX] [Abstract] [Download PDF]

    The prediction accuracy of any learning algorithm highly depends on the quality of the selected features; but often, the task of feature construction and selection is tedious and non- scalable. In recent years, however, there have been numerous projects with the goal of constructing general-purpose or domain-specific knowledge bases with entity-relationship- entity triples extracted from various Web sources or col- lected from user communities, e.g., YAGO, DBpedia, Free- base, UMLS, etc. This paper advocates the simple and yet far-reaching idea that the structured knowledge contained in such knowledge bases can be exploited to automatically extract features for general learning tasks. We introduce an expressive graph-based language for extracting features from such knowledge bases and a theoretical framework for constructing feature vectors from the extracted features. Our experimental evaluation on different learning scenarios provides evidence that the features derived through our framework can considerably improve the prediction accu- racy, especially when the labeled data at hand is sparse.

    @inproceedings{ChengKGSH11,
    abstract = {The prediction accuracy of any learning algorithm highly depends on the quality of the selected features; but often, the task of feature construction and selection is tedious and non- scalable. In recent years, however, there have been numerous projects with the goal of constructing general-purpose or domain-specific knowledge bases with entity-relationship- entity triples extracted from various Web sources or col- lected from user communities, e.g., YAGO, DBpedia, Free- base, UMLS, etc. This paper advocates the simple and yet far-reaching idea that the structured knowledge contained in such knowledge bases can be exploited to automatically extract features for general learning tasks. We introduce an expressive graph-based language for extracting features from such knowledge bases and a theoretical framework for constructing feature vectors from the extracted features. Our experimental evaluation on different learning scenarios provides evidence that the features derived through our framework can considerably improve the prediction accu- racy, especially when the labeled data at hand is sparse.},
    address = {Glasgow},
    author = {Cheng, Weiwei and Kasneci, Gjergji and Graepel, Thore and Stern, David H and Herbrich, Ralf},
    booktitle = {Proceedings of the 20th ACM Conference on Information and Knowledge Management},
    pages = {1395--1404},
    title = {{Automated Feature Generation From Structured Knowledge}},
    url = {http://www.herbrich.me/papers/cikmfp0337-cheng.pdf},
    year = {2011}
    }

  • P. Hennig, D. Stern, R. Herbrich, and T. Graepel, “Kernel Topic Models,” , vol. abs/1110.4, p. 9, 2011.
    [BibTeX] [Abstract] [Download PDF]

    Latent Dirichlet Allocation models discrete data as a mixture of discrete distributions, using Dirichlet beliefs over the mixture weights. We study a variation of this concept, in which the documents’ mixture weight beliefs are replaced with squashed Gaussian distributions. This allows documents to be associated with elements of a Hilbert space, admitting kernel topic models (KTM), modelling temporal, spatial, hierarchical, social and other structure between documents. The main challenge is efficient approximate inference on the latent Gaussian. We present an approximate algorithm cast around a Laplace approximation in a transformed basis. The KTM can also be interpreted as a type of Gaussian process latent variable model, or as a topic model conditional on document features, uncovering links between earlier work in these areas.

    @article{DBLP:journals/corr/abs-1110-4713,
    abstract = {Latent Dirichlet Allocation models discrete data as a mixture of discrete distributions, using Dirichlet beliefs over the mixture weights. We study a variation of this concept, in which the documents' mixture weight beliefs are replaced with squashed Gaussian distributions. This allows documents to be associated with elements of a Hilbert space, admitting kernel topic models (KTM), modelling temporal, spatial, hierarchical, social and other structure between documents. The main challenge is efficient approximate inference on the latent Gaussian. We present an approximate algorithm cast around a Laplace approximation in a transformed basis. The KTM can also be interpreted as a type of Gaussian process latent variable model, or as a topic model conditional on document features, uncovering links between earlier work in these areas.},
    author = {Hennig, Philipp and Stern, David and Herbrich, Ralf and Graepel, Thore},
    file = {:Users/rherb/Code/herbrich.me/papers/ktm.pdf:pdf},
    pages = {9},
    title = {{Kernel Topic Models}},
    url = {http://www.herbrich.me/papers/ktm.pdf},
    volume = {abs/1110.4},
    year = {2011}
    }

  • P. Kohli, Y. Bachrach, T. Graepel, G. Smyth, M. Armstrong, D. Stillwell, and M. Kearns, “Behavioral Game Theory on Online Social Networks: Colonel Blotto is on Facebook,” in Proceedings of the second workshop on computational social science and the wisdom of crowds, 2011.
    [BibTeX] [Abstract] [Download PDF]

    We show how online social networks such as Facebook can be used in Behavioral Game Theory research. We report the deployment of a Facebook application ‘Project Waterloo’ that allows users to play the Colonel Blotto game against their friends and strangers. Unlike conventional studies performed in the laboratory environment, which rely on monetary incentives to attract human subjects to play games, our framework does not use money and instead relies on reputation and entertainment incentives. We describe the Facebook application we created for conducting this experiment, and perform a preliminary analysis of the data collected in the game. We conclude by discussing the advantages of our approach and list some ideas for future work.

    @inproceedings{Kohli2011,
    abstract = {We show how online social networks such as Facebook can be used in Behavioral Game Theory research. We report the deployment of a Facebook application ‘Project Waterloo’ that allows users to play the Colonel Blotto game against their friends and strangers. Unlike conventional studies performed in the laboratory environment, which rely on monetary incentives to attract human subjects to play games, our framework does not use money and instead relies on reputation and entertainment incentives. We describe the Facebook application we created for conducting this experiment, and perform a preliminary analysis of the data collected in the game. We conclude by discussing the advantages of our approach and list some ideas for future work.},
    author = {Kohli, Pushmeet and Bachrach, Yoram and Graepel, Thore and Smyth, Gavin and Armstrong, Michael and Stillwell, David and Kearns, Michael},
    booktitle = {Proceedings of the Second Workshop on Computational Social Science and the Wisdom of Crowds},
    file = {:Users/rherb/Code/herbrich.me/papers/nips\_blotto\_2011.pdf:pdf},
    title = {{Behavioral Game Theory on Online Social Networks: Colonel Blotto is on Facebook}},
    url = {http://www.herbrich.me/papers/nips\_blotto\_2011.pdf},
    year = {2011}
    }

  • Y. Xu, X. Cao, A. Sellen, R. Herbrich, and T. Graepel, “Sociable killers: understanding social relationships in an online first-person shooter game,” in Proceedings of the 2011 acm conference on computer supported cooperative work, 2011, p. 197–206.
    [BibTeX] [Abstract] [Download PDF]

    Online video games can be seen as medium for the formation and maintenance of social relationships. In this paper, we explore what social relationships mean under the context of online First-Person Shooter (FPS) games, how these relationships influence game experience, and how players manage them. We combine qualitative interview and quantitative game log data, and find that despite the gap between the non-persistent game world and potentially persistent social relationships, a diversity of social relationships emerge and play a central role in the enjoyment of online FPS games. We report the forms, development, and impact of such relationships, and discuss our findings in light of design implications and comparison with other game genres.

    @inproceedings{DBLP:conf/cscw/XuCSHG11,
    abstract = {Online video games can be seen as medium for the formation and maintenance of social relationships. In this paper, we explore what social relationships mean under the context of online First-Person Shooter (FPS) games, how these relationships influence game experience, and how players manage them. We combine qualitative interview and quantitative game log data, and find that despite the gap between the non-persistent game world and potentially persistent social relationships, a diversity of social relationships emerge and play a central role in the enjoyment of online FPS games. We report the forms, development, and impact of such relationships, and discuss our findings in light of design implications and comparison with other game genres.},
    author = {Xu, Yan and Cao, Xiang and Sellen, Abigail and Herbrich, Ralf and Graepel, Thore},
    booktitle = {Proceedings of the 2011 ACM Conference on Computer Supported Cooperative Work},
    file = {:Users/rherb/Code/herbrich.me/papers/CSCW11.pdf:pdf},
    pages = {197--206},
    title = {{Sociable killers: understanding social relationships in an online first-person shooter game}},
    url = {http://www.herbrich.me/papers/CSCW11.pdf},
    year = {2011}
    }

2010

  • Y. Bachrach and R. Herbrich, “Fingerprinting Ratings for Collaborative Filtering – Theoretical and Empirical Analysis,” in Proceedings of 17th international symposium on string processing and information retrieval, Los Cabos, 2010, p. 25–36.
    [BibTeX] [Abstract] [Download PDF]

    We consider fingerprinting methods for collaborative filter- ing (CF) systems. In general, CF systems show their real strength when supplied with enormous data sets. Earlier work already suggests sketching techniques to handle massive amounts of information, but most prior analysis has so far been limited to non-ranking application scenarios and has focused mainly on a theoretical analysis. We demonstrate how to use fingerprinting methods to compute a family of rank correlation coefficients. Our methods allow identifying users who have similar rankings over a certain set of items, a problem that lies at the heart of CF applications. We show that our method allows approximating rank correlations with high accuracy and confidence. We examine the suggested methods empirically through a recommender system for the Netflix dataset, showing that the required fingerprint sizes are even smaller than the theoretical analysis suggests. We also explore the of use standard hash functions rather than min-wise independent hashes and the relation between the quality of the final recommendations and the fingerprint size.

    @inproceedings{DBLP:conf/spire/BachrachH10,
    abstract = {We consider fingerprinting methods for collaborative filter- ing (CF) systems. In general, CF systems show their real strength when supplied with enormous data sets. Earlier work already suggests sketching techniques to handle massive amounts of information, but most prior analysis has so far been limited to non-ranking application scenarios and has focused mainly on a theoretical analysis. We demonstrate how to use fingerprinting methods to compute a family of rank correlation coefficients. Our methods allow identifying users who have similar rankings over a certain set of items, a problem that lies at the heart of CF applications. We show that our method allows approximating rank correlations with high accuracy and confidence. We examine the suggested methods empirically through a recommender system for the Netflix dataset, showing that the required fingerprint sizes are even smaller than the theoretical analysis suggests. We also explore the of use standard hash functions rather than min-wise independent hashes and the relation between the quality of the final recommendations and the fingerprint size.},
    address = {Los Cabos},
    author = {Bachrach, Yoram and Herbrich, Ralf},
    booktitle = {Proceedings of 17th International Symposium on String Processing and Information Retrieval},
    file = {:Users/rherb/Code/herbrich.me/papers/spire10.pdf:pdf},
    pages = {25--36},
    title = {{Fingerprinting Ratings for Collaborative Filtering - Theoretical and Empirical Analysis}},
    url = {http://www.herbrich.me/papers/spire10.pdf},
    year = {2010}
    }

  • T. Graepel, J. Q. Candela, T. Borchert, and R. Herbrich, “Web-Scale Bayesian Click-Through rate Prediction for Sponsored Search Advertising in Microsoft’s Bing Search Engine,” in Proceedings of the 27th international conference on machine learning, Haifa, 2010, p. 13–20.
    [BibTeX] [Abstract] [Download PDF]

    We describe a new Bayesian click-through rate (CTR) prediction algorithm used for Sponsored Search in Microsoft’s Bing search engine. The algorithm is based on a probit regression model that maps discrete or real-valued input features to probabilities. It maintains Gaussian beliefs over weights of the model and performs Gaussian online updates derived from approximate message passing. Scalability of the algorithm is ensured through a principled weight pruning procedure and an approximate parallel implementation. We discuss the challenges arising from evaluating and tuning the predictor as part of the complex system of sponsored search where the predictions made by the algorithm decide about future training sample composition. Finally, we show experimental results from the production system and compare to a calibrated Naïve Bayes algorithm.

    @inproceedings{DBLP:conf/icml/GraepelCBH10,
    abstract = {We describe a new Bayesian click-through rate (CTR) prediction algorithm used for Sponsored Search in Microsoft’s Bing search engine. The algorithm is based on a probit regression model that maps discrete or real-valued input features to probabilities. It maintains Gaussian beliefs over weights of the model and performs Gaussian online updates derived from approximate message passing. Scalability of the algorithm is ensured through a principled weight pruning procedure and an approximate parallel implementation. We discuss the challenges arising from evaluating and tuning the predictor as part of the complex system of sponsored search where the predictions made by the algorithm decide about future training sample composition. Finally, we show experimental results from the production system and compare to a calibrated Na\"{\i}ve Bayes algorithm.},
    address = {Haifa},
    author = {Graepel, Thore and Candela, Joaquin Qui\~{n}onero and Borchert, Thomas and Herbrich, Ralf},
    booktitle = {Proceedings of the 27th International Conference on Machine Learning},
    file = {:Users/rherb/Code/herbrich.me/papers/adpredictor.pdf:pdf},
    pages = {13--20},
    title = {{Web-Scale Bayesian Click-Through rate Prediction for Sponsored Search Advertising in Microsoft's Bing Search Engine}},
    url = {http://www.herbrich.me/papers/adpredictor.pdf},
    year = {2010}
    }

  • G. Kasneci, J. V. Gael, R. Herbrich, and T. Graepel, “Bayesian Knowledge Corroboration with Logical Rules and User Feedback,” in Proceedings of european conference on machine learning and knowledge discovery in databases, Barcelona, 2010, p. 1–18.
    [BibTeX] [Abstract] [Download PDF]

    Current knowledge bases suffer from either low coverage or low accuracy. The underlying hypothesis of this work is that user feedback can greatly improve the quality of automatically extracted knowledge bases. The feedback could help quantify the uncertainty associated with the stored statements and would enable mechanisms for searching, ranking and reasoning at entity-relationship level. Most importantly, a principled model for exploiting user feedback to learn the truth values of statements in the knowledge base would be a major step forward in addressing the issue of knowledge base curation. We present a family of probabilistic graphical models that builds on user feedback and logical inference rules derived from the popular Semantic-Web formalism of RDFS. Through internal inference and belief propagation, these models can learn both, the truth values of the statements in the knowledge base and the reliabilities of the users who give feedback. We demonstrate the viability of our approach in extensive experiments on real-world datasets, with feedback collected from Amazon Mechanical Turk.

    @inproceedings{DBLP:conf/pkdd/KasneciGHG10,
    abstract = {Current knowledge bases suffer from either low coverage or low accuracy. The underlying hypothesis of this work is that user feedback can greatly improve the quality of automatically extracted knowledge bases. The feedback could help quantify the uncertainty associated with the stored statements and would enable mechanisms for searching, ranking and reasoning at entity-relationship level. Most importantly, a principled model for exploiting user feedback to learn the truth values of statements in the knowledge base would be a major step forward in addressing the issue of knowledge base curation. We present a family of probabilistic graphical models that builds on user feedback and logical inference rules derived from the popular Semantic-Web formalism of RDFS. Through internal inference and belief propagation, these models can learn both, the truth values of the statements in the knowledge base and the reliabilities of the users who give feedback. We demonstrate the viability of our approach in extensive experiments on real-world datasets, with feedback collected from Amazon Mechanical Turk.},
    address = {Barcelona},
    author = {Kasneci, Gjergji and Gael, Jurgen Van and Herbrich, Ralf and Graepel, Thore},
    booktitle = {Proceedings of European Conference on Machine Learning and Knowledge Discovery in Databases},
    file = {:Users/rherb/Code/herbrich.me/papers/ecml10.pdf:pdf},
    pages = {1--18},
    publisher = {Springer},
    title = {{Bayesian Knowledge Corroboration with Logical Rules and User Feedback}},
    url = {http://www.herbrich.me/papers/ecml10.pdf},
    year = {2010}
    }

  • U. Paquet, J. {Van Gael}, D. Stern, G. Kasneci, R. Herbrich, and T. Graepel, “Vuvuzelas & Active Learning for Online Classification,” in Proceedings of computational social science and the wisdom of crowds workshop, Vancouver, 2010.
    [BibTeX] [Abstract] [Download PDF]

    Many online service systems leverage user-generated content from Web 2.0 style platforms such asWikipedia, Twitter, Facebook, and many more. Often, the value lies in the freshness of this information (e.g. tweets, event-based articles, blog posts, etc.). This freshness poses a challenge for supervised learning models as they frequently have to deal with previously unseen features. In this paper we address the problem of online classification for tweets, namely, how can a classifier be updated in an online manner, so that it can correctly classify the latest “hype” on Twitter? We propose a two-step strategy to solve this problem. The first step follows an active learning strategy that enables the selection of tweets for which a label would be most useful; the selected tweet is then forwarded to Amazon Mechanical Turk where it is labeled by multiple users. The second step builds on a Bayesian corroboration model that aggregates the noisy labels provided by the users by taking their reliabilities into account.

    @inproceedings{Paquet2010,
    abstract = {Many online service systems leverage user-generated content from Web 2.0 style platforms such asWikipedia, Twitter, Facebook, and many more. Often, the value lies in the freshness of this information (e.g. tweets, event-based articles, blog posts, etc.). This freshness poses a challenge for supervised learning models as they frequently have to deal with previously unseen features. In this paper we address the problem of online classification for tweets, namely, how can a classifier be updated in an online manner, so that it can correctly classify the latest “hype” on Twitter? We propose a two-step strategy to solve this problem. The first step follows an active learning strategy that enables the selection of tweets for which a label would be most useful; the selected tweet is then forwarded to Amazon Mechanical Turk where it is labeled by multiple users. The second step builds on a Bayesian corroboration model that aggregates the noisy labels provided by the users by taking their reliabilities into account.},
    address = {Vancouver},
    author = {Paquet, Ulrich and {Van Gael}, Jurgen and Stern, David and Kasneci, Gjergji and Herbrich, Ralf and Graepel, Thore},
    booktitle = {Proceedings of Computational Social Science and the Wisdom of Crowds Workshop},
    file = {:Users/rherb/Code/herbrich.me/papers/vuvuzela.pdf:pdf},
    title = {{Vuvuzelas \& Active Learning for Online Classification}},
    url = {http://www.herbrich.me/papers/vuvuzela.pdf},
    year = {2010}
    }

  • D. Stern, H. Samulowitz, R. Herbrich, T. Graepel, L. Pulina, and A. Tacchella, “Collaborative Expert Portfolio Management,” in Proceedings of the twenty-fourth aaai conference on artificial intelligence, Atlanta, 2010.
    [BibTeX] [Abstract] [Download PDF]

    We consider the task of assigning experts from a portfolio of specialists in order to solve a set of tasks. We apply a Bayesian model which combines collaborative filtering with a feature-based description of tasks and experts to yield a general framework for managing a portfolio of experts. The model learns an embedding of tasks and problems into a latent space in which affinity is measured by the inner product. The model can be trained incrementally and can track non-stationary data, tracking potentially changing expert and task characteristics. The approach allows us to use a principled decision theoretic framework for expert selection, allowing the user to choose a utility function that best suits their objectives. The model component for taking into account the performance feedback data is pluggable, allowing flexibility. We apply the model to manage a portfolio of algorithms to solve hard combinatorial problems. This is a well studied area and we demonstrate a large improvement on the state of the art in one domain (constraint solving) and in a second domain (combinatorial auctions) created a portfolio that performed significantly better than any single algorithm.

    @inproceedings{DBLP:conf/aaai/SternSHGPT10,
    abstract = {We consider the task of assigning experts from a portfolio of specialists in order to solve a set of tasks. We apply a Bayesian model which combines collaborative filtering with a feature-based description of tasks and experts to yield a general framework for managing a portfolio of experts. The model learns an embedding of tasks and problems into a latent space in which affinity is measured by the inner product. The model can be trained incrementally and can track non-stationary data, tracking potentially changing expert and task characteristics. The approach allows us to use a principled decision theoretic framework for expert selection, allowing the user to choose a utility function that best suits their objectives. The model component for taking into account the performance feedback data is pluggable, allowing flexibility. We apply the model to manage a portfolio of algorithms to solve hard combinatorial problems. This is a well studied area and we demonstrate a large improvement on the state of the art in one domain (constraint solving) and in a second domain (combinatorial auctions) created a portfolio that performed significantly better than any single algorithm.},
    address = {Atlanta},
    author = {Stern, David and Samulowitz, Horst and Herbrich, Ralf and Graepel, Thore and Pulina, Luca and Tacchella, Armando},
    booktitle = {Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence},
    file = {:Users/rherb/Code/herbrich.me/papers/matchboxqbf.pdf:pdf},
    title = {{Collaborative Expert Portfolio Management}},
    url = {http://www.herbrich.me/papers/matchboxqbf.pdf},
    year = {2010}
    }

  • T. R. Zaman, R. Herbrich, J. {Van Gael}, and D. Stern, “Predicting Information Spreading in Twitter,” in Proceedings of computational social science and the wisdom of crowds workshop, Vancouver, 2010.
    [BibTeX] [Abstract] [Download PDF]

    We present a new methodology for predicting the spread of information in a social network. We focus on the Twitter network, where information is in the form of 140 character messages called tweets, and information is spread by users forwarding tweets, a practice known as retweeting. Using data of who and what was retweeted, we train a probabilistic collaborative filter model to predict future retweets. We find that the most important features for prediction are the identity of the source of the tweet and retweeter. Our methodology is quite flexible and be used as a basis for other prediction models in social networks.

    @inproceedings{Zaman2010,
    abstract = {We present a new methodology for predicting the spread of information in a social network. We focus on the Twitter network, where information is in the form of 140 character messages called tweets, and information is spread by users forwarding tweets, a practice known as retweeting. Using data of who and what was retweeted, we train a probabilistic collaborative filter model to predict future retweets. We find that the most important features for prediction are the identity of the source of the tweet and retweeter. Our methodology is quite flexible and be used as a basis for other prediction models in social networks.},
    address = {Vancouver},
    author = {Zaman, Tauhid R and Herbrich, Ralf and {Van Gael}, Jurgen and Stern, David},
    booktitle = {Proceedings of Computational Social Science and the Wisdom of Crowds Workshop},
    file = {:Users/rherb/Code/herbrich.me/papers/nips10\_twitter.pdf:pdf},
    title = {{Predicting Information Spreading in Twitter}},
    url = {http://www.herbrich.me/papers/nips10\_twitter.pdf},
    year = {2010}
    }

  • X. Zhang, T. Graepel, and R. Herbrich, “Bayesian Online Learning for Multi-label and Multi-variate Performance Measures,” Proceedings of the 13th international conference on artificial intelligence and statistics (aistats) 2010, vol. 9, p. 956–963, 2010.
    [BibTeX] [Abstract] [Download PDF]

    Many real world applications employ multi- variate performance measures and each example can belong to multiple classes. The currently most popular approaches train an SVM for each class, followed by ad-hoc thresholding. Probabilistic models using Bayesian decision theory are also commonly adopted. In this paper, we propose a Bayesian online multi-label classification framework (BOMC) which learns a probabilistic linear classifier. The likelihood is modeled by a graphical model similar to TrueSkill, and inference is based on Gaussian density filtering with expectation propagation. Using samples from the posterior, we label the testing data by maximizing the expected F1-score. Our experiments on Reuters1-v2 dataset show BOMC compares favorably to the state-of-the-art online learners in macro- averaged F1-score and training time.

    @article{DBLP:journals/jmlr/ZhangGH10,
    abstract = {Many real world applications employ multi- variate performance measures and each example can belong to multiple classes. The currently most popular approaches train an SVM for each class, followed by ad-hoc thresholding. Probabilistic models using Bayesian decision theory are also commonly adopted. In this paper, we propose a Bayesian online multi-label classification framework (BOMC) which learns a probabilistic linear classifier. The likelihood is modeled by a graphical model similar to TrueSkill, and inference is based on Gaussian density filtering with expectation propagation. Using samples from the posterior, we label the testing data by maximizing the expected F1-score. Our experiments on Reuters1-v2 dataset show BOMC compares favorably to the state-of-the-art online learners in macro- averaged F1-score and training time.},
    address = {Sardinia},
    author = {Zhang, Xinhua and Graepel, Thore and Herbrich, Ralf},
    file = {:Users/rherb/Code/herbrich.me/papers/zhang10b.pdf:pdf},
    journal = {Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS) 2010},
    pages = {956--963},
    title = {{Bayesian Online Learning for Multi-label and Multi-variate Performance Measures}},
    url = {http://www.herbrich.me/papers/zhang10b.pdf},
    volume = {9},
    year = {2010}
    }

2009

  • Y. Bachrach, R. Herbrich, and E. Porat, “Sketching Algorithms for Approximating Rank Correlations in Collaborative Filtering Systems,” in Proceedings of 16th international symposium string processing and information retrieval, Saariselkä, 2009, p. 344–352.
    [BibTeX] [Abstract] [Download PDF]

    Collaborative filtering (CF) shares information between users to provide each with recommendations. Previous work suggests using sketching techniques to handle massive data sets in CF systems, but only allows testing whether users have a high proportion of items they have both ranked. We show how to determine the correlation between the rankings of two users, using concise “sketches” of the rankings. The sketches allow approximating Kendall’s Tau, a known rank correlation, with high accuracy \$\backslash epsilon\$ and high confidence \$1-\backslash delta\$. The required sketch size is logarithmic in the confidence and polynomial in the accuracy.

    @inproceedings{DBLP:conf/spire/BachrachHP09,
    abstract = {Collaborative filtering (CF) shares information between users to provide each with recommendations. Previous work suggests using sketching techniques to handle massive data sets in CF systems, but only allows testing whether users have a high proportion of items they have both ranked. We show how to determine the correlation between the rankings of two users, using concise “sketches” of the rankings. The sketches allow approximating Kendall’s Tau, a known rank correlation, with high accuracy \$\backslash epsilon\$ and high confidence \$1-\backslash delta\$. The required sketch size is logarithmic in the confidence and polynomial in the accuracy.},
    address = {Saariselk\"{a}},
    author = {Bachrach, Yoram and Herbrich, Ralf and Porat, Ely},
    booktitle = {Proceedings of 16th International Symposium String Processing and Information Retrieval},
    file = {:Users/rherb/Code/herbrich.me/papers/spire09.pdf:pdf},
    pages = {344--352},
    title = {{Sketching Algorithms for Approximating Rank Correlations in Collaborative Filtering Systems}},
    url = {http://www.herbrich.me/papers/spire09.pdf},
    year = {2009}
    }

  • P. Flach, S. Spiegler, B. Golénia, S. Price, J. Guiver, R. Herbrich, T. Graepel, and M. Zaki, “Novel Tools to Streamline the Conference Review Process: Experiences from SIGKDD’09,” Sigkdd explorations, vol. 11, iss. 2, p. 63–67, 2009.
    [BibTeX] [Abstract] [Download PDF]

    The SIGKDD’09 Research Track received 537 paper submissions, which were reviewed by a Program Committee of 199 members, and a Senior Program Committee of 22 members. We used techniques from artificial intelligence and data mining to streamline and support this complicated process at three crucial stages: bidding by PC members on papers, assigning papers to reviewers, and calibrating scores obtained from the reviews. In this paper we report on the approaches taken, evaluate how well they worked, and describe some further work done after the conference.

    @article{DBLP:journals/sigkdd/FlachSGPGHGZ09,
    abstract = {The SIGKDD'09 Research Track received 537 paper submissions, which were reviewed by a Program Committee of 199 members, and a Senior Program Committee of 22 members. We used techniques from artificial intelligence and data mining to streamline and support this complicated process at three crucial stages: bidding by PC members on papers, assigning papers to reviewers, and calibrating scores obtained from the reviews. In this paper we report on the approaches taken, evaluate how well they worked, and describe some further work done after the conference.},
    author = {Flach, Peter and Spiegler, Sebastian and Gol\'{e}nia, Bruno and Price, Simon and Guiver, John and Herbrich, Ralf and Graepel, Thore and Zaki, Mohammed},
    file = {:Users/rherb/Code/herbrich.me/papers/ReviewerCalibration.pdf:pdf},
    journal = {SIGKDD Explorations},
    number = {2},
    pages = {63--67},
    title = {{Novel Tools to Streamline the Conference Review Process: Experiences from SIGKDD'09}},
    url = {http://www.herbrich.me/papers/ReviewerCalibration.pdf},
    volume = {11},
    year = {2009}
    }

  • A. Schwaighofer, J. Q. Candela, T. Borchert, T. Graepel, and R. Herbrich, “Scalable Clustering and Keyword Suggestion for Online Advertisements,” in Proceedings of adkdd 2009: 3rd annual international workshop on data mining and audience intelligence for advertising, 2009, p. 27–36.
    [BibTeX] [Abstract] [Download PDF]

    We present an efficient Bayesian online learning algorithm for clustering vectors of binary values based on a well known model, the mixture of Bernoulli profiles. The model includes conjugate Beta priors over the success probabilities and maintains discrete probability distributions for cluster assignments. Clustering is then formulated as inference in a factor graph which is solved efficiently using online approximate message passing. The resulting algorithm has three key features: a) it requires only a single pass across the data and can hence be used on data streams, b) it maintains the uncertainty of parameters and cluster assignments, and c) it implements an automatic step size adaptation based on the current model uncertainty. The model is tested on an artificially generated toy dataset and applied to a large scale real-world data set from online advertising, the data being online ads characterized by the set of keywords to which they have been subscribed. The proposed approach scales well for large datasets, and compares favorably to other clustering algorithms on the ads dataset. As a concrete application to online advertising we show how the learned model can be used to recommend new keywords for given ads.

    @inproceedings{DBLP:conf/kdd/SchwaighoferCBGH09,
    abstract = {We present an efficient Bayesian online learning algorithm for clustering vectors of binary values based on a well known model, the mixture of Bernoulli profiles. The model includes conjugate Beta priors over the success probabilities and maintains discrete probability distributions for cluster assignments. Clustering is then formulated as inference in a factor graph which is solved efficiently using online approximate message passing. The resulting algorithm has three key features: a) it requires only a single pass across the data and can hence be used on data streams, b) it maintains the uncertainty of parameters and cluster assignments, and c) it implements an automatic step size adaptation based on the current model uncertainty. The model is tested on an artificially generated toy dataset and applied to a large scale real-world data set from online advertising, the data being online ads characterized by the set of keywords to which they have been subscribed. The proposed approach scales well for large datasets, and compares favorably to other clustering algorithms on the ads dataset. As a concrete application to online advertising we show how the learned model can be used to recommend new keywords for given ads.},
    author = {Schwaighofer, Anton and Candela, Joaquin Qui\~{n}onero and Borchert, Thomas and Graepel, Thore and Herbrich, Ralf},
    booktitle = {Proceedings of ADKDD 2009: 3rd Annual International Workshop on Data Mining and Audience Intelligence for Advertising},
    file = {:Users/rherb/Code/herbrich.me/papers/kdd2009.pdf:pdf},
    pages = {27--36},
    title = {{Scalable Clustering and Keyword Suggestion for Online Advertisements}},
    url = {http://www.herbrich.me/papers/kdd2009.pdf},
    year = {2009}
    }

  • D. Stern, R. Herbrich, and T. Graepel, “Matchbox: Large Scale Online Bayesian Recommendations,” in Proceedings of the 18th international conference on world wide web, 2009, p. 111–120.
    [BibTeX] [Abstract] [Download PDF]

    We present a probabilistic model for generating personalised recommendations of items to users of a web service. The Matchbox system makes use of content information in the form of user and item meta data in combination with collaborative filtering information from previous user behavior in order to predict the value of an item for a user. Users and items are represented by feature vectors which are mapped into a low-dimensional ‘trait space’ in which similarity is measured in terms of inner products. The model can be trained from different types of feedback in order to learn user-item preferences. Here we present three alternatives: direct observation of an absolute rating each user gives to some items, observation of a binary preference (like/ don’t like) and observation of a set of ordinal ratings on a userspecific scale. Efficient inference is achieved by approximate message passing involving a combination of Expectation Propagation (EP) and Variational Message Passing. We also include a dynamics model which allows an item’s popularity, a user’s taste or a user’s personal rating scale to drift over time. By using Assumed-Density Filtering (ADF) for training, the model requires only a single pass through the training data. This is an on-line learning algorithm capable of incrementally taking account of new data so the system can immediately reflect the latest user preferences. We evaluate the performance of the algorithm on the MovieLens and Netflix data sets consisting of approximately 1,000,000 and 100,000,000 ratings respectively. This demonstrates that training the model using the on-line ADF approach yields state-of-the-art performance with the option of improving performance further if computational resources are available by performing multiple EP passes over the training data.

    @inproceedings{DBLP:conf/www/SternHG09,
    abstract = {We present a probabilistic model for generating personalised recommendations of items to users of a web service. The Matchbox system makes use of content information in the form of user and item meta data in combination with collaborative filtering information from previous user behavior in order to predict the value of an item for a user. Users and items are represented by feature vectors which are mapped into a low-dimensional ‘trait space’ in which similarity is measured in terms of inner products. The model can be trained from different types of feedback in order to learn user-item preferences. Here we present three alternatives: direct observation of an absolute rating each user gives to some items, observation of a binary preference (like/ don’t like) and observation of a set of ordinal ratings on a userspecific scale. Efficient inference is achieved by approximate message passing involving a combination of Expectation Propagation (EP) and Variational Message Passing. We also include a dynamics model which allows an item’s popularity, a user’s taste or a user’s personal rating scale to drift over time. By using Assumed-Density Filtering (ADF) for training, the model requires only a single pass through the training data. This is an on-line learning algorithm capable of incrementally taking account of new data so the system can immediately reflect the latest user preferences. We evaluate the performance of the algorithm on the MovieLens and Netflix data sets consisting of approximately 1,000,000 and 100,000,000 ratings respectively. This demonstrates that training the model using the on-line ADF approach yields state-of-the-art performance with the option of improving performance further if computational resources are available by performing multiple EP passes over the training data.},
    author = {Stern, David and Herbrich, Ralf and Graepel, Thore},
    booktitle = {Proceedings of the 18th International Conference on World Wide Web},
    file = {:Users/rherb/Code/herbrich.me/papers/www09.pdf:pdf},
    pages = {111--120},
    title = {{Matchbox: Large Scale Online Bayesian Recommendations}},
    url = {http://www.herbrich.me/papers/www09.pdf},
    year = {2009}
    }

2008

  • T. Graepel and R. Herbrich, “Large Scale Data Analysis and Modelling in Online Services and Advertising,” in Proceedings of the 14th acm sigkdd international conference on knowledge discovery and data mining, Las Vegas, 2008, p. 2.
    [BibTeX] [Download PDF]
    @inproceedings{DBLP:conf/kdd/GraepelH08,
    address = {Las Vegas},
    author = {Graepel, Thore and Herbrich, Ralf},
    booktitle = {Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining},
    file = {:Users/rherb/Code/herbrich.me/papers/kdd2008.pdf:pdf},
    pages = {2},
    title = {{Large Scale Data Analysis and Modelling in Online Services and Advertising}},
    url = {http://www.herbrich.me/papers/kdd2008.pdf},
    year = {2008}
    }

2007

  • P. Dangauthier, R. Herbrich, T. Minka, and T. Graepel, “TrueSkill Through Time: Revisiting the History of Chess,” in Advances in neural information processing systems 20 2007, Vancouver, 2007, p. 931–938.
    [BibTeX] [Abstract] [Download PDF]

    We extend the Bayesian skill rating system TrueSkill to infer entire time series of skills of players by smoothing through time instead of filtering. The skill of each participating player, say, every year is represented by a latent skill variable which is affected by the relevant game outcomes that year, and coupled with the skill variables of the previous and subsequent year. Inference in the resulting factor graph is carried out by approximate message passing (EP) along the time series of skills. As before the system tracks the uncertainty about player skills, explicitly models draws, can deal with any number of competing entities and can infer individual skills from team results. We extend the system to estimate player-specific draw margins. Based on these models we present an analysis of the skill curves of important players in the history of chess over the past 150 years. Results include plots of players’ lifetime skill development as well as the ability to compare the skills of different players across time. Our results indicate that a) the overall playing strength has increased over the past 150 years, and b) that modelling a player’s ability to force a draw provides significantly better predictive power.

    @inproceedings{DBLP:conf/nips/DangauthierHMG07,
    abstract = {We extend the Bayesian skill rating system TrueSkill to infer entire time series of skills of players by smoothing through time instead of filtering. The skill of each participating player, say, every year is represented by a latent skill variable which is affected by the relevant game outcomes that year, and coupled with the skill variables of the previous and subsequent year. Inference in the resulting factor graph is carried out by approximate message passing (EP) along the time series of skills. As before the system tracks the uncertainty about player skills, explicitly models draws, can deal with any number of competing entities and can infer individual skills from team results. We extend the system to estimate player-specific draw margins. Based on these models we present an analysis of the skill curves of important players in the history of chess over the past 150 years. Results include plots of players' lifetime skill development as well as the ability to compare the skills of different players across time. Our results indicate that a) the overall playing strength has increased over the past 150 years, and b) that modelling a player's ability to force a draw provides significantly better predictive power.},
    address = {Vancouver},
    author = {Dangauthier, Pierre and Herbrich, Ralf and Minka, Tom and Graepel, Thore},
    booktitle = {Advances in Neural Information Processing Systems 20 2007},
    file = {:Users/rherb/Code/herbrich.me/papers/ttt.pdf:pdf},
    pages = {931--938},
    publisher = {The MIT Press},
    title = {{TrueSkill Through Time: Revisiting the History of Chess}},
    url = {http://www.herbrich.me/papers/ttt.pdf},
    year = {2007}
    }

  • R. Herbrich, T. Graepel, and B. Murphy, “Structure From Failure,” in Proceedings of second workshop on tackling computer systems problems with machine learning techniques, 2007.
    [BibTeX] [Abstract] [Download PDF]

    We investigate the problem of learning the dependencies among servers in large networks based on failure patterns in their up-time behaviour. We model up-times in terms of exponential distributions whose inverse lifetime parameters lmay vary with the state of other servers. Based on a conjugate Gamma prior over inverse lifetimes we identify the most likely network configuration given that any node has at most one parent. The method can be viewed as a special case of learning a continuous time Bayesian network. Our approach enables us to easily incorporate existing expert prior knowledge. Furthermore our method enjoys advantages over a state-of-the-art rule-based approach. We validate the approach on synthetic data and apply it to five year data for a set of over 500 servers at a server farm of a major Microsoft web site.

    @inproceedings{Herbrich2007a,
    abstract = {We investigate the problem of learning the dependencies among servers in large networks based on failure patterns in their up-time behaviour. We model up-times in terms of exponential distributions whose inverse lifetime parameters lmay vary with the state of other servers. Based on a conjugate Gamma prior over inverse lifetimes we identify the most likely network configuration given that any node has at most one parent. The method can be viewed as a special case of learning a continuous time Bayesian network. Our approach enables us to easily incorporate existing expert prior knowledge. Furthermore our method enjoys advantages over a state-of-the-art rule-based approach. We validate the approach on synthetic data and apply it to five year data for a set of over 500 servers at a server farm of a major Microsoft web site.},
    author = {Herbrich, Ralf and Graepel, Thore and Murphy, Brendan},
    booktitle = {Proceedings of Second Workshop on Tackling Computer Systems Problems with Machine Learning Techniques},
    file = {:Users/rherb/Code/herbrich.me/papers/sysml2007.pdf:pdf},
    publisher = {USENIX},
    title = {{Structure From Failure}},
    url = {http://www.herbrich.me/papers/sysml2007.pdf},
    year = {2007}
    }

  • D. Stern, R. Herbrich, and T. Graepel, “Learning to Solve Game Trees,” in Proceedings of the twenty-fourth international conference on machine learning, Corvallis, 2007, p. 839–846.
    [BibTeX] [Abstract] [Download PDF]

    We apply probability theory to the task of proving whether a goal can be achieved by a player in an adversarial game. Such problems are solved by searching the game tree. We view this tree as a graphical model which yields a distribution over the (Boolean) outcome of the search before it terminates. Experiments show that a best-first search algorithm guided by this distribution explores a similar number of nodes as Proof-Number Search to solve Go problems. Knowledge is incorporated into search by using domain-specific models to provide prior distributions over the values of leaf nodes of the game tree. These are surrogate for the unexplored parts of the tree. The parameters of these models can be learned from previous search trees. Experiments on Go show that the speed of problem solving can be increased by orders of magnitude by this technique but care must be taken to avoid over-fitting.

    @inproceedings{DBLP:conf/icml/SternHG07,
    abstract = {We apply probability theory to the task of proving whether a goal can be achieved by a player in an adversarial game. Such problems are solved by searching the game tree. We view this tree as a graphical model which yields a distribution over the (Boolean) outcome of the search before it terminates. Experiments show that a best-first search algorithm guided by this distribution explores a similar number of nodes as Proof-Number Search to solve Go problems. Knowledge is incorporated into search by using domain-specific models to provide prior distributions over the values of leaf nodes of the game tree. These are surrogate for the unexplored parts of the tree. The parameters of these models can be learned from previous search trees. Experiments on Go show that the speed of problem solving can be increased by orders of magnitude by this technique but care must be taken to avoid over-fitting.},
    address = {Corvallis},
    author = {Stern, David and Herbrich, Ralf and Graepel, Thore},
    booktitle = {Proceedings of the Twenty-Fourth International Conference on Machine Learning},
    file = {:Users/rherb/Code/herbrich.me/papers/p839-stern.pdf:pdf},
    pages = {839--846},
    title = {{Learning to Solve Game Trees}},
    url = {http://www.herbrich.me/papers/p839-stern.pdf},
    year = {2007}
    }

2006

  • T. Graepel and R. Herbrich, “Ranking and Matchmaking,” Game developer magazine, iss. 10, 2006.
    [BibTeX] [Abstract] [Download PDF]

    Online game modes are gradually becoming the standard for console games, rather than an added bonus. But how do you get the right players playing against the right people, close enough in skill to be challenging, but not so difficult as to be frustrating?

    @article{Graepel2006,
    abstract = {Online game modes are gradually becoming the standard for console games, rather than an added bonus. But how do you get the right players playing against the right people, close enough in skill to be challenging, but not so difficult as to be frustrating?},
    author = {Graepel, Thore and Herbrich, Ralf},
    journal = {Game Developer Magazine},
    number = {10},
    title = {{Ranking and Matchmaking}},
    url = {http://www.gdmag.com/archive/oct06.htm},
    year = {2006}
    }

  • R. Herbrich, T. Minka, and T. Graepel, “TrueSkill(TM): A Bayesian Skill Rating System,” in Advances in neural information procesing systems 19, 2006, p. 569–576.
    [BibTeX] [Abstract] [Download PDF]

    We present a new Bayesian skill rating system which can be viewed as a generalisation of the Elo system used in Chess. The new system tracks the uncertainty about player skills, explicitly models draws, can deal with any number of competing entities and can infer individual skills from team results. Inference is performed by approximate message passing on a factor graph representation of the model. We present experimental evidence on the increased accuracy and convergence speed of the system compared to Elo and report on our experience with the new rating system running in a large-scale commercial online gaming service under the name of TrueSkill.

    @inproceedings{Herbrich2007,
    abstract = {We present a new Bayesian skill rating system which can be viewed as a generalisation of the Elo system used in Chess. The new system tracks the uncertainty about player skills, explicitly models draws, can deal with any number of competing entities and can infer individual skills from team results. Inference is performed by approximate message passing on a factor graph representation of the model. We present experimental evidence on the increased accuracy and convergence speed of the system compared to Elo and report on our experience with the new rating system running in a large-scale commercial online gaming service under the name of TrueSkill.},
    author = {Herbrich, Ralf and Minka, Tom and Graepel, Thore},
    booktitle = {Advances in Neural Information Procesing Systems 19},
    file = {:Users/rherb/Library/Application Support/Mendeley Desktop/Downloaded/Herbrich, Minka, Graepel - 2007 - TrueSkill TM A Bayesian Skill Rating System.pdf:pdf},
    keywords = {bayesian,ranking},
    mendeley-tags = {bayesian,ranking},
    pages = {569--576},
    publisher = {The MIT Press},
    title = {{TrueSkill(TM): A Bayesian Skill Rating System}},
    url = {http://www.herbrich.me/papers/trueskill.pdf},
    year = {2006}
    }

  • D. Stern, R. Herbrich, and T. Graepel, “Bayesian Pattern Ranking for Move Prediction in the Game of Go,” in Proceedings of the twenty-third international conference on machine learning, Pittsburgh, 2006, p. 873–880.
    [BibTeX] [Abstract] [Download PDF]

    We investigate the problem of learning to predict moves in the board game of Go from game records of expert players. In particular, we obtain a probability distribution over legal moves for professional play in a given position. This distribution has numerous applications in computer Go, including serving as an efficient stand-alone Go player. It would also be effective as a move selector and move sorter for game tree search and as a training tool for Go players. Our method has two major components: a) a pattern extraction scheme for efficiently harvesting patterns of given size and shape from expert game records and b) a Bayesian learning algorithm (in two variants) that learns a distribution over the values of a move given a board position based on the local pattern context. The system is trained on 181,000 expert games and shows excellent prediction performance as indicated by its ability to perfectly predict the moves made by professional Go players in 34\% of test positions.

    @inproceedings{DBLP:conf/icml/SternHG06,
    abstract = {We investigate the problem of learning to predict moves in the board game of Go from game records of expert players. In particular, we obtain a probability distribution over legal moves for professional play in a given position. This distribution has numerous applications in computer Go, including serving as an efficient stand-alone Go player. It would also be effective as a move selector and move sorter for game tree search and as a training tool for Go players. Our method has two major components: a) a pattern extraction scheme for efficiently harvesting patterns of given size and shape from expert game records and b) a Bayesian learning algorithm (in two variants) that learns a distribution over the values of a move given a board position based on the local pattern context. The system is trained on 181,000 expert games and shows excellent prediction performance as indicated by its ability to perfectly predict the moves made by professional Go players in 34\% of test positions.},
    address = {Pittsburgh},
    author = {Stern, David and Herbrich, Ralf and Graepel, Thore},
    booktitle = {Proceedings of the Twenty-Third International Conference on Machine Learning},
    file = {:Users/rherb/Code/herbrich.me/papers/p873-stern.pdf:pdf},
    pages = {873--880},
    title = {{Bayesian Pattern Ranking for Move Prediction in the Game of Go}},
    url = {http://www.herbrich.me/papers/p873-stern.pdf},
    year = {2006}
    }

2005

  • S. Agarwal, T. Graepel, R. Herbrich, S. Har-Peled, and D. Roth, “Generalization Bounds for the Area Under the ROC Curve,” Journal of machine learning research, vol. 6, p. 393–425, 2005.
    [BibTeX] [Abstract] [Download PDF]

    We study generalization properties of the area under the ROC curve (AUC), a quantity that has been advocated as an evaluation criterion for the bipartite ranking problem. The AUC is a different term than the error rate used for evaluation in classification problems; consequently, existing generalization bounds for the classification error rate cannot be used to draw conclusions about the AUC. In this paper, we define the expected accuracy of a ranking function (analogous to the expected error rate of a classification function), and derive distribution-free probabilistic bounds on the deviation of the empirical AUC of a ranking function (observed on a finite data sequence) from its expected accuracy. We derive both a large deviation bound, which serves to bound the expected accuracy of a ranking function in terms of its empirical AUC on a test sequence, and a uniform convergence bound, which serves to bound the expected accuracy of a learned ranking function in terms of its empirical AUC on a training sequence. Our uniform convergence bound is expressed in terms of a new set of combinatorial parameters that we term the bipartite rank-shatter coefficients; these play the same role in our result as do the standard VC-dimension related shatter coefficients (also known as the growth function) in uniform convergence results for the classification error rate. A comparison of our result with a recent uniform convergence result derived by Freund et al. (2003) for a quantity closely related to the AUC shows that the bound provided by our result can be considerably tighter.

    @article{DBLP:journals/jmlr/AgarwalGHHR05,
    abstract = {We study generalization properties of the area under the ROC curve (AUC), a quantity that has been advocated as an evaluation criterion for the bipartite ranking problem. The AUC is a different term than the error rate used for evaluation in classification problems; consequently, existing generalization bounds for the classification error rate cannot be used to draw conclusions about the AUC. In this paper, we define the expected accuracy of a ranking function (analogous to the expected error rate of a classification function), and derive distribution-free probabilistic bounds on the deviation of the empirical AUC of a ranking function (observed on a finite data sequence) from its expected accuracy. We derive both a large deviation bound, which serves to bound the expected accuracy of a ranking function in terms of its empirical AUC on a test sequence, and a uniform convergence bound, which serves to bound the expected accuracy of a learned ranking function in terms of its empirical AUC on a training sequence. Our uniform convergence bound is expressed in terms of a new set of combinatorial parameters that we term the bipartite rank-shatter coefficients; these play the same role in our result as do the standard VC-dimension related shatter coefficients (also known as the growth function) in uniform convergence results for the classification error rate. A comparison of our result with a recent uniform convergence result derived by Freund et al. (2003) for a quantity closely related to the AUC shows that the bound provided by our result can be considerably tighter.},
    author = {Agarwal, Shivani and Graepel, Thore and Herbrich, Ralf and Har-Peled, Sariel and Roth, Dan},
    file = {:Users/rherb/Code/herbrich.me/papers/auc.pdf:pdf},
    journal = {Journal of Machine Learning Research},
    pages = {393--425},
    title = {{Generalization Bounds for the Area Under the ROC Curve}},
    url = {http://www.herbrich.me/papers/auc.pdf},
    volume = {6},
    year = {2005}
    }

  • T. Graepel, R. Herbrich, and J. Shawe-Taylor, “PAC-Bayesian Compression Bounds on the Prediction Error of Learning Algorithms for Classification,” Machine learning, vol. 59, p. 55–76, 2005.
    [BibTeX] [Abstract] [Download PDF]

    We consider bounds on the prediction error of classification algorithms based on sample compression. We refine the notion of a compression scheme to distinguish permutation and repetition invariant and non-permutation and repetition invariant compression schemes leading to different prediction error bounds. Also, we extend known results on compression to the case of non-zero empirical risk. We provide bounds on the prediction error of classifiers returned by mistakedriven online learning algorithms by interpreting mistake bounds as bounds on the size of the respective compression scheme of the algorithm. This leads to a bound on the prediction error of perceptron solutions that depends on the margin a support vector machine would achieve on the same training sample. Furthermore, using the property of compression we derive bounds on the average prediction error of kernel classifiers in the PAC-Bayesian framework. These bounds assume a prior measure over the expansion coefficients in the data-dependent kernel expansion and bound the average prediction error uniformly over subsets of the space of expansion coefficients.

    @article{DBLP:journals/ml/GraepelHS05,
    abstract = {We consider bounds on the prediction error of classification algorithms based on sample compression. We refine the notion of a compression scheme to distinguish permutation and repetition invariant and non-permutation and repetition invariant compression schemes leading to different prediction error bounds. Also, we extend known results on compression to the case of non-zero empirical risk. We provide bounds on the prediction error of classifiers returned by mistakedriven online learning algorithms by interpreting mistake bounds as bounds on the size of the respective compression scheme of the algorithm. This leads to a bound on the prediction error of perceptron solutions that depends on the margin a support vector machine would achieve on the same training sample. Furthermore, using the property of compression we derive bounds on the average prediction error of kernel classifiers in the PAC-Bayesian framework. These bounds assume a prior measure over the expansion coefficients in the data-dependent kernel expansion and bound the average prediction error uniformly over subsets of the space of expansion coefficients.},
    author = {Graepel, Thore and Herbrich, Ralf and Shawe-Taylor, John},
    file = {:Users/rherb/Code/herbrich.me/papers/graehertay05.pdf:pdf},
    journal = {Machine Learning},
    pages = {55--76},
    title = {{PAC-Bayesian Compression Bounds on the Prediction Error of Learning Algorithms for Classification}},
    url = {http://www.herbrich.me/papers/graehertay05.pdf},
    volume = {59},
    year = {2005}
    }

  • A. Gretton, R. Herbrich, A. J. Smola, O. Bousquet, and B. Schölkopf, “Kernel Methods for Measuring Independence,” Journal of machine learning research, vol. 6, p. 2075–2129, 2005.
    [BibTeX] [Abstract] [Download PDF]

    We introduce two new functionals, the constrained covariance and the kernel mutual information, to measure the degree of independence of random variables. These quantities are both based on the covariance between functions of the random variables in reproducing kernel Hilbert spaces (RKHSs). We prove that when the RKHSs are universal, both functionals are zero if and only if the random variables are pairwise independent. We also show that the kernel mutual information is an upper bound near independence on the Parzen window estimate of the mutual information. Analogous results apply for two correlation-based dependence functionals introduced earlier: we show the kernel canonical correlation and the kernel generalised variance to be independence measures for universal kernels, and prove the latter to be an upper bound on the mutual information near independence. The performance of the kernel dependence functionals in measuring independence is verified in the context of independent component analysis.

    @article{DBLP:journals/jmlr/GrettonHSBS05,
    abstract = {We introduce two new functionals, the constrained covariance and the kernel mutual information, to measure the degree of independence of random variables. These quantities are both based on the covariance between functions of the random variables in reproducing kernel Hilbert spaces (RKHSs). We prove that when the RKHSs are universal, both functionals are zero if and only if the random variables are pairwise independent. We also show that the kernel mutual information is an upper bound near independence on the Parzen window estimate of the mutual information. Analogous results apply for two correlation-based dependence functionals introduced earlier: we show the kernel canonical correlation and the kernel generalised variance to be independence measures for universal kernels, and prove the latter to be an upper bound on the mutual information near independence. The performance of the kernel dependence functionals in measuring independence is verified in the context of independent component analysis.},
    author = {Gretton, Arthur and Herbrich, Ralf and Smola, Alexander J and Bousquet, Olivier and Sch\"{o}lkopf, Bernhard},
    file = {:Users/rherb/Code/herbrich.me/papers/gretton05a.pdf:pdf},
    journal = {Journal of Machine Learning Research},
    pages = {2075--2129},
    title = {{Kernel Methods for Measuring Independence}},
    url = {http://www.herbrich.me/papers/gretton05a.pdf},
    volume = {6},
    year = {2005}
    }

  • A. Gretton, A. Smola, O. Bousquet, R. Herbrich, A. Belitski, M. Augath, Y. Murayama, J. Pauls, B. Schölkopf, and N. Logothetis, “Kernel Constrained Covariance for Dependence Measurement,” in Proceedings of the tenth international workshop on artificial intelligence and statistics, 2005, p. 112–119.
    [BibTeX] [Abstract] [Download PDF]

    We discuss reproducing kernel Hilbert space (RKHS)-based measures of statistical dependence, with emphasis on constrained covariance (COCO), a novel criterion to test dependence of random variables. We show that COCO is a test for independence if and only if the associated RKHSs are universal. That said, no independence test exists that can distinguish dependent and independent random variables in all circumstances. Dependent random variables can result in a COCO which is arbitrarily close to zero when the source densities are highly non-smooth. All current kernel-based independence tests share this behaviour. We demonstrate exponential convergence between the population and empirical COCO. Finally, we use COCO as a measure of joint neural activity between voxels in MRI recordings of the macaque monkey, and compare the results to the mutual information and the correlation. We also show the effect of removing breathing artefacts from the MRI recording.

    @inproceedings{Gretton2005,
    abstract = {We discuss reproducing kernel Hilbert space (RKHS)-based measures of statistical dependence, with emphasis on constrained covariance (COCO), a novel criterion to test dependence of random variables. We show that COCO is a test for independence if and only if the associated RKHSs are universal. That said, no independence test exists that can distinguish dependent and independent random variables in all circumstances. Dependent random variables can result in a COCO which is arbitrarily close to zero when the source densities are highly non-smooth. All current kernel-based independence tests share this behaviour. We demonstrate exponential convergence between the population and empirical COCO. Finally, we use COCO as a measure of joint neural activity between voxels in MRI recordings of the macaque monkey, and compare the results to the mutual information and the correlation. We also show the effect of removing breathing artefacts from the MRI recording.},
    author = {Gretton, Arthur and Smola, Alexander and Bousquet, Olivier and Herbrich, Ralf and Belitski, Andrei and Augath, Mark and Murayama, Yusuke and Pauls, Jon and Sch\"{o}lkopf, Bernhard and Logothetis, Nikos},
    booktitle = {Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics},
    file = {:Users/rherb/Code/herbrich.me/papers/kcc.pdf:pdf},
    pages = {112--119},
    title = {{Kernel Constrained Covariance for Dependence Measurement}},
    url = {http://www.herbrich.me/papers/kcc.pdf},
    year = {2005}
    }

  • R. Herbrich, “On Gaussian Expectation Propagation,” Microsoft Research, July, 2005.
    [BibTeX] [Abstract] [Download PDF]

    In this short note we will re-derive the Gaussian expectation propagation (Gaussian EP) algorithm as presented in Minka (2001) and demonstrate an application of Gaussian EP to approximate multi-dimensional truncated Gaussians.

    @techreport{Herbrich2005b,
    abstract = {In this short note we will re-derive the Gaussian expectation propagation (Gaussian EP) algorithm as presented in Minka (2001) and demonstrate an application of Gaussian EP to approximate multi-dimensional truncated Gaussians.},
    author = {Herbrich, Ralf},
    file = {:Users/rherb/Code/herbrich.me/papers/EP.pdf:pdf},
    institution = {Microsoft Research},
    number = {July},
    title = {{On Gaussian Expectation Propagation}},
    url = {http://www.herbrich.me/papers/EP.pdf},
    year = {2005}
    }

  • R. Herbrich, “Minimising the Kullback-Leibler Divergence,” Microsoft Research, August, 2005.
    [BibTeX] [Abstract] [Download PDF]

    In this note we show that minimising the Kullback–Leibler divergence over a family in the class of exponential distributions is achieved by matching the expected natural statistic. We will also give an explicit update formula for distributions with only one likelihood term.

    @techreport{Herbrich2005a,
    abstract = {In this note we show that minimising the Kullback–Leibler divergence over a family in the class of exponential distributions is achieved by matching the expected natural statistic. We will also give an explicit update formula for distributions with only one likelihood term.},
    author = {Herbrich, Ralf},
    file = {:Users/rherb/Code/herbrich.me/papers/KL.pdf:pdf},
    institution = {Microsoft Research},
    number = {August},
    title = {{Minimising the Kullback-Leibler Divergence}},
    url = {http://www.herbrich.me/papers/KL.pdf},
    year = {2005}
    }

  • R. Herbrich, T. Graepel, and R. C. Williamson, “The Structure of Version Space,” in Innovations in machine learning: theory and applications, Springer, 2005, p. 257–274.
    [BibTeX] [Abstract] [Download PDF]

    We investigate the generalisation performance of consistent classifiers, i.e. classifiers that are contained in the so-called version space, both from a theoretical and experimental angle. In contrast to classical VC analysis – where no single classifier within version space is singled out on grounds of a generalisation error bound – the data dependent structural risk minimisation framework suggests that there exists one particular classifier that is to be preferred because it minimises the generalisation error bound. This is usually taken to provide a theoretical justification for learning algorithms such as the well known support vector machine. A reinterpretation of a recent PAC-Bayesian result, however, reveals that given a suitably chosen hypothesis space there exists a large fraction of classifiers with small generalisation error although we cannot readily identify them for a specific learning task. In the particular case of linear classifiers we show that classifiers found by the classical perceptron algorithm have guarantees bounded by the size of version space. These results are complemented with an empirical study for kernel classifiers on the task of handwritten digit recognition which demonstrates that even classifiers with a small margin may exhibit excellent generalisation. In order to perform this analysis we introduce the kernel Gibbs sampler – an algorithm which can be used to sample consistent kernel classifiers.

    @incollection{Herbrich2005,
    abstract = {We investigate the generalisation performance of consistent classifiers, i.e. classifiers that are contained in the so-called version space, both from a theoretical and experimental angle. In contrast to classical VC analysis - where no single classifier within version space is singled out on grounds of a generalisation error bound - the data dependent structural risk minimisation framework suggests that there exists one particular classifier that is to be preferred because it minimises the generalisation error bound. This is usually taken to provide a theoretical justification for learning algorithms such as the well known support vector machine. A reinterpretation of a recent PAC-Bayesian result, however, reveals that given a suitably chosen hypothesis space there exists a large fraction of classifiers with small generalisation error although we cannot readily identify them for a specific learning task. In the particular case of linear classifiers we show that classifiers found by the classical perceptron algorithm have guarantees bounded by the size of version space. These results are complemented with an empirical study for kernel classifiers on the task of handwritten digit recognition which demonstrates that even classifiers with a small margin may exhibit excellent generalisation. In order to perform this analysis we introduce the kernel Gibbs sampler - an algorithm which can be used to sample consistent kernel classifiers.},
    author = {Herbrich, Ralf and Graepel, Thore and Williamson, Robert C},
    booktitle = {Innovations in Machine Learning: Theory and Applications},
    chapter = {9},
    file = {:Users/rherb/Code/herbrich.me/papers/holmes\_mod2.pdf:pdf},
    pages = {257--274},
    publisher = {Springer},
    title = {{The Structure of Version Space}},
    url = {http://www.herbrich.me/papers/holmes\_mod2.pdf},
    year = {2005}
    }

2004

  • S. Agarwal, T. Graepel, R. Herbrich, and D. Roth, “A Large Deviation Bound for the Area Under the ROC Curve,” in Advances in neural information processing systems 17, Vancouver, 2004, p. 9–16.
    [BibTeX] [Abstract] [Download PDF]

    The area under the ROC curve (AUC) has been advocated as an evaluation criterion for the bipartite ranking problem. We study large deviation properties of the AUC; in particular, we derive a distribution-free large deviation bound for the AUC which serves to bound the expected accuracy of a ranking function in terms of its empirical AUC on an independent test sequence. A comparison of our result with a corresponding large deviation result for the classification error rate suggests that the test sample size required to obtain an epsilon-accurate estimate of the expected accuracy of a ranking function with delta-confidence is larger than that required to obtain an epsilon-accurate estimate of the expected error rate of a classification function with the same confidence. A simple application of the union bound allows the large deviation bound to be extended to learned ranking functions chosen from finite function classes.

    @inproceedings{DBLP:conf/nips/AgarwalGHR04,
    abstract = {The area under the ROC curve (AUC) has been advocated as an evaluation criterion for the bipartite ranking problem. We study large deviation properties of the AUC; in particular, we derive a distribution-free large deviation bound for the AUC which serves to bound the expected accuracy of a ranking function in terms of its empirical AUC on an independent test sequence. A comparison of our result with a corresponding large deviation result for the classification error rate suggests that the test sample size required to obtain an epsilon-accurate estimate of the expected accuracy of a ranking function with delta-confidence is larger than that required to obtain an epsilon-accurate estimate of the expected error rate of a classification function with the same confidence. A simple application of the union bound allows the large deviation bound to be extended to learned ranking functions chosen from finite function classes.},
    address = {Vancouver},
    author = {Agarwal, Shivani and Graepel, Thore and Herbrich, Ralf and Roth, Dan},
    booktitle = {Advances in Neural Information Processing Systems 17},
    file = {:Users/rherb/Code/herbrich.me/papers/nips04-auc.pdf:pdf},
    pages = {9--16},
    title = {{A Large Deviation Bound for the Area Under the ROC Curve}},
    url = {http://www.herbrich.me/papers/nips04-auc.pdf},
    year = {2004}
    }

  • T. Graepel, R. Herbrich, and J. Gold, “Learning to Fight,” in Proceedings of the international conference on computer games: artificial intelligence, design and education, 2004, p. 193–200.
    [BibTeX] [Abstract] [Download PDF]

    We apply reinforcement learning to the problem of finding good policies for a fighting agent in a commercial computer game. The learning agent is trained using the SARSA algorithm for on-policy learning of an action-value function represented by linear and neural network function approximators. We discuss the selection and construction of features, actions, and rewards as well as other design choices necessary to integrate the learning process into the game. The learning agent is trained against the built-in AI of the game with different rewards encouraging aggressive or defensive behaviour. We show that the learning agent finds interesting (and partly near optimal) policies in accordance with the reward functions provided. We also discuss the particular challenges arising in the application of reinforcement learning to the domain of computer games.

    @inproceedings{Graepel2004,
    abstract = {We apply reinforcement learning to the problem of finding good policies for a fighting agent in a commercial computer game. The learning agent is trained using the SARSA algorithm for on-policy learning of an action-value function represented by linear and neural network function approximators. We discuss the selection and construction of features, actions, and rewards as well as other design choices necessary to integrate the learning process into the game. The learning agent is trained against the built-in AI of the game with different rewards encouraging aggressive or defensive behaviour. We show that the learning agent finds interesting (and partly near optimal) policies in accordance with the reward functions provided. We also discuss the particular challenges arising in the application of reinforcement learning to the domain of computer games.},
    author = {Graepel, Thore and Herbrich, Ralf and Gold, Julian},
    booktitle = {Proceedings of the International Conference on Computer Games: Artificial Intelligence, Design and Education},
    file = {:Users/rherb/Code/herbrich.me/papers/graehergol04.pdf:pdf},
    pages = {193--200},
    title = {{Learning to Fight}},
    url = {http://www.herbrich.me/papers/graehergol04.pdf},
    year = {2004}
    }

  • S. Rajaram, T. Graepel, and R. Herbrich, “Poisson-Networks : A Model for Structured Poisson Processes,” in Proceedings of the tenth international workshop on artificial intelligence and statistics, 2004, p. 277–284.
    [BibTeX] [Abstract] [Download PDF]

    Modelling structured multivariate point process data has wide ranging applications like understanding neural activity, developing faster file access systems and learning dependencies among servers in large networks. In this paper, we develop the Poisson network model for representing multivariate structured Poisson processes. In our model each node of the network represents a Poisson process. The novelty of our work is that waiting times of a process are modelled by an exponential distribution with a piecewise constant rate function that depends on the event counts of its parents in the network in a generalised linear way. Our choice of model allows to perform exact sampling from arbitrary structures. We adopt a Bayesian approach for learning the network structure. Further, we discuss fixed point and sampling based approximations for performing inference of rate functions in Poisson networks.

    @inproceedings{Rajaram2004,
    abstract = {Modelling structured multivariate point process data has wide ranging applications like understanding neural activity, developing faster file access systems and learning dependencies among servers in large networks. In this paper, we develop the Poisson network model for representing multivariate structured Poisson processes. In our model each node of the network represents a Poisson process. The novelty of our work is that waiting times of a process are modelled by an exponential distribution with a piecewise constant rate function that depends on the event counts of its parents in the network in a generalised linear way. Our choice of model allows to perform exact sampling from arbitrary structures. We adopt a Bayesian approach for learning the network structure. Further, we discuss fixed point and sampling based approximations for performing inference of rate functions in Poisson networks.},
    author = {Rajaram, Shyamsundar and Graepel, Thore and Herbrich, Ralf},
    booktitle = {Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics},
    file = {:Users/rherb/Code/herbrich.me/papers/spikenets.pdf:pdf},
    pages = {277--284},
    title = {{Poisson-Networks : A Model for Structured Poisson Processes}},
    url = {http://www.herbrich.me/papers/spikenets.pdf},
    year = {2004}
    }

2003

  • T. Graepel and R. Herbrich, “Invariant Pattern Recognition by Semidefinite Programming Machines,” in Advances in neural information processing systems 16, Vancouver, 2003, p. 33–40.
    [BibTeX] [Abstract] [Download PDF]

    Knowledge about local invariances with respect to given pattern transformations can greatly improve the accuracy of classification. Previous approaches are either based on regularisation or on the generation of virtual (transformed) examples. We develop a new framework for learning linear classifiers under known transformations based on semidefinite programming. We present a new learning algorithm – the Semidefinite Programming Machine (SDPM) – which is able to find a maximum margin hyperplane when the training examples are polynomial trajectories instead of single points. The solution is found to be sparse in dual variables and allows to identify those points on the trajectory with minimal real-valued output as virtual support vectors. Extensions to segments of trajectories, to more than one transformation parameter, and to learning with kernels are discussed. In experiments we use a Taylor expansion to locally approximate rotational invariance in pixel images from USPS and find improvements over known methods.

    @inproceedings{Graepel2003,
    abstract = {Knowledge about local invariances with respect to given pattern transformations can greatly improve the accuracy of classification. Previous approaches are either based on regularisation or on the generation of virtual (transformed) examples. We develop a new framework for learning linear classifiers under known transformations based on semidefinite programming. We present a new learning algorithm - the Semidefinite Programming Machine (SDPM) - which is able to find a maximum margin hyperplane when the training examples are polynomial trajectories instead of single points. The solution is found to be sparse in dual variables and allows to identify those points on the trajectory with minimal real-valued output as virtual support vectors. Extensions to segments of trajectories, to more than one transformation parameter, and to learning with kernels are discussed. In experiments we use a Taylor expansion to locally approximate rotational invariance in pixel images from USPS and find improvements over known methods.},
    address = {Vancouver},
    author = {Graepel, Thore and Herbrich, Ralf},
    booktitle = {Advances in Neural Information Processing Systems 16},
    file = {:Users/rherb/Code/herbrich.me/papers/sdpm.pdf:pdf},
    pages = {33--40},
    title = {{Invariant Pattern Recognition by Semidefinite Programming Machines}},
    url = {http://www.herbrich.me/papers/sdpm.pdf},
    year = {2003}
    }

  • T. Graepel, R. Herbrich, A. Kharechko, and J. Shawe-Taylor, “Semi-Definite Programming by Perceptron Learning,” in Advances in neural information processing systems 16, Vancouver, 2003, p. 457–464.
    [BibTeX] [Abstract] [Download PDF]

    We present a modified version of the perceptron learning algorithm (PLA) which solves semidefinite programs (SDPs) in polynomial time. The algorithm is based on the following three observations: (i) Semidefinite programs are linear programs with infinitely many (linear) constraints; (ii) every linear program can be solved by a sequence of constraint satisfaction problems with linear constraints; (iii) in general, the perceptron learning algorithm solves a constraint satisfaction problem with linear constraints in finitely many updates. Combining the PLA with a probabilistic rescaling algorithm (which, on average, increases the size of the feasable region) results in a probabilistic algorithm for solving SDPs that runs in polynomial time. We present preliminary results which demonstrate that the algorithm works, but is not competitive with state-of-the-art interior point methods.

    @inproceedings{DBLP:conf/nips/GraepelHKS03,
    abstract = {We present a modified version of the perceptron learning algorithm (PLA) which solves semidefinite programs (SDPs) in polynomial time. The algorithm is based on the following three observations: (i) Semidefinite programs are linear programs with infinitely many (linear) constraints; (ii) every linear program can be solved by a sequence of constraint satisfaction problems with linear constraints; (iii) in general, the perceptron learning algorithm solves a constraint satisfaction problem with linear constraints in finitely many updates. Combining the PLA with a probabilistic rescaling algorithm (which, on average, increases the size of the feasable region) results in a probabilistic algorithm for solving SDPs that runs in polynomial time. We present preliminary results which demonstrate that the algorithm works, but is not competitive with state-of-the-art interior point methods.},
    address = {Vancouver},
    author = {Graepel, Thore and Herbrich, Ralf and Kharechko, Andriy and Shawe-Taylor, John},
    booktitle = {Advances in Neural Information Processing Systems 16},
    file = {:Users/rherb/Code/herbrich.me/papers/sdpm-pla.pdf:pdf},
    pages = {457--464},
    publisher = {The MIT Press},
    title = {{Semi-Definite Programming by Perceptron Learning}},
    url = {http://www.herbrich.me/papers/sdpm-pla.pdf},
    year = {2003}
    }

  • A. Gretton, R. Herbrich, and A. Smola, “The Kernel Mutual Information,” in Proceedings of ieee internaltional conference on acoustics, speech and signal processing, 2003, p. 880–883.
    [BibTeX] [Abstract] [Download PDF]

    We introduce a new contrast function, the kernel mutual information (KMI), to measure the degree of independence of continuous random variables. This contrast function provides an approximate upper bound on the mutual information, as measured near independence, and is based on a kernel density estimate of the mutual information between a discretised approximation of the continuous random variables. We show that Bach and Jordan’s kernel generalised variance (KGV) is also an upper bound on the same kernel density estimate, but is looser. Finally, we suggest that the addition of a regularising term in the KGV causes it to approach the KMI, which motivates the introduction of this regularisation.

    @inproceedings{Gretton2003,
    abstract = {We introduce a new contrast function, the kernel mutual information (KMI), to measure the degree of independence of continuous random variables. This contrast function provides an approximate upper bound on the mutual information, as measured near independence, and is based on a kernel density estimate of the mutual information between a discretised approximation of the continuous random variables. We show that Bach and Jordan's kernel generalised variance (KGV) is also an upper bound on the same kernel density estimate, but is looser. Finally, we suggest that the addition of a regularising term in the KGV causes it to approach the KMI, which motivates the introduction of this regularisation.},
    author = {Gretton, Arthur and Herbrich, Ralf and Smola, Alexander},
    booktitle = {Proceedings of IEEE Internaltional Conference on Acoustics, Speech and Signal Processing},
    file = {:Users/rherb/Code/herbrich.me/papers/GreHerSmo03.pdf:pdf},
    pages = {880--883},
    title = {{The Kernel Mutual Information}},
    url = {http://www.herbrich.me/papers/GreHerSmo03.pdf},
    year = {2003}
    }

  • E. Harrington, R. Herbrich, J. Kivinen, J. C. Platt, and R. C. Williamson, “Online Bayes Point Machines,” in Proceedings of advances in knowledge discovery and data mining, Seoul, 2003, p. 241–252.
    [BibTeX] [Abstract] [Download PDF]

    We present a new and simple algorithm for learning large margin classifiers that works in a truly online manner. The algorithm generates a linear classifier by averaging the weights associated with several perceptron-like algorithms run in parallel in order to approximate the Bayes point. A random subsample of the incoming data stream is used to ensure diversity in the perceptron solutions. We experimentally study the algorithm’s performance on online and batch learning settings. The online experiments showed that our algorithm produces a low prediction error on the training sequence and tracks the presence of concept drift. On the batch problems its performance is comparable to the maximum margin algorithm which explicitly maximises the margin.

    @inproceedings{DBLP:conf/pakdd/HarringtonHKPW03,
    abstract = {We present a new and simple algorithm for learning large margin classifiers that works in a truly online manner. The algorithm generates a linear classifier by averaging the weights associated with several perceptron-like algorithms run in parallel in order to approximate the Bayes point. A random subsample of the incoming data stream is used to ensure diversity in the perceptron solutions. We experimentally study the algorithm's performance on online and batch learning settings. The online experiments showed that our algorithm produces a low prediction error on the training sequence and tracks the presence of concept drift. On the batch problems its performance is comparable to the maximum margin algorithm which explicitly maximises the margin.},
    address = {Seoul},
    author = {Harrington, Edward and Herbrich, Ralf and Kivinen, Jyrki and Platt, John C and Williamson, Robert C},
    booktitle = {Proceedings of Advances in Knowledge Discovery and Data Mining},
    file = {:Users/rherb/Code/herbrich.me/papers/OBPM.pdf:pdf},
    pages = {241--252},
    title = {{Online Bayes Point Machines}},
    url = {http://www.herbrich.me/papers/OBPM.pdf},
    year = {2003}
    }

  • R. Herbrich and T. Graepel, “Introduction to the Special Issue on Learning Theory,” Journal of machine learning research, vol. 4, p. 755–757, 2003.
    [BibTeX] [Download PDF]
    @article{DBLP:journals/jmlr/HerbrichG03,
    author = {Herbrich, Ralf and Graepel, Thore},
    file = {:Users/rherb/Code/herbrich.me/papers/lt.pdf:pdf},
    journal = {Journal of Machine Learning Research},
    pages = {755--757},
    title = {{Introduction to the Special Issue on Learning Theory}},
    url = {http://www.herbrich.me/papers/lt.pdf},
    volume = {4},
    year = {2003}
    }

2002

  • R. Herbrich, Learning Kernel Classifiers: Theory and Algorithms, 2nd editio ed., The MIT Press, 2002.
    [BibTeX]
    @book{Herbrich2002a,
    author = {Herbrich, Ralf},
    edition = {2nd editio},
    publisher = {The MIT Press},
    title = {{Learning Kernel Classifiers: Theory and Algorithms}},
    year = {2002}
    }

  • R. Herbrich and T. Graepel, “A PAC-Bayesian margin bound for linear classifiers,” Ieee transactions on information theory, vol. 48, iss. 12, p. 3140–3150, 2002.
    [BibTeX] [Abstract] [Download PDF]

    We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training sample. The result is obtained in a PAC-Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound, which was developed in the luckiness framework, and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to non-trivial bound values and — for maximum margins — to a vanishing complexity term. In contrast to previous results, however, the new bound does depend on the dimensionality of feature space. The analysis shows that the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the fraction of hypothesis space consistent with the training sample. The practical relevance of the result lies in the fact that the well-known support vector machine is optimal with respect to the new bound only if the feature vectors in the training sample are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only. Numerical simulations support this recommendation and demonstrate that the new error bound can be used for the purpose of model selection.

    @article{DBLP:journals/tit/HerbrichG02,
    abstract = {We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training sample. The result is obtained in a PAC-Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound, which was developed in the luckiness framework, and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to non-trivial bound values and — for maximum margins — to a vanishing complexity term. In contrast to previous results, however, the new bound does depend on the dimensionality of feature space. The analysis shows that the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the fraction of hypothesis space consistent with the training sample. The practical relevance of the result lies in the fact that the well-known support vector machine is optimal with respect to the new bound only if the feature vectors in the training sample are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only. Numerical simulations support this recommendation and demonstrate that the new error bound can be used for the purpose of model selection.},
    author = {Herbrich, Ralf and Graepel, Thore},
    file = {:Users/rherb/Code/herbrich.me/papers/ieee-pacbayes.pdf:pdf},
    journal = {IEEE Transactions on Information Theory},
    number = {12},
    pages = {3140--3150},
    title = {{A PAC-Bayesian margin bound for linear classifiers}},
    url = {http://www.herbrich.me/papers/ieee-pacbayes.pdf},
    volume = {48},
    year = {2002}
    }

  • R. Herbrich and R. C. Williamson, “Learning and Generalization: Theoretical Bounds,” in Handbook of brain theory and neural networks, 2nd editio ed., The MIT Press, 2002, p. 619–623.
    [BibTeX] [Download PDF]
    @incollection{Herbrich2002,
    author = {Herbrich, Ralf and Williamson, Robert C},
    booktitle = {Handbook of Brain Theory and Neural Networks},
    edition = {2nd editio},
    file = {:Users/rherb/Code/herbrich.me/papers/HerWil02.pdf:pdf},
    pages = {619--623},
    publisher = {The MIT Press},
    title = {{Learning and Generalization: Theoretical Bounds}},
    url = {http://www.herbrich.me/papers/HerWil02.pdf},
    year = {2002}
    }

  • R. Herbrich and R. C. Williamson, “Algorithmic Luckiness,” Journal of machine learning research, vol. 3, p. 175–212, 2002.
    [BibTeX] [Abstract] [Download PDF]

    Classical statistical learning theory studies the generalisation performance of machine learning algorithms rather indirectly. One of the main detours is that algorithms are studied in terms of the hypothesis class that they draw their hypotheses from. In this paper, motivated by the luckiness framework of Shawe-Taylor et al. (1998), we study learning algorithms more directly and in a way that allows us to exploit the serendipity of the training sample. The main difference to previous approaches lies in the complexity measure; rather than covering all hypotheses in a given hypothesis space it is only necessary to cover the functions which could have been learned using the fixed learning algorithm. We show how the resulting framework relates to the VC, luckiness and compression frameworks. Finally, we present an application of this framework to the maximum margin algorithm for linear classifiers which results in a bound that exploits the margin, the sparsity of the resultant weight vector, and the degree of clustering of the training data in feature space.

    @article{DBLP:journals/jmlr/HerbrichW02,
    abstract = {Classical statistical learning theory studies the generalisation performance of machine learning algorithms rather indirectly. One of the main detours is that algorithms are studied in terms of the hypothesis class that they draw their hypotheses from. In this paper, motivated by the luckiness framework of Shawe-Taylor et al. (1998), we study learning algorithms more directly and in a way that allows us to exploit the serendipity of the training sample. The main difference to previous approaches lies in the complexity measure; rather than covering all hypotheses in a given hypothesis space it is only necessary to cover the functions which could have been learned using the fixed learning algorithm. We show how the resulting framework relates to the VC, luckiness and compression frameworks. Finally, we present an application of this framework to the maximum margin algorithm for linear classifiers which results in a bound that exploits the margin, the sparsity of the resultant weight vector, and the degree of clustering of the training data in feature space.},
    author = {Herbrich, Ralf and Williamson, Robert C},
    file = {:Users/rherb/Code/herbrich.me/papers/algoluck.pdf:pdf},
    journal = {Journal of Machine Learning Research},
    pages = {175--212},
    title = {{Algorithmic Luckiness}},
    url = {http://www.herbrich.me/papers/algoluck.pdf},
    volume = {3},
    year = {2002}
    }

  • S. Hill, H. Zaragoza, R. Herbrich, and P. Rayner, “Average Precision and the Problem of Generalisation,” in Proceedings of the acm sigir workshop on mathematical and formal methods in information retrieval, 2002.
    [BibTeX] [Abstract] [Download PDF]

    In this paper we study the problem of generalisation in information retrieval. In particular we study precision-recall curves and the average precision value. We provide two types of bounds: large-deviation bounds of the average precision and maximum deviation bounds with respect to a given point of the precision recall curve. The first type of bounds are useful to answer the question: how far can true average precision be from the value observed on a test collection? The second is useful for obtaining bounds on average precision when tight bounds on a particular point of the curve can be established, as is the case when training SVMs or Perceptrons for document categorisation.

    @inproceedings{Hill2002,
    abstract = {In this paper we study the problem of generalisation in information retrieval. In particular we study precision-recall curves and the average precision value. We provide two types of bounds: large-deviation bounds of the average precision and maximum deviation bounds with respect to a given point of the precision recall curve. The first type of bounds are useful to answer the question: how far can true average precision be from the value observed on a test collection? The second is useful for obtaining bounds on average precision when tight bounds on a particular point of the curve can be established, as is the case when training SVMs or Perceptrons for document categorisation.},
    author = {Hill, Simon and Zaragoza, Hugo and Herbrich, Ralf and Rayner, Peter},
    booktitle = {Proceedings of the ACM SIGIR Workshop on Mathematical and Formal Methods in Information Retrieval},
    file = {:Users/rherb/Downloads/hugoz\_sigir02\_MFMIR.pdf:pdf},
    keywords = {ir,theory},
    mendeley-tags = {ir,theory},
    title = {{Average Precision and the Problem of Generalisation}},
    url = {http://www.herbrich.me/papers/hill.pdf},
    year = {2002}
    }

  • N. Lawrence, M. Seeger, and R. Herbrich, “Fast Sparse Gaussian Process Methods: The Informative Vector Machine,” in Advances in neural information processing systems 15, Vancouver, 2002, p. 609–616.
    [BibTeX] [Abstract] [Download PDF]

    We present a framework for sparse Gaussian process (GP) methods which uses forward selection with criteria based on information-theoretical principles, previously suggested for active learning. In contrast to most previous work on sparse GPs, our goal is not only to learn sparse predictors (which can be evaluated in O(d) rather than O(n), d $\backslash$ll n, n the number of training points), but also to perform training under strong restrictions on time and memory requirements. The scaling of our method is at most O(n d*d), and in large real-world classification experiments we show that it can match prediction performance of the popular support vector machine (SVM), yet it requires only a fraction of the training time. In contrast to the SVM, our approximation produces estimates of predictive probabilities (`error bars’), allows for Bayesian model selection and is less complex in implementation.

    @inproceedings{DBLP:conf/nips/LawrenceSH02,
    abstract = {We present a framework for sparse Gaussian process (GP) methods which uses forward selection with criteria based on information-theoretical principles, previously suggested for active learning. In contrast to most previous work on sparse GPs, our goal is not only to learn sparse predictors (which can be evaluated in O(d) rather than O(n), d $\backslash$ll n, n the number of training points), but also to perform training under strong restrictions on time and memory requirements. The scaling of our method is at most O(n d*d), and in large real-world classification experiments we show that it can match prediction performance of the popular support vector machine (SVM), yet it requires only a fraction of the training time. In contrast to the SVM, our approximation produces estimates of predictive probabilities (`error bars'), allows for Bayesian model selection and is less complex in implementation.},
    address = {Vancouver},
    author = {Lawrence, Neil and Seeger, Matthias and Herbrich, Ralf},
    booktitle = {Advances in Neural Information Processing Systems 15},
    file = {:Users/rherb/Code/herbrich.me/papers/ivm.pdf:pdf},
    pages = {609--616},
    title = {{Fast Sparse Gaussian Process Methods: The Informative Vector Machine}},
    url = {http://www.herbrich.me/papers/ivm.pdf},
    year = {2002}
    }

  • Y. Li, H. Zaragoza, R. Herbrich, J. Shawe-Taylor, and J. Kandola, “The Perceptron Algorithm with Uneven Margins,” in Proceedings of the nineteenth international conference, New South Wales, 2002, p. 379–386.
    [BibTeX] [Abstract] [Download PDF]

    The perceptron algorithm with margins is a simple, fast and effective learning algorithm for linear classifiers; it produces decision hyperplanes within some constant ratio of the maximal margin. In this paper we study this algorithm and a new variant: the perceptron algorithm with uneven margins, tailored for document categorisation problems (i.e. problems where classes are highly unbalanced and performance depends on the ranking of patterns). We discuss the interest of these algorithms from a theoretical point of view, provide a generalisation of Novikoff’s theorem for uneven margins, give a geometrically description of these algorithms and show experimentally that both algorithms yield equal or better performances than support vector machines, while reducing training time and sparsity, in classification (USPS) and document categorisation (Reuters) problems.

    @inproceedings{DBLP:conf/icml/LiZHSK02,
    abstract = {The perceptron algorithm with margins is a simple, fast and effective learning algorithm for linear classifiers; it produces decision hyperplanes within some constant ratio of the maximal margin. In this paper we study this algorithm and a new variant: the perceptron algorithm with uneven margins, tailored for document categorisation problems (i.e. problems where classes are highly unbalanced and performance depends on the ranking of patterns). We discuss the interest of these algorithms from a theoretical point of view, provide a generalisation of Novikoff's theorem for uneven margins, give a geometrically description of these algorithms and show experimentally that both algorithms yield equal or better performances than support vector machines, while reducing training time and sparsity, in classification (USPS) and document categorisation (Reuters) problems.},
    address = {New South Wales},
    author = {Li, Yaoyong and Zaragoza, Hugo and Herbrich, Ralf and Shawe-Taylor, John and Kandola, Jasvinder},
    booktitle = {Proceedings of the Nineteenth International Conference},
    file = {:Users/rherb/Code/herbrich.me/papers/paum.pdf:pdf},
    pages = {379--386},
    title = {{The Perceptron Algorithm with Uneven Margins}},
    url = {http://www.herbrich.me/papers/paum.pdf},
    year = {2002}
    }

  • S. E. Robertson, S. Walker, H. Zaragoza, and R. Herbrich, “Microsoft Cambridge at TREC 2002: Filtering track,” in Proceedings of the ninth text retrieval conference (trec-9), 2002, p. 361–368.
    [BibTeX] [Abstract] [Download PDF]

    Six runs were submitted for the Adaptive Filtering track, four on the adaptive filtering task (ok11af??), and two on the routing task (msPUM?). The adaptive filtering system has been somewhat modified from the one used for TREC-10, largely for efficiently and flexibility reasons; the basic filtering algorithms remain similar to those used in recent TRECs. For the routing task, a completely new system based on perceptrons with uneven margins was used.

    @inproceedings{Robertson2002,
    abstract = {Six runs were submitted for the Adaptive Filtering track, four on the adaptive filtering task (ok11af??), and two on the routing task (msPUM?). The adaptive filtering system has been somewhat modified from the one used for TREC-10, largely for efficiently and flexibility reasons; the basic filtering algorithms remain similar to those used in recent TRECs. For the routing task, a completely new system based on perceptrons with uneven margins was used.},
    author = {Robertson, Stephen E. and Walker, Stephen and Zaragoza, Hugo and Herbrich, Ralf},
    booktitle = {Proceedings of the Ninth Text Retrieval Conference (TREC-9)},
    file = {:Users/rherb/Downloads/MSRC\_TREC\_2002.pdf:pdf},
    keywords = {ir},
    mendeley-tags = {ir},
    pages = {361--368},
    title = {{Microsoft Cambridge at TREC 2002: Filtering track}},
    url = {http://www.herbrich.me/papers/trec2002.pdf},
    year = {2002}
    }

2001

  • T. Graepel, M. Goutrié, M. Krüger, and R. Herbrich, “Learning on Graphs in the Game of Go,” in International conference on artifical neural networks, Vienna, 2001, p. 347–352.
    [BibTeX] [Abstract] [Download PDF]

    We consider the game of Go from the point of view of machine learning and as a well-defined domain for learning on graph representations. We discuss the representation of both board positions and candidate moves and introduce the common fate graph (CFG) as an adequate representation of board positions for learning. Single candidate moves are represented as feature vectors with features given by subgraphs relative to the given move in the CFG. Using this representation we train a support vector machine (SVM) and a kernel perceptron to discriminate good moves from bad moves on a collection of life-and-death problems and on 9×9 game records. We thus obtain kernel machines that solve Go problems and play 9×9 Go.

    @inproceedings{DBLP:conf/icann/GraepelGKH01,
    abstract = {We consider the game of Go from the point of view of machine learning and as a well-defined domain for learning on graph representations. We discuss the representation of both board positions and candidate moves and introduce the common fate graph (CFG) as an adequate representation of board positions for learning. Single candidate moves are represented as feature vectors with features given by subgraphs relative to the given move in the CFG. Using this representation we train a support vector machine (SVM) and a kernel perceptron to discriminate good moves from bad moves on a collection of life-and-death problems and on 9x9 game records. We thus obtain kernel machines that solve Go problems and play 9x9 Go.},
    address = {Vienna},
    author = {Graepel, Thore and Goutri\'{e}, Mike and Kr\"{u}ger, Marco and Herbrich, Ralf},
    booktitle = {International Conference on Artifical Neural Networks},
    file = {:Users/rherb/Code/herbrich.me/papers/go.pdf:pdf},
    pages = {347--352},
    title = {{Learning on Graphs in the Game of Go}},
    url = {http://www.herbrich.me/papers/go.pdf},
    year = {2001}
    }

  • A. Gretton, A. Doucet, R. Herbrich, P. Rayner, and B. Schölkopf, “Support Vector Regression for Black-Box System Identification,” in 11th ieee workshop on statistical signal processing, 2001, p. 341–344.
    [BibTeX] [Abstract]

    In this paper, we demonstrate the use of support vector regression (SVR) techniques for black-box system identification. There methods derive from statistical learning theory, and are of great theoretical and practical interest. We briefly describe the theory underpinning SVR, and compare support vector methods with other approaches using radial basis networks. Finally, we apply SVR to modeling the behaviour of a hydralic robot arm, and show that SVR improves on previously published results.

    @inproceedings{GrettonArthur2001,
    abstract = {In this paper, we demonstrate the use of support vector regression (SVR) techniques for black-box system identification. There methods derive from statistical learning theory, and are of great theoretical and practical interest. We briefly describe the theory underpinning SVR, and compare support vector methods with other approaches using radial basis networks. Finally, we apply SVR to modeling the behaviour of a hydralic robot arm, and show that SVR improves on previously published results.},
    author = {Gretton, Arthur and Doucet, Arnaud and Herbrich, Ralf and Rayner, Peter and Sch\"{o}lkopf, Bernhard},
    booktitle = {11th IEEE Workshop on Statistical Signal Processing},
    file = {:Users/rherb/Downloads/gredouherrayschoe01.ps:ps},
    keywords = {kernel},
    mendeley-tags = {kernel},
    pages = {341--344},
    title = {{Support Vector Regression for Black-Box System Identification}},
    year = {2001}
    }

  • R. Herbrich, T. Graepel, and C. Campbell, “Bayes Point Machines,” Journal of machine learning research, vol. 1, p. 245–279, 2001.
    [BibTeX] [Abstract] [Download PDF]

    Kernel-classifiers comprise a powerful class of non-linear decision functions for binary classification. The support vector machine is an example of a learning algorithm for kernel classifiers that singles out the consistent classifier with the largest margin, i.e. minimal real-valued output on the training sample, within the set of consistent hypotheses, the so-called version space. We suggest the Bayes point machine as a well-founded improvement which approximates the Bayes-optimal decision by the centre of mass of version space. We present two algorithms to stochastically approximate the centre of mass of version space: a billiard sampling algorithm and a sampling algorithm based on the well known perceptron algorithm. It is shown how both algorithms can be extended to allow for soft-boundaries in order to admit training errors. Experimentally, we find that — for the zero training error case — Bayes point machines consistently outperform support vector machines on both surrogate data and real-world benchmark data sets. In the soft-boundary/soft-margin case, the improvement over support vector machines is shown to be reduced. Finally, we demonstrate that the real-valued output of single Bayes points on novel test points is a valid confidence measure and leads to a steady decrease in generalisation error when used as a rejection criterion.

    @article{DBLP:journals/jmlr/HerbrichGC01,
    abstract = {Kernel-classifiers comprise a powerful class of non-linear decision functions for binary classification. The support vector machine is an example of a learning algorithm for kernel classifiers that singles out the consistent classifier with the largest margin, i.e. minimal real-valued output on the training sample, within the set of consistent hypotheses, the so-called version space. We suggest the Bayes point machine as a well-founded improvement which approximates the Bayes-optimal decision by the centre of mass of version space. We present two algorithms to stochastically approximate the centre of mass of version space: a billiard sampling algorithm and a sampling algorithm based on the well known perceptron algorithm. It is shown how both algorithms can be extended to allow for soft-boundaries in order to admit training errors. Experimentally, we find that — for the zero training error case — Bayes point machines consistently outperform support vector machines on both surrogate data and real-world benchmark data sets. In the soft-boundary/soft-margin case, the improvement over support vector machines is shown to be reduced. Finally, we demonstrate that the real-valued output of single Bayes points on novel test points is a valid confidence measure and leads to a steady decrease in generalisation error when used as a rejection criterion.},
    author = {Herbrich, Ralf and Graepel, Thore and Campbell, Colin},
    file = {:Users/rherb/Code/herbrich.me/papers/bpm.pdf:pdf},
    journal = {Journal of Machine Learning Research},
    pages = {245--279},
    title = {{Bayes Point Machines}},
    url = {http://www.herbrich.me/papers/bpm.pdf},
    volume = {1},
    year = {2001}
    }

  • R. Herbrich and R. C. Williamson, “Algorithmic Luckiness,” in Advances in neural information processing systems 14, Vancouver, 2001, p. 391–397.
    [BibTeX] [Abstract] [Download PDF]

    In contrast to standard statistical learning theory which studies uniform bounds on the expected error we present a framework that exploits the specific learning algorithm used. Motivated by the luckiness framework [Taylor et al., 1998] we are also able to exploit the serendipity of the training sample. The main difference to previous approaches lies in the complexity measure; rather than covering all hypotheses in a given hypothesis space it is only necessary to cover the functions which could have been learned using the fixed learning algorithm. We show how the resulting framework relates to the VC, luckiness and compression frameworks. Finally, we present an application of this framework to the maximum margin algorithm for linear classifiers which results in a bound that exploits both the margin and the distribution of the data in feature space.

    @inproceedings{DBLP:conf/nips/HerbrichW01,
    abstract = {In contrast to standard statistical learning theory which studies uniform bounds on the expected error we present a framework that exploits the specific learning algorithm used. Motivated by the luckiness framework [Taylor et al., 1998] we are also able to exploit the serendipity of the training sample. The main difference to previous approaches lies in the complexity measure; rather than covering all hypotheses in a given hypothesis space it is only necessary to cover the functions which could have been learned using the fixed learning algorithm. We show how the resulting framework relates to the VC, luckiness and compression frameworks. Finally, we present an application of this framework to the maximum margin algorithm for linear classifiers which results in a bound that exploits both the margin and the distribution of the data in feature space.},
    address = {Vancouver},
    author = {Herbrich, Ralf and Williamson, Robert C},
    booktitle = {Advances in Neural Information Processing Systems 14},
    file = {:Users/rherb/Code/herbrich.me/papers/nips01\_algoluck.pdf:pdf},
    pages = {391--397},
    title = {{Algorithmic Luckiness}},
    url = {http://www.herbrich.me/papers/nips01\_algoluck.pdf},
    year = {2001}
    }

  • B. Schölkopf, R. Herbrich, and A. Smola, “A Generalized Representer Theorem,” in Proceedings of the fourteenth annual conference on computational learning theory, 2001, p. 416–426.
    [BibTeX] [Abstract]

    Wahba’s classical representer theorem states that the solutions of certain risk minimization problems involving an empirical risk term and a quadratic regularizer can be written as expansions in terms of the training examples. We generalize the theorem to a larger class of regularizers and empirical risk terms, and give a self-contained proof utilizing the feature space associated with a kernel. The result shows that a wide range of problems have optimal solutions that live in the finite dimensional span of the training examples mapped into feature space, thus enabling us to carry out kernel algorithms independent of the (potentially infinite) dimensionality of the feature space.

    @inproceedings{Scholkopf2001,
    abstract = {Wahba's classical representer theorem states that the solutions of certain risk minimization problems involving an empirical risk term and a quadratic regularizer can be written as expansions in terms of the training examples. We generalize the theorem to a larger class of regularizers and empirical risk terms, and give a self-contained proof utilizing the feature space associated with a kernel. The result shows that a wide range of problems have optimal solutions that live in the finite dimensional span of the training examples mapped into feature space, thus enabling us to carry out kernel algorithms independent of the (potentially infinite) dimensionality of the feature space.},
    author = {Sch\"{o}lkopf, Bernhard and Herbrich, Ralf and Smola, Alexander},
    booktitle = {Proceedings of the Fourteenth Annual Conference on Computational Learning Theory},
    file = {:Users/rherb/Dropbox/Documents/tex/colt2001/colt-final.pdf:pdf},
    keywords = {kernel,theory},
    mendeley-tags = {kernel,theory},
    pages = {416--426},
    title = {{A Generalized Representer Theorem}},
    year = {2001}
    }

2000

  • T. Graepel and R. Herbrich, “The Kernel Gibbs Sampler,” in Advances in neural information processing systems 13, Denver, 2000, p. 514–520.
    [BibTeX] [Abstract]

    We present an algorithm that samples the hypothesis space of kernel classifiers. Given a uniform prior over normalised weight vectors and a likelihood based on a model of label noise leads to a piecewise constant posterior that can be sampled by the kernel Gibbs sampler (KGS). The KGS is a Markov Chain Monte Carlo method that chooses a random direction in parameter space and samples from the resulting piecewise constant density along the line chosen. The KGS can be used as an analytical tool for the exploration of Bayesian transduction, Bayes point machines, active learning, and evidence-based model selection on small data sets that are contaminated with label noise. For a simple toy example we demonstrate experimentally how a Bayes point machine based on the KGS outperforms an SVM that is incapable of taking into account label noise.

    @inproceedings{DBLP:conf/nips/GraepelH00,
    abstract = {We present an algorithm that samples the hypothesis space of kernel classifiers. Given a uniform prior over normalised weight vectors and a likelihood based on a model of label noise leads to a piecewise constant posterior that can be sampled by the kernel Gibbs sampler (KGS). The KGS is a Markov Chain Monte Carlo method that chooses a random direction in parameter space and samples from the resulting piecewise constant density along the line chosen. The KGS can be used as an analytical tool for the exploration of Bayesian transduction, Bayes point machines, active learning, and evidence-based model selection on small data sets that are contaminated with label noise. For a simple toy example we demonstrate experimentally how a Bayes point machine based on the KGS outperforms an SVM that is incapable of taking into account label noise.},
    address = {Denver},
    author = {Graepel, Thore and Herbrich, Ralf},
    booktitle = {Advances in Neural Information Processing Systems 13},
    file = {:Users/rherb/Dropbox/Documents/tex/nips2000/kgibbs/kgibbs.pdf:pdf},
    pages = {514--520},
    publisher = {The MIT Press},
    title = {{The Kernel Gibbs Sampler}},
    year = {2000}
    }

  • T. Graepel, R. Herbrich, and J. Shawe-Taylor, “Generalisation Error Bounds for Sparse Linear Classifiers,” in Proceedings of the thirteenth annual conference on computational learning theory (colt 2000), june 28 – july 1, 2000, palo alto, california, 2000, p. 298–303.
    [BibTeX] [Download PDF]
    @inproceedings{DBLP:conf/colt/GraepelHS00,
    author = {Graepel, Thore and Herbrich, Ralf and Shawe-Taylor, John},
    booktitle = {Proceedings of the Thirteenth Annual Conference on Computational Learning Theory (COLT 2000), June 28 - July 1, 2000, Palo Alto, California},
    file = {:Users/rherb/Code/herbrich.me/papers/colt00\_sparse.pdf:pdf},
    pages = {298--303},
    title = {{Generalisation Error Bounds for Sparse Linear Classifiers}},
    url = {http://www.herbrich.me/papers/colt00\_sparse.pdf},
    year = {2000}
    }

  • T. Graepel, R. Herbrich, and R. C. Williamson, “From Margin to Sparsity,” in Advances in neural information processing systems 13, Denver, 2000, p. 210–216.
    [BibTeX] [Abstract] [Download PDF]

    We present an improvement of Novikoff’s perceptron convergence theorem. Reinterpreting this mistake bound as a margin dependent sparsity guarantee allows us to give a PAC-style generalisation error bound for the classifier learned by the dual perceptron learning algorithm. The bound value crucially depends on the margin a support vector machine would achieve on the same data set using the same kernel. Ironically, the bound yields better guarantees than are currently available for the support vector solution itself.

    @inproceedings{DBLP:conf/nips/GraepelHW00,
    abstract = {We present an improvement of Novikoff's perceptron convergence theorem. Reinterpreting this mistake bound as a margin dependent sparsity guarantee allows us to give a PAC-style generalisation error bound for the classifier learned by the dual perceptron learning algorithm. The bound value crucially depends on the margin a support vector machine would achieve on the same data set using the same kernel. Ironically, the bound yields better guarantees than are currently available for the support vector solution itself.},
    address = {Denver},
    author = {Graepel, Thore and Herbrich, Ralf and Williamson, Robert C},
    booktitle = {Advances in Neural Information Processing Systems 13},
    file = {:Users/rherb/Dropbox/Documents/tex/nips2000/sparsity/perc.pdf:pdf},
    pages = {210--216},
    publisher = {The MIT Press},
    title = {{From Margin to Sparsity}},
    url = {http://www.herbrich.me/papers/perc.pdf},
    year = {2000}
    }

  • R. Herbrich, “Learning Linear Classifiers – Theory and Algorithms,” PhD Thesis PhD Thesis, 2000.
    [BibTeX]
    @phdthesis{Herbrich2000,
    author = {Herbrich, Ralf},
    school = {Technical University of Berlin},
    title = {{Learning Linear Classifiers - Theory and Algorithms}},
    type = {PhD Thesis},
    year = {2000}
    }

  • R. Herbrich and T. Graepel, “Large Scale Bayes Point Machines,” in Advances in neural information processing systems 13, Denver, 2000, p. 528–534.
    [BibTeX] [Abstract] [Download PDF]

    The concept of averaging over classifiers is fundamental to the Bayesian analysis of learning. Based on this viewpoint, it has recently been demonstrated for linear classifiers that the centre of mass of version space (the set of all classifiers consistent with the training set) – also known as the Bayes point – exhibits excellent generalisation abilities. However, the billiard algorithm as presented in [Herbrich et al., 2000] is restricted to small sample size because it requires O(m*m) of memory and O(N*m*m) computational steps where m is the number of training patterns and N is the number of random draws from the posterior distribution. In this paper we present a method based on the simple perceptron learning algorithm which allows to overcome this algorithmic drawback. The method is algorithmically simple and is easily extended to the multi-class case. We present experimental results on the MNIST data set of handwritten digits which show that Bayes Point Machines are competitive with the current world champion, the support vector machine. In addition, the computational complexity of BPMs can be tuned by varying the number of samples from the posterior. Finally, rejecting test points on the basis of their (approximative) posterior probability leads to a rapid decrease in generalisation error, e.g. 0.1\% generalisation error for a given rejection rate of 10\%.

    @inproceedings{DBLP:conf/nips/HerbrichG00a,
    abstract = {The concept of averaging over classifiers is fundamental to the Bayesian analysis of learning. Based on this viewpoint, it has recently been demonstrated for linear classifiers that the centre of mass of version space (the set of all classifiers consistent with the training set) - also known as the Bayes point - exhibits excellent generalisation abilities. However, the billiard algorithm as presented in [Herbrich et al., 2000] is restricted to small sample size because it requires O(m*m) of memory and O(N*m*m) computational steps where m is the number of training patterns and N is the number of random draws from the posterior distribution. In this paper we present a method based on the simple perceptron learning algorithm which allows to overcome this algorithmic drawback. The method is algorithmically simple and is easily extended to the multi-class case. We present experimental results on the MNIST data set of handwritten digits which show that Bayes Point Machines are competitive with the current world champion, the support vector machine. In addition, the computational complexity of BPMs can be tuned by varying the number of samples from the posterior. Finally, rejecting test points on the basis of their (approximative) posterior probability leads to a rapid decrease in generalisation error, e.g. 0.1\% generalisation error for a given rejection rate of 10\%.},
    address = {Denver},
    author = {Herbrich, Ralf and Graepel, Thore},
    booktitle = {Advances in Neural Information Processing Systems 13},
    file = {:Users/rherb/Dropbox/Documents/tex/nips2000/mnist/mnist.pdf:pdf},
    pages = {528--534},
    publisher = {The MIT Press},
    title = {{Large Scale Bayes Point Machines}},
    url = {http://www.herbrich.me/papers/mnist.pdf},
    year = {2000}
    }

  • R. Herbrich and T. Graepel, “A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work,” in Advances in neural information processing systems 13, Denver, 2000, p. 224–230.
    [BibTeX] [Abstract] [Download PDF]

    We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training set. The result is obtained in a PAC-Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound by Shawe-Taylor et al. and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to non-trivial bound values and –for maximum margins – to a vanishing complexity term. Furthermore, the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the volume ratio between the whole hypothesis space and the subset of consistent hypotheses. The practical relevance of the result lies in the fact that the well-known support vector machine is optimal w.r.t. the new bound only if the feature vectors are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only – a recommendation that is well supported by our numerical experiments on two benchmark data sets.

    @inproceedings{DBLP:conf/nips/HerbrichG00,
    abstract = {We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training set. The result is obtained in a PAC-Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound by Shawe-Taylor et al. and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to non-trivial bound values and –for maximum margins - to a vanishing complexity term. Furthermore, the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the volume ratio between the whole hypothesis space and the subset of consistent hypotheses. The practical relevance of the result lies in the fact that the well-known support vector machine is optimal w.r.t. the new bound only if the feature vectors are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only - a recommendation that is well supported by our numerical experiments on two benchmark data sets.},
    address = {Denver},
    author = {Herbrich, Ralf and Graepel, Thore},
    booktitle = {Advances in Neural Information Processing Systems 13},
    file = {:Users/rherb/Dropbox/Documents/tex/nips2000/pac-bayes/pacbayes.pdf:pdf},
    pages = {224--230},
    title = {{A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work}},
    url = {http://www.herbrich.me/papers/pacbayes.pdf},
    year = {2000}
    }

  • R. Herbrich, T. Graepel, and C. Campbell, “Robust Bayes Point Machines,” in Proceedings of european symposium on artificial neural networks, Brussels, 2000, p. 49–54.
    [BibTeX] [Abstract] [Download PDF]

    Support Vector Machines choose the hypothesis corresponding to the centre of the largest hypersphere that can be inscribed in version space. If version space is elongated or irregularly shaped a potentially superior approach is take into account the whole of version space. We propose to construct the Bayes point which is approximated by the centre of mass. Our implementation of a Bayes Point Machine (BPM) uses an ergodic billiard to estimate this point in the kernel space. We show that BPMs outperform hard margin Support Vector Machines (SVMs) on real world datasets. We introduce a technique that allows the BPM to construct hypotheses with non–zero training error similar to soft margin SVMs with quadratic penelisation of the margin slacks. An experimental study reveals that with decreasing penelisation of training error the improvement of BPMs over SVMs decays, a finding that is explained by geometrical considerations.

    @inproceedings{DBLP:conf/esann/HerbichGC00,
    abstract = {Support Vector Machines choose the hypothesis corresponding to the centre of the largest hypersphere that can be inscribed in version space. If version space is elongated or irregularly shaped a potentially superior approach is take into account the whole of version space. We propose to construct the Bayes point which is approximated by the centre of mass. Our implementation of a Bayes Point Machine (BPM) uses an ergodic billiard to estimate this point in the kernel space. We show that BPMs outperform hard margin Support Vector Machines (SVMs) on real world datasets. We introduce a technique that allows the BPM to construct hypotheses with non–zero training error similar to soft margin SVMs with quadratic penelisation of the margin slacks. An experimental study reveals that with decreasing penelisation of training error the improvement of BPMs over SVMs decays, a finding that is explained by geometrical considerations.},
    address = {Brussels},
    author = {Herbrich, Ralf and Graepel, Thore and Campbell, Colin},
    booktitle = {Proceedings of European Symposium on Artificial Neural Networks},
    file = {:Users/rherb/Code/herbrich.me/papers/esann00.pdf:pdf},
    pages = {49--54},
    title = {{Robust Bayes Point Machines}},
    url = {http://www.herbrich.me/papers/esann00.pdf},
    year = {2000}
    }

  • R. Herbrich, T. Graepel, and J. Shawe-Taylor, “Sparsity vs. Large Margins for Linear Classifiers,” in Proceedings of the thirteenth annual conference on computational learning theory, Palo Alto, 2000, p. 304–308.
    [BibTeX] [Abstract] [Download PDF]

    We provide small sample size bounds on the generalisation error of linear classifiers that take advantage of large observed margins on the training set and sparsity in the data dependent expansion coefficients. It is already known from results in the luckiness framework that both criteria independently have a large impact on the generalisation error. Our new results show that they can be combined which theoretically justifies learning algorithms like the Support Vector Machine or the Relevance Vector Machine. In contrast to previous studies we avoid using the classical technique of symmetrisation by a ghost sample but directly using the sparsity for the estimation of the generalisation error. We demonstrate that our result leads to practical useful results even in case of small sample size if the training set witnesses our prior belief in sparsity and large margins.

    @inproceedings{DBLP:conf/colt/HerbrichGS00,
    abstract = {We provide small sample size bounds on the generalisation error of linear classifiers that take advantage of large observed margins on the training set and sparsity in the data dependent expansion coefficients. It is already known from results in the luckiness framework that both criteria independently have a large impact on the generalisation error. Our new results show that they can be combined which theoretically justifies learning algorithms like the Support Vector Machine or the Relevance Vector Machine. In contrast to previous studies we avoid using the classical technique of symmetrisation by a ghost sample but directly using the sparsity for the estimation of the generalisation error. We demonstrate that our result leads to practical useful results even in case of small sample size if the training set witnesses our prior belief in sparsity and large margins.},
    address = {Palo Alto},
    author = {Herbrich, Ralf and Graepel, Thore and Shawe-Taylor, John},
    booktitle = {Proceedings of the Thirteenth Annual Conference on Computational Learning Theory},
    file = {:Users/rherb/Code/herbrich.me/papers/colt00\_sparsemargin.pdf:pdf},
    pages = {304--308},
    title = {{Sparsity vs. Large Margins for Linear Classifiers}},
    url = {http://www.herbrich.me/papers/colt00\_sparsemargin.pdf},
    year = {2000}
    }

1999

  • T. Graepel, R. Herbrich, and K. Obermayer, “Bayesian Transduction,” in Advances in neural information processing systems 12, Denver, 1999, p. 456–462.
    [BibTeX] [Abstract] [Download PDF]

    Transduction is an inference principle that takes a training sample and aims at estimating the values of a function at given points contained in the so-called working sample. Hence, transduction is a less ambitious task than induction which aims at inferring a functional dependency on the whole of input space. As a consequence, however, transduction provides a confidence measure on single predictions rather than classifiers, a feature particularly important for risk–sensitive applications. We consider the case of binary classification by linear discriminant functions (perceptrons) in kernel space. From the transductive point of view, the infinite number of perceptrons is boiled down to a finite number of equivalence classes on the working sample each of which corresponds to a polyhedron in parameter space. In the Bayesian spirit the posteriori probability of a labelling of the working sample is determined as the ratio between the volume of the corresponding polyhedron and the volume of version space. Then the maximum posteriori scheme recommends to choose the labelling of maximum volume. We suggest to sample version space by an ergodic billiard in kernel space. Experimental results on real world data indicate that Bayesian Transduction compares favourably to the well-known Support Vector Machine, in particular if the posteriori probability of labellings is used as a confidence measure to exclude test points of low confidence.

    @inproceedings{DBLP:conf/nips/GraepelHO99,
    abstract = {Transduction is an inference principle that takes a training sample and aims at estimating the values of a function at given points contained in the so-called working sample. Hence, transduction is a less ambitious task than induction which aims at inferring a functional dependency on the whole of input space. As a consequence, however, transduction provides a confidence measure on single predictions rather than classifiers, a feature particularly important for risk–sensitive applications. We consider the case of binary classification by linear discriminant functions (perceptrons) in kernel space. From the transductive point of view, the infinite number of perceptrons is boiled down to a finite number of equivalence classes on the working sample each of which corresponds to a polyhedron in parameter space. In the Bayesian spirit the posteriori probability of a labelling of the working sample is determined as the ratio between the volume of the corresponding polyhedron and the volume of version space. Then the maximum posteriori scheme recommends to choose the labelling of maximum volume. We suggest to sample version space by an ergodic billiard in kernel space. Experimental results on real world data indicate that Bayesian Transduction compares favourably to the well-known Support Vector Machine, in particular if the posteriori probability of labellings is used as a confidence measure to exclude test points of low confidence.},
    address = {Denver},
    author = {Graepel, Thore and Herbrich, Ralf and Obermayer, Klaus},
    booktitle = {Advances in Neural Information Processing Systems 12},
    file = {:Users/rherb/Dropbox/Documents/tex/nips1999/transduction/trans.pdf:pdf},
    pages = {456--462},
    publisher = {The MIT Press},
    title = {{Bayesian Transduction}},
    url = {http://www.herbrich.me/papers/trans.pdf},
    year = {1999}
    }

  • T. Graepel, R. Herbrich, B. Schölkopf, A. Smola, P. Bartlett, K. R. Müller, K. Obermayer, and R. C. Williamson, “Classification on Proximity Data with LP-Machines,” in Proceedings of the ninth international conference on artificial neural networks, Edinburgh, 1999, p. 304–309.
    [BibTeX] [Abstract] [Download PDF]

    We provide a new linear program to deal with classification of data in the case of data given in terms of pairwise proximities. This allows to avoid the problems inherent in using feature spaces with indefinite metric in Support Vector Machines, since the notion of a margin is purely needed in input space where the classification actually occurs. Moreover in our approach we can enforce sparsity in the proximity representation by sacrificing training error. This turns out to be favorable for proximity data. Similar to nu–SV methods, the only parameter needed in the algorithm is the (asymptotical) number of data points being classified with a margin. Finally, the algorithm is successfully compared with nu–SV learning in proximity space and K–nearest-neighbors on real world data from Neuroscience and molecular biology.

    @inproceedings{Graepel1999,
    abstract = {We provide a new linear program to deal with classification of data in the case of data given in terms of pairwise proximities. This allows to avoid the problems inherent in using feature spaces with indefinite metric in Support Vector Machines, since the notion of a margin is purely needed in input space where the classification actually occurs. Moreover in our approach we can enforce sparsity in the proximity representation by sacrificing training error. This turns out to be favorable for proximity data. Similar to nu–SV methods, the only parameter needed in the algorithm is the (asymptotical) number of data points being classified with a margin. Finally, the algorithm is successfully compared with nu–SV learning in proximity space and K–nearest-neighbors on real world data from Neuroscience and molecular biology.},
    address = {Edinburgh},
    author = {Graepel, Thore and Herbrich, Ralf and Sch\"{o}lkopf, Bernhard and Smola, Alex and Bartlett, Peter and M\"{u}ller, Klaus Robert and Obermayer, Klaus and Williamson, Robert C},
    booktitle = {Proceedings of the Ninth International Conference on Artificial Neural Networks},
    file = {:Users/rherb/Code/herbrich.me/papers/icann99\_proxy.pdf:pdf},
    pages = {304--309},
    title = {{Classification on Proximity Data with LP-Machines}},
    url = {http://www.herbrich.me/papers/icann99\_proxy.pdf},
    year = {1999}
    }

  • R. Herbrich, T. Graepel, and C. Campbell, “Bayes Point Machines : Estimating the Bayes Point in Kernel Space,” in Proceedings of ijcai workshop support vector machines, Stockholm, 1999, p. 23–27.
    [BibTeX] [Abstract] [Download PDF]

    From a Bayesian perspective Support Vector Machines choose the hypothesis corresponding to the largest possible hypersphere that can be inscribed in version space, i.e. in the space of all consistent hypotheses given a training set. Those boundaries of version space which are tangent to the hypersphere define the support vectors. An alternative and potentially better approach is to construct the hypothesis using the whole of version space. This is achieved by using a Bayes Point Machine which finds the midpoint of the region of intersection of all hyperplanes bisecting version space into two halves of equal volume (the Bayes point). It is known that the center of mass of version space approximates the Bayes point. We suggest estimating the center of mass by averaging over the trajectory of a billiard ball bouncing in version space. Experimental results are presented indicating that Bayes Point Machines consistently outperform Support Vector Machines.

    @inproceedings{Herbrich1999b,
    abstract = {From a Bayesian perspective Support Vector Machines choose the hypothesis corresponding to the largest possible hypersphere that can be inscribed in version space, i.e. in the space of all consistent hypotheses given a training set. Those boundaries of version space which are tangent to the hypersphere define the support vectors. An alternative and potentially better approach is to construct the hypothesis using the whole of version space. This is achieved by using a Bayes Point Machine which finds the midpoint of the region of intersection of all hyperplanes bisecting version space into two halves of equal volume (the Bayes point). It is known that the center of mass of version space approximates the Bayes point. We suggest estimating the center of mass by averaging over the trajectory of a billiard ball bouncing in version space. Experimental results are presented indicating that Bayes Point Machines consistently outperform Support Vector Machines.},
    address = {Stockholm},
    author = {Herbrich, Ralf and Graepel, Thore and Campbell, Colin},
    booktitle = {Proceedings of IJCAI Workshop Support Vector Machines},
    file = {:Users/rherb/Code/herbrich.me/papers/ijcai99.pdf:pdf},
    pages = {23--27},
    title = {{Bayes Point Machines : Estimating the Bayes Point in Kernel Space}},
    url = {http://www.herbrich.me/papers/ijcai99.pdf},
    year = {1999}
    }

  • R. Herbrich, T. Graepel, and K. Obermayer, “Support Vector Learning for Ordinal Regression A Risk Formulation for Ordinal Regression,” in Proceedings of the ninth international conference on artificial neural networks, Edinburgh, 1999, p. 97–102.
    [BibTeX] [Abstract] [Download PDF]

    We investigate the problem of predicting variables of ordinal scale. This task is referred to as ordinal regression and is complementary to the standard machine learning tasks of classification and metric regression. In contrast to statistical models we present a distribution independent formulation of the problem together with uniform bounds of the risk functional. The approach presented is based on a mapping from objects to scalar utility values. Similar to Support Vector methods we derive a new learning algorithm for the task of ordinal regression based on large margin rank boundaries. We give experimental results for an information retrieval task: learning the order of documents w.r.t. an initial query. Experimental results indicate that the presented algorithm outperforms more naive approaches to ordinal regression such as Support Vector classification and Support Vector regression in the case of more than two ranks.

    @inproceedings{Herbrich1999c,
    abstract = {We investigate the problem of predicting variables of ordinal scale. This task is referred to as ordinal regression and is complementary to the standard machine learning tasks of classification and metric regression. In contrast to statistical models we present a distribution independent formulation of the problem together with uniform bounds of the risk functional. The approach presented is based on a mapping from objects to scalar utility values. Similar to Support Vector methods we derive a new learning algorithm for the task of ordinal regression based on large margin rank boundaries. We give experimental results for an information retrieval task: learning the order of documents w.r.t. an initial query. Experimental results indicate that the presented algorithm outperforms more naive approaches to ordinal regression such as Support Vector classification and Support Vector regression in the case of more than two ranks.},
    address = {Edinburgh},
    author = {Herbrich, Ralf and Graepel, Thore and Obermayer, Klaus},
    booktitle = {Proceedings of the Ninth International Conference on Artificial Neural Networks},
    file = {:Users/rherb/Code/herbrich.me/papers/icann99\_ordinal.pdf:pdf},
    pages = {97--102},
    title = {{Support Vector Learning for Ordinal Regression A Risk Formulation for Ordinal Regression}},
    url = {http://www.herbrich.me/papers/icann99\_ordinal.pdf},
    year = {1999}
    }

  • R. Herbrich, T. Graepel, and K. Obermayer, “Large Margin Rank Boundaries for Ordinal Regression,” in Advances in large margin classifiers, The MIT Press, 1999, p. 115–132.
    [BibTeX] [Abstract] [Download PDF]

    In contrast to the standard machine learning tasks of classification and metric regression we investigate the problem of predicting variables of ordinal scale, a setting referred to as ordinal regression. This problem arises frequently in the social sciences and in information retrieval where human preferences play a major role. Whilst approaches proposed in statistics rely on a probability model of a latent (unobserved) variable we present a distribution independent risk formulation of ordinal regression which allows us to derive a uniform convergence bound. Applying this bound we present a large margin algorithm that is based on a mapping from objects to scalar utility values thus classifying pairs of objects. We give experimental results for an information retrieval task which show that our algorithm outperforms more naive approaches to ordinal regression such as Support Vector Classification and Support Vector Regression in the case of more than two ranks.

    @incollection{Herbrich1999d,
    abstract = {In contrast to the standard machine learning tasks of classification and metric regression we investigate the problem of predicting variables of ordinal scale, a setting referred to as ordinal regression. This problem arises frequently in the social sciences and in information retrieval where human preferences play a major role. Whilst approaches proposed in statistics rely on a probability model of a latent (unobserved) variable we present a distribution independent risk formulation of ordinal regression which allows us to derive a uniform convergence bound. Applying this bound we present a large margin algorithm that is based on a mapping from objects to scalar utility values thus classifying pairs of objects. We give experimental results for an information retrieval task which show that our algorithm outperforms more naive approaches to ordinal regression such as Support Vector Classification and Support Vector Regression in the case of more than two ranks.},
    author = {Herbrich, Ralf and Graepel, Thore and Obermayer, Klause},
    booktitle = {Advances in Large Margin Classifiers},
    chapter = {7},
    file = {:Users/rherb/Code/herbrich.me/papers/nips98\_ordinal.pdf:pdf},
    pages = {115--132},
    publisher = {The MIT Press},
    title = {{Large Margin Rank Boundaries for Ordinal Regression}},
    url = {http://www.herbrich.me/papers/nips98\_ordinal.pdf},
    year = {1999}
    }

  • R. Herbrich, M. Keilbach, P. Bollmann-Sdorra, and K. Obermayer, “Neural Networks in Economics : Background , Applications and New Developments,” Advances in computational economics, vol. 11, p. 169–196, 1999.
    [BibTeX] [Abstract] [Download PDF]

    Neural Networks were developed in the sixties as devices for classification and regression. The approach was originally inspired from Neuroscience. Its attractiveness lies in the ability to ‘learn’, i.e. to generalize to as yet unseen observations. One aim of this paper is to give an introduction to the technique of Neural Networks and an overview of the most popular architectures. We start from statistical learning theory to introduce the basics of learning. Then, we give an overview of the general principles of neural networks and of their use in the field of Economics. A second purpose is to introduce a recently developed Neural Network Learning technique, so called Support Vector Network Learning, which is an application of ideas from statistical learning theory. This approach has shown very promising results on problems with a limited amount of training examples. Moreover, utilizing a technique that is known as the kernel trick, Support Vector Networks can easily be adapted to nonlinear models. Finally, we present an economic application of this approach from the field of preference learning.

    @article{Herbrich1999,
    abstract = {Neural Networks were developed in the sixties as devices for classification and regression. The approach was originally inspired from Neuroscience. Its attractiveness lies in the ability to 'learn', i.e. to generalize to as yet unseen observations. One aim of this paper is to give an introduction to the technique of Neural Networks and an overview of the most popular architectures. We start from statistical learning theory to introduce the basics of learning. Then, we give an overview of the general principles of neural networks and of their use in the field of Economics. A second purpose is to introduce a recently developed Neural Network Learning technique, so called Support Vector Network Learning, which is an application of ideas from statistical learning theory. This approach has shown very promising results on problems with a limited amount of training examples. Moreover, utilizing a technique that is known as the kernel trick, Support Vector Networks can easily be adapted to nonlinear models. Finally, we present an economic application of this approach from the field of preference learning.},
    author = {Herbrich, Ralf and Keilbach, Max and Bollmann-Sdorra, Peter and Obermayer, Klaus},
    file = {:Users/rherb/Code/herbrich.me/papers/herkeilgraebollober99.pdf:pdf},
    journal = {Advances in Computational Economics},
    pages = {169--196},
    title = {{Neural Networks in Economics : Background , Applications and New Developments}},
    url = {http://www.herbrich.me/papers/herkeilgraebollober99.pdf},
    volume = {11},
    year = {1999}
    }

  • R. Herbrich and J. Weston, “Adaptive Margin Support Vector Machines for Classification,” in Proceedings of the ninth international conference on artificial neural networks, Edinburgh, 1999, p. 97–102.
    [BibTeX] [Abstract] [Download PDF]

    In this paper we propose a new learning algorithm for classi cation learning based on the Support Vector Machine (SVM) approach. Existing ap- proaches for constructing SVMs are based on minimization of a regularized margin loss where the margin is treated equivalently for each train- ing pattern. We propose a reformulation of the minimization problem such that adaptive margins for each training pattern are utilized, which we call the Adaptive Margin (AM)-SVM. We give bounds on the generalization error of AM-SVMs which justify their robustness against outliers, and show experimentally that the generalization error of AM-SVMs is comparable to classical SVMs on benchmark datasets from the UCI repository.

    @inproceedings{Herbrich1999a,
    abstract = {In this paper we propose a new learning algorithm for classi cation learning based on the Support Vector Machine (SVM) approach. Existing ap- proaches for constructing SVMs are based on minimization of a regularized margin loss where the margin is treated equivalently for each train- ing pattern. We propose a reformulation of the minimization problem such that adaptive margins for each training pattern are utilized, which we call the Adaptive Margin (AM)-SVM. We give bounds on the generalization error of AM-SVMs which justify their robustness against outliers, and show experimentally that the generalization error of AM-SVMs is comparable to classical SVMs on benchmark datasets from the UCI repository.},
    address = {Edinburgh},
    author = {Herbrich, Ralf and Weston, Jason},
    booktitle = {Proceedings of the Ninth International Conference on Artificial Neural Networks},
    file = {:Users/rherb/Code/herbrich.me/papers/icann99\_ann.pdf:pdf},
    pages = {97--102},
    title = {{Adaptive Margin Support Vector Machines for Classification}},
    url = {http://www.herbrich.me/papers/icann99\_ann.pdf},
    year = {1999}
    }

  • J. Weston and R. Herbrich, “Adaptive Margin Support Vector Machines,” in Advances in large margin classifiers, The MIT Press, 1999, p. 281–296.
    [BibTeX] [Abstract] [Download PDF]

    In this chapter we present a new learning algorithm, Leave-One-Out (LOO -) SVMs and its generalization Adaptive Margin (AM-) SVMs, inspired by a recent upper bound on the leave-one-out error proved for kernel classifiers by Jaakkola and Haussler. The new approach minimizes the expression given by the bound in an attempt to minimize the leave-one-out error. This gives a convex optimization problem which constructs a sparse linear classifier in feature space using the kernel technique. As such the algorithm possesses many of the same properties as SVMs and Linear Programming (LP-) SVMs. These former techniques are based on the minimization of a regularized margin loss, where the margin is treated equivalently for each training pattern. We propose a minimization problem such that adaptive margins for each training pattern are utilized. Furthermore, we give bounds on the generalization error of the approach which justifies its robustness against outliers. We show experimentally that the generalization error of AM-SVMs is comparable to SVMs and LP-SVMs on benchmark datasets from the UCI repository.

    @incollection{Weston1999,
    abstract = {In this chapter we present a new learning algorithm, Leave-One-Out (LOO -) SVMs and its generalization Adaptive Margin (AM-) SVMs, inspired by a recent upper bound on the leave-one-out error proved for kernel classifiers by Jaakkola and Haussler. The new approach minimizes the expression given by the bound in an attempt to minimize the leave-one-out error. This gives a convex optimization problem which constructs a sparse linear classifier in feature space using the kernel technique. As such the algorithm possesses many of the same properties as SVMs and Linear Programming (LP-) SVMs. These former techniques are based on the minimization of a regularized margin loss, where the margin is treated equivalently for each training pattern. We propose a minimization problem such that adaptive margins for each training pattern are utilized. Furthermore, we give bounds on the generalization error of the approach which justifies its robustness against outliers. We show experimentally that the generalization error of AM-SVMs is comparable to SVMs and LP-SVMs on benchmark datasets from the UCI repository.},
    author = {Weston, Jason and Herbrich, Ralf},
    booktitle = {Advances in Large Margin Classifiers},
    chapter = {15},
    file = {:Users/rherb/Code/herbrich.me/papers/nips98\_ann.pdf:pdf},
    pages = {281--296},
    publisher = {The MIT Press},
    title = {{Adaptive Margin Support Vector Machines}},
    url = {http://www.herbrich.me/papers/nips98\_ann.pdf},
    year = {1999}
    }

1998

  • T. Graepel, R. Herbrich, P. Bollmann-Sdorra, and K. Obermayer, “Classification on Pairwise Proximity Data,” in Advances in neural information processing systems 11, Denver, 1998, p. 438–444.
    [BibTeX] [Abstract] [Download PDF]

    We investigate the problem of learning a classification task on data represented in terms of their pairwise proximities. This representation does not refer to an explicit feature representation of the data items and is thus more general than the standard approach of using Euclidean feature vectors, from which pairwise proximities can always be calculated. Our first approach is based on a combined linear embedding and classification procedure resulting in an extension of the Optimal Hyperplane algorithm to pseudo-Euclidean data. As an alternative we present another approach based on a linear threshold model in the proximity values themselves, which is optimized using Structural Risk Minimization. We show that prior knowledge about the problem can be incorporated by the choice of distance measures and examine different metrics w.r.t. their generalization. Finally, the algorithms are successfully applied to protein structure data and to data from the cat’s cerebral cortex. They show better performance than K-nearest-neighbor classification.

    @inproceedings{DBLP:conf/nips/GraepelHBO98,
    abstract = {We investigate the problem of learning a classification task on data represented in terms of their pairwise proximities. This representation does not refer to an explicit feature representation of the data items and is thus more general than the standard approach of using Euclidean feature vectors, from which pairwise proximities can always be calculated. Our first approach is based on a combined linear embedding and classification procedure resulting in an extension of the Optimal Hyperplane algorithm to pseudo-Euclidean data. As an alternative we present another approach based on a linear threshold model in the proximity values themselves, which is optimized using Structural Risk Minimization. We show that prior knowledge about the problem can be incorporated by the choice of distance measures and examine different metrics w.r.t. their generalization. Finally, the algorithms are successfully applied to protein structure data and to data from the cat's cerebral cortex. They show better performance than K-nearest-neighbor classification.},
    address = {Denver},
    author = {Graepel, Thore and Herbrich, Ralf and Bollmann-Sdorra, Peter and Obermayer, Klaus},
    booktitle = {Advances in Neural Information Processing Systems 11},
    file = {:Users/rherb/Downloads/graeherbollober99.pdf:pdf},
    pages = {438--444},
    publisher = {The MIT Press},
    title = {{Classification on Pairwise Proximity Data}},
    url = {http://www.herbrich.me/papers/graeherbollober99.pdf},
    year = {1998}
    }

  • R. Herbrich, T. Graepel, P. Bollmann-Sdorra, and K. Obermayer, “Learning Preference Relations for Information Retrieval,” in Training, Madison, 1998, p. 80–84.
    [BibTeX] [Abstract] [Download PDF]

    In this paper we investigate the problem of learning a preference relation from a given set of ranked documents. We show that the Bayes’s optimal decision function, when applied to learning a preference relation, may violate transitivity. This is undesirable for information retrieval, because it is in conflict with a document ranking based on the user’s preferences. To overcome this problem we present a vector space based method that performs a linear mapping from documents to scalar utility values and thus guarantees transitivity. The learning of the relation between documents is formulated as a classification problem on pairs of documents and is solved using the principle of structural risk minimization for good generalization. The approach is extended to polynomial utility functions by using the potential function method (the so called “kernel trick”), which allows to incorporate higher order correlations of features into the utility function at minimal computational costs. The resulting algorithm is tested on an example with artificial data. The algorithm successfully learns the utility function underlying the training examples and shows good classification performance.

    @inproceedings{Herbrich1998,
    abstract = {In this paper we investigate the problem of learning a preference relation from a given set of ranked documents. We show that the Bayes's optimal decision function, when applied to learning a preference relation, may violate transitivity. This is undesirable for information retrieval, because it is in conflict with a document ranking based on the user's preferences. To overcome this problem we present a vector space based method that performs a linear mapping from documents to scalar utility values and thus guarantees transitivity. The learning of the relation between documents is formulated as a classification problem on pairs of documents and is solved using the principle of structural risk minimization for good generalization. The approach is extended to polynomial utility functions by using the potential function method (the so called "kernel trick"), which allows to incorporate higher order correlations of features into the utility function at minimal computational costs. The resulting algorithm is tested on an example with artificial data. The algorithm successfully learns the utility function underlying the training examples and shows good classification performance.},
    address = {Madison},
    author = {Herbrich, Ralf and Graepel, Thore and Bollmann-Sdorra, Peter and Obermayer, Klaus},
    booktitle = {Training},
    file = {:Users/rherb/Code/herbrich.me/papers/hergraebollober98.pdf:pdf},
    pages = {80--84},
    title = {{Learning Preference Relations for Information Retrieval}},
    url = {http://www.herbrich.me/papers/hergraebollober98.pdf},
    volume = {0},
    year = {1998}
    }

1997

  • R. Herbrich, “Segmentierung mit Gaborfiltern zur Induktion struktureller Klassifikatoren auf Bilddaten,” Master Thesis PhD Thesis, 1997.
    [BibTeX]
    @phdthesis{Herbrich1997a,
    author = {Herbrich, Ralf},
    school = {Technical University Berlin},
    title = {{Segmentierung mit Gaborfiltern zur Induktion struktureller Klassifikatoren auf Bilddaten}},
    type = {Master Thesis},
    year = {1997}
    }

  • R. Herbrich and T. Scheffer, “Generation of Task-Specific Segmentation Procedures as a Model Selection Task,” in Proceedings of workshop on visual information processing, Sydney, 1997, p. 11–21.
    [BibTeX] [Abstract] [Download PDF]

    In image segmentation problems, there is usually a vast amount of filter operations available, a subset of which has to be selected and instantiated in order to obtain a satisfactory segmentation procedure for a particular do- main. In supervised segmentation, a mapping from features, such as filter outputs for individual pixels, to classes is induced automatically. However, since the sample size required for supervised learning grows exponentially in the number of features it is not feasible to learn a segmentation procedure from a large amount of possible filters. But we argue that automatic model selection methods are able to select a region model in terms of some filters.

    @inproceedings{Herbrich1997,
    abstract = {In image segmentation problems, there is usually a vast amount of filter operations available, a subset of which has to be selected and instantiated in order to obtain a satisfactory segmentation procedure for a particular do- main. In supervised segmentation, a mapping from features, such as filter outputs for individual pixels, to classes is induced automatically. However, since the sample size required for supervised learning grows exponentially in the number of features it is not feasible to learn a segmentation procedure from a large amount of possible filters. But we argue that automatic model selection methods are able to select a region model in terms of some filters.},
    address = {Sydney},
    author = {Herbrich, Ralf and Scheffer, Tobias},
    booktitle = {Proceedings of Workshop on Visual Information Processing},
    file = {:Users/rherb/Downloads/10.1.1.44.5227.pdf:pdf},
    pages = {11--21},
    title = {{Generation of Task-Specific Segmentation Procedures as a Model Selection Task}},
    url = {http://www.herbrich.me/papers/herbrich97.pdf},
    year = {1997}
    }

  • T. Scheffer and R. Herbrich, “Unbiased Assesment of Learning Algorithms,” in Proceedings of the international joint conference on artificial intelligence, Nagoya, 1997, p. 798–803.
    [BibTeX] [Abstract] [Download PDF]

    In order to rank the performance of machine learning algorithms, many researchs conduct experiments on benchmark datasets. Since most learning algorithms have domain-specific parameters, it is a popular custom to adapt these parameters to obtain a minimal error rate on the test set. The same rate is used to rank the algorithm which causes an optimistic bias. We quantify this bias, showing in particular that an algorithm with more parameters will probably be ranked higher than an equally good algorithm with fewer parameters. We demonstrate this result, showing the number of parameters and trials required in order to pretend to outperform C4.5 or FOIL, respectively, for various benchmark problems. We then describe how unbiased ranking experiments should be conducted.

    @inproceedings{DBLP:conf/ijcai/SchefferH97,
    abstract = {In order to rank the performance of machine learning algorithms, many researchs conduct experiments on benchmark datasets. Since most learning algorithms have domain-specific parameters, it is a popular custom to adapt these parameters to obtain a minimal error rate on the test set. The same rate is used to rank the algorithm which causes an optimistic bias. We quantify this bias, showing in particular that an algorithm with more parameters will probably be ranked higher than an equally good algorithm with fewer parameters. We demonstrate this result, showing the number of parameters and trials required in order to pretend to outperform C4.5 or FOIL, respectively, for various benchmark problems. We then describe how unbiased ranking experiments should be conducted.},
    address = {Nagoya},
    author = {Scheffer, Tobias and Herbrich, Ralf},
    booktitle = {Proceedings of the International Joint Conference on Artificial Intelligence},
    file = {:Users/rherb/Downloads/10.1.1.2.2973.pdf:pdf},
    pages = {798--803},
    title = {{Unbiased Assesment of Learning Algorithms}},
    url = {http://www.herbrich.me/papers/scheffer97.pdf},
    year = {1997}
    }

1996

  • T. Scheffer, R. Herbrich, and F. Wysotzki, “Efficient Theta-Subsumption Based on Graph Algorithms,” in Lecture notes in artifical intelligence: 6th international workshop on inductive logic programming,, Stockholm, 1996, p. 212–228. doi:http://dx.doi.org/10.1007/3-540-63494-0_57
    [BibTeX] [Abstract] [Download PDF]

    The \$\backslash theta\$-subsumption problem is crucial to the efficincy of ILP learning systems. We discuss two \$\backslash theta\$-subsumption algorithms based on strategies for preselecting suitable matching literals. The class of clauses, for which subsumption becomes polynomial, is a superset of the deterministic clauses. We further map the general problem of \$\backslash theta\$-subsumption to a certain problem of finding a clique of fixed size in a graph, and in return show that a specialization of the pruning strategy of the Carraghan and Pardalos clique algorithm provides a dramatic reduction of the subsumption search space. We also present empirical results for the mesh design data set.

    @inproceedings{DBLP:conf/ilp/SchefferHW96,
    abstract = {The \$\backslash theta\$-subsumption problem is crucial to the efficincy of ILP learning systems. We discuss two \$\backslash theta\$-subsumption algorithms based on strategies for preselecting suitable matching literals. The class of clauses, for which subsumption becomes polynomial, is a superset of the deterministic clauses. We further map the general problem of \$\backslash theta\$-subsumption to a certain problem of finding a clique of fixed size in a graph, and in return show that a specialization of the pruning strategy of the Carraghan and Pardalos clique algorithm provides a dramatic reduction of the subsumption search space. We also present empirical results for the mesh design data set.},
    address = {Stockholm},
    author = {Scheffer, Tobias and Herbrich, Ralf and Wysotzki, Fritz},
    booktitle = {Lecture Notes in Artifical Intelligence: 6th International Workshop on Inductive Logic Programming,},
    doi = {http://dx.doi.org/10.1007/3-540-63494-0\_57},
    file = {:Users/rherb/Downloads/pdf.pdf:pdf},
    pages = {212--228},
    publisher = {Springer},
    title = {{Efficient Theta-Subsumption Based on Graph Algorithms}},
    url = {http://www.herbrich.me/papers/scheffer96.pdf},
    volume = {1314},
    year = {1996}
    }