# Knowledge Bases

## Knowledge Corroboration

Current knowledge bases suffer from either low coverage or low accuracy, yet user feedback can greatly improve the quality of automatically extracted knowledge bases. User feedback could help quantify the uncertainty associated with the stored statements and would enable mechanisms for searching, ranking and reasoning at entity-relationship level. Most importantly, a principled model for exploiting user feedback to learn the truth values of statements in the knowledge base would be a major step forward in addressing the issue of knowledge base curation. We present a family of probabilistic graphical models that builds on user feedback and logical inference rules derived from the popular Semantic-Web formalism of RDFS. Through internal inference and belief propagation, these models can learn both, the truth values of the statements in the knowledge base and the reliabilities of the users who give feedback. We demonstrate the viability of our approach in extensive experiments on real-world datasets, with feedback collected from Amazon Mechanical Turk.

• G. Kasneci, J. V. Gael, R. Herbrich, and T. Graepel, „Bayesian Knowledge Corroboration with Logical Rules and User Feedback,“ in Proceedings of European Conference on Machine Learning and Knowledge Discovery in Databases, 2010, p. 1–18.

Current knowledge bases suffer from either low coverage or low accuracy. The underlying hypothesis of this work is that user feedback can greatly improve the quality of automatically extracted knowledge bases. The feedback could help quantify the uncertainty associated with the stored statements and would enable mechanisms for searching, ranking and reasoning at entity-relationship level. Most importantly, a principled model for exploiting user feedback to learn the truth values of statements in the knowledge base would be a major step forward in addressing the issue of knowledge base curation. We present a family of probabilistic graphical models that builds on user feedback and logical inference rules derived from the popular Semantic-Web formalism of RDFS. Through internal inference and belief propagation, these models can learn both, the truth values of the statements in the knowledge base and the reliabilities of the users who give feedback. We demonstrate the viability of our approach in extensive experiments on real-world datasets, with feedback collected from Amazon Mechanical Turk.

@inproceedings{kasneci2010knowledgecorroboration,
abstract = {Current knowledge bases suffer from either low coverage or low accuracy. The underlying hypothesis of this work is that user feedback can greatly improve the quality of automatically extracted knowledge bases. The feedback could help quantify the uncertainty associated with the stored statements and would enable mechanisms for searching, ranking and reasoning at entity-relationship level. Most importantly, a principled model for exploiting user feedback to learn the truth values of statements in the knowledge base would be a major step forward in addressing the issue of knowledge base curation. We present a family of probabilistic graphical models that builds on user feedback and logical inference rules derived from the popular Semantic-Web formalism of RDFS. Through internal inference and belief propagation, these models can learn both, the truth values of the statements in the knowledge base and the reliabilities of the users who give feedback. We demonstrate the viability of our approach in extensive experiments on real-world datasets, with feedback collected from Amazon Mechanical Turk.},
author = {Kasneci, Gjergji and Gael, Jurgen Van and Herbrich, Ralf and Graepel, Thore},
booktitle = {Proceedings of European Conference on Machine Learning and Knowledge Discovery in Databases},
pages = {1--18},
title = {{Bayesian} Knowledge Corroboration with Logical Rules and User Feedback},
url = {https://www.herbrich.me/papers/ecml10.pdf},
year = {2010},
bdsk-url-1 = {https://www.herbrich.me/papers/ecml10.pdf}}

• U. Paquet, J. {Van Gael}, D. Stern, G. Kasneci, R. Herbrich, and T. Graepel, „Vuvuzelas & Active Learning for Online Classification,“ in Proceedings of Computational Social Science and the Wisdom of Crowds Workshop, 2010.

Many online service systems leverage user-generated content from Web 2.0 style platforms such as Wikipedia, Twitter, Facebook, and many more. Often, the value lies in the freshness of this information (e.g. tweets, event-based articles, blog posts, etc.). This freshness poses a challenge for supervised learning models as they frequently have to deal with previously unseen features. In this paper we address the problem of online classification for tweets, namely, how can a classifier be updated in an online manner, so that it can correctly classify the latest hype on Twitter? We propose a two-step strategy to solve this problem. The first step follows an active learning strategy that enables the selection of tweets for which a label would be most useful; the selected tweet is then forwarded to Amazon Mechanical Turk where it is labeled by multiple users. The second step builds on a Bayesian corroboration model that aggregates the noisy labels provided by the users by taking their reliabilities into account.

@inproceedings{paquet2010,
abstract = {Many online service systems leverage user-generated content from Web 2.0 style platforms such as Wikipedia, Twitter, Facebook, and many more. Often, the value lies in the freshness of this information (e.g. tweets, event-based articles, blog posts, etc.). This freshness poses a challenge for supervised learning models as they frequently have to deal with previously unseen features. In this paper we address the problem of online classification for tweets, namely, how can a classifier be updated in an online manner, so that it can correctly classify the latest hype on Twitter? We propose a two-step strategy to solve this problem. The first step follows an active learning strategy that enables the selection of tweets for which a label would be most useful; the selected tweet is then forwarded to Amazon Mechanical Turk where it is labeled by multiple users. The second step builds on a Bayesian corroboration model that aggregates the noisy labels provided by the users by taking their reliabilities into account.},
author = {Paquet, Ulrich and {Van Gael}, Jurgen and Stern, David and Kasneci, Gjergji and Herbrich, Ralf and Graepel, Thore},
booktitle = {Proceedings of Computational Social Science and the Wisdom of Crowds Workshop},
title = {{Vuvuzelas \& Active Learning for Online Classification}},
url = {https://www.herbrich.me/papers/vuvuzela.pdf},
year = {2010},
bdsk-url-1 = {https://www.herbrich.me/papers/vuvuzela.pdf}}

## Features from Knowledge Bases

The prediction accuracy of learning algorithms highly depends on the quality of the selected features; but often, the task of feature construction and selection is tedious and non-scalable. Over the years, however, there have been numerous projects with the goal of constructing general-purpose or domain-specific knowledge bases with entity-relationship-entity triples extracted from various Web sources or collected from user communities, e.g., YAGO, DBpedia, Free- base, UMLS. We introduce an expressive graph-based language for extracting features from such knowledge bases and a theoretical framework for constructing feature vectors from the extracted features. The experimental evaluation on different learning scenarios provides evidence that the features derived through our framework can considerably improve the prediction accuracy, especially when the labeled data at hand is sparse.

• W. Cheng, G. Kasneci, T. Graepel, D. H. Stern, and R. Herbrich, „Automated Feature Generation From Structured Knowledge,“ in Proceedings of the 20th ACM Conference on Information and Knowledge Management, 2011, p. 1395–1404.

The prediction accuracy of any learning algorithm highly depends on the quality of the selected features; but often, the task of feature construction and selection is tedious and non-scalable. In recent years, however, there have been numerous projects with the goal of constructing general-purpose or domain-specific knowledge bases with entity-relationship-entity triples extracted from various Web sources or collected from user communities, e.g., YAGO, DBpedia, Free- base, UMLS, etc. This paper advocates the simple and yet far-reaching idea that the structured knowledge contained in such knowledge bases can be exploited to automatically extract features for general learning tasks. We introduce an expressive graph-based language for extracting features from such knowledge bases and a theoretical framework for constructing feature vectors from the extracted features. Our experimental evaluation on different learning scenarios provides evidence that the features derived through our framework can considerably improve the prediction accuracy, especially when the labeled data at hand is sparse.

@inproceedings{cheng2011featuresfromknowledge,
abstract = {The prediction accuracy of any learning algorithm highly depends on the quality of the selected features; but often, the task of feature construction and selection is tedious and non-scalable. In recent years, however, there have been numerous projects with the goal of constructing general-purpose or domain-specific knowledge bases with entity-relationship-entity triples extracted from various Web sources or collected from user communities, e.g., YAGO, DBpedia, Free- base, UMLS, etc. This paper advocates the simple and yet far-reaching idea that the structured knowledge contained in such knowledge bases can be exploited to automatically extract features for general learning tasks. We introduce an expressive graph-based language for extracting features from such knowledge bases and a theoretical framework for constructing feature vectors from the extracted features. Our experimental evaluation on different learning scenarios provides evidence that the features derived through our framework can considerably improve the prediction accuracy, especially when the labeled data at hand is sparse.},
author = {Cheng, Weiwei and Kasneci, Gjergji and Graepel, Thore and Stern, David H and Herbrich, Ralf},
booktitle = {Proceedings of the 20th ACM Conference on Information and Knowledge Management},
pages = {1395--1404},
title = {Automated Feature Generation From Structured Knowledge},
url = {https://www.herbrich.me/papers/cikmfp0337-cheng.pdf},
year = {2011},
bdsk-url-1 = {https://www.herbrich.me/papers/cikmfp0337-cheng.pdf}}

## Efficient Graph Matching

The ⍬-subsumption problem is crucial to the efficiency of learning systems on structured knowledge bases, finding a match of a sub-graph with variables in node or edge labels. We present discuss two ⍬-subsumption algorithms based on strategies for preselecting suitable matching literals for the variables. We further map the general problem of ⍬-subsumption to a certain problem of finding a clique of fixed size in a graph, and in turn show that a specialization of the pruning strategy of the Carraghan and Pardalos clique algorithm provides a dramatic reduction of the subsumption search space.

• T. Scheffer, R. Herbrich, and F. Wysotzki, „Efficient $\Theta$-Subsumption Based on Graph Algorithms,“ in Lecture Notes in Artifical Intelligence: 6th International Workshop on Inductive Logic Programming,, 1996, p. 212–228.
The $\theta$-subsumption problem is crucial to the efficiency of ILP learning systems. We discuss two $\theta$-subsumption algorithms based on strategies for preselecting suitable matching literals. The class of clauses, for which subsumption becomes polynomial, is a superset of the deterministic clauses. We further map the general problem of $\theta$-subsumption to a certain problem of finding a clique of fixed size in a graph, and in return show that a specialization of the pruning strategy of the Carraghan and Pardalos clique algorithm provides a dramatic reduction of the subsumption search space. We also present empirical results for the mesh design data set.
@inproceedings{scheffer1996subsumption,
abstract = {The $\theta$-subsumption problem is crucial to the efficiency of ILP learning systems. We discuss two $\theta$-subsumption algorithms based on strategies for preselecting suitable matching literals. The class of clauses, for which subsumption becomes polynomial, is a superset of the deterministic clauses. We further map the general problem of $\theta$-subsumption to a certain problem of finding a clique of fixed size in a graph, and in return show that a specialization of the pruning strategy of the Carraghan and Pardalos clique algorithm provides a dramatic reduction of the subsumption search space. We also present empirical results for the mesh design data set.},
title = {Efficient $\Theta$-Subsumption Based on Graph Algorithms},
bdsk-url-1 = {https://www.herbrich.me/papers/scheffer96.pdf}}