Learning Kernel Classifiers

Overview

Learning Kernel Classifiers – Theory and Algorithms
Ralf Herbrich
The MIT Press 2002
ISBN: 0-262-08306-X

This book provides a comprehensive overview of both the theory and algorithms of kernel classifiers. It begins by describing the major algorithmic advances: kernel perceptron learning, kernel Fisher discriminants, support vector machines, relevance vector machines, Gaussian processes, and Bayes point machines. Then follows a detailed introduction to learning theory, including VC and PAC-Bayesian theory, data dependent structural risk minimization, and compression bounds. Throughout, the book emphasizes the interaction between theory and algorithms: how learning algorithms work and why. The book includes many examples, complete pseudo code of the algorithms presented, and an extensive source code library.

Software

All the software for the book – both figures and example implementations – have been written in R and are available from my GitHub repository at https://github.com/rherbrich74/lkc-book.

Errata

  • p. xvii, line -3: “Since the book uses a very rigorous notation systems, …” should read “Since the book uses a very rigorous notation system, …” (thanks to Neil Lawrence).
  • p. 8, line 5: “… reader should refer Sutton and Barto (1998) …” should read “… reader is referred to Sutton and Barto (1998) …” (thanks to Thore Graepel).
  • p. 33, line 15: “x Є K” should read “x Є X” (thanks to Arthur Gretton).
  • p. 33, line -3: “… where U=(u1;…;ur) is …” should read “… where U=(u1,…,un)=(v1;…;vr) is …”. Accordingly, in page 34, line 2 and 4 each is replaced by a v. Furthermore, the subscript in line on page 34 is omitted (thanks to Malte Kuss).
  • p. 34, line 5: “… and a mapping into it …” should read “… and a mapping into it …” (thanks to Petra Philips).
  • p. 34, line 9: “the nth mapped object xn is” should read “the point ∑Ui,nf(xi)=L½Uun is”. Accordingly, the following line changes to “||L½Uun||2=unULUun=enLen=ln<0″ (thanks to Malte Kuss).
  • p. 43, equation (2.30): The term l|v|-j should read l|v|-t (thanks to Vikas Sindhwani).
  • p. 77, Example 3.5: “PX=Binomial(n,p)” should read “PX|P=p=Binomial(n,p)” (thanks to Jaz Kandola).
  • p. 84, line -5: “C(x,x) = áx,xñ + σt2Ix¹x=C(x,x) = k(x,x) + σt2Ix¹x” should read “C(x,x) = áx,xñ + σt2Ix=x=C(x,x) = k(x,x) + σt2Ix=x” (thanks to Arthur Gretton).
  • p. 86, line 17: “… on the found local maximum.” should read “… on the local maximum found.” (thanks to Thore Graepel).
  • p. 86, line 20: “…is of the ability of …” should read “…is the ability of …” (thanks to Thore Graepel).
  • p. 87, Figure 3.2: “… of the 6 observations …” should read “… of the 7 observations …”. Furthermore, “This local maxima …” should read “This local maximum …” (thanks to Thore Graepel).
  • p. 87, line -12: “σ Є R+” should read “σ Є (R+)N“.
  • p. 92, Figure 3.5: “… this liklihood is not normalizable.” should read “… this likelihood is not normalizable.” (thanks to Neil Lawrence).
  • p. 99, line -7: “… out be the single weight vector …” should read “… out by the single weight vector …” (thanks to Matthias Heiler).
  • p. 101, line 5: “… we first run learning algorithm to find …” should read “… we first run a learning algorithm to find …” (thanks to Matthias Heiler).
  • p. 110, line -11: Closing parenthesis at MacKay (1998) missing.
  • p. 110, line -6: Closing parenthesis at Barber and Williams (1997) missing.
  • p. 122, equation (4.7): The subscript should read “Remp[h,z] =0″ instead of “Remp[h] =0″ (thanks to Petra Philips).
  • p. 123, line 13: “… real-valued loss functions conceptually similar …” should read “… real-valued loss functions is conceptually similar …” (thanks to Petra Philips).
  • p. 123, line -8: “… for all δ Є (0,1], and all training samples sizes m, with probability at least 1-δ over the random draw of the training sample Є Zwe have …” should read “… for the zero-one loss l0-1 given by equation (2.10) and all ε > 0 we have …” as in Theorem 4.7 (thanks to Simon Hill).
  • p. 124, line 10: “… (see equation (4.7)) …” should read “… (see equation (4.8)) …” (thanks to Jürgen Schweiger).
  • p. 125, line 5: “It we denote the maximum number …” should read “If we denote the maximum number …” (thanks to Petra Philips).
  • p. 126, enumeration 1: “If the function NH fulfills NH(m)=2m …” should read “If the function NH fulfills NH(2m)=22m …” (thanks to Petra Philips).
  • p. 127, line 2: “The best constants that can be achieved are 2 as a coefficient of and 1 in the exponent of the exponential term, respectively.” should read “The best constants that can be achieved for the coefficients of the exponent in the exponential term are 2 and 1, respectively.” (thanks to Jaz Kandola).
  • p. 131, line 3: “Clearly, all functions in h are monotonically …” should read “Clearly, all functions in H are monotonically …” (thanks to Petra Philips).
  • p. 135, line -1: In the statement, “R[h]” should read “R[h] – Remp[h,z] ” (thanks to Petra Philips).
  • p. 137, line 12: The function ω is typed ω:N x R x [0,1]  rather than ω:R x [0,1] because it depends on the training sample size m. Hence, in the next line, ω(L((Z1,…,Zm),h),δ) should be replaced by ω(m,L((Z1,…,Zm),h),δ).This change also appears in line 18, -6 and -3 on page 137, line 2 on page 138, line 17 on page 139 and line 7 on page 140 (thanks to Petra Philips!).
  • p. 140, line 5: “… in Appendix C.4 that L(z,h) = -νeff(z) …” should read “… in Appendix C.4 that L(z,h) = -vH(z) …” (thanks to Petra Philips).
  • p. 175, Figure 5.1: “… with (solid line) and without (dashed line) …” should read “… with (dashed line) and without (solid line) …” (thanks to Dongwei Cao).
  • p. 182, Definition 5.20: The update function maps from Y x X x H -> rather than Y x X x  Y -> as stated in the text. This changes the definition of an online learning algorithm. As a consequence, the sentence preceding the displayed equation (“… and the prediction of the current hypothesis hj Є H …”) changes to “… and the current hypothesis hj Є H …” (thanks to Thore Graepel!).
  • p. 183, first equation: Accordingly, “U(x,y,h(x))=h” changes to “U(x,y,h)=h” (thanks to Thore Graepel).
  • p. 183, line 15: Similar to p. 177, line -5, “(compression function Ci)” should read “(compression function C|i|)”.
  • p. 184, line-8: The fourth point should read “If all training examples are correctly classified, it outputs C and classifies according to (5.18).“. Furthermore, the last two sentences of this example are wrong and should be deleted (thanks to Thore Graepel).
  • p. 187, equation (5.19): This is a definition (similar to (2.14) at  page 29) and not an equation.
  • p. 194, line 19: “preceeding Shawe-Taylor and Williamson (1997, p. 4)” should read “preceding Shawe-Taylor and Williamson (1997)”.
  • p. 211, line -8: “… is a Gaussian measures, …” should read “… are Gaussian measures, …” (thanks to Petra Philips).
  • p. 217, line 5: In the definition of ℓnthe second case should read maxi=1,…,n |xi| < ¥ rather than maxi=1,…,n |xi| (thanks to Arthur Gretton).
  • p. 222, line 6: “… is the smallest number ε > 0 such that …” should read “… is the smallest number ε ≥ 0 such that …” (thanks to Petra Philips).
  • p. 240, line -5: “a≠0″ should read “a≥0″ because otherwise exponentiation of 1+a/x with invalidates the inequality (thanks to Peter Bollmann-Sdorra).
  • p. 257, line 6: The braces must not include the summation over t (thanks to Vikas Sindhwani).
  • p. 282, line 3: In the statement on the lhs. P(A) should read PZ(A) (thanks to Petra Philips).
  • p. 283, Lemma C.2: There is a major flaw in the proof of this lemma. At the bottom of this page, it is argued that Theorem A.116 proves that the probability that a binomially distributed variable with mean of at least ε exceeds a value of mε/2 is at least 1/2. This is wrong. In order to prove this statement we need a different theorem which is given in the PDF version of the errata (thanks to Ulrich Kockelkorn, Mingrui Wu and Vu Ha for pointing this out mistake and helping with the additional proof). Note the Mingrui Wu has also provided an alternative proof which makes the least assumptions.
  • p. 283, line 13 “… ε(z) Ù νz(A(z)) = 0 if such a sets exists …” should read “… ε Ù νz(A(z)) = 0 if such a sets exists …” (thanks to Petra Philips).
  • p. 285, line -8: “… given in equation (22) and …” should read “… given in equation (2.11) and …” (thanks to Petra Philips).
  • p. 302, Section C.8: The section heading should read “A PAC-Bayesian Margin Bound” rather than “A PAC-Bayesian Marin Bound” (thanks to Thore Graepel).
  • p. 303, line 4: In the numerator, the integration is up to p rather than 2p (thanks to John Shawe-Taylor).
  • p. 340: The entry Bartlett and Shawe-Taylor (1998) has the wrong year and is identical to Bartlett and Shawe-Taylor (1999). 
  • p. 340: In the entry Bennett (1998) there is one extra “19” in the paper title (thanks to Jaz Kandola).
  • p. 347: In the entry Lauritzen (1981) “t. n. thiele” should read “T. N. Thiele” (thanks to Jaz Kandola).
  • p. 349: In the entry Neal (1997b) “Technical Report” is spelt twice (thanks to Jaz Kandola).
  • p. 350: In the entry Robert (1994) “Ney York” should read “New York” (thanks to Jaz Kandola).
  • p. 354: The entry Watkins (1998) contains (almost) the same content as Watkins (2000) and will be eliminated in future editions.
  • p. 355: In the entry Williams and Seeger (2001) “nystrom” should read “Nyström” (thanks to Jaz Kandola).