Learning Kernel Classifiers – Ralf Herbrich’s Homepage

Overview

Learning Kernel Classifiers – Theory and Algorithms
Ralf Herbrich
The MIT Press 2002
ISBN: 0-262-08306-X

This book provides a comprehensive overview of both the theory and algorithms of kernel classifiers. It begins by describing the major algorithmic advances: kernel perceptron learning, kernel Fisher discriminants, support vector machines, relevance vector machines, Gaussian processes, and Bayes point machines. Then follows a detailed introduction to learning theory, including VC and PAC-Bayesian theory, data dependent structural risk minimization, and compression bounds. Throughout, the book emphasizes the interaction between theory and algorithms: how learning algorithms work and why. The book includes many examples, complete pseudo code of the algorithms presented, and an extensive source code library.

Software

All the software for the book – both figures and example implementations – have been written in R and are available from my GitHub repository at https://github.com/rherbrich74/lkc-book.

Errata

p. xvii, line -3: “Since the book uses a very rigorous notation systems, …” should read “Since the book uses a very rigorous notation system, …” (thanks to Neil Lawrence).
p. 8, line 5: “… reader should refer Sutton and Barto (1998) …” should read “… reader is referred to Sutton and Barto (1998) …” (thanks to Thore Graepel).
p. 33, line 15: “x Є K” should read “x Є X” (thanks to Arthur Gretton).
p. 33, line -3: “… where U=(u‘₁;…;u‘_r) is …” should read “… where U=(u₁,…,u_n)=(v‘₁;…;v‘_r) is …”. Accordingly, in page 34, line 2 and 4 each u is replaced by a v. Furthermore, the subscript K in line on page 34 is omitted (thanks to Malte Kuss).
p. 34, line 5: “… and a mapping L into it …” should read “… and a mapping f into it …” (thanks to Petra Philips).
p. 34, line 9: “the nth mapped object x_n is” should read “the point ∑_iU_i,nf(x_i)=L^½U‘u_nis”. Accordingly, the following line changes to “||L^½U‘u_n||²=u_n‘ULU‘u_n=e_n‘Le_n=l_n<0″ (thanks to Malte Kuss).
p. 43, equation (2.30): The term l^|v|-j should read l^|v|-t (thanks to Vikas Sindhwani).
p. 77, Example 3.5: “P_X=Binomial(n,p)” should read “P_X|P=p=Binomial(n,p)” (thanks to Jaz Kandola).
p. 84, line -5: “C(x,x) = áx,xñ + σ_t²I_x¹x=C(x,x) = k(x,x) + σ_t²I_x¹x” should read “C(x,x) = áx,xñ + σ_t²I_x=x=C(x,x) = k(x,x) + σ_t²I_x=x” (thanks to Arthur Gretton).
p. 86, line 17: “… on the found local maximum.” should read “… on the local maximum found.” (thanks to Thore Graepel).
p. 86, line 20: “…is of the ability of …” should read “…is the ability of …” (thanks to Thore Graepel).
p. 87, Figure 3.2: “… of the 6 observations …” should read “… of the 7 observations …”. Furthermore, “This local maxima …” should read “This local maximum …” (thanks to Thore Graepel).
p. 87, line -12: “σ Є R⁺” should read “σ Є (R⁺)^N“.
p. 92, Figure 3.5: “… this liklihood is not normalizable.” should read “… this likelihood is not normalizable.” (thanks to Neil Lawrence).
p. 99, line -7: “… out be the single weight vector …” should read “… out by the single weight vector …” (thanks to Matthias Heiler).
p. 101, line 5: “… we first run learning algorithm to find …” should read “… we first run a learning algorithm to find …” (thanks to Matthias Heiler).
p. 110, line -11: Closing parenthesis at MacKay (1998) missing.
p. 110, line -6: Closing parenthesis at Barber and Williams (1997) missing.
p. 122, equation (4.7): The subscript should read “R_emp[h,z] =0″ instead of “R_emp[h] =0″ (thanks to Petra Philips).
p. 123, line 13: “… real-valued loss functions conceptually similar …” should read “… real-valued loss functions is conceptually similar …” (thanks to Petra Philips).
p. 123, line -8: “… for all δ Є (0,1], and all training samples sizes m, with probability at least 1-δ over the random draw of the training sample z Є Z^mwe have …” should read “… for the zero-one loss l_0-1given by equation (2.10) and all ε > 0 we have …” as in Theorem 4.7 (thanks to Simon Hill).
p. 124, line 10: “… (see equation (4.7)) …” should read “… (see equation (4.8)) …” (thanks to Jürgen Schweiger).
p. 125, line 5: “It we denote the maximum number …” should read “If we denote the maximum number …” (thanks to Petra Philips).
p. 126, enumeration 1: “If the function N_H fulfills N_H(m)=2^m …” should read “If the function N_H fulfills N_H(2m)=2^2m …” (thanks to Petra Philips).
p. 127, line 2: “The best constants that can be achieved are 2 as a coefficient of and 1 in the exponent of the exponential term, respectively.” should read “The best constants that can be achieved for the coefficients of the exponent in the exponential term are 2 and 1, respectively.” (thanks to Jaz Kandola).
p. 131, line 3: “Clearly, all functions in h are monotonically …” should read “Clearly, all functions in H are monotonically …” (thanks to Petra Philips).
p. 135, line -1: In the statement, “R[h]” should read “R[h] – R_emp[h,z] ” (thanks to Petra Philips).
p. 137, line 12: The function ω is typed ω:N x R x [0,1] rather than ω:R x [0,1] because it depends on the training sample size m. Hence, in the next line, ω(L((Z₁,…,Z_m),h),δ) should be replaced by ω(m,L((Z₁,…,Z_m),h),δ).This change also appears in line 18, -6 and -3 on page 137, line 2 on page 138, line 17 on page 139 and line 7 on page 140 (thanks to Petra Philips!).
p. 140, line 5: “… in Appendix C.4 that L(z,h) = -ν_eff(z) …” should read “… in Appendix C.4 that L(z,h) = -v_H(z) …” (thanks to Petra Philips).
p. 175, Figure 5.1: “… with (solid line) and without (dashed line) …” should read “… with (dashed line) and without (solid line) …” (thanks to Dongwei Cao).
p. 182, Definition 5.20: The update function U maps from Y x X x H -> H rather than Y x X x Y -> H as stated in the text. This changes the definition of an online learning algorithm. As a consequence, the sentence preceding the displayed equation (“… and the prediction of the current hypothesis h_j Є H …”) changes to “… and the current hypothesis h_j Є H …” (thanks to Thore Graepel!).
p. 183, first equation: Accordingly, “U(x,y,h(x))=h” changes to “U(x,y,h)=h” (thanks to Thore Graepel).
p. 183, line 15: Similar to p. 177, line -5, “(compression function C_i)” should read “(compression function C_|i|)”.
p. 184, line-8: The fourth point should read “If all training examples are correctly classified, it outputs C and classifies according to (5.18).“. Furthermore, the last two sentences of this example are wrong and should be deleted (thanks to Thore Graepel).
p. 187, equation (5.19): This is a definition (similar to (2.14) at page 29) and not an equation.
p. 194, line 19: “preceeding Shawe-Taylor and Williamson (1997, p. 4)” should read “preceding Shawe-Taylor and Williamson (1997)”.
p. 211, line -8: “… is a Gaussian measures, …” should read “… are Gaussian measures, …” (thanks to Petra Philips).
p. 217, line 5: In the definition of ℓⁿ_pthe second case should read max_i=1,…,n |x_i| < ¥ rather than max_i=1,…,n |x_i| (thanks to Arthur Gretton).
p. 222, line 6: “… is the smallest number ε > 0 such that …” should read “… is the smallest number ε ≥ 0 such that …” (thanks to Petra Philips).
p. 240, line -5: “a≠0″ should read “a≥0″ because otherwise exponentiation of 1+a/x with x invalidates the inequality (thanks to Peter Bollmann-Sdorra).
p. 257, line 6: The braces must not include the summation over t (thanks to Vikas Sindhwani).
p. 282, line 3: In the statement on the lhs. P(A) should read P_Z(A) (thanks to Petra Philips).
p. 283, Lemma C.2: There is a major flaw in the proof of this lemma. At the bottom of this page, it is argued that Theorem A.116 proves that the probability that a binomially distributed variable with mean of at least ε exceeds a value of mε/2 is at least 1/2. This is wrong. In order to prove this statement we need a different theorem which is given in the PDF version of the errata (thanks to Ulrich Kockelkorn, Mingrui Wu and Vu Ha for pointing this out mistake and helping with the additional proof). Note the Mingrui Wu has also provided an alternative proof which makes the least assumptions.
p. 283, line 13 “… ε(z) Ù ν_z(A(z)) = 0 if such a sets exists …” should read “… ε Ù ν_z(A(z)) = 0 if such a sets exists …” (thanks to Petra Philips).
p. 285, line -8: “… given in equation (22) and …” should read “… given in equation (2.11) and …” (thanks to Petra Philips).
p. 302, Section C.8: The section heading should read “A PAC-Bayesian Margin Bound” rather than “A PAC-Bayesian Marin Bound” (thanks to Thore Graepel).
p. 303, line 4: In the numerator, the integration is up to p rather than 2p (thanks to John Shawe-Taylor).
p. 340: The entry Bartlett and Shawe-Taylor (1998) has the wrong year and is identical to Bartlett and Shawe-Taylor (1999).
p. 340: In the entry Bennett (1998) there is one extra “19” in the paper title (thanks to Jaz Kandola).
p. 347: In the entry Lauritzen (1981) “t. n. thiele” should read “T. N. Thiele” (thanks to Jaz Kandola).
p. 349: In the entry Neal (1997b) “Technical Report” is spelt twice (thanks to Jaz Kandola).
p. 350: In the entry Robert (1994) “Ney York” should read “New York” (thanks to Jaz Kandola).
p. 354: The entry Watkins (1998) contains (almost) the same content as Watkins (2000) and will be eliminated in future editions.
p. 355: In the entry Williams and Seeger (2001) “nystrom” should read “Nyström” (thanks to Jaz Kandola).