With linear methods, we may need a whole lot of features to get a hypothesis space that’s expressive enough to fit our data — there can be orders of magnitude more features than training examples. While regularization can control overfitting, having a huge number of features can make things computationally very difficult, if handled naively. For objective functions of a particular general form, which includes ridge regression and SVMs but not lasso regression, we can “kernelize”, which can allow significant speedups in certain situations. In fact, with the “kernel trick”, we can even use an infinite-dimensional feature space at a computational cost that depends primarily on the training set size.
In more detail, it turns out that even when the optimal parameter vector we’re searching for lives in a very high-dimensional vector space (dimension being the number of features), a basic linear algebra argument shows that for certain objective functions, the optimal parameter vector lives in a subspace spanned by the training input vectors. Thus, when we have more features than training points, we may be better off restricting our search to the lower-dimensional subspace spanned by training inputs. We can do this by an easy reparameterization of the objective function. This result is referred to as the “representer theorem”, and its proof can be given on one slide.
After reparameterization, we’ll find that the objective function depends on the data only through the Gram matrix, or “kernel matrix”, which contains the dot products between all pairs of training feature vectors. This is where things get interesting a second time: Suppose f is our featurization function. Sometimes the dot product between two feature vectors f(x) and f(x’) can be computed much more efficiently than multiplying together corresponding features and summing. In such a situation, we write the dot products in terms of the “kernel function”: k(x,x’)=〈f(x),f(x’)〉, which we hope to compute much more quickly than O(d), where d is the dimension of the feature space. Using this “kernel trick”, together with the reparameterization described above, is the essence of a “kernel method”, and it allows one to use huge (even infinite-dimensional) feature spaces with a computational burden that depends primarily on the size of your training set. In practice, it’s useful for small and medium-sized datasets for which computing the kernel matrix is tractable. Scaling kernel methods to large data sets is still an active area of research.