Supervised Machine Learning: a theoretical (very) minimum

Table of Contents

  1. Optimization

Optimization

Optimization of a cost function is the core of learning for most supervised machine learning algorithms: it is in fact the way we infere model paramenters that fit our training data. In this section we give a few definitions and introduce the gradient descent algorithm that we use to fit models in our examples.

Definition: Optimization problem
A problem of the following form \begin{equation} \arg\min_{\theta} J(\theta), \, \text{subject to} \hspace{10px} c_i(\theta) \leq b_i, \, i=1,\cdots,M \label{eq:opt-problem} \end{equation} where
  • $(\theta_1,\ldots,theta_N)$ is the optimization variable;
  • $J : \mathbb{R}^N \rightarrow \mathbb{R}$ is the objective function;
  • $c_i : \mathbb{R}^N \rightarrow \mathbb{R}$ is a set of contraints.
is called an optimization problem.


A vector $\theta^{*}$ is said to be optimal if it is a solution for $\eqref{eq:opt-problem}$.

According to the properties of the objective funcion and eventual contraints different classes of problems can be identified, most notably:

  • if $J$ and ${c_i}$ are all linear functions then problem $\eqref{eq:opt-problem}$ is said to be a linear program;
  • if some of the $J$ and ${c_i}$ are non-linear functions then problem $\eqref{eq:opt-problem}$ is a nonlinear program;
  • if $J$ and ${c_i}$ are convex1 the problem is said to be convex.

A solution method for an optimization problem is an algorithm that computes a solution $\theta^*$ for $\eqref{eq:opt-problem}$. Some problems are simply solved in closed form, but for most problems this is not the case. Nonetheless we we briefly discuss here a simple problem, the least square problem, which we also refer to as linear regression (discussed here!).

A simple example: the least square problem

In a least square problem we try to minimized the euclidean distance between a target $N$-dimensional vector ($y$) and the linear mapping of an input vector $\theta$ \begin{equation} \arg\min_{\theta} J(\theta) = ||X^T \theta - y ||_2^2 = \sum_1^K (\langle x^{(i)},\,\theta\rangle - y^{(i)})^2 \label{eq:least-squares} \end{equation} where $X\in \mathbb{R}^{K \times N}$ is a fixed matrix with rows $x^{(i)}$. Since $J(\theta)$ is a quadratic function of $\theta$ one can differentiate to minimize, resulting in the normal equations and the problem solution \begin{equation} \theta^* = (X^TX)^{-1}X^Ty \end{equation}

In general, even for convex problems, we do not have a general solution and so we need different methods to solve th problem at hand: a comomn class of optimization algorithms is that of descent methods.

Descent methods

Consider a generic convex unconstrained problem \begin{equation} \arg\min_{\theta} J(\theta), \quad J : \mathbb{R}^N \rightarrow \mathbb{R} \label{eq:unc-conv} \end{equation} where $J \in C^2(\Omega)$ is convex with domain $\Omega\subseteqq\mathbb{R}^N$. If we assume that $\theta^{*}$ is a solution for $\eqref{eq:unc-conv}$, since $J$ is $C^2$, then we know that $\theta^{*}$ must be solution for \begin{equation} \nabla_{\theta}J(\theta^*)=0 \label{eq:unc-conv-diff} \end{equation} So, for the convex case, a solution of $\eqref{eq:unc-conv-diff}$ is a solution for $\eqref{eq:unc-conv}$: descent methods aim at defining a sequence of point in $\Omega$ such that the objective funcion $J$ evaluated at these points converge to the minimum value $J^{*} \equiv \min_{\theta} J(\theta)$, in order to reach condition $\eqref{eq:unc-conv-diff}$.

Definition: Descent method
Descent methods are defined as a collection of iterative algorithms that compute sequences of values \begin{equation*} \theta^{[0]}, \theta^{[1]}, \ldots, \theta^{[M]}, \ldots \in \Omega \end{equation*} such that \begin{equation*} \lim_{M\rightarrow \infty} J(\theta^{[M]}) = J^{*} \end{equation*}


A generic sequence starting at point $\theta^{[0]}$ can be generically written in recursive form as $$\theta^{[K+1]} = \theta^{[K]} + \alpha^{[K]} \delta\theta^{[K]}, \quad \alpha^{[K]} > 0 $$ where $\delta\theta^{[K]}$ represents a step, a direction in $\Omega$ connecting $\theta^{[K]}$ and $\theta^{[K+1]}$, and $\alpha^{[K]}$ represents the size or magnitude of the step itself. Of course not any step will allow to descend and thus converge to the minimum value: for the algorithm to descend the following condition should be verified:

\begin{equation} J(\theta^{[K+1]}) < J(\theta^{[K]}) \label{eq:desc-cond} \end{equation}

For convex function it easy to show that, for $\widetilde{\theta}\in\Omega$,

\begin{equation} \langle \nabla_{\theta} J(\theta^{[K]}),\, \widetilde{\theta} - \theta^{[K]}\rangle \geq 0 \quad \Longrightarrow \quad J(\widetilde{\theta}) \geq J(\theta^{[K]}) \end{equation}

so descending condition $\eqref{eq:desc-cond}$ can only be satisfied if the step is such that \begin{equation} \langle \nabla_{\theta} J(\theta^{[K]}),\, \delta\theta^{[K]} \rangle \leq 0 \label{eq:step-cond} \end{equation}

Definition: Gradient descent
Gradient descent is descent method defined by \begin{equation} \delta\theta^{[K]} \equiv -\nabla_{\theta} J(\theta^{[K]}) \end{equation} This definition of the step is compatible with the step descending condition $\eqref{eq:step-cond}$.


Convergence of gradient descent for convex problems (TODO)

Can we use gradient descent for non-convex problems? (TODO)

Stochastic gradient descent (SGD) and mini-batching (TODO)

References

Optimization

Notes

  1. A function $f$ is said to be convex if \(f(\alpha \theta_1 + \beta \theta_2) \leq \alpha f(\theta_1) + \beta f(\theta_2)\) for all $\theta_1, \theta_2$ and $\alpha, \beta >0$, s.t. $\alpha + \beta = 1$.