Contents
1 Introduction: Machine Learning Models 4
1.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
I Supervised Learning 6
2 Introduction 7
2.1 Overview of Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Linear Regression 10
3.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Basis functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Direct approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4 Discriminative approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.5 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.6 Bayesian Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.7 Notable indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.8 Modeling Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4 Linear Classification 20
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2 Discriminant function (Direct approach) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 Probabilistic Discriminative Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.4 Probabilistic Generative Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.5 Non-parametric methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.6 Performance measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5 Bias-Variance trade-o↵ and model selection 30
5.1 No Free Lunch Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.2 Bias-Variance Tradeo↵ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.3 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.4 Model Ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6 PAC-Learning and VC-Dimension 36
6.1 PAC-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.2 VC-Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7 Kernel Methods 40
7.1 Dual representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
7.2 Constructing kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
8 Support Vector Machines 43
8.1 Learning SVMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
8.2 Handling noisy data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
8.3 Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2
CONTENTS 3
II Reinforcement Learning 47
9 Markov Decision Processes 51
9.1 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
9.2 Goals, rewards and return . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
9.3 Policies and value functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
10 Solving Markov Decision Processes 57
10.1 Policy Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
10.2 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
10.3 Infinite Horizon Linear Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
11 RL in finite MDPs 61
11.1 Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
11.2 Model-free Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
11.3 Temporal Di↵erence Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
11.4 n-step TD Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
11.5 Model-Free Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
12 Multi-Armed Bandit 70
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
12.2 Multi-Armed Bandit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
12.3 Stochastic MAB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
12.4 Adversarial MAB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Chapter 1
Introduction: Machine Learning
Models
1.1 Supervised Learning
• Goal Estimating the unknown model that maps known inputs to known outputs
D {hx, )
Training set: = ti} t = f (x)
• Problems
Classification
Regression
Probability estimation
• Techniques
Artificial Neural Networks
Support Vector Machines
Decision trees
...
1.2 Unsupervised Learning
• Goal Learning a more efficient representation of a set on unknown inputs
D {x} )?
Training set: = = f (x)
• Problems
Compression
Clustering
• Techniques
K-means
Self-organizing maps
Principal Component Analysis
... 4
1.3. REINFORCEMENT LEARNING 5
1.3 Reinforcement Learning
• Goal Learning the optimal policy
0
D {hx, ) ⇤ {Q ⇤
Training set: = u, x , ri} ⇡ (x) = argmax (x, u)} where Q*(x,u) must be
u
estimated
• Problems
Markov Decision Process (MDP)
Partially Observable MDP (POMDP)
Stochastic Games (SG)
• Techniques
Q-learning
SARSA
Fitted Q-iteration
... Part I
Supervised Learning
6
Chapter 2
Introduction
Supervised learning is the largest, most mature and most widely used sub-field of machine learning.
D {hx,
Given training data set including desired outputs = ti} from some unknown function f.
Find a good approximation of f that generalizes well on test data.
Input variables x are also called features or attributes.
Output variables t are also called targets or labels.
• If t is discrete: classification
• if t is continuous: regression
• if t is the probability of x : probability estimation, which is di↵erent from regression because of the
constraints imposed by probability.
Supervised learning is based on an ill-posed problem, because given the training set we are not interested
on the performance on the training set, but we have to optimize and objective function which is not
known.
2.1 Overview of Supervised Learning
D
We want to approximate f given the dataset
The steps are: L,
1. Define a loss function which is a metric of how much a f is far from the target
The loss function cannot be build over the true function f but over the sample. It is a noisy measure
7
8 CHAPTER 2. INTRODUCTION
H,
2. Choose some hypothesis space which is the subspace of all the possible functions that we are
interested in. h
3. Optimize to find and approximate model that minimizes the loss with respect to the target
function.
What happens if we enlarge the hypothesis space?
The loss of the best hypothesis will not increase and probably decrease.
So the hypothesis space can be enlarged until the true model is inside of it, obtaining a loss value equal
to zero.
2.1. OVERVIEW OF SUPERVISED LEARNING 9
If we don’t know f but we only have a few samples taken from it, the loss function is based not based
on the function anymore, but on the samples.
So, the optimal function h based on the new loss function may be far from the true model and be farther
2
as long as we enlarge the hypothesis space.
If there are few samples, the loss function may be very noisy, and the more we enlarge the hypothesis
space, the more we are sensitive to the noise.
With infinite samples we can enlarge the hypothesis space without problems, because it’s like knowing f
(no uncertainty).
Chapter 3
Linear Regression
3.1 Linear Regression
The goal of regression is to learn a mapping from D-dimensional vector x of input variables to a continuous
output t. The simplest form of linear regression models are linear functions of the input variables.
This imposes significant limitations to the problem
D 1
X T
y(b, w) = w + w x = w x
0 j j The simplest form of linear regression models are also linear
j=1 functions of the input variables, however, we can obtain a
much more useful class of functions by taking LINEAR
where x = (1, x , ..., x ) and w is called o↵set or bias parameter.
1 D 1 0 COMBINATIONS of fixed sets of non linear functions of the
input variables, also known as BASIS FUNCTIONS
3.1.1 The loss function
The loss (error) function is a measure to quantify the performance on a task L(t, y(x)).
The average, or expected, loss is given by
Z Z
E[L] = L(t, y(x))p(x, t)dxdt
A common choice is the squared loss function
Z Z 2
E[L] = (t y(x)) p(x, t)dxdt
p(x,t) is the probability distribution under which we want to op-
timize the model. It’s the distribution over the future (test) samples
which is usually equal to the distribution of the train samples. So, if
p(x,t) is known, also the distribution of the true model is known.
E[L].
Our goal is to choose y(x) so as to minimize Hence, the optimal
solution is the conditional average
Z E[t|x]
y(x) = tp(t|x)dt =
The squared loss is not the only possible choice of loss function for
regression. An example is the simple generalization of the squared
loss, called Minkowski loss Figure 3.1: y(x) is given by the
Z Z mean of the conditional distribu-
q
E[L] = (t y(x)) p(x, t)dxdt tion p(t—x)
E[L]
The minimum of is given by:
• the conditional mean for q = 2
1
• the conditional median for q = 1
2
• !
the conditional mode for q 0
Note that in a Gaussian distribution mean=median=mode.
1 The value separating the higher half of the samples from the lower half
2 The most frequent sample 10
3.2. BASIS FUNCTIONS 11
3.2 Basis functions
We extend the class of models by considering linear combinations of fixed nonlinear functions (basis
functions) of the input variables M 1
X T
y(x, w) = w + w (x) = w (x)
0 j j
j=1
T
where (x) = (1, (x), ..., (x)) .
1 M 1
There are multiple examples of basis functions:
j
• Polynomial (x) = x .
j
One limitation is that they are global functions of the input variable, so that changes in one region
of input space a↵ect all other regions. can be solved by dividing the input space in regions and fit a different polynomial in each
⇣ ⌘ region —> spline functions
2
(x µ )
j
• Gaussian (x) = exp .
j 2
2
They are not required to have a probabilistic interpretation, and in particular the normalization
coefficient is unimportant because these basis functions will be multiplied by w . —> Local because the 0 region is well
j defined. mu governs the location of
1
• Sigmoidal (x) = ⇣ ⌘ the basis functions in the input space
j µ x
j
1+exp and sigma governs their spatial scale
If there is no linear relationship between the target and the input, it is possible to change the space
through the basis function to obtain a linear model that can learn nonlinear functions of the input
variable.
In this way the model is linear in the new feature space but nonlinear in the original space.
3.2.1 Discriminative vs Generative
• Direct approach
Find a regression function y(x) directly from the training data by searching in the space of
the model (changing the parameters)
• Discriminative approach
Model directly the conditional density p(t|x) R
E[t|x]
Marginalize to find the conditional mean = tp(t|x)dt
• Generative approach: useful for augmenting data because it can generate new samples (e.g.
new input) and then ask someone to label new data
Model the joint density p(x, t) = p(x|t)p(t)
p(x,t)
Infer conditional density p(t|x) = p(x) R
E[t|x]
Marginalize to find the conditional mean = tp(t|x)dt
12 CHAPTER 3. LINEAR REGRESSION
3.3 Direct approaches
3.3.1 Minimizing Least Squares
Given a data set with N samples, let us consider the following loss function
N
X
1 2
L(w) = (y(x , w) t )
n n
2 n=1
This is (half) the residual sum of squares (RSS), also known as sum of squares error (SSE). It can
3
also be written as the sum of the ` -norm of the vector of the residual errors
2 N
X
22 2
||✏||
RSS(w) = = ✏ i
T T
Let’s write RSS in matrix form with = ( (x ), ..., (x )) and t = (t , ..., t )
1 N 1 N
⇥ ⇥ ⇥ ⇥
The residual ✏ = t w, of size N 1 (N M )(M 1) = N 1
P
22 2 T
||✏||
Given that = ✏ = ✏ ✏
i
i 1 1 T
L(w) = RSS(w) = (t w) (t w)
2 2
To obtain the minimum we need the point with:
• gradient = 0
• curvature has all eigenvalues > 0
We compute the first and second derivative 2
@L(w) @ L(w)
T T
= (t w); =
T
@w @w@w
T
We can observe that, assuming nonsingular, it is symmetric and positive semi-definite, so all the
eigenvalues are 0. If some eigenvalues are = 0 we have infinite solutions.
T T 1 T T 1 T T T
) )
(t w) = 0 ( ) w =( ) t w = t
And we obtain T 1 T
ŵ = ( ) t
OLS
It is not possible to perform the computation in two cases:
T
• we have less samples than features, so is not with full rank, so we have infinite solutions
• some features are linear combination of others, so we have a singular matrix
3.3.2 Geometric interpretation
We consider an N -dimensional space whose axes are given by the t ,
n
so that t = (t , ..., t ) is a vector in this space. Each basis function
1 N
(x ) can be also represented as a vector in this space.
n th
Note that as we can see from Figure 3.2, ' corresponds to the j
j th
column of , whereas (x ) corresponds to the n row of .
n
If the number M of basis functions is smaller than N, than the M
S
vectors (x ) will span a linear subspace of dimensionality M.
n th Figure 3.2: Graphical represen-
Define t̂ as the N -dimensional vector whose n element is given by tation of x ) and '
y(x , w). Because t̂ is an arbitrary linear combination ' , ..., ' , it n j
(
n 1 M
S.
lies in an M -dimensional subspace
The residual sum of squares is then equal to the squared Euclidean distance between t̂ and t. Thus the
S
least-squares solution for w corresponds to that choice of t̂ that lies in the subspace and that is closest
S.
to t. This solution corresponds to the orthogonal projection of t onto the subspace
T 1 T
t̂ = ŵ = ( ) t
T 1 T
where H = ( ) is called hat matrix.
P 1/p
3 p
||x|| |x |
=
p i
i2N
3.4. DISCRIMINATIVE APPROACHES 13
3.3.3 Gradient Optimization
A closed-form solution can be impractical with large data sets. It may be worthwhile to use a sequential
(online) algorithm, in which the gradient is computed one sample at a time. This algorithm can be
applied also when data observations are arriving in a continuous stream, and predictions must be made
before all of the data points are seen.
For this purpose we can apply the stochastic gradient descent. P
If the loss function can be expressed as a sum over samples (L(x) = L(x )) the algorithm updates
n
n
w as follows: (k+1) (k) (k) rL(X
w = w ↵ )
n
T
(k+1) (k) (k) (k)
w = w ↵ (w (x ) t ) (x )
n n n
where k is the iteration and ↵ is the learning rate. In the case we want to consider a set of K = 10 samples we could use the batch update
formulation:
For convergence the learning rate has to satisfy: we should divide alpha by K and then put then the summation from k=1 to K
1
X 1 = +1 (learning converges quickly enough)
(k)
↵
k=0
1
X 1 < +1 (learning converges not too fast)
2
(k)
↵
k=0 1
A possible example of learning rate is .
k
3.4 Discriminative approaches
3.4.1 Maximum Likelihood (ML)
We assume that the target variable t is given by a deterministic function y with a Gaussian noise
2
⇠ N
✏ (0, ) : t = y(x, w) + ✏
2
⇠ N
So, t (y(x, w), ).
Given N samples independent and identically distributed (i.i.d), with inputs X = x , ..., x and outputs
1 N
T
t = (t , ..., t ) , the likelihood function is
1 N N
Y
2 T 2
N |w
p(t|X, w, )= (t (x ), )=
n n
n=1
N
Y 1 y(x,w))2
(t
= e 2
2
2
2⇡
n=1
Now we need to use w to approximate the mean of the Gaussian, because as seen in section 3.1.1, with
a squared loss function, the optimal prediction of a new value x is equal to the conditional mean of the
14 CHAPTER 3. LINEAR REGRESSION
target variable. This can be done by finding the maximum likelihood, which is the probability of the
target values given the assumed model.
We can consider the log-likelihood function, which has the maximum value at the same point.
N
X
2 2
|x
`(w) = ln p(t|X, w, )= ln p(t , w, )
n n
n=1
N 1 N 1
2 2 T
= ln(2⇡ ) RSS(w) = ln(2⇡ ) (t w ) (t w )
2 2
2 2 2 4
Note that the first term does not depend on w, so it will disappear when we compute the gradient; the
second term is basically the loss function, confirming that maximizing the likelihood is equivalent to
minimizing the loss function. T T T
r ln `(W) = (t w) = t + w =0
T 1 T
w = ( ) t
M L
T 1 T
Where ( ) is a generalization of the inv
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
-
Appunti completi di Machine Learning
-
Appunti del corso di Machine Learning
-
Machine Learning - Appunti completi
-
Appunti Machine Learning