Estratto del documento

Contents

1 Introduction: Machine Learning Models 4

1.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

I Supervised Learning 6

2 Introduction 7

2.1 Overview of Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Linear Regression 10

3.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2 Basis functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.3 Direct approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.4 Discriminative approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.5 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.6 Bayesian Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.7 Notable indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.8 Modeling Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 Linear Classification 20

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.2 Discriminant function (Direct approach) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.3 Probabilistic Discriminative Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.4 Probabilistic Generative Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.5 Non-parametric methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.6 Performance measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5 Bias-Variance trade-o↵ and model selection 30

5.1 No Free Lunch Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.2 Bias-Variance Tradeo↵ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.3 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.4 Model Ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

6 PAC-Learning and VC-Dimension 36

6.1 PAC-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

6.2 VC-Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

7 Kernel Methods 40

7.1 Dual representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

7.2 Constructing kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

8 Support Vector Machines 43

8.1 Learning SVMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

8.2 Handling noisy data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

8.3 Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

2

CONTENTS 3

II Reinforcement Learning 47

9 Markov Decision Processes 51

9.1 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

9.2 Goals, rewards and return . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

9.3 Policies and value functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

10 Solving Markov Decision Processes 57

10.1 Policy Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

10.2 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

10.3 Infinite Horizon Linear Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

11 RL in finite MDPs 61

11.1 Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

11.2 Model-free Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

11.3 Temporal Di↵erence Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

11.4 n-step TD Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

11.5 Model-Free Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

12 Multi-Armed Bandit 70

12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

12.2 Multi-Armed Bandit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

12.3 Stochastic MAB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

12.4 Adversarial MAB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

Chapter 1

Introduction: Machine Learning

Models

1.1 Supervised Learning

• Goal Estimating the unknown model that maps known inputs to known outputs

D {hx, )

Training set: = ti} t = f (x)

• Problems

Classification

Regression

Probability estimation

• Techniques

Artificial Neural Networks

Support Vector Machines

Decision trees

...

1.2 Unsupervised Learning

• Goal Learning a more efficient representation of a set on unknown inputs

D {x} )?

Training set: = = f (x)

• Problems

Compression

Clustering

• Techniques

K-means

Self-organizing maps

Principal Component Analysis

... 4

1.3. REINFORCEMENT LEARNING 5

1.3 Reinforcement Learning

• Goal Learning the optimal policy

0

D {hx, ) ⇤ {Q ⇤

Training set: = u, x , ri} ⇡ (x) = argmax (x, u)} where Q*(x,u) must be

u

estimated

• Problems

Markov Decision Process (MDP)

Partially Observable MDP (POMDP)

Stochastic Games (SG)

• Techniques

Q-learning

SARSA

Fitted Q-iteration

... Part I

Supervised Learning

6

Chapter 2

Introduction

Supervised learning is the largest, most mature and most widely used sub-field of machine learning.

D {hx,

Given training data set including desired outputs = ti} from some unknown function f.

Find a good approximation of f that generalizes well on test data.

Input variables x are also called features or attributes.

Output variables t are also called targets or labels.

• If t is discrete: classification

• if t is continuous: regression

• if t is the probability of x : probability estimation, which is di↵erent from regression because of the

constraints imposed by probability.

Supervised learning is based on an ill-posed problem, because given the training set we are not interested

on the performance on the training set, but we have to optimize and objective function which is not

known.

2.1 Overview of Supervised Learning

D

We want to approximate f given the dataset

The steps are: L,

1. Define a loss function which is a metric of how much a f is far from the target

The loss function cannot be build over the true function f but over the sample. It is a noisy measure

7

8 CHAPTER 2. INTRODUCTION

H,

2. Choose some hypothesis space which is the subspace of all the possible functions that we are

interested in. h

3. Optimize to find and approximate model that minimizes the loss with respect to the target

function.

What happens if we enlarge the hypothesis space?

The loss of the best hypothesis will not increase and probably decrease.

So the hypothesis space can be enlarged until the true model is inside of it, obtaining a loss value equal

to zero.

2.1. OVERVIEW OF SUPERVISED LEARNING 9

If we don’t know f but we only have a few samples taken from it, the loss function is based not based

on the function anymore, but on the samples.

So, the optimal function h based on the new loss function may be far from the true model and be farther

2

as long as we enlarge the hypothesis space.

If there are few samples, the loss function may be very noisy, and the more we enlarge the hypothesis

space, the more we are sensitive to the noise.

With infinite samples we can enlarge the hypothesis space without problems, because it’s like knowing f

(no uncertainty).

Chapter 3

Linear Regression

3.1 Linear Regression

The goal of regression is to learn a mapping from D-dimensional vector x of input variables to a continuous

output t. The simplest form of linear regression models are linear functions of the input variables.

This imposes significant limitations to the problem

D 1

X T

y(b, w) = w + w x = w x

0 j j The simplest form of linear regression models are also linear

j=1 functions of the input variables, however, we can obtain a

much more useful class of functions by taking LINEAR

where x = (1, x , ..., x ) and w is called o↵set or bias parameter.

1 D 1 0 COMBINATIONS of fixed sets of non linear functions of the

input variables, also known as BASIS FUNCTIONS

3.1.1 The loss function

The loss (error) function is a measure to quantify the performance on a task L(t, y(x)).

The average, or expected, loss is given by

Z Z

E[L] = L(t, y(x))p(x, t)dxdt

A common choice is the squared loss function

Z Z 2

E[L] = (t y(x)) p(x, t)dxdt

p(x,t) is the probability distribution under which we want to op-

timize the model. It’s the distribution over the future (test) samples

which is usually equal to the distribution of the train samples. So, if

p(x,t) is known, also the distribution of the true model is known.

E[L].

Our goal is to choose y(x) so as to minimize Hence, the optimal

solution is the conditional average

Z E[t|x]

y(x) = tp(t|x)dt =

The squared loss is not the only possible choice of loss function for

regression. An example is the simple generalization of the squared

loss, called Minkowski loss Figure 3.1: y(x) is given by the

Z Z mean of the conditional distribu-

q

E[L] = (t y(x)) p(x, t)dxdt tion p(t—x)

E[L]

The minimum of is given by:

• the conditional mean for q = 2

1

• the conditional median for q = 1

2

• !

the conditional mode for q 0

Note that in a Gaussian distribution mean=median=mode.

1 The value separating the higher half of the samples from the lower half

2 The most frequent sample 10

3.2. BASIS FUNCTIONS 11

3.2 Basis functions

We extend the class of models by considering linear combinations of fixed nonlinear functions (basis

functions) of the input variables M 1

X T

y(x, w) = w + w (x) = w (x)

0 j j

j=1

T

where (x) = (1, (x), ..., (x)) .

1 M 1

There are multiple examples of basis functions:

j

• Polynomial (x) = x .

j

One limitation is that they are global functions of the input variable, so that changes in one region

of input space a↵ect all other regions. can be solved by dividing the input space in regions and fit a different polynomial in each

⇣ ⌘ region —> spline functions

2

(x µ )

j

• Gaussian (x) = exp .

j 2

2

They are not required to have a probabilistic interpretation, and in particular the normalization

coefficient is unimportant because these basis functions will be multiplied by w . —> Local because the 0 region is well

j defined. mu governs the location of

1

• Sigmoidal (x) = ⇣ ⌘ the basis functions in the input space

j µ x

j

1+exp and sigma governs their spatial scale

If there is no linear relationship between the target and the input, it is possible to change the space

through the basis function to obtain a linear model that can learn nonlinear functions of the input

variable.

In this way the model is linear in the new feature space but nonlinear in the original space.

3.2.1 Discriminative vs Generative

• Direct approach

Find a regression function y(x) directly from the training data by searching in the space of

the model (changing the parameters)

• Discriminative approach

Model directly the conditional density p(t|x) R

E[t|x]

Marginalize to find the conditional mean = tp(t|x)dt

• Generative approach: useful for augmenting data because it can generate new samples (e.g.

new input) and then ask someone to label new data

Model the joint density p(x, t) = p(x|t)p(t)

p(x,t)

Infer conditional density p(t|x) = p(x) R

E[t|x]

Marginalize to find the conditional mean = tp(t|x)dt

12 CHAPTER 3. LINEAR REGRESSION

3.3 Direct approaches

3.3.1 Minimizing Least Squares

Given a data set with N samples, let us consider the following loss function

N

X

1 2

L(w) = (y(x , w) t )

n n

2 n=1

This is (half) the residual sum of squares (RSS), also known as sum of squares error (SSE). It can

3

also be written as the sum of the ` -norm of the vector of the residual errors

2 N

X

22 2

||✏||

RSS(w) = = ✏ i

T T

Let’s write RSS in matrix form with = ( (x ), ..., (x )) and t = (t , ..., t )

1 N 1 N

⇥ ⇥ ⇥ ⇥

The residual ✏ = t w, of size N 1 (N M )(M 1) = N 1

P

22 2 T

||✏||

Given that = ✏ = ✏ ✏

i

i 1 1 T

L(w) = RSS(w) = (t w) (t w)

2 2

To obtain the minimum we need the point with:

• gradient = 0

• curvature has all eigenvalues > 0

We compute the first and second derivative 2

@L(w) @ L(w)

T T

= (t w); =

T

@w @w@w

T

We can observe that, assuming nonsingular, it is symmetric and positive semi-definite, so all the

eigenvalues are 0. If some eigenvalues are = 0 we have infinite solutions.

T T 1 T T 1 T T T

) )

(t w) = 0 ( ) w =( ) t w = t

And we obtain T 1 T

ŵ = ( ) t

OLS

It is not possible to perform the computation in two cases:

T

• we have less samples than features, so is not with full rank, so we have infinite solutions

• some features are linear combination of others, so we have a singular matrix

3.3.2 Geometric interpretation

We consider an N -dimensional space whose axes are given by the t ,

n

so that t = (t , ..., t ) is a vector in this space. Each basis function

1 N

(x ) can be also represented as a vector in this space.

n th

Note that as we can see from Figure 3.2, ' corresponds to the j

j th

column of , whereas (x ) corresponds to the n row of .

n

If the number M of basis functions is smaller than N, than the M

S

vectors (x ) will span a linear subspace of dimensionality M.

n th Figure 3.2: Graphical represen-

Define t̂ as the N -dimensional vector whose n element is given by tation of x ) and '

y(x , w). Because t̂ is an arbitrary linear combination ' , ..., ' , it n j

(

n 1 M

S.

lies in an M -dimensional subspace

The residual sum of squares is then equal to the squared Euclidean distance between t̂ and t. Thus the

S

least-squares solution for w corresponds to that choice of t̂ that lies in the subspace and that is closest

S.

to t. This solution corresponds to the orthogonal projection of t onto the subspace

T 1 T

t̂ = ŵ = ( ) t

T 1 T

where H = ( ) is called hat matrix.

P 1/p

3 p

||x|| |x |

=

p i

i2N

3.4. DISCRIMINATIVE APPROACHES 13

3.3.3 Gradient Optimization

A closed-form solution can be impractical with large data sets. It may be worthwhile to use a sequential

(online) algorithm, in which the gradient is computed one sample at a time. This algorithm can be

applied also when data observations are arriving in a continuous stream, and predictions must be made

before all of the data points are seen.

For this purpose we can apply the stochastic gradient descent. P

If the loss function can be expressed as a sum over samples (L(x) = L(x )) the algorithm updates

n

n

w as follows: (k+1) (k) (k) rL(X

w = w ↵ )

n

T

(k+1) (k) (k) (k)

w = w ↵ (w (x ) t ) (x )

n n n

where k is the iteration and ↵ is the learning rate. In the case we want to consider a set of K = 10 samples we could use the batch update

formulation:

For convergence the learning rate has to satisfy: we should divide alpha by K and then put then the summation from k=1 to K

1

X 1 = +1 (learning converges quickly enough)

(k)

k=0

1

X 1 < +1 (learning converges not too fast)

2

(k)

k=0 1

A possible example of learning rate is .

k

3.4 Discriminative approaches

3.4.1 Maximum Likelihood (ML)

We assume that the target variable t is given by a deterministic function y with a Gaussian noise

2

⇠ N

✏ (0, ) : t = y(x, w) + ✏

2

⇠ N

So, t (y(x, w), ).

Given N samples independent and identically distributed (i.i.d), with inputs X = x , ..., x and outputs

1 N

T

t = (t , ..., t ) , the likelihood function is

1 N N

Y

2 T 2

N |w

p(t|X, w, )= (t (x ), )=

n n

n=1

N

Y 1 y(x,w))2

(t

= e 2

2

2

2⇡

n=1

Now we need to use w to approximate the mean of the Gaussian, because as seen in section 3.1.1, with

a squared loss function, the optimal prediction of a new value x is equal to the conditional mean of the

14 CHAPTER 3. LINEAR REGRESSION

target variable. This can be done by finding the maximum likelihood, which is the probability of the

target values given the assumed model.

We can consider the log-likelihood function, which has the maximum value at the same point.

N

X

2 2

|x

`(w) = ln p(t|X, w, )= ln p(t , w, )

n n

n=1

N 1 N 1

2 2 T

= ln(2⇡ ) RSS(w) = ln(2⇡ ) (t w ) (t w )

2 2

2 2 2 4

Note that the first term does not depend on w, so it will disappear when we compute the gradient; the

second term is basically the loss function, confirming that maximizing the likelihood is equivalent to

minimizing the loss function. T T T

r ln `(W) = (t w) = t + w =0

T 1 T

w = ( ) t

M L

T 1 T

Where ( ) is a generalization of the inv

Anteprima
Vedrai una selezione di 16 pagine su 73
Appunti Lezione Machine Learning Pag. 1 Appunti Lezione Machine Learning Pag. 2
Anteprima di 16 pagg. su 73.
Scarica il documento per vederlo tutto.
Appunti Lezione Machine Learning Pag. 6
Anteprima di 16 pagg. su 73.
Scarica il documento per vederlo tutto.
Appunti Lezione Machine Learning Pag. 11
Anteprima di 16 pagg. su 73.
Scarica il documento per vederlo tutto.
Appunti Lezione Machine Learning Pag. 16
Anteprima di 16 pagg. su 73.
Scarica il documento per vederlo tutto.
Appunti Lezione Machine Learning Pag. 21
Anteprima di 16 pagg. su 73.
Scarica il documento per vederlo tutto.
Appunti Lezione Machine Learning Pag. 26
Anteprima di 16 pagg. su 73.
Scarica il documento per vederlo tutto.
Appunti Lezione Machine Learning Pag. 31
Anteprima di 16 pagg. su 73.
Scarica il documento per vederlo tutto.
Appunti Lezione Machine Learning Pag. 36
Anteprima di 16 pagg. su 73.
Scarica il documento per vederlo tutto.
Appunti Lezione Machine Learning Pag. 41
Anteprima di 16 pagg. su 73.
Scarica il documento per vederlo tutto.
Appunti Lezione Machine Learning Pag. 46
Anteprima di 16 pagg. su 73.
Scarica il documento per vederlo tutto.
Appunti Lezione Machine Learning Pag. 51
Anteprima di 16 pagg. su 73.
Scarica il documento per vederlo tutto.
Appunti Lezione Machine Learning Pag. 56
Anteprima di 16 pagg. su 73.
Scarica il documento per vederlo tutto.
Appunti Lezione Machine Learning Pag. 61
Anteprima di 16 pagg. su 73.
Scarica il documento per vederlo tutto.
Appunti Lezione Machine Learning Pag. 66
Anteprima di 16 pagg. su 73.
Scarica il documento per vederlo tutto.
Appunti Lezione Machine Learning Pag. 71
1 su 73
D/illustrazione/soddisfatti o rimborsati
Acquista con carta o PayPal
Scarica i documenti tutte le volte che vuoi
Dettagli
SSD
Ingegneria industriale e dell'informazione ING-INF/05 Sistemi di elaborazione delle informazioni

I contenuti di questa pagina costituiscono rielaborazioni personali del Publisher bonadiamatilde di informazioni apprese con la frequenza delle lezioni di Machine learning e studio autonomo di eventuali libri di riferimento in preparazione dell'esame finale o della tesi. Non devono intendersi come materiale ufficiale dell'università Politecnico di Milano o del prof Restelli Marcello.
Appunti correlati Invia appunti e guadagna

Domande e risposte

Hai bisogno di aiuto?
Chiedi alla community