Anteprima
Vedrai una selezione di 20 pagine su 97
Appunti di Machine Learning Pag. 1 Appunti di Machine Learning Pag. 2
Anteprima di 20 pagg. su 97.
Scarica il documento per vederlo tutto.
Appunti di Machine Learning Pag. 6
Anteprima di 20 pagg. su 97.
Scarica il documento per vederlo tutto.
Appunti di Machine Learning Pag. 11
Anteprima di 20 pagg. su 97.
Scarica il documento per vederlo tutto.
Appunti di Machine Learning Pag. 16
Anteprima di 20 pagg. su 97.
Scarica il documento per vederlo tutto.
Appunti di Machine Learning Pag. 21
Anteprima di 20 pagg. su 97.
Scarica il documento per vederlo tutto.
Appunti di Machine Learning Pag. 26
Anteprima di 20 pagg. su 97.
Scarica il documento per vederlo tutto.
Appunti di Machine Learning Pag. 31
Anteprima di 20 pagg. su 97.
Scarica il documento per vederlo tutto.
Appunti di Machine Learning Pag. 36
Anteprima di 20 pagg. su 97.
Scarica il documento per vederlo tutto.
Appunti di Machine Learning Pag. 41
Anteprima di 20 pagg. su 97.
Scarica il documento per vederlo tutto.
Appunti di Machine Learning Pag. 46
Anteprima di 20 pagg. su 97.
Scarica il documento per vederlo tutto.
Appunti di Machine Learning Pag. 51
Anteprima di 20 pagg. su 97.
Scarica il documento per vederlo tutto.
Appunti di Machine Learning Pag. 56
Anteprima di 20 pagg. su 97.
Scarica il documento per vederlo tutto.
Appunti di Machine Learning Pag. 61
Anteprima di 20 pagg. su 97.
Scarica il documento per vederlo tutto.
Appunti di Machine Learning Pag. 66
Anteprima di 20 pagg. su 97.
Scarica il documento per vederlo tutto.
Appunti di Machine Learning Pag. 71
Anteprima di 20 pagg. su 97.
Scarica il documento per vederlo tutto.
Appunti di Machine Learning Pag. 76
Anteprima di 20 pagg. su 97.
Scarica il documento per vederlo tutto.
Appunti di Machine Learning Pag. 81
Anteprima di 20 pagg. su 97.
Scarica il documento per vederlo tutto.
Appunti di Machine Learning Pag. 86
Anteprima di 20 pagg. su 97.
Scarica il documento per vederlo tutto.
Appunti di Machine Learning Pag. 91
1 su 97
D/illustrazione/soddisfatti o rimborsati
Disdici quando
vuoi
Acquista con carta
o PayPal
Scarica i documenti
tutte le volte che vuoi
Estratto del documento

T

( )

Var( ^ ) =

w ϕ ϕ σ

OLS

​ ​ ​ ​

−1

T

( )

Noticing that if  is singular, then the variance goes to infinity.

ϕ ϕ

​ ​

More are the samples, smaller is the variance.

Gauss-Markov Theorem: the least squares estimate of  has the

w ​

smallest variance among all linear unbiased estimates.

It follows that least squares estimator has the lowest MSE (Mean Squared Error)

of all linear estimators with no bias. However, the important thing is having a

minimum MSE, so we can introduce a small bias in order to further reduce the

MSE.

Under-fitting VS over-fitting

Under-fitting: with low-order polynomials, so there is too much bias.

Over-fitting: with high-order polynomials, so there is too much

variance.

One method to avoid over-fitting is increasing the number of samples:

In green we have the true model, in red the approximisation

Linear Regression 9

Note that when the model gets more complex the parameters became very high,

thus making a small change in the features results in a large variance, so the

model is very unstable. The answer is changing the loss function.

Regularization

As we said above, one way to reduce the MSE is to change the loss function,

how? We can add a term that reppresents a small bias in order having less

parameters and regolarize the loss function:

) = ( ) + ( )

L(

w L w λL w

D W

​ ​ ​ ​

So: ( )

 is a regular loss function that models the error on data, for example

L w

D ​ ​

the RSS.

is the hyper parameter that determines how much we want to regolarize

λ

the function.

( )

 reppresents the model complexity.

L w

W ​ ​

⟹ ( )

 is the term for the regularization!

λL w

W ​ ​

Ridge Regression

We can take the following :

L

W ​ 1 1 2

= ∥ ∥

T

( ) =

L w w w w 2

W 2 2

​ ​ ​ ​ ​ ​ ​ ​

12 22 2

∥ ∥ = + + ... +

Note:  → Euclidean Norm (norma euclidea)

w w w w

2 N

​ ​ ​ ​ ​

Therefore, the loss function becames:

N

1 λ

∑ 2 2

( )) +

(t − ∥ ∥

T

) =

L(

w w ϕ x w 2

i i

2 2

​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

i=1

This is called Ridge Regression. → 0

Looking at  we can deduce that if it is small, so , we will have a high

λ λ → ∞

variance so the function won’t be regularized, otherwise, if  we will have

λ

a too high bias. Consequently, we have to find the right value for it.

Note that the loss function is still quadratic in :

w ​

T T

Linear Regression 10

−1

T T

)

^ = (λ +

w I ϕ ϕ ϕ t

ridge

​ ​ ​ ​ ​ ​

→ 0

If  it becames the squared loss.

λ

Lasso Approach

With the Lasso Approach we can choose the following function:

N

1 λ

∑ 2 ∥ ∥

T

) = ( )) +

(t −

L(

w w ϕ x w 1

i i

2 2

​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

i=1

M

∥ ∥ = ∣w ∣

Where  → sum of absolute values

w 1 i

j=1

✅ ​ ​ ​ ​

PRO: some weighs are equal to zero for values of  sufficientl large (so it is

λ

possible to discard the corrispondent features!). For this reason this approach

produces sparse models.

❌ CON: it is nonlinear in the  and no closed-form solution exists.

t i ​

Ridge Regression Lasso

In blue there is the plot of the contours of the unregularized error function, the

graph in orange reppresents the constraints introduced with the penalization.

Linear Regression 11

w

The optimum value for the parameter vector  is denoted by . Looking at

w ​ 1∗ = 0

these graphs we can see that the Lasso gives a sparse solution in which .

w ​

Bayesian Linear Regression

Recap: we can use maximum likelihood to set the parameters of a linear

regression model

the model complexity is governed by the number of basis functions but

also needs to be controlled according to the size of the data set

another way to controll the model complexity is through regularization

althogh the choice of the parameters of the basis function is still

important

Now, the issue is deciding the appropriate model complexity for a particolar

problem (it cannot be determinated by maximisizing the likelihood function

because this always leeds to excessively complex models and over-fitting).

The answer is a Bayesian treatment for linear regression which will avoid the

over-fitting problem and will lead to automatic methods of determing model

complexity using the training data alone.

So we want to formulate our knowledge about the world in a probabilistic way.

We define the model that expresses the knowledge qualitatively, this model will

have some unknown parameters about which we will make some assumptions

based on the prior distribution over those parameters before seeing the data.

Then we observe the data and we compute the posterior probability distribution

for the parameters, given observed data.

Posterior distribution ( ∣D)

The posterior distribution for the model parameters  can be found by

P w ​

combining the prior with the likelihood for the parameters given data, in

particular the posterior distribution is proportional to the product of the likelihood

and the prior.

Ingredients: ( ) = ( ∣ , )

Prior probability distribution: 

N

P w w m S

0 0

​ ​ ​ ​ ​ ​

)

Likelihood: , i.e. probability of parameters  given training data 

D

p(D∣

w w

​ ​

(D) = ) ( )

Normalizing constant: 

P p(D∣

w P w d

w

​ ​ ​

Linear Regression 12

Result: )P ( )

p(D∣

w w

( ∣D) = ​ ​

P w (D)

​ ​

P

We want the most probable value of  given the data: Maximum A Posteriori

w ​

(MAP). It is the mode of the posterior.

Sequential approach

The Bayesian approach can be used in a sequential way, when? If you have

computed the posterior distribution in a given situation and then you recive more

data, you can use the posterior previously computed as the prior distribution and

calculate the new posterior distribution.

Problem: this overtures (=approccio) works well when prior and posterior are in

the same family, otherwise analitically does not work.

Solution: we use the conjugate prior that is a pair of distribution of prior and

likelihood that multiplied produces a posterior in the same family of the prior (you

have to choose a prior that works well with the likelihood).

OBS: if you don’t want to use the conjugate prior, you can always approximate.

Assuming Gaussian likelihood model, the conjugate prior is a Gaussian

multivariate: ) = ( ∣ , )

N

p(

w w w S

0 0

​ ​ ​ ​ ​ ​

With:  → mean vector

w 0

​ ​

 → covariance matrix (if it is diagonal then the weighs are indipendent)

S 0

​ ​

So, as we said befone, the posterior is still Gaussian:

2 2

∣ , , ) ∝ ( ∣ , ) ( ∣ , ) = ( ∣ , )

N N N

p(

w t ϕ σ w w S t ϕw σ I w w S

0 0 N N N

​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

T

ϕ t

−1

= +

( )

​ ​

w S S w

0 0

N N 2

​ ​ ​ ​ ​ ​ ​ ​ ​

σ

T

ϕ ϕ

−1 −1 +

= ​ ​

S S 0

N 2

​ ​ ​ ​ ​

σ

Linear Regression 13

This is a probabilistic version of the Ridge Regression, infact in Gaussian

distributions the mode concides with the mean, it follows that  is the MAP

w N

​ ​

estimator:

If the prior has infinite variance,  reduces to the ML estimator.

w N

​ ​

2

= 0 = =

If  and , then  reduces to the ridge estimate, where

w S τ I w λ

0 0 N

​ ​ ​ ​ ​ ​ ​ ​

2

σ .

2

τ ​

Predictive Distribution

In practice, we are not usually interested in the value of  itself but rather in

w ​

making predictions of  for new value of . This requires that we evaluate the

t x

posterior predictive distribution defined by

2 2 2

(t∣ ( ), )N ( ∣ , )d =

T T

, ) = ( ), ( ))

(t∣

D, N N

p(t∣

x σ w ϕ x σ w w S w w ϕ x σ x

N N N

​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

2 2 ( ) ( )

T

( ) = +

σ x σ ϕ x S ϕ x

N N

​ ​ ​ ​ ​ ​ ​ ​

where:

2  → noise in the target values (usually unknown), note that the variance

σ

arises only from this term

( ) ( )

T  → uncertainty associated with parameter values, note that as

ϕ x S ϕ x

N

​ ​ ​ ​ ​ ​

→ ∞

 it goes to zero

N

Modeling Challenges

1. Specifing:

a. suitable model: it should admit all the prossibilities that thought to be at

all likely

b. suitable prior distribution: it should avoid giving zero or very small

probabilities to possible events, but also to spreading out the probability

over all possibilities.

To avoid uninformative priors, we may need to model dependencies between

parameters: one strategy is to introduce latent variables into the model and

hyperparameters into the prior. This permits to model in a tractable way.

2. Computing the posterior distribution, several approaches:

a. Analytical integration

Linear Regression 14

b. Gaussian (Lapalce) approximation

c. Monte Carlo integration

d. Variational approximation

Pros and cons of fixed basis functions

✅ ❌

PRO: CON:

closed-form solution basis functions are chosen

indipendently from the training set

tractable bayesian treatment curse of dimensionality

arbitrary non-linearity with the

proper basis functions

Linear Regression 15

Linear Classification

Classification problems

Goal of Classification: assign an input  into one of  discrete

x K

= 1, ...,

classes , where .

C k K

k ​

Tipically, each input is assign to only one class, we assume that we have no

noise in the labels.

Linear Classification

The input space is divided into decision regions whose bounderies are called

decision bounderies or decision surfaces.

For classification, we need to predict discrete class labels, that lie n the range

(0, 1)

of , so we use a nonlinear function:

(

Dettagli
Publisher
A.A. 2023-2024
97 pagine
SSD Scienze matematiche e informatiche INF/01 Informatica

I contenuti di questa pagina costituiscono rielaborazioni personali del Publisher irelop di informazioni apprese con la frequenza delle lezioni di Machine learning e studio autonomo di eventuali libri di riferimento in preparazione dell'esame finale o della tesi. Non devono intendersi come materiale ufficiale dell'università Politecnico di Milano o del prof Restelli Marcello.