Appunti di Machine Learning

Appunti per l'esame di Machine Learning (in inglese) del corso di laurea magistrale Computer Science and Engineering tenuto al Politecnico di Milano. Comprendono: introduzione al machine …

Esame Machine learning

Facoltà Ingegneria dell'informazione

Dal corso del Prof. Restelli Marcello

Università Politecnico di Milano

Publisher irelop

A.A. 2023-2024

97 pagine

Appunti esame

Vota 4,0 / 5 (2)

Scarica

Estratto del documento

T

( )

Var( ^ ) =

w ϕ ϕ σ

OLS

−1

T

( )

Noticing that if is singular, then the variance goes to infinity.

ϕ ϕ

More are the samples, smaller is the variance.

Gauss-Markov Theorem: the least squares estimate of has the

smallest variance among all linear unbiased estimates.

It follows that least squares estimator has the lowest MSE (Mean Squared Error)

of all linear estimators with no bias. However, the important thing is having a

minimum MSE, so we can introduce a small bias in order to further reduce the

MSE.

Under-fitting VS over-fitting

Under-fitting: with low-order polynomials, so there is too much bias.

Over-fitting: with high-order polynomials, so there is too much

variance.

One method to avoid over-fitting is increasing the number of samples:

In green we have the true model, in red the approximisation

Linear Regression 9

Note that when the model gets more complex the parameters became very high,

thus making a small change in the features results in a large variance, so the

model is very unstable. The answer is changing the loss function.

Regularization

As we said above, one way to reduce the MSE is to change the loss function,

how? We can add a term that reppresents a small bias in order having less

parameters and regolarize the loss function:

) = ( ) + ( )

L(

w L w λL w

D W

So: ( )

is a regular loss function that models the error on data, for example

L w

the RSS.

is the hyper parameter that determines how much we want to regolarize

the function.

( )

reppresents the model complexity.

L w

⟹ ( )

is the term for the regularization!

λL w

Ridge Regression

We can take the following :

L

W 1 1 2

= ∥ ∥

T

( ) =

L w w w w 2

W 2 2

12 22 2

∥ ∥ = + + ... +

Note: → Euclidean Norm (norma euclidea)

w w w w

2 N

Therefore, the loss function becames:

N

1 λ

∑ 2 2

( )) +

(t − ∥ ∥

T

) =

L(

w w ϕ x w 2

i i

2 2

i=1

This is called Ridge Regression. → 0

Looking at we can deduce that if it is small, so , we will have a high

λ λ → ∞

variance so the function won’t be regularized, otherwise, if we will have

a too high bias. Consequently, we have to find the right value for it.

Note that the loss function is still quadratic in :

T T

Linear Regression 10

−1

T T

)

^ = (λ +

w I ϕ ϕ ϕ t

ridge

→ 0

If it becames the squared loss.

Lasso Approach

With the Lasso Approach we can choose the following function:

N

1 λ

∑ 2 ∥ ∥

T

) = ( )) +

(t −

L(

w w ϕ x w 1

i i

2 2

i=1

M

∥ ∥ = ∣w ∣

∑

Where → sum of absolute values

w 1 i

j=1

✅

PRO: some weighs are equal to zero for values of sufficientl large (so it is

possible to discard the corrispondent features!). For this reason this approach

produces sparse models.

❌ CON: it is nonlinear in the and no closed-form solution exists.

t i

Ridge Regression Lasso

In blue there is the plot of the contours of the unregularized error function, the

graph in orange reppresents the constraints introduced with the penalization.

Linear Regression 11

∗

The optimum value for the parameter vector is denoted by . Looking at

w 1∗ = 0

these graphs we can see that the Lasso gives a sparse solution in which .

Bayesian Linear Regression

Recap: we can use maximum likelihood to set the parameters of a linear

regression model

the model complexity is governed by the number of basis functions but

also needs to be controlled according to the size of the data set

another way to controll the model complexity is through regularization

althogh the choice of the parameters of the basis function is still

important

Now, the issue is deciding the appropriate model complexity for a particolar

problem (it cannot be determinated by maximisizing the likelihood function

because this always leeds to excessively complex models and over-fitting).

The answer is a Bayesian treatment for linear regression which will avoid the

over-fitting problem and will lead to automatic methods of determing model

complexity using the training data alone.

So we want to formulate our knowledge about the world in a probabilistic way.

We define the model that expresses the knowledge qualitatively, this model will

have some unknown parameters about which we will make some assumptions

based on the prior distribution over those parameters before seeing the data.

Then we observe the data and we compute the posterior probability distribution

for the parameters, given observed data.

Posterior distribution ( ∣D)

The posterior distribution for the model parameters can be found by

P w

combining the prior with the likelihood for the parameters given data, in

particular the posterior distribution is proportional to the product of the likelihood

and the prior.

Ingredients: ( ) = ( ∣ , )

Prior probability distribution:

N

P w w m S

0 0

)

Likelihood: , i.e. probability of parameters given training data

D

p(D∣

w w

(D) = ) ( )

∫

Normalizing constant:

P p(D∣

w P w d

Linear Regression 12

Result: )P ( )

p(D∣

w w

( ∣D) =

P w (D)

P

We want the most probable value of given the data: Maximum A Posteriori

(MAP). It is the mode of the posterior.

Sequential approach

The Bayesian approach can be used in a sequential way, when? If you have

computed the posterior distribution in a given situation and then you recive more

data, you can use the posterior previously computed as the prior distribution and

calculate the new posterior distribution.

Problem: this overtures (=approccio) works well when prior and posterior are in

the same family, otherwise analitically does not work.

Solution: we use the conjugate prior that is a pair of distribution of prior and

likelihood that multiplied produces a posterior in the same family of the prior (you

have to choose a prior that works well with the likelihood).

OBS: if you don’t want to use the conjugate prior, you can always approximate.

Assuming Gaussian likelihood model, the conjugate prior is a Gaussian

multivariate: ) = ( ∣ , )

N

w w w S

0 0

With: → mean vector

w 0

→ covariance matrix (if it is diagonal then the weighs are indipendent)

S 0

So, as we said befone, the posterior is still Gaussian:

2 2

∣ , , ) ∝ ( ∣ , ) ( ∣ , ) = ( ∣ , )

N N N

w t ϕ σ w w S t ϕw σ I w w S

0 0 N N N

T

ϕ t

−1

= +

( )

w S S w

0 0

N N 2

T

ϕ ϕ

−1 −1 +

S S 0

N 2

Linear Regression 13

This is a probabilistic version of the Ridge Regression, infact in Gaussian

distributions the mode concides with the mean, it follows that is the MAP

w N

estimator:

If the prior has infinite variance, reduces to the ML estimator.

w N

= 0 = =

If and , then reduces to the ridge estimate, where

w S τ I w λ

0 0 N

σ .

Predictive Distribution

In practice, we are not usually interested in the value of itself but rather in

making predictions of for new value of . This requires that we evaluate the

t x

posterior predictive distribution defined by

∫

2 2 2

(t∣ ( ), )N ( ∣ , )d =

T T

, ) = ( ), ( ))

(t∣

D, N N

p(t∣

x σ w ϕ x σ w w S w w ϕ x σ x

N N N

2 2 ( ) ( )

T

( ) = +

σ x σ ϕ x S ϕ x

N N

where:

2 → noise in the target values (usually unknown), note that the variance

arises only from this term

( ) ( )

T → uncertainty associated with parameter values, note that as

ϕ x S ϕ x

N

→ ∞

it goes to zero

N

Modeling Challenges

1. Specifing:

a. suitable model: it should admit all the prossibilities that thought to be at

all likely

b. suitable prior distribution: it should avoid giving zero or very small

probabilities to possible events, but also to spreading out the probability

over all possibilities.

To avoid uninformative priors, we may need to model dependencies between

parameters: one strategy is to introduce latent variables into the model and

hyperparameters into the prior. This permits to model in a tractable way.

2. Computing the posterior distribution, several approaches:

a. Analytical integration

Linear Regression 14

b. Gaussian (Lapalce) approximation

c. Monte Carlo integration

d. Variational approximation

Pros and cons of fixed basis functions

✅ ❌

PRO: CON:

closed-form solution basis functions are chosen

indipendently from the training set

tractable bayesian treatment curse of dimensionality

arbitrary non-linearity with the

proper basis functions

Linear Regression 15

Linear Classification

Classification problems

Goal of Classification: assign an input into one of discrete

x K

= 1, ...,

classes , where .

C k K

Tipically, each input is assign to only one class, we assume that we have no

noise in the labels.

Linear Classification

The input space is divided into decision regions whose bounderies are called

decision bounderies or decision surfaces.

For classification, we need to predict discrete class labels, that lie n the range

(0, 1)

of , so we use a nonlinear function:

(

Anteprima

Vedrai una selezione di 20 pagine su 97