Appunti Lezione Machine Learning

Appunti di Finanza computazionale basati su appunti personali del publisher presi alle lezioni del prof. Marazzina, dell’università degli Studi del Politecnico di Milano - Polimi, della facoltà di Ingegneria dei sistemi. Scarica il file in formato PDF!

Esame Machine learning

Facoltà Ingegneria dei sistemi

Dal corso del Prof. Restelli Marcello

Università Politecnico di Milano

Publisher bonadiamatilde

A.A. 2021-2022

73 pagine

Appunto

Vota 3,0 / 5 (1)

Scarica

Estratto del documento

Regularization

For each output t we have

k T 1 Tŵ = ( ) t k

where t is an N -dimensional column vector. Thus the solution decouples between the di↵erent outputs,kand we need only to compute a single pseudo-inverse matrix, which is shared by all of the vectors w .k4 It has variance on the diagonal and covariance on the rest of the matrixP 2(y(x ,w) t )5 n nnMean Squared Error: N3.5.

REGULARIZATION 153.5

Regularization3.5.1 Under-fitting vs Over-fitting The test error is a measure of how well we aredoing in predicting the values of t for a new dataA model with low complexity is usually not capable of fitting the data observation of xand so representing appropriately the true model (under-fitting).A model with high complexity (e.g. high-order polynomials) givesexcellent fit over the training data, but a poor representation of thetrue function (over-fitting).

The goal is to achieve a good generalization and we obtain some quan-titative insight about that by reserving a part of the set for

testing.By calculating the error also for the test set it is possible to evaluate Here thetraining errorthe generalization. The root-mean-square error (RMS) is used: goes to 0r because it⇤2 RSS( ŵ)E = used all theRM S N dof

Figure 3.3: E on trainingRM SIn which the division by N allows to compare sets of di↵erent sizes and test setsand the square root ensures that the error is based on the same scaleas the target variable.

As we can see from Figure 3.3, from a specific value of the model complexity, even if the training errorcontinues decreasing, ato some point the test error starts to increase. That’s because there is over-fittingand the model has huge values for the parameters in order to try to fit all the points of the training set,as showed in Figure 3.4.

Figure 3.4: Table of the coefficients w* for polynomials of various order.

For a given model complexity, the over-fitting problem becomes less severe as the size of the data setincreases. In other words, the larger

The data set, the more complex (and flexible) the model we can afford to fit the data.

3.5.2 Ridge regression

Regularization involves adding a penalty term to the error function to discourage the coefficients from reaching large values and so, to make the curve smoother.

L(w) = L (w) + L (w)D W

Where L (w) is the empirical loss on data (e.g. RSS) and L (w) is the measure of the size of the parameters (the model complexity).

12 12T 22||w||w w = we obtain the ridge regression (or weight decay)

By taking L (w) = W NX1 T 2 22||w||

L(w) = (t w (x )) +i i2 2i=1

Note that if there are a lot of samples, the regularization is useless and it may even worsen the performance of the model.

The loss function is still quadratic in w: This means that we are able to find a unique solution T 1 Tŵ = ( I + ) t

As lambda increases, var decreases, bias increases ridge

As lambda decreases, var increases, bias decreases

16 CHAPTER 3. LINEAR REGRESSION

Since the matrix (phi^T)phi+lambdaI is semi-definite positive and

its eigenvalues should all be greater than lambda!
The matrix I + is always non-singular, because of the fact that is semidefinite (eigenvalues ≥ 0) and so the whole matrix has all eigenvalues > 0.
Lasso can drive coefficients to zero because its constraint has a rhomboid shape with the vertices on the axis, so the contact with the regions of constant RSS (whose center is Wols) could happen on the axis and so the corresponding weight is 0.
The same does not happen with ridge because it has a circular shape for its constraint, so the contact with the regions of constant RSS does not happen on the axis, so the corresponding weight is not zero, but approximately approaches 0 as the regions of constant RSS expands.
Another popular regularization method is lasso corresponding weight is 0. The same does not happen with ridge because it has a circular shape for its constraint, so the contact with the regions of constant RSS does not happen on the axis, so the corresponding weight is not zero, but approximately approaches 0 as the regions of constant RSS expands.
NX1 so the contact with the regions of constant RSS does not happen on the axis, so the corresponding weight is not zero, but approximately approaches 0 as the regions of constant RSS expands.
T 2 ||w||L(w) = (t w (x )) + i i 12 2
i=1P constant RSS expands.
M||w|| |w |where =1 jj=1Di↵erently from ridge, lasso is nonlinear in t and no closed-form solution exists (quadratic programmingicould be used). Nonetheless, it

has the property that if is sufficiently large, some of the coefficients are driven to zero, leading to a sparse model. For this reason it can be also used for feature selection, excluding the ones that have coefficient equal to zero.

Generally LASSO performs better in a setting where a relatively small number of predictors have substantial coefficients, and the remaining have coefficients near 0. Ridge performs better when the response is a function of many predictors, all with coefficients of roughly equal size. Since they are not known a priori we use CROSSVALIDATION.

3.6 Bayesian Linear Regression

3.6.1 Bayesian approach

In the previous examples of linear regression we have viewed probabilities in terms of the frequencies of random, repeatable events (frequentist approach). In some cases it's not possible to repeat multiple times some events to define a notion of probability. In such circumstances, the solution is to quantify our expression of uncertainty and make precise revisions of

il testo formattato con i tag html sarebbe:

the uncertainty after acquiring new evidence. This is the Bayesian approach, which can be divided in steps:

Formulate the knowledge about the world in a probabilistic way:
- Define the model that expresses our knowledge quantitatively.
- The model will have some unknown parameters.
- Capture our assumptions about unknown parameters by specifying the prior distribution p(w) over those parameters before seeing the data.
Observe the data, whose effect is expressed through the conditional probability p(D|w). It can be viewed as a function of the parameter vector (likelihood function).
Compute the posterior probability distribution p(w|D) for the parameters, given the observed data. It is the uncertainty in w after is observed.
Use the posterior distribution to:
- Make predictions by averaging over the posterior distribution.
- Examine/Account for uncertainty in the parameter values.
- Make decisions by minimizing the expected posterior loss.

Note that

this new approach is not affected by the problem of over-fitting. 3.6.2 Posterior Distribution The posterior distribution for the model parameters can be found by combining the prior with the likelihood for the parameters given the data. This is accomplished by using Bayes' Rule: p(D|w)p(w)p(w|D) = P (D) R where P (D) is the marginal likelihood (normalizing constant): P (D) = p(D|w)P (w)dw The Bayes' theorem can be expressed in words in multiple ways: P (data|parameters)P (parameters)P (parameters|data) = P (data) 3.6. BAYESIAN LINEAR REGRESSION 17 or / ·posterior likelihood prior Where all of these quantities are viewed as functions of w. The aim is to obtain the most probable value of w given the data (maximum a posteriori or MAP), which is the mode of the posterior. If new data is available, it is possible to use the posterior value as prior to compute a new posterior. This sequential approach of learning depends only on the assumption of i.i.d data. It can be useful in

real-time learning scenarios or in case of large datasets, because they do not require the whole dataset to be stored or loaded into memory.
It is important that the prior and posterior have the same distribution. This fact leads to introducing the concept of conjugate priors: for a given probability distribution, we can seek a prior that is conjugate to the likelihood function, so that the posterior has the same distribution as the prior. For example the prior of a Gaussian is a Gaussian, and the prior of a Beta is a Bernoulli (so it gives another Beta as posterior).
data | theta ~ Bernoulli, theta ~ Beta
3.6.3 Predictive Distribution
Prediction for a new data point x* (given the training dataset can be done by integrating over the posterior distribution:
Z E[p(x*|w,D)p(w|D)dw D)]p(x*|D) = p(x*|w, = p(x | z)=int p(x, y | z)dy = int p(x | y, z)p(y | z)dy
which is sometimes called predictive distribution.
Note that computing the predictive distribution requires knowledge of the posterior distribution,

which is usually intractable.

3.6.4 Bayesian Linear Regression

In the Bayesian approach the parameters of the model are considered as drawn from some distribution.

Assuming a Gaussian likelihood model, the conjugate prior is Gaussian too

Np(w) = (w|w , S )0 0D,

Given the data the posterior is still Gaussian:

2 2/ N Np(w|t, , ) (w|w , S )N (t| w, I ) = (w|w , S )0 0 N N N

where 0 1B CT tB C1w = S S w +@ AN N 00 2| {z } | {z }

from prior from data when we have a non informative

T1 1 prior the gaussian solution

S = S +0N 2 coincides with the OLS

In a Gaussian distribution the mode coincides with the mean. It follows that w is the MAP estimator.

NIn many cases we may have little idea of what form the distribution should take. We may then seek a

form of prior distribution, called a noninformative prior, which is intended to have as little influence

on the posterior distribution as possible.! 1

In this case the value S and so0T T t1 2 T 1 2 T 1 T 1 T! ) )S 0 + S = ( ) w = ( ) =( ) tN NN 2 22

so, w reduces

to the ML estimator. If w = 0 and S = ⌧ I, then w reduces to the ridge estimate,N 0 0 N2where = 2⌧6 independent and identically distributed18 CHAPTER 3. LINEAR REGRESSION3.6.5 Posterior Predictive DistributionWe are interested in the posterior predictive distribution, which is the distribution over the outputvariable obtained taking into account all the models, each one with its density.probabilityZ of the modelz }| {2 T 2D, N Np(t|x, )= (t|w (x), ) (w|w , S ) dwN NTN 2N= (t|w (x), (x))Nwhere 2 2 T(x) = + (x) S (x)N|{z}N | {z }noise in the uncertainty associatedtarget values with parameter values! 1,In the limit, as N the second term goes to zero.The variance of the predictive distribution arises only from the additive noise governed by parameter .Spans one standard deviationThe predictive uncertainty either side of the mean (reddepends on x and it is line)smallest in the neighborhoodof the data point The level of uncertaintydecreases as more datapoints are observedFigure 3.5:

Example of the predictive distribution. The pink area will never go to zero for the intrinsic noise in the samples, even asymptotically. 3.7 Notable indices: - P 2: Residual Sum of Squares RSS(w) = ( t̂ t ), representing how much of the prediction di&cr

Anteprima

Vedrai una selezione di 16 pagine su 73