Anteprima
Vedrai una selezione di 5 pagine su 19
Optimization for machine learning - riassunto completo corso Pag. 1 Optimization for machine learning - riassunto completo corso Pag. 2
Anteprima di 5 pagg. su 19.
Scarica il documento per vederlo tutto.
Optimization for machine learning - riassunto completo corso Pag. 6
Anteprima di 5 pagg. su 19.
Scarica il documento per vederlo tutto.
Optimization for machine learning - riassunto completo corso Pag. 11
Anteprima di 5 pagg. su 19.
Scarica il documento per vederlo tutto.
Optimization for machine learning - riassunto completo corso Pag. 16
1 su 19
D/illustrazione/soddisfatti o rimborsati
Disdici quando
vuoi
Acquista con carta
o PayPal
Scarica i documenti
tutte le volte che vuoi
Estratto del documento

CROSS VALIDATION:

• hold-out method: part of the dataset available for training are held out and used as test set since they are actual unseen data. Unfortunately, this method only gives an estimate of the model's accuracy and "throws away" a lot of information that would be useful for training.

• k-fold CV: divide dataset into k different parts, choose one of the k parts and use the other k-1 parts to train the model, use the held out part to test the model; repeat k times. This yields k estimates of the prediction error e. A good value of k is between 5 and 10. Bias is the average of all e (low is good); variance is the standard deviation of all e (high is bad).

In linear regression, a GENERATIVE MODEL output is assumed in the form:

Linear regression model:

Other regression models are:

• polynomial models:

• Sigmoidal models:

• Gaussian basis function models:

A common way to estimate parameters is to compute the MLE (Maximum Likelihood Estimation) which

is the same as maximizing the log-likelihood, which is the same as minimising the negative of the log-likelihood (NLL).

If the noise variance is known, we can compute the MLE of the second term.

The RSS (Residual Sum of Squares) is:

For a given variance, the MLE is found by minimising the least squares criterion:

If we try to solve the least square optimisation problem we find that:

A formal solution of the Normal Equations is:

In the case phi is full rank, is invertible, thus yielding:

This last solution might not be the most efficient if N and/or d are very large.

Gaussian linear regression models can be very sensitive to outliers since squared errors penalise deviations quadratically. One way to achieve ROBUSTNESS (being less sensitive to outliers) is to choose a distribution that has heavy tails, so to assign higher likelihood to outliers without having them perturb the straight line.

One possibility is to use the Laplace distribution:

The robustness arises from the use of | . | instead of ( .

2The ML estimate of w is found by minimising the SAE (Sum of Absolute Residuals):
The solution of the above LR problem can be found via Linear Programming since:
LP is very fast, very accurate and can deal with huge scale problems.
Another path in the pursue of robustness could lead to the minimisation of the average Huber loss:
The Huber loss penalises small residuals quadratically and large residuals linearly, plus it is a smooth and convex function.
The Gauss-Markov theorem states that the LS estimate provides an unbiased estimator having minimum variance.
There can be biased estimator having a smaller mean square error: those trade a little bias for a larger reduction in variance —> SHRINKAGE METHODS.
Ridge Regression shrinks regression coefficients by imposing a penalty on their size; it minimizes a penalized residual sum of squares:
The regularisation ridge is selected via CV and improves the “out of sample” performance, alleviating the weight of the prior.

complexity parameter that controls the amount of shrinkage:the larger, the grater the shrinkage.

Ridge regression can be rewritten as:

Solving a Ridge regression has the same complexity as solving a standard LS regression.

Standard solvers for ridge regression add an offset w0 to the problem.

The problem with the offset is equivalent to the problem without offset but run on centered data:

Normalisation of the data can be addressed by scaling the penalty term by a factor D.

Another shrinkage method is the LASSO (Least Absolute Shrinkage and Selection Operator)

As in the Ridge case, also the LASSO can be computed with an offset w0.

The solution is now non linear and has no closed form expression due to the use of theis the largest convex lower bound of the cardinality function over the unit hypercube.

LASSO penalty promotes sparsity in the solution; LASSO solution is a convex QP problem.

Constrained form of LASSO:

Elastic-net penalty is a compromise between Ridge and LASSO + has good

computationaltractability:Group-LASSO can be used when we want to shrink and select the members of a group together.

In the LASSO prediction algorithm a crucial part is the plot of the MSE against values and the shrinkage parameter:

We adopt the “one-standard-error” rule: we pick the most parsimonious model within one standard error of the minimum.

COORDINATE DESCENT is an optimization algorithm that successively minimizes along coordinate directions to find the minimum of a function. At each iteration, the algorithm determines a coordinate or coordinate block via a coordinate selection rule, then exactly or inexactly minimizes over the corresponding coordinate hyperplane while fixing all other coordinates or coordinate blocks. More generally, block-coordinate descent methods apply to problems where each variable is independently constrained.

The Jacobi method solves all the partial minimisation problems and then updates all the blocks simultaneously. Convergence is not guaranteed.

Gauss-Seidel method (block coordinate minimisation method, BCM) works similarly to the Jacobi but the variable blocks are updated sequentially. The sequential block-coordinate descent method may fail to converge for non-smooth objectives. But if f0 is a composite function which can be rewritten as the sum of a convex and nondifferentiabke function and a separable convex (but possibly non-smooth) term, convergence of sequential coordinate descent methods is guaranteed. If z is a minimum point of f, it is also a coordinate-wise minimum point; the converse is not true in general. If a BCM converges to a coordinate-wise minimum point, then it also converges to a minimum point. LASSO can be expressed as a convex QP: A classical algorithm for solving the LASSO is the LAR (Least Angle Regression) but recently also CD (Coordinate Descent) methods emerged as efficient alternatives. When computing the coordinate descent for LASSO, we see that the i-th coordinate minimization problem takes the form: Once weobtained the optimal value and substituted it back into f0 we obtain the function of a parabola that has two optimums: The univariate LASSO solution can be expressed using the soft-threshold if function: <code>LASSO</code> is solved by cyclic coordinate descent: optimise each variable separately while holding all others fixed. This is done on a grid of values using WARM START (we start from <code>that gives us all null coefficients</code> and then we go down to <code></code>). Coordinate descent can achieve amazing speedups over all competitors. ACTIVE SET CONVERGENCE: after a full cycle through the d variables we can restrict further iterations to the active set (the non zero variables) till convergence plus one more cycle through all d to check if the active set has changed. Helps when d >> N. We can always divide the input space into a collection of regions labelled according to classification: in an important class of procedures these decision boundaries are linear —> LINEAR CLASSIFIERS. A direct approach is to

explicitly model a boundary to separate classes: in two-class problem with n-dimensional input space the boundary is a hyperplane.

Explicit hyperplane separation methods:

  • Rosenblatt's perceptron
  • hyperplane to maximize distance of the closest points from either classes

Linear classification rule:

In the LOGISTIC REGRESSION problem, if K=2, the model is especially simple, since there is only a single linear function.

The logistic regression model amounts to find the values of parameters able to minimize the loss function where outcome y can be 1 (classification in class 1) or 0 (classification in class 2).

The above equals to find (w, w0) that maximize le log-likelihood:

Since the Hessian is negative semi-definite, is a concave function.

Alternatively, we can find formulations where y can be 1 (classification in class 1) or -1 (classification in class 2). The log-likelihood becomes:

Training the LR amounts to minimizing the loss function (which penalizes misclassified

CONFUSION MATRIX: is a 2x2 matrix containing the numbers of TP, FP, FN, TN resulting from a classifier's validation experiment.

Naive Bayes (NB):

  • assumes conditional independence from features
  • posterior is computed by means of Bayes' rule
  • generative model: assumes a stochastic model (described by a random probability distribution) for the way features are generated, given the class, and then uses this model to evaluate posterior probability via Bayes' rule.
  • mostly used with discrete distributions on features —> Bernoulli or Multinoulli
  • provides linear discrimination surface

Linear Regression (LR):

  • discriminative model: directly models the posterior by learning the I-O mapping by minimising a suitable loss function.
  • posterior is learned directly from data samples
  • provides linear discrimination surface

In regression, the average prediction error is good to evaluate the performance of an algorithm.

classification the counterpart of the average prediction error is the ACCURACY, counting number of correct predictions over the total number of predictions made.

Errors can be:

  • type 1: False positive (FP)
  • type 2: False Negative (FN)

Precision:

Sensitivity (recall):

Specificity:

in addition to that:

Classifier's performance can be evaluated via:

  • ROC (Receiver Operating Characteristic): parametric curve, plotting TPR (vertical axis) vs FPR(horizontal axis): given a threshold T, an instance is classified as positive if X > T and vice versa.

The AUC (Area Under Curve) is a good evaluation metric:

  • AUC=1 —> good measure of separability
  • AUC=0.5 —> can't distinguish between positive and negative, NOT GOOD

For T=1 —> TPR=0, FOR=0

For T=0 —> TPR=1, FPR=1

A good value for the threshold T is the point where the ROC curve has a knee: 0.5 could be good.

Given a function an hyperplane H is the locus of

Points for which classifiers that compute a linear combination of input features and return the sign are called PERCEPTRONS: algorithm tries to find a separating H by minimising the distance of misclassified points from the decision boundary.

For misclassified points, the cost contribution is proportional to the distance from the decision boundary.

MAXIMUM MARGIN CLASSIFIER: hyperplane that minimizes cost and is able to maximize the margin between the two classes —> unique solution + better classification performance.

For not linearly separable classes the above QP formulation becomes infeasible!

We may allow for some points to be on the wrong side of the margin —> relax constraints.

We can penalise misclassification in the objective, using a trade-off parameter c:

The general linear Support Vector Machine (SVM) problem can be rewritten as

Perceptron loss and logistic regression loss are similar for large values of

Dettagli
Publisher
A.A. 2021-2022
19 pagine
2 download
SSD Ingegneria industriale e dell'informazione ING-IND/13 Meccanica applicata alle macchine

I contenuti di questa pagina costituiscono rielaborazioni personali del Publisher mangulot di informazioni apprese con la frequenza delle lezioni di Optimization for machine learning e studio autonomo di eventuali libri di riferimento in preparazione dell'esame finale o della tesi. Non devono intendersi come materiale ufficiale dell'università Politecnico di Torino o del prof Calafiore Giuseppe Carlo.