Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
vuoi
o PayPal
tutte le volte che vuoi
MACHINE LEARNING 2018/2019
DETAILED PROGRAM
Prof. F. Vandin
Last Update: January 20th, 2019
This document describes, for each topic, the level of detail required for the material presented during the lectures (see the slides). The level of detail relates to what has been presented during the lectures, so all details means that all details presented during the lectures (see the slides) is required while main idea means that only the understanding of the concepts presented is required, while the details (e.g., details of propositions, proofs, formulas) are not required. Note that the background material (probability, linear algebra) is assumed to be known at the level of detail used during the presentation of the topics below. For some propositions the corresponding proposition in the book is used for clarity.
-
Learning Model
- All details presented in class required, including: definitions, propositions’ statements, proof of Corollary 2.3 [UML], definition of agnostic PAC learnability for general loss functions
-
Uniform Convergence
- Lemma 4.2 [UML]: statement and proof with all details;
- Definition of uniform convergence property: main idea (details of definition not required)
- Corollary 4.6 [UML]: only main idea and bound on the number of samples
-
Basics of Statistics
- definition confidence interval, definition rejection region: details
- hypothesis testing: main idea; hypothesis testing rejection rule in detail
- everything else: main idea
-
Linear Models
- linear predictors/models: definitions with all details
- linear classification, perceptron: definitions and algorithm in detail
- proposition on perceptron convergence: only main idea (as in slide “Perceptron: Notes”)
- linear regression: definitions, matrix form, derivation best predictor, use of generalized inverse in detail (derivation generalized inverse: not required)
- logistic regression: definition, loss function, equivalence MLE solution and ERM solution in detail
-
Bias-Complexity
- No Free Lunch (NFL) theorem, NFL and priori knowledge: only main idea
- approximation error + estimation error, complexity and error decomposition: all details
6. VC-dimension
- Restrictions, shattering, VC-dimension: definitions in detail
- Fundamental Theorems of Statistical Learning and bound on generalization: in detail
7. Model Selection and Validation
- validation: main idea
- validation for model selection: main idea
- model-selection curve, train-validation-test split, k-fold cross validation: all details
- what if learning fails: main idea
- model selection with SRM: main idea
8. Regularization and Feature Selection
- Regularized Loss Minimization, Tikhonov Regularization: all details
- Ridge Regression, derivation of optimal solution: all details
- Stability, stability rules do not overfit, Tikhonov Regularization as a stabilizer: main idea
- Fitting-stability tradeoff definition and considerations: all details; guarantee results for the choice of lambda: only idea
- l1 regularization, LASSO: all details
- subset selection, forward selection, backward selection, without and with validation data: all details; pseudocode: main idea and structure required (details not required)
10. SVM
- hard-SVM optimization problem and quadratic formulation: all details (no proof of equivalence between the two formulations)
- soft-SVM optimization problem: all details
- gradient descent (GD): all details; GD guarantees: main idea
- stochastic gradient descent (SGD): main idea
- SGD for solving soft-SVM (algorithm): all details
- Hard-SVM dual formulation: main idea; final optimization problem: all details (no derivation required)
- Definition of Kernel and commonly used kernels: all details
- SVM for regression: all details only for optimization problem and support vectors definition
11. Neural Networks
- Neuron, activation function, network architecture, point of view of one node, hypothesis set: all details
- Expressiveness: main idea of each statement
- Sample complexity, runtime of learning NNs: main idea
- Forward propagation algorithm: all details
- Backpropagation algorithm: main idea (pseudocode: only main structure)
- Regularized NNs: main idea
PAC learning
Definition
An hypothesis class H is said to be PAC learnable if exists a function, called mH: (0,1)2→ℕ and a learning algorithm s.t. For every
- δ ∈ (0,1) and ε ∈ (0,1)
- D distribution over X
- ℒ labeling function F: X→{0,1}
If the realizability assumption holds w.r.t H, D, F then, when running the learning algorithm on a number m ≥ mH(ε, δ) of i.i.d examples (generated according to D and labeled according to F), it produces an hypothesis h s.T.
LD,F(h) ≤ ε with probability ≥ 1-δ
corollary: every finite hypothesis class is PAC learnable with mH(ε, δ) ≤ ⌈log2|H|/δ⌉/ε
|H| < ∞
Realizability assumption is too strong in many applications. Given x, the label Y is obtained according to a conditional probability P[Y|X]
Bayes Optimal Predictor
Given a probability distribution D over X×{0,1} the best classifier is
FD(x) = { 1 if P[Y=1|X] ≥ 1/2 0 if otherwise }
Hoeffdings Inequality
Uniform Convergence | 6 November
Proof. [Corollary 4.6]
Fix ε, δ ∈ (0,1). We need to find the sample size m s.t. for any D, with prob ≥ 1 - δ over the choice of S = {(z1,..., zm) zi = (xi, yi) sampled i.i.d. From D} we have:
∀h ∈ H, | LS(h) - LD(h) | ≤ ε
That is:
- Pm (S : ∀h ∈ H, | LS(h) - LD(h) | ≤ ε ) ≥ 1 - δ
Equivalently:
- Pm (S : ∃h ∈ H, | LS(h) - LD(h) | > ε ) < δ
We have:
{ S : ∃h ∈ H, | LS(h) - LD(h) | > ε } =
= ⋃h ∈ H { S : | LS(h) - LD(h) | > ε } ≤
≤ ∑h ∈ H P { | LS(h) - LD(h) | > ε } For union bound
Recall:
- LD(h) = Ez→D [l(h,z)]
- LS(h) = 1/m ∑i=1m l(h,zi)
therefore: E [ LS(h)] = 1/m Σi=1m E [ l(h,zi)]
= LD(h)
linear regression
(x = ℝd, y = ℝ, squared-loss)
Hreg = Ld
ER Function: Mean Square Error
Ls(h) = 1/m m∑i=1 (h(xi) - yi)2
how do we find (good so using ERM) hypothesis? ⇒ Least Squares algorithm
Best hypothesis:
argminw Ls(h)
Since, 1/m is constant we can simplify it in the expression of Ls(h).
We now look for w minimizing the Residual Sum of Squares:
argminw m∑i=1 ( - yi)2
Let X be the design matrix
[ x1
...
xm ]
and y the labels vector
[ y1
...
ym ]
we rewrite the function to be minimized as
m∑i=1 ( - yi)2 = (Y - Xw)T (Y - Xw) ≜ RSS
→ arg min RSSw.
- ∂/∂w (RSS) = - 2XT (Y - Xw)
- ∂/∂w (RSS) ≟ 0 ⇔ -2XT (Y - Xw) ≟ 0 → W = (XT X)-1 XT y
assuming this invertible
otherwise: see generalized inverse