Appunti di Statistical Learning Theory (STALT)

Appunti del corso di Statistical Learning Theory di Computer Engineering dell'Università di Pavia, svolto nell'anno 2022-2023 e tenuto dal prof. Giuseppe De Nicolao. Nella prima pagina …

Esame Statistical learning theory

Facoltà Ingegneria

Dal corso del Prof. De Nicolao Giuseppe

Università Università degli Studi di Pavia

Publisher Teoscard

A.A. 2022-2023

60 pagine

Appunti esame

Vota

Scarica

Estratto del documento

1 - REVIEW

Gaussian distribution
Mean
Variance
Standardization

2 - STATISTICAL LEARNING

Regression function
MSE decomposition
Nearest neighbor averaging
Linear model
Model accuracy
Training/Test MSE
Bias-variance trade-off
Bias
Classification problems
Conditional class probability
Misclassification error rate
K-nearest neighbors

2B - REVIEW 2

Covariance
Correlation
Sample moments
Mean
Variance

3 - LINEAR REGRESSION

Estimation with least squares
Residual
RSS
Accuracy of LS
Standard error
Confidence interval
Hypothesis testing
Regression basics review
LS criterion
Matrix formulation
Confidence intervals
σ² is known
General case
Comparing nested models
Fisher's F
RSE
R²
Multiple linear regression
Forward selection
Backward selection
Qualitative predictors
Interactions

4 - CLASSIFICATION

Using Linear Regression
Logistic Regression
Probability
Logit
Maximum likelihood
Confounding
Case-control sampling
Multinomial Regression
Discriminant analysis
Probability
LDA with R
Discriminant score
Estimated parameters
QDA with R
Discriminant score
Fisher's discriminant plot
From g(x) to p
Probabilities
Types of errors
ROC plot
Quadratic DA
Naive Bayes

5 - RESAMPLING

Validation set approach
K-fold cross validation
CV
LOOCV
Classification
Loss of predictors
Bootstrap
Estimating prediction error

6 - MODEL SELECTION

Linear model selection
Feature selection
Subset selection
Best subset selection
Stepwise selection
Forward
Backwards
Estimating best error
Cp
AIC
BIC
Adjusted R²
Validation/CV
One-standard-error rule
Shrinkage methods
Ridge regression
Lasso
Dimension reduction methods
Principal Components Analysis
Partial Least Squares

7 - NONLINEAR MODELS

Polynomial Regression
Step functions
Piecewise polynomials
Linear splines
Cubic splines
Natural cubic splines
Knot placement
Smoothing splines
Local Regression
Generalized Additive Models

8 - TREES

Regression problems
Tree building
Recursive binary splitting
Pruning
Cost complexity pruning
Classification problems
Gini index
Deviance
Bagging
Out-of-bag error estimation
Boosting
B
λ
X

9 - SUPPORT VECTOR MACHINES

Maximal margin classifier
Non-separable data
Feature expansion
Kernels

10 - UNSUPERVISED LEARNING

Principal Component Analysis
Proportion of Variance Explained
K-means clustering

Events

Probability is defined on events. An event is a set, so it has the operations of sets.

Probability

Kolmogorov axioms:
0 < P(A) < 1
A ∪ B = &:RightArrow; P(A + B) = P(A) + P(B) (∪ is the union)
In general, P(A + B) = P(A) + P(B) - P(AB) (AB = intersection)

Conditional probability

If P(M) ≠ 0, P(A|M) = P(AM) / P(M)

Total probability theorem

M₁, ..., M_n disjoint events with Σ_i=1ⁿ M_i > A, P(M_i) ≠ 0. Then,

P(A) = Σ_i=1ⁿ P(A|M_i)P(M_i)

Bayes theorem

P(M|A) = P(A|M)P(M) / P(A)

Using the total probability theorem:

P(M_i|A) = P(A|M_i)P(M_i) / Σ_j=1ⁿ P(A|M_j)P(M_j)

Independence

A and B are independent if P(AB) = P(A)P(B) <=> P(A|B) = P(A)

Mean

m_x = m₁ = ∫_-∞^∞ x f_x(x) dx

It's the barycenter of f_x(x)

Y = g(x) => m_Y = ∫_-∞^∞ g(x) f_x(x) dx

All the moments can be interpreted as a first order moment of the corresponding power of X.

m_k = E[X^k] = ∫_-∞^∞ x^k f_x(x) dx

Y = a X + b E[Y] = a E[X] + b => E[.] is a linear operator.

Variance

Var[X] = σ_x² = ∫_-∞^∞ (x - m_x)² f_x(x) dx

The variance measures how concentrated the pdf is around its mean.

Standard deviation: σ_x = √Var[X]

Var[X] = m₂ - m₁²

Var[X] = E[(X - E[X])²]

Y = a X + b => Var[Y] = Var[a X + b] = ... = a² Var[X] it's invariant to translation

Standardization

Y = X - E[X]_{σ_x} => E[Y] = 0 , σ_Y = 1 => Y = Z

X ~ N(m_x, σ_x²) , Z ~ N(0, 1)

F_X(x) = F_Z(x - m_X_{σ_X})

K-Nearest Neighbors

Example:KNN with K=3

It is not necessary that the final regions are consistent with the training set.

We take a neighborhood with K nearest observations to a certain point and see what's the class with highest probability.

The decision of the number of neighbors influences the complexity of the model.Large K → low flexibility.Low K → high flexibility, might be noisy.Using the test data we can find the optimal K.

Statistics

Sample moments

X_i, i=1,...,n i.i.d. (independent and identically distributed)Problem: estimating E[X_i^k]Solution: sample moments M_k=¹⁄_n∑ X_i^k

Sample mean: M₁=X̄_n=¹⁄_n∑ X_i

Sample variance: S²=¹⁄_n-1∑ (X_i-X̄_n)²

Law of large numbers (proof)

X_i, i=1,...,n i.i.d,E[X_i]=m Var[X_i]=σ²<∞Then, lim_n→∞E[(X̄_n-m)²]=0 (The Mean Squared Error tends to zero)

Mean square convergence

Let θ⁰ be deterministic. Then, lim_n→∞E[(Θ^∧_n-θ⁰)²]=0 iff:lim_n→∞E[Θ^∧_n]=θ⁰lim_n→∞Var[Θ^∧_n]=0

lim_n→∞E[(Θ^∧_n-θ⁰)²]=0 mean square convergence => lim_n→∞p(|Θ^∧_n-θ⁰|>ε)=0, ∀ε>0 convergence in probability

Central limit theory

X_i, i=1,...,n i.i.d,E[X_i]=m Var[X_i]=σ²<∞X̄_n=ΣX_i, S_n=^{X̄_n−E[X̄_n]}⁄_{√Var[X̄_n]}Then, the cdf of S_n converges to N(0,1)

Consequence:Asymptotically, X̄_n ~ N(m, ^σ²⁄_n). Regardless of the starting distribution, it converges to a Gaussian.

Anteprima

Vedrai una selezione di 13 pagine su 60