Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
vuoi
o PayPal
tutte le volte che vuoi
T
( )
Var( ^ ) =
w ϕ ϕ σ
OLS
−1
T
( )
Noticing that if is singular, then the variance goes to infinity.
ϕ ϕ
More are the samples, smaller is the variance.
Gauss-Markov Theorem: the least squares estimate of has the
w
smallest variance among all linear unbiased estimates.
It follows that least squares estimator has the lowest MSE (Mean Squared Error)
of all linear estimators with no bias. However, the important thing is having a
minimum MSE, so we can introduce a small bias in order to further reduce the
MSE.
Under-fitting VS over-fitting
Under-fitting: with low-order polynomials, so there is too much bias.
Over-fitting: with high-order polynomials, so there is too much
variance.
One method to avoid over-fitting is increasing the number of samples:
In green we have the true model, in red the approximisation
Linear Regression 9
Note that when the model gets more complex the parameters became very high,
thus making a small change in the features results in a large variance, so the
model is very unstable. The answer is changing the loss function.
Regularization
As we said above, one way to reduce the MSE is to change the loss function,
how? We can add a term that reppresents a small bias in order having less
parameters and regolarize the loss function:
) = ( ) + ( )
L(
w L w λL w
D W
So: ( )
is a regular loss function that models the error on data, for example
L w
D
the RSS.
is the hyper parameter that determines how much we want to regolarize
λ
the function.
( )
reppresents the model complexity.
L w
W
⟹ ( )
is the term for the regularization!
λL w
W
Ridge Regression
We can take the following :
L
W 1 1 2
= ∥ ∥
T
( ) =
L w w w w 2
W 2 2
12 22 2
∥ ∥ = + + ... +
Note: → Euclidean Norm (norma euclidea)
w w w w
2 N
Therefore, the loss function becames:
N
1 λ
∑ 2 2
( )) +
(t − ∥ ∥
T
) =
L(
w w ϕ x w 2
i i
2 2
i=1
This is called Ridge Regression. → 0
Looking at we can deduce that if it is small, so , we will have a high
λ λ → ∞
variance so the function won’t be regularized, otherwise, if we will have
λ
a too high bias. Consequently, we have to find the right value for it.
Note that the loss function is still quadratic in :
w
T T
Linear Regression 10
−1
T T
)
^ = (λ +
w I ϕ ϕ ϕ t
ridge
→ 0
If it becames the squared loss.
λ
Lasso Approach
With the Lasso Approach we can choose the following function:
N
1 λ
∑ 2 ∥ ∥
T
) = ( )) +
(t −
L(
w w ϕ x w 1
i i
2 2
i=1
M
∥ ∥ = ∣w ∣
∑
Where → sum of absolute values
w 1 i
j=1
✅
PRO: some weighs are equal to zero for values of sufficientl large (so it is
λ
possible to discard the corrispondent features!). For this reason this approach
produces sparse models.
❌ CON: it is nonlinear in the and no closed-form solution exists.
t i
Ridge Regression Lasso
In blue there is the plot of the contours of the unregularized error function, the
graph in orange reppresents the constraints introduced with the penalization.
Linear Regression 11
∗
w
The optimum value for the parameter vector is denoted by . Looking at
w 1∗ = 0
these graphs we can see that the Lasso gives a sparse solution in which .
w
Bayesian Linear Regression
Recap: we can use maximum likelihood to set the parameters of a linear
regression model
the model complexity is governed by the number of basis functions but
also needs to be controlled according to the size of the data set
another way to controll the model complexity is through regularization
althogh the choice of the parameters of the basis function is still
important
Now, the issue is deciding the appropriate model complexity for a particolar
problem (it cannot be determinated by maximisizing the likelihood function
because this always leeds to excessively complex models and over-fitting).
The answer is a Bayesian treatment for linear regression which will avoid the
over-fitting problem and will lead to automatic methods of determing model
complexity using the training data alone.
So we want to formulate our knowledge about the world in a probabilistic way.
We define the model that expresses the knowledge qualitatively, this model will
have some unknown parameters about which we will make some assumptions
based on the prior distribution over those parameters before seeing the data.
Then we observe the data and we compute the posterior probability distribution
for the parameters, given observed data.
Posterior distribution ( ∣D)
The posterior distribution for the model parameters can be found by
P w
combining the prior with the likelihood for the parameters given data, in
particular the posterior distribution is proportional to the product of the likelihood
and the prior.
Ingredients: ( ) = ( ∣ , )
Prior probability distribution:
N
P w w m S
0 0
)
Likelihood: , i.e. probability of parameters given training data
D
p(D∣
w w
(D) = ) ( )
∫
Normalizing constant:
P p(D∣
w P w d
w
Linear Regression 12
Result: )P ( )
p(D∣
w w
( ∣D) =
P w (D)
P
We want the most probable value of given the data: Maximum A Posteriori
w
(MAP). It is the mode of the posterior.
Sequential approach
The Bayesian approach can be used in a sequential way, when? If you have
computed the posterior distribution in a given situation and then you recive more
data, you can use the posterior previously computed as the prior distribution and
calculate the new posterior distribution.
Problem: this overtures (=approccio) works well when prior and posterior are in
the same family, otherwise analitically does not work.
Solution: we use the conjugate prior that is a pair of distribution of prior and
likelihood that multiplied produces a posterior in the same family of the prior (you
have to choose a prior that works well with the likelihood).
OBS: if you don’t want to use the conjugate prior, you can always approximate.
Assuming Gaussian likelihood model, the conjugate prior is a Gaussian
multivariate: ) = ( ∣ , )
N
p(
w w w S
0 0
With: → mean vector
w 0
→ covariance matrix (if it is diagonal then the weighs are indipendent)
S 0
So, as we said befone, the posterior is still Gaussian:
2 2
∣ , , ) ∝ ( ∣ , ) ( ∣ , ) = ( ∣ , )
N N N
p(
w t ϕ σ w w S t ϕw σ I w w S
0 0 N N N
T
ϕ t
−1
= +
( )
w S S w
0 0
N N 2
σ
T
ϕ ϕ
−1 −1 +
=
S S 0
N 2
σ
Linear Regression 13
This is a probabilistic version of the Ridge Regression, infact in Gaussian
distributions the mode concides with the mean, it follows that is the MAP
w N
estimator:
If the prior has infinite variance, reduces to the ML estimator.
w N
2
= 0 = =
If and , then reduces to the ridge estimate, where
w S τ I w λ
0 0 N
2
σ .
2
τ
Predictive Distribution
In practice, we are not usually interested in the value of itself but rather in
w
making predictions of for new value of . This requires that we evaluate the
t x
posterior predictive distribution defined by
∫
2 2 2
(t∣ ( ), )N ( ∣ , )d =
T T
, ) = ( ), ( ))
(t∣
D, N N
p(t∣
x σ w ϕ x σ w w S w w ϕ x σ x
N N N
2 2 ( ) ( )
T
( ) = +
σ x σ ϕ x S ϕ x
N N
where:
2 → noise in the target values (usually unknown), note that the variance
σ
arises only from this term
( ) ( )
T → uncertainty associated with parameter values, note that as
ϕ x S ϕ x
N
→ ∞
it goes to zero
N
Modeling Challenges
1. Specifing:
a. suitable model: it should admit all the prossibilities that thought to be at
all likely
b. suitable prior distribution: it should avoid giving zero or very small
probabilities to possible events, but also to spreading out the probability
over all possibilities.
To avoid uninformative priors, we may need to model dependencies between
parameters: one strategy is to introduce latent variables into the model and
hyperparameters into the prior. This permits to model in a tractable way.
2. Computing the posterior distribution, several approaches:
a. Analytical integration
Linear Regression 14
b. Gaussian (Lapalce) approximation
c. Monte Carlo integration
d. Variational approximation
Pros and cons of fixed basis functions
✅ ❌
PRO: CON:
closed-form solution basis functions are chosen
indipendently from the training set
tractable bayesian treatment curse of dimensionality
arbitrary non-linearity with the
proper basis functions
Linear Regression 15
Linear Classification
Classification problems
Goal of Classification: assign an input into one of discrete
x K
= 1, ...,
classes , where .
C k K
k
Tipically, each input is assign to only one class, we assume that we have no
noise in the labels.
Linear Classification
The input space is divided into decision regions whose bounderies are called
decision bounderies or decision surfaces.
For classification, we need to predict discrete class labels, that lie n the range
(0, 1)
of , so we use a nonlinear function:
(