Riassunto esame Multivariate analysis and statistical learning, Prof. Merlini Stefano, libro consigliato An introduction to statistical learning (Second edition). New York: Springer., James, G., Witten, D., Hastie, T., & Tibshirani, R

Riassunto per l'esame di Multivariate analysis and statistical learning, basato sul corso e sullo studio autonomo del libro consigliato da Prof. Merlini Stefano: An introduction to statistical learning (Second edition). New York: Springer., James, G., Witten, D., Hastie, T., & Tibshirani, R. Università degli Studi di Firenze - Unifi, facoltà di Scienze matematiche fisiche e naturali. Scarica il file in PDF!

Esame Multivariate analysis and statistical learning

Facoltà Scienze matematiche fisiche e naturali

Dal corso del Prof. Merlini Stefano

Università Università degli Studi di Firenze

Publisher ElenaSmith

A.A. 2022-2023

62 pagine

Appunti esame

Vota 4,0 / 5 (1)

Scarica

Estratto del documento

SIN

- Strategy 2: Stability Section algorithm: it looks for a stable graph structure using

resampling. E.g.: package stabs in R

- Strategy 3: use a regularised estimator that shrinks towards zero small partial

correlation coefficients. E.g.: lasso, elastic net, adaptive lasso estimators, package

glmnet in R −1

Σ Σ

We can estimate , then invert it obtaining . From that we can obtain the correlation

between two edges given the rest:

−σ

ρ =

σ σ

(

−1

)

Ω

obtaining the matrix with parameters, which are the values under the diagonal.

(

−1

)

Since there are different tests, this leads to a multiple tests issue.

The test for partial correlation is stated by the null hypothesis:

: ρ = 0

0 .

against : ρ ≠ 0

1 .

and the statistic to compute this test is a Student:

(−2)−(−2)

. ∼ =

(−2)−(−2) −

1−

This statistic is derived using the pairwise correlation coefficient .

To improve this test there is a transformation to apply, the Firsher’s transformation which

asymptotically converges under to a standard Gaussian:

0 1+

⎡⎢ ⎤⎥

1 .

( − 3) − ( − 3) · ∼ (0, 1) → ∞

2 1−

⎣ ⎦

It is proven that the test derived using the Fisher transformation converges to the standard

Gaussian faster than the previous statistic converges to the .

(−2)−(−2)

As we define a statistic for the test, we can also define some confidence intervals for the

partial correlation. We define the Fisher transformation with

⎡⎢ ⎤⎥

1 .

2 1−

⎣ ⎦

and then from the Fisher transformation we obtain:

( )

1+ρ

⎡⎢ ⎤⎥

1 1

∼ ,

2 1−ρ −3−

⎣ ⎦

= − 2

where , dimensions of the conditioning set.

The intervals for the Fisher transformed parameter are:

= [

, ]

with α

= − 2

−3−

= + 2

−3−

In the context of graphical models we call the saturated model the one corresponding to a

complete graph.

An adjacency matrix is a matrix composed by 0s and 1s formed this way:

= 1

- if there is an edge connecting the nodes and then the

= 0

- if there is no edge connecting the nodes and then the

The adjacency matrix is symmetric for undirected graphs, the opposite is not always true.

If we have an independence like 1 ⊥ 3 | 2, 4

−1

the variance/covariance matrix will be indicated as (obtained by solving (= inverting)

1234

−1

( )

Σ [1, 3] [3, 1]

the matrix ) and there will be zeros in the entries and .

1234

In this case in order to check a sub-independence like

1 ⊥ 3 | 2 −1

we have to consider the variance/covariance matrix indicated as obtained by inverting

123

the matrix which is the matrix obtained by removing the 4th column and the 4th row.

123

Then compute the same test explained before considering this submatrix. If the null

hypothesis is rejected the partial correlation coefficient is most likely not 0, which means

1 3 2

that is not independent from given .

Recap:

We have variables and we want to study their conditional independence structure using a

graph, learning it from the data. In this situation we’re assuming that the data comes from a

multivariate Gaussian distribution and the models resulting from this data are called

Concentration Graph Models.

In this context a missing edge represents conditional independence between those two

variables. We construct the graph based on the Pairwise Markov properties. Since the joint

Gaussian distribution is strictly positive, the Pairwise Markov property will imply the Global

Markov property, so all the conditional independences that we can read off the graph using

the Global Markov properties are implied by the construction of the graph.

There are two cases:

- when we know the graph we want to do inference, estimating the parameters given

→

the graph iterative procedures (Iterative Fitting Algorithm), very rare situation

since usually we don’t know the graph

- inference when we don’t know the graph, 3 possible strategies:

- start with the maximum likelihood estimator for the complete graph

variance/covariance matrix and then test with partial correlation

= 0 →

possibly taking into account the multiple testing issue we compute

Pairwise Markov properties

- Stability Section algorithm

- use a regularised estimator that shrinks towards zero small partial correlation

coefficients

Multiple testing issue:

the probability of committing an error (first type or second type) on the whole graph, not

just on a specific edge. Since the number of edges grows as

(

−1

)

the number of tests grows really quickly as the number of variables grows.

Consider for example a scenario in which we have a multivariate regression:

= β + ∑ β + ε

=1 β

here a possible solution to know if a given coefficient is zero is to compute the Wald

t-test. If you do this though, you lose the “big picture” since here you need to compute

tests. For this reason you actually need to compute the instead, where you compare

the model with and the model without the -th variable.

Multiple testing issue

Type 1 and 2 errors:

- type 1 error = reject when is true. The probability of committing a type 1 error

0 0

- type 2 error = don’t reject when is false. The probability of committing a type

0 0

2 error is

The multiple testing problem concerns a situation in which we want to consider many

hypothesis at the same time.

Some notation:

We set as the number of possible edges:

(

−1

)

= 2

where is the number of variables.

Consider then the set of null hypothesis: { }

= ,...,

0 0,1 0,

we define the Intersection Null (or Global Null) as

= ⋂

0 0,

The Global Null is rejected if at least one single is rejected.

We then denote the set of p-values of the set of tests as

{ }

,...,

We compare the scenarios of a single test against multiple tests:

- when we compute a single test we consider the following probabilities:

( )

| = α

( )

| = 1 − α

- when we compute multiple tests we consider the following probabilities:

( )

| = ( 1 − α )

( )

| = 1 − (

1 − α )

where we assume each test is independent from the others

The bigger the value of (e.g. 0.1, 0.05, 0.01) the faster the probability of at least one type 1

error reaches 1.

The Bonferroni global test is a way to test the Global Null hypothesis:

- choose the overall significance level α

- test each null hypothesis at level

- accept if

0 α

> ∀ = 1,...,

- reject if

0 α

≤

∈[1,] α

The idea here is to set the significance level of each test to since we compute tests.

The overall significance level is ( )

( )

α α α

( 1 ) = ⋃ < ≤ ∑ < = ∑ = α

=1 =1 =1

0 0 0

∼ (0, 1)

since under .

This bound is now considered too strong, expecially for graphical models, since in some way

we are looking for independences. Whenever we put an edge where it is not present, we are

considering an overparametrised model (that won’t be wrong in any sense, but still). Instead

when we put an edge which is actually present and we’re estimating a model putting a 0 to

that partial correlation coeffcient we’re causing a bias on the others, and so the model will

be wrong.

This means that when we commit a type 1 error in graphical models we’re instead α

committing a way harder error than the opposite kind of error. This means that fixing to be

too small, we will have that will be too high, and for graphical models this is a big problem.

In the context of graphical models it is wrong to say that there are more independences than

the real amount. In particular we need to control the type 1 errors, but we can’t be too hard

α β

on , otherwise will be too high.

Multiple testing classical scheme Actual situation

true false

0 0

Decision −

Don’t reject 0

Reject 0 −

0 0

We consider the following notation:

(

−1

)

→ →

total number of tests total number of possible edges 2

→

number of true null hypothesis number of independences (missing edges on the

graph)

− number of total edges on the graph in the model

number of rejected null hypothesis →

number of rejected true null hypothesis type 1 errors

type 2 errors

Remember that we only know the last column of the table, that is and .

Anteprima

Vedrai una selezione di 14 pagine su 62