Quaderno Data Analysis

Il documento presenta i tools per l'apprendimento non supervisionato (Unsupervised Statistical Learning) con spiegazioni, formule, esempi.

Contenuti:
1) Univariate Statistical modelling, 2) Basics of Matrices, 3) Basics of Multivariate Statistics, 4) Principal Component Analysis, 5) Cluster Analysis, 6) Cluster Validation, 7) Model-Based Clustering

Esame Data Analysis

Facoltà Economia

Dal corso del Prof. Punzo Antonio

Università Università degli Studi di Catania

Publisher alesplat98

A.A. 2021-2022

140 pagine

Appunto

Vota 4,5 / 5 (2)

Scarica

Estratto del documento

R

The criterion:

2) Kaiser’s rule

This rule was first proposed by Kaiser as a selection criterion for PCs derived from a

correlation matrix R. The rule suggests to retain as many principal components as are

the eigenvalues of R larger than 1. But this rule can be generalized to PCs computed

via the eigen decomposition of S. 46

The criterion:

3) Scree plot

This rule, that can be applied to PCs extracted both from S or R, is a graphical criterion

and is based on what is called scree plot. This is simply a plot of the values against

, = 1, . . . , .

the corresponding

How should the scree plot be used?

, -axis,

The scree plot suggests to select on the as the value to the left of the point

where the curve becomes flat, namely to the left of the point at which the elbow

occurs

Preliminaries for biplot...

To be noted: we can always compute correlations between the original variables and

the PCs! 47

Minimi quadrati generalizzati

...Biplot

The biplot is a type of exploratory graph allowing information on both the sample

units and the variables of a data matrix X to be displayed graphically on a

(bidimensional) Cartesian plane spanned by (usually the first) two PCs.

• Principal components are displayed as axes.

• Sample units are displayed as points (PC scores).

• Original variables are displayed as arrows (PC loadings).

To understand:

• A point (score) close to the origin means that the corresponding unit has values

of and close to the mean.

1 2

• A point (score) far from the origin in the direction of one of the axes (PCs)

means that the corresponding unit presents values which are different from

the mean for that PC. 48

• A point (score) far from the origin in the direction of one of the arrows

(variables) means that the corresponding unit presents values which are

different from the mean for that variable.

Important!: the biplot is a useful graphical representation if the (first) two PCs

explain an high proportion of variance.

Understanding the arrows:

Example: USArrests

The USArrests data are contained in the datasets package for R.

= 4

The data set contains variables. Three variables represent the number of

arrests per 100,000 residents for Assault, Murder, and Rape in each of the n = 50 US

states in 1973. The data set also contains the percentage of the population living in

urban areas (UrbanPop). 49

1) First issue: different mean

Each original variable may have a different mean.

It is usually beneficial for each variable to be centered at zero for PCA, due to the fact

that it makes comparing each principal component to the mean straightforward.

2) Second issue: different scale

The variance of Assault is about 6945, while the variance of Murder is only 18.97. The

Assault data isn’t necessarily more variable, it’s simply on a different scale relative to

Murder! Remember that the variance is not a relative measure of variability!

Standardizing (scaling) each variable will fix the issue

However, keep in mind that there may be instances where scaling is not desirable. An

example would be if every variable in the data set had the same units and the analyst

wished to capture this difference in variance for his or her results. Since Murder,

Assault, and Rape are all measured on occurrences per 100,000 people this may be

reasonable depending on how you want to interpret the results. But since UrbanPop

is measured as a percentage of total population it wouldn’t make sense to compare

the variability of UrbanPop to Murder, Assault, and Rape.

Important!: the important thing to remember is that PCA is influenced by the

magnitude of each variable; therefore, the results obtained when we perform PCA

will also depend on whether the variables have been individually scaled.

Because it is undesirable for the PCs obtained to depend on an arbitrary choice of

scaling, we typically scale each variable to have standard deviation one before we

perform PCA. 50

If we perform PCA on the unscaled variables, then the first PC loading vector will

have a very large loading for Assault, since that variable has by far the highest variance

The right-hand plot displays the first two PCs, without scaling the variables to have

standard deviation one. As predicted, places almost all of its weight on Assault,

while the second PC loading vector places almost all of its weight on UrbanPop.

Comparing this to the left-hand plot, we see that scaling does indeed have a

substantial effect on the results obtained.

Looking at the arrows

Overall, we see that the crime-related variables (Murder, Assault, and Rape) are

located close to each other, and that UrbanPop is far from the other three. This

indicates that the crime-related variables are correlated with each other – states with

high murder rates tend to have high assault and rape rates – and that the UrbanPop

variable is less correlated with the other three.

Computing PCs ()

To calculate principal components, we start by using the function to calculate

the covariance matrix S (or the correlation matrix R if data have been standardized),

()

followed by the command to calculate the eigenvectors and eigenvalues of

()

S. produces an object that contains both the ordered eigenvalues ($)

and the corresponding eigenvector matrix ($).

Just as an example, we take the first two sets of loadings and store them in the matrix

called phi.

Principal component loadings

Warning: eigenvectors that are calculated in any software package are unique up to

a sign flip.

Suggestion: choice the sign of each eigenvector in order to make the interpretation

of the corresponding PC easier.

By default, eigenvectors in R point in the negative direction. For this example, we’d

prefer the eigenvectors point in the positive direction because it leads to more logical

interpretation of graphical results as we’ll see shortly. To use the positive-pointing

vector, we multiply the default loadings by -1. The set of loadings for the first principal

component (1) and second principal component (2) are shown below: 52

1 2

Each eigenvector (i.e. and in this example) defines a direction in the original

variable space. Because eigenvectors are orthogonal, principal components are

uncorrelated with one another, and form a basis of the new space. This holds true no

matter how many dimensions are being used.

By examining the loadings we note that:

• 0.535

The first loading vector places approximately equal weight (from to

0.583) on Assault, Murder, and Rape, with much less weight (0.278) on

UrbanPop. Hence this component (1) roughly corresponds to a measure of

overall rates of serious crimes.

•

The second loading vector places most of its weight on UrbanPop (0.872)

and much less weight on the other three variables. Hence, this component

(2) roughly corresponds to the level of urbanization of the state, with some

opposite, smaller influence by Murder and Assault.

Principal component score

If we project the data points (i.e. the rows of X) onto the first eigenvector (1),

the projected values are called the principal component scores for each observation.

We’ve calculated the first and second principal components for each US state, and

we can plot them against each other and produce a two-dimensional view of the

data.

So far, we only looked at two of the four principal components.

Questions

• How did we know to use two principal components?

• How well is the data explained by these two principal components compared

to using the full data set?

The Proportion of Variance Explained

Remark: PCA reduces the dimensionality while explaining most of the variability, but

there is a more technical method for measuring exactly what percentage of the

variance was retained in these principal components

A vector of PVE for each principal component can be calculated in R as 54

gomito

Remark: The elbow point is not so clear in this example; anyway, according to the

cumulative PVE rule, reduction from d = 4 original variables to 2 PCs, while still

explaining 86.7% of the variability, is a good compromise 55

SLIDE 5: Cluster Analysis

Cluster Analysis (CA), simply said clustering, is one of the most important statistical

methods for discovering knowledge in multidimensional data. The goal of CA is to

identify patterns (or groups, or clusters) of similar units within a data set X. In the

literature, CA is also referred to as “unsupervised machine learning”:

→

unsupervised because we are not guided by a priori ideas about the

underlying clusters that, for this reason, are often referred to as “latent

groups”. →

learning because the machine algorithm “learns” how to cluster.

Some very general considerations

Clustering observations means partitioning them into distinct groups such that:

• observations within the same group are similar;

• observations from different (between) groups are as different as possible from

each other.

Preliminary requirement: to make CA concrete, we must define what it means for two

or more observations to be similar or different.

Comparison with PCA: both clustering and PCA seek to simplify the data via a small

number of summaries, but their mechanisms are different:

o PCA looks to find a low-dimensional representation of the observations that

explain a good fraction of the variance;

o Clustering looks to find homogeneous subgroups among the observations.

The logic 56

Examples:

• Finding homogeneous groups of users of mobile phones;

• Finding personality types based on questionnaire data;

• Looking for families, species, or groups of animals or plants;

• Looking for search types based on Google search histories;

• Based on observed choices, find customer typologies to introduce new

products;

• Arranging insured persons of an insurance company into groups (risk classes).

Clustering distance/dissimilarity measures

As for PCA, the starting point of CA is an data matrix X.

Clustering units: in CA we are interested in the units (rows of X).

Where is the information for unit i?: with reference to unit its information is so

= 1, . . . , .

contained in the d-dimensional vector (row of X), with

The classification of observations (rows of X) into groups requires some methods for

computing the distance or the dissimilarity between each pair of observations.

Dissimilarity or distance matrix: the result of this computation, for each pair of units,

yields the so-called dis

Anteprima

Vedrai una selezione di 20 pagine su 140