Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
vuoi
o PayPal
tutte le volte che vuoi
R
The criterion:
2) Kaiser’s rule
This rule was first proposed by Kaiser as a selection criterion for PCs derived from a
correlation matrix R. The rule suggests to retain as many principal components as are
the eigenvalues of R larger than 1. But this rule can be generalized to PCs computed
via the eigen decomposition of S. 46
The criterion:
3) Scree plot
This rule, that can be applied to PCs extracted both from S or R, is a graphical criterion
and is based on what is called scree plot. This is simply a plot of the values against
, = 1, . . . , .
the corresponding
How should the scree plot be used?
, -axis,
The scree plot suggests to select on the as the value to the left of the point
where the curve becomes flat, namely to the left of the point at which the elbow
occurs
Preliminaries for biplot...
To be noted: we can always compute correlations between the original variables and
the PCs! 47
Minimi quadrati generalizzati
...Biplot
The biplot is a type of exploratory graph allowing information on both the sample
units and the variables of a data matrix X to be displayed graphically on a
(bidimensional) Cartesian plane spanned by (usually the first) two PCs.
• Principal components are displayed as axes.
• Sample units are displayed as points (PC scores).
• Original variables are displayed as arrows (PC loadings).
To understand:
• A point (score) close to the origin means that the corresponding unit has values
of and close to the mean.
1 2
• A point (score) far from the origin in the direction of one of the axes (PCs)
means that the corresponding unit presents values which are different from
the mean for that PC. 48
• A point (score) far from the origin in the direction of one of the arrows
(variables) means that the corresponding unit presents values which are
different from the mean for that variable.
Important!: the biplot is a useful graphical representation if the (first) two PCs
explain an high proportion of variance.
Understanding the arrows:
Example: USArrests
The USArrests data are contained in the datasets package for R.
= 4
The data set contains variables. Three variables represent the number of
arrests per 100,000 residents for Assault, Murder, and Rape in each of the n = 50 US
states in 1973. The data set also contains the percentage of the population living in
urban areas (UrbanPop). 49
1) First issue: different mean
Each original variable may have a different mean.
It is usually beneficial for each variable to be centered at zero for PCA, due to the fact
that it makes comparing each principal component to the mean straightforward.
2) Second issue: different scale
The variance of Assault is about 6945, while the variance of Murder is only 18.97. The
Assault data isn’t necessarily more variable, it’s simply on a different scale relative to
Murder! Remember that the variance is not a relative measure of variability!
Standardizing (scaling) each variable will fix the issue
However, keep in mind that there may be instances where scaling is not desirable. An
example would be if every variable in the data set had the same units and the analyst
wished to capture this difference in variance for his or her results. Since Murder,
Assault, and Rape are all measured on occurrences per 100,000 people this may be
reasonable depending on how you want to interpret the results. But since UrbanPop
is measured as a percentage of total population it wouldn’t make sense to compare
the variability of UrbanPop to Murder, Assault, and Rape.
Important!: the important thing to remember is that PCA is influenced by the
magnitude of each variable; therefore, the results obtained when we perform PCA
will also depend on whether the variables have been individually scaled.
Because it is undesirable for the PCs obtained to depend on an arbitrary choice of
scaling, we typically scale each variable to have standard deviation one before we
perform PCA. 50
If we perform PCA on the unscaled variables, then the first PC loading vector will
1
have a very large loading for Assault, since that variable has by far the highest variance
The right-hand plot displays the first two PCs, without scaling the variables to have
standard deviation one. As predicted, places almost all of its weight on Assault,
1
while the second PC loading vector places almost all of its weight on UrbanPop.
2
Comparing this to the left-hand plot, we see that scaling does indeed have a
substantial effect on the results obtained.
Looking at the arrows
Overall, we see that the crime-related variables (Murder, Assault, and Rape) are
located close to each other, and that UrbanPop is far from the other three. This
indicates that the crime-related variables are correlated with each other – states with
high murder rates tend to have high assault and rape rates – and that the UrbanPop
variable is less correlated with the other three.
Computing PCs ()
To calculate principal components, we start by using the function to calculate
the covariance matrix S (or the correlation matrix R if data have been standardized),
()
followed by the command to calculate the eigenvectors and eigenvalues of
51
()
S. produces an object that contains both the ordered eigenvalues ($)
and the corresponding eigenvector matrix ($).
Just as an example, we take the first two sets of loadings and store them in the matrix
called phi.
Principal component loadings
Warning: eigenvectors that are calculated in any software package are unique up to
a sign flip.
Suggestion: choice the sign of each eigenvector in order to make the interpretation
of the corresponding PC easier.
By default, eigenvectors in R point in the negative direction. For this example, we’d
prefer the eigenvectors point in the positive direction because it leads to more logical
interpretation of graphical results as we’ll see shortly. To use the positive-pointing
vector, we multiply the default loadings by -1. The set of loadings for the first principal
component (1) and second principal component (2) are shown below: 52
1 2
Each eigenvector (i.e. and in this example) defines a direction in the original
variable space. Because eigenvectors are orthogonal, principal components are
uncorrelated with one another, and form a basis of the new space. This holds true no
matter how many dimensions are being used.
By examining the loadings we note that:
• 0.535
The first loading vector places approximately equal weight (from to
1
0.583) on Assault, Murder, and Rape, with much less weight (0.278) on
UrbanPop. Hence this component (1) roughly corresponds to a measure of
overall rates of serious crimes.
•
The second loading vector places most of its weight on UrbanPop (0.872)
2
and much less weight on the other three variables. Hence, this component
(2) roughly corresponds to the level of urbanization of the state, with some
opposite, smaller influence by Murder and Assault.
Principal component score
If we project the data points (i.e. the rows of X) onto the first eigenvector (1),
the projected values are called the principal component scores for each observation.
53
We’ve calculated the first and second principal components for each US state, and
we can plot them against each other and produce a two-dimensional view of the
data.
So far, we only looked at two of the four principal components.
Questions
• How did we know to use two principal components?
• How well is the data explained by these two principal components compared
to using the full data set?
The Proportion of Variance Explained
Remark: PCA reduces the dimensionality while explaining most of the variability, but
there is a more technical method for measuring exactly what percentage of the
variance was retained in these principal components
A vector of PVE for each principal component can be calculated in R as 54
gomito
Remark: The elbow point is not so clear in this example; anyway, according to the
cumulative PVE rule, reduction from d = 4 original variables to 2 PCs, while still
explaining 86.7% of the variability, is a good compromise 55
SLIDE 5: Cluster Analysis
Cluster Analysis (CA), simply said clustering, is one of the most important statistical
methods for discovering knowledge in multidimensional data. The goal of CA is to
identify patterns (or groups, or clusters) of similar units within a data set X. In the
literature, CA is also referred to as “unsupervised machine learning”:
→
unsupervised because we are not guided by a priori ideas about the
underlying clusters that, for this reason, are often referred to as “latent
groups”. →
learning because the machine algorithm “learns” how to cluster.
Some very general considerations
Clustering observations means partitioning them into distinct groups such that:
• observations within the same group are similar;
• observations from different (between) groups are as different as possible from
each other.
Preliminary requirement: to make CA concrete, we must define what it means for two
or more observations to be similar or different.
Comparison with PCA: both clustering and PCA seek to simplify the data via a small
number of summaries, but their mechanisms are different:
o PCA looks to find a low-dimensional representation of the observations that
explain a good fraction of the variance;
o Clustering looks to find homogeneous subgroups among the observations.
The logic 56
Examples:
• Finding homogeneous groups of users of mobile phones;
• Finding personality types based on questionnaire data;
• Looking for families, species, or groups of animals or plants;
• Looking for search types based on Google search histories;
• Based on observed choices, find customer typologies to introduce new
products;
• Arranging insured persons of an insurance company into groups (risk classes).
Clustering distance/dissimilarity measures
×
As for PCA, the starting point of CA is an data matrix X.
Clustering units: in CA we are interested in the units (rows of X).
,
Where is the information for unit i?: with reference to unit its information is so
= 1, . . . , .
contained in the d-dimensional vector (row of X), with
The classification of observations (rows of X) into groups requires some methods for
computing the distance or the dissimilarity between each pair of observations.
Dissimilarity or distance matrix: the result of this computation, for each pair of units,
yields the so-called dis