Applied statistics: multivariate analysis
Types of multivariate analysis
Multivariate analysis can be divided into two types with different goals:
- Analysis of dependence: simple multiple linear regression, logistic regression.
- Analysis of interdependence: factor analysis, cluster analysis (hierarchical and non-hierarchical).
Factor analysis
Definition: It is a multivariate technique for interdependence analysis among quantitative variables.
Main objective: Reduce variables to provide more aggregate information and create new quantitative variables characterized by optimal properties. The input may have multicollinearity problems, which can be adjusted with factor analysis.
Extraction method: Principal components method (PCA). This method assumes that the specific information contribution of the input variables is very low, while the shared information contribution is very high, allowing explanation through k principal common factors.
How many factors:
- The factors must be 30% of the initial variables (k/p).
- Scree plot: stop before the point where the line gets flatter.
- Percentage of total variance explained: between 60% and 75%. If it is more, you need to reduce factors.
- Latent roots (Eigenvalues): > 1 (default factor).
- Communalities (sum of component loadings): > 0.5, indicating the variance explained by the solution of the single input variables.
How to interpret the components:
- The component matrix: each initial variable shows the correlation with the new factors.
- The rotated component matrix, with different methods:
- Varimax: minimizes the number of variables with high loadings (correlation) for each factor.
- Quantimax: attempts to minimize the number of factors strongly correlated to each variable.
- Equimax: a cross between Varimax and Quantimax.
Factor scores: Once an adequate solution is found, it's possible to use the obtained factors as new macro variables called factor scores. They are standardized variables with a mean of 0 and variance of 1. They can be used as explanatory variables in a regression model or as segmentation variables in cluster analysis.
Cluster analysis: non-hierarchical algorithm
Definition: Cluster analysis (CA) is an automatic classification technique that classifies statistical units into groups or clusters, which are internally homogeneous but externally very heterogeneous.
Main objective: Creating groups that are internally homogeneous but have high external variability (each group is different from the others), also known as segmentation. Groups can be formed by factors.
Different types of segmentation: Based on types of data, segmentation can be behavioral, need-based, demographic, or value-based.
Two main types of algorithms:
- Direct classification algorithms (k-means algorithm): where the number of clusters is specified.