vuoi
o PayPal
tutte le volte che vuoi
K-MEANS ALGORITHM
Greedy iterative approach to find a clustering that minimizes: SSE(C) = sum(i=1:k) sum (x_j in c_i) { dist( x_j, mu_i)^2} find C*=argmin(C) SSE(C). Being agreedy iterative algorithm, it can converge to a local optima instead of aglobally optimal clustering.
Initial centroids matter: Centroid initialization, which is not trivial. Sometimesthe initial centroids will readjust themselves in “right” way, and sometimes theydo not.
Extensions: Updating centers incrementally, pre processing or post processing.
Strength: relatively efficient, simple, often terminates to a local optimum, theglobal optimum may be found using techniques such as deterministic annealingand genetic algorithms.
Limitations: has problems when clusters are of different size, density and of anon globular shape. It has also problem with outliers and noisy data. It isapplicable only when mean is defined and needs to specify k in advance.
BFR ALGORITHM
It can handle very large datasets, assumes
clusters are NORMALLY distributed around a centroid in Euclidean space, standard deviations in different dimensions may vary, clusters are axis aligned ellipses. (O(clusters))
Points are read one chunk at the time. Most points from previous memory loads are summarized by simple statistics:
- DISCARD SET (DS): Points close enough to a centroid to be summarized
- COMPRESSION SET (CS): Groups of points that are closed together but not close to any existing centroid
- RETAINED SET (RS): Isolated points waiting to be assigned to a compression set
Discard set is summarized by the number of points N, the sum vector and the sum-squared vector. When the last chunk of data is processed, the remaining mini-clusters and the points in the retained set (RS) might be labeled as outliers or alternatively can be assigned to one of the centroids.
MEAN-SHIFT
It is an iterative, non parametric and versatile algorithm. It searches for the modes of a data distribution. The window size h selection is not
trivialK-MEANS VS MEAN-SHIFT:
- K-means assumes that the number of clusters is already known and that clusters are spherical. While MEAN-SHIFT does not assume anything about the clusters, moreover it can handle arbitrary shapes.
- K-means is sensitive to initialization and outliers, while mean shift is fairly robust to initialization and less sensitive to outliers.
- K-means is fast, while mean shift is computationally expensive.
EXPECTATION-MAXIMIZATION ALGORITHM (EM)
K-means assigns each point to only one cluster (hard assignment) -> The idea is to consider soft assignment of points to clusters, so that each point has a probability of belonging to each cluster
The goal is to maximize the likelihood: tetha*=argmax(tetha) P(dati|tetha)
For each cluster 1,...,k we can use the current estimate of the parameter to compute the posterior probabilities.
BEYOND CONVEX CLUSTERS: DBSCAN
Representative-based clustering methods are suitable for finding ellipsoid-shaped clusters,
or at best convex clusters.
Density-based methods can mine nonconvex clusters and handle noise.
LIMITATIONS: it is sensitive to the choice of epsilon, in particular if clusters have different densities, actually if epsilon is too small, we get that sparser clusters would classified as noise, while if epsilon is too large, denser clusters might be merged together.
So, if there are clusters with different local densities, then a single epsilon value may not suffice.
The main cost of dbscan is for computing epsilon for each point, if the dimension is small, this can be done efficiently using a spacial index structure in O(nlogn) time. When dimension is high, it takes O(n^2). Once the neighborhood of x has been computed, the algorithm only needs a single pass over all the points to find the density connected clusters. So the overall complexity is O(n^2) in the worst case
REGRESSION
Given D input variables, assumes that the output y can be computed as: y=w_0+ sum w_j x_j + eps
RSS = somme (
predicted_values - valori veri ) ^ 2 = somme ( residui ) ^ 2
TSS = somme ( valori veri - media dei valori veri ) ^ 2 => R^2 = 1 - RSS/TSS.
R2 measures of how well the regression line approximates the real data points.
When R2 is equal to 1 the regression line perfectly fits the data. If it is negative it means that the model performs worst than the basic mean: the model tries very hard to fit the training data so to become useless to predict unseen cases.
Gradient descent: w_t = w_{t-1} - eta * grad RSS(w)
Models should be EVALUATED using data that have not been used to build the model itself. The available data must be split between training (used to build the model) and test (used to evaluate the model) sets.
– HOLDOUT EVALUATION: reserves a certain amount for testing and uses the remainder for training: too small training sets might results in poor weight estimations, while too small test might result in poor estimation of future performances.
– K-FOLD CROSSVALIDATION: data is
split into k subsets of equal size, each subset is used in turn for testing and the remainder for training. k-fold cross validation is a pessimistic model validator because it overestimates generalization error as less data is involved in training sets. The error estimates are averaged to yield an overall error estimate.
OVERFITTING vs UNDERFITTING: Having too few parameters leads to under-fitting —> Model isn’t powerful enough to describe data (small var, high bias). While too many parameters leads to overfitting (high var, small bias). By increasing degree of polynomial we first see improvements in cross-validation error, then the error starts to increase again as model starts to overfit.
In regression, overfitting is associated to large weights => we add a term to penalize the large weights in RSS.
– RIDGE REGRESSION: RSS + alpha * norm_2 {w}^2
– LASSO REGRESSION: RSS + alpha * norm_1 {w}^2 (does implicitly feature selection. Embedded filter approach)
As alpha increases,
for lasso almost all parameters go to zero, while for ridge none of the weights go to zero, but tend to. To choose alpha we need the VALIDATION SET: If we have enough data, we can extract a validation set (or development set) from the training data which will be used to select alpha; If we don't have enough data, we should select alpha by applying k-fold cross-validation over the training data choosing the α corresponding to the lowest average cost over the k folds
CLASSIFICATION
LOGISTIC REGRESSION
It computes the probability of assigning a class to an example, that is: P(class | data) Logistic regression search for the weight vector that corresponds to the highest likelihood. Vs linear regression: logistic is better in fitting 0/1 data, moreover its decision boundaries are not sensitive to outliers. We can again use Lasso and ridge regression
SVM
It works by searching for the hyperplane that maximizes the margin
K-NEAREST NEIGHBORS
Non-parametric method: fast to perform calculations, but
slow to predict. To decide the label for an unseen example, consider the k examples from the training data that are more similar to the unknown one. Classify the unknown example using the most frequent class. If K is too small, classification might be sensitive to noise points, while if K is too large, neighbors might include quite dissimilar examples. To determine the class it takes the majority vote of class labels among the k neighbors, or weight the vote according to distance (e.g., w = 1/d^2 )
KNN VS DECISION TREES
k-nn and decision trees are both non parametric methods: easy computations but slow predictions. Decision tree supports automatic feature interaction, whereas K-NN can't. Moreover Decision trees are faster, you're just running a series of boolean comparisons. While k-nn is slower due to its expensive real time execution. KNN determines neighborhoods, so there must be a distance metric, which may be affected by varying scales between attributes and also high-dimensional space.
DT, on the other hand, predicts a class for a given input vector. The KNN-based classifier, however, does not build any classification model. It directly learns from the training instances (observations). It starts processing data only after it is given a test observation to classify. Thus, KNN comes under the category of "Lazy Learner" approaches.
Find nearest neighbors efficiently: KD TREES which split the space hierarchically using a tree generated from data. Possible split direction throughout the greatest variance, possible split point median value along that direction. They become inefficient when the number of attributes is too large => BALL TREES so that I do not have any intersection of hype rectangles.
NAIVE BAYES A priori probability of Class C, P(C), is the probability of event before evidence is seen. A posteriori probability of Class, P(Class|Data), is the probability of event after evidence is seen. It assumes that attributes are equally important and attributes are
statisticallyindependent.P( Class | Data ) = P( Data | Class ) P( class ). P( Data | Class ) this isthe one that I computeZero-frequency problem —> Laplace estimatorNaive Bayes works well even if independence assumption is violated, becauseclassification do not require accurate probability estimates as long as maximumprobability is assigned to correct class.
DECISION TREES
An internal node is a test on an attribute, a branch represents an outcome ofthe test, a leaf represents a class label or a class label distribution, at eachnode one attribute is chosen to split training examples into distinct classes. Anew case is classified following a matching path to a leaf node.
Attribute for splitting: Information Gain (ID3), Information Gain Ratio (C4.5), GiniIndex (CART).
INFORMATION GAIN
It increase with the average purity of the subsets that an attribute produces. Itchoose the attribute that results in greatest information gain.info_before splitting = Entropy (p1, p2, p3,…) =
-p1 log2(p1) - p2 log2(p2) ….
Information gain = info_before splitting - info_after splitting
info_before splitting = |D1| / |D| info_before (D1) + …
INFORMATION GAIN RATIO
Highly branching attributes might result in an overfitted tree. So we weight the information gain in such a way that this algorithm do not split this attribute.
Gain_Ratio = Gain / Intrinsic_Info
GINI INDEX
The gini index measures error rate of random classifier that assigns classes to instances according to their priors.
gini(T) = 1 - sum pj^2
The attribute providing the largest reduction in impurity is chosen to split the node
CLASSIFICATION ENSEMBLES
Generate a set of classifiers from the training data and predict class label of un