Riassunti Data mining and text mining

Riassunti di Data mining and text mining basati su appunti personali del publisher presi alle lezioni del prof. Lanzi, dell’università degli Studi del Politecnico di Milano - Polimi, della facolt…

Esame Data Mining and Text Mining

Facoltà Ingegneria dei sistemi

Dal corso del Prof. Lanzi Pier Luca

Università Politecnico di Milano

Publisher bonadiamatilde

A.A. 2020-2021

85 pagine

Appunto

Vota 5,0 / 5 (1)

Scarica

Estratto del documento

KNN VS DECISION TREES

k-nn and decision trees are both non parametric methods: easy computations but slow predictions.

Decision tree supports automatic feature interaction, whereas K-NN can't. Moreover Decision trees are faster, you're just running a series of boolean comparisons. While k-nn is slower due to its expensive real time execution. KNN determines neighborhoods, so there must be a distance metric, which may be affected by varying scales between attributes and also high-dimensional space. DT, on the other hand, predicts a class for a given input vector. The KNN-based classifier, however, does not build any classification model. It directly learns from the training instances (observations). It starts processing data only after it is given a test observation to classify. Thus, KNN comes under the category of "Lazy Learner" approaches.

What similarity measures?

Cosine Distance

Jaccard Distance

can be interpreted as the percentage of identical attributes à Hamming

Distancen = # of variables
m = # of matching components

To determine the class from nearest neighbor list:

Take the majority vote of class labels among the k neighbors
Or weight the vote according to distance (e.g., w = 1/d^2)

Normalization and Other Issues

Different attributes are measured on different scales!

We often need to normalize them, using for instance range normalization or standard normalization

For nominal attributes, the distance is either 0 (the value is the same) or 1 (different)
Missing values are usually assumed to be maximally distant

Finding Nearest Neighbors Efficiently

Basic Approach: Full scan of the data

Classification time O(nd): it can be prohibitive when the training set is large

Nearest-neighbor search can be speeded up by using

KD-Trees
Ball-Trees

KD-Trees

Split the space hierarchically using a tree generated from the data

Tree construction

Add points iteratively to the tree
Each new point falls in a leaf and splits region around it based

on value of one of its attributes

Search for nearest neighbors

Navigate tree to reach leaf and check
Backtrack up tree to check nearby regions
Until all k-nearest neighbors found

Effectiveness of KD-trees

Search complexity depends on depth of tree (O(log(n)) for balanced trees) —> Occasional rebalancing of tree randomizing order of data is an option

How to build a good tree?

split point split direction:

Need to find good and
Possible split direction: direction with greatest variance
Possible split point: median value along that direction

NB: using value closest to mean (rather than median) can be better if data is skewed

Observation: KD-trees can become inefficient when the number of attributes is too large!

Ball Trees

Corners in high-dimensional space may mean that the query ball intersects with many regions and this is a potential limitation for KD-Trees!

Note that there is no need to make sure that regions do not overlap, so they do not heed to be hyper-rectangles

Hyper-spheres

(balls) can be used instead of hyper-rectangles. A ball tree organizes the data into a tree of k-dimensional hyper-spheres. While the balls themselves may intersect, each point is assigned to one or the other ball in the partition according to its distance from the ball's center—> More efficient search.

LEZIONE 12 - NAIVE BAYES & BAYESIAN NETWORKS

before- A priori probability of C, P(C), is the probability of event evidence is seen

after- A posteriori probability of C, P(C|A), is the probability of event evidence is seen

ÙBayes Theorem says:

Naïve Bayes Classifiers assume that:

attributes are equally important
attributes are statistically independent

Independence assumption is almost never correct! But the scheme works well in practice…

Training:

Use the counts to compute estimates for the class probability P(y) and the conditional probability P(xi|y)

Testing:

Given an example x, computes the most likely class as

Example: REMEMBER TO NORMALIZE !!!

“Zero-Frequency Problem”

What if an attribute value does not occur with every class value?

The corresponding probability will be zero—> A posteriori probability will also be zero! estimator)

Typical remedy is to add 1 to count for every pair (Laplace smoothing

This process is called

Sometimes, it is more appropriate to add a constant different from one.

For example, we can add the same μ/3 to each count (μ regularization parameter)

If background knowledge is available on the frequency of certain values, we can add different weights to each count(p1+p2+p3=1) are prior estimates on probability of each values

How does Naïve Bayes deal with missing values?

The attribute will be omitted from calculation!

Numeric Attributes in Naïve Bayes

So far, we applied Naïve Bayes to categorical data.

What if some (or all) of the attributes are numeric?

Two options:

- Discretize the data to make it binary or discrete

- Compute a probability density for each class:

1. Assume

parametric form for distribution and estimate its parameters (e.g. gaussian distribution: estimate mean and variance with sample mean and sample variance)

Directly estimate probability density from the data (e.g. kernel smoothing)

Example:

NB: missing values during training are not included in calculation of mean and standard deviation!

Discussion

Naïve Bayes works surprisingly well, even if independence assumption is clearly violated! Why? Because classification doesn’t require accurate probability estimates as long as maximum probability is assigned to correct class

However:

- Adding too many redundant attributes will cause problems

- If many numeric attributes are not normally distributed, the Gaussian assumption may be poor and need substituting

Bayesian Belief Networks

Bayesian Belief Networks (BBN) allows us to specify which pair of attributes are conditionally independent.

They provide a graphical representation of probabilistic relationships among a set of random variables.

Two key

elements: - A direct acyclic graph, encoding the dependence relationships among variables - A probability table associating each node to its immediate parents node - If a node X does not have any parents, then the table contains only the prior probability P(X) - If a node X has only one any parent Y, then the table contains the conditional probability P(X|Y) - If a node X has multiple parents (Y1, ..., Yk) table contains the conditional probability P(X|Y1...Yk) NB: in general the inference is NP-complete, but there are approximating methods (e.g. Monte-Carlo) Final remarks: Bayesian Networks: - Encode causal dependencies among variables - Well suited to deal with incomplete data - Robust to overfitting - When used to build classifier, they extend Naïve Bayes by removing conditional independence assumption Other benefits: - Can be used to capture prior knowledge - Graphical model describing relationship between variables is often useful for understanding/visualizing data Building the network is time

consumo (ricerca di un ampio spazio di possibili relazioni di indipendenza condizionale) l'aggiunta di variabili a una rete esistente è semplice.

LEZIONE 13 - ALBERI DI DECISIONE

Cosa è un Albero di Decisione?

Un nodo interno è un test su un attributo
Un ramo rappresenta un risultato del test
Un nodo foglia rappresenta una classe o una distribuzione di classi
Ad ogni nodo, viene scelto un attributo per suddividere gli esempi di addestramento in classi distinte il più possibile
Un nuovo caso viene classificato seguendo un percorso corrispondente fino ad un nodo foglia

Esempio: Calcolo degli Alberi di Decisione

Costruzione dell'Albero dall'alto verso il basso: gli esempi vengono suddivisi in modo ricorsivo scegliendo un attributo alla volta

Potatura dell'Albero dal basso verso l'alto: rimuovere sotto-alberi o rami, in modo da migliorare la precisione stimata sui nuovi casi.

Quale Attributo utilizzare per la suddivisione?

Viene utilizzata una misura di purezza o impurità a questo scopo

Misure tipiche utilizzate:

Information Gain (ID3)
Information Gain Ratio (C4.5)
Gini Index (CART)

Information Gain

Aumenta con il

Average purity of the subsets that an attribute produces

Splitting Strategy: choose the attribute that results in greatest information gain

Information is measured in bits

Given a probability distribution, we compute the distribution's entropy (log has base = 2)

Example: it results into 2 yes and 3 no

What is Information Gain?

Difference between the information before split and the information after split

- The information before the split, info(D), is the entropy

- The information after the split using attribute A is computed as the weighted sum of the entropies on each split

In the previous example: Actually, before the splitting of outlook, we had 9 yes and 5 no

NB: Not all the leaves need to be pure!

stopsSplitting when there is nothing to gain in going further or there are no remaining attributes for further partitioning.

Many-valued Attributes (or Highly-Branching Attributes)

Attributes with a large number of values are usually problematic (e.g. id, primary keys, ...)

Why?

Subsets are

likely to be pure if there is a large number of values and Information Gain is likely to choose attributes with a large number of values—> This may result in overfitting!

Information Gain Ratio

Modification of the Information Gain that reduces bias toward highly-branching attributes.

Information Gain Ratio takes number and size of branches into account when choosing an attribute.

It should be:

Large when data is evenly spread
Small when all data belong to one branch

Intrinsic Information S is the original dataset, and A is the set of attributes that split S into S1, S2,…

Importance of attribute decreases as intrinsic information gets larger! à

Attention: in practice, tree learners often use both IG and IGR:

First, only consider attributes with greater than average Information Gain;
Then, compare them using the Information Gain Ratio.

Why?

Information Gain Ratio may overcompensate and choose an attribute just because its intrinsic information is very low!

Numerical

rom the sort order of their parent. This can be done by assigning a unique index to each node in the tree, such that the index of a node is greater than the index of its parent. Then, the sort order of the children of a node can be determined by comparing their indices. For example, consider the following tree: ``` A / \ B C / \ / \ D E F G ``` We can assign the following indices to the nodes: ``` A (1) / \ B (2) C (3) / \ / \ D (4) E (5) F (6) G (7) ``` Now, we can determine the sort order of the children of each node by comparing their indices. For example, the sort order of the children of node A is B and C, because the index of B (2) is less than the index of C (3). Similarly, the sort order of the children of node B is D and E, because the index of D (4) is less than the index of E (5). By using this approach, we can avoid repeated sorting and improve the efficiency of tree-based algorithms.

Anteprima

Vedrai una selezione di 18 pagine su 85