Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
vuoi
o PayPal
tutte le volte che vuoi
KNN VS DECISION TREES
k-nn and decision trees are both non parametric methods: easy computations but slow predictions.
Decision tree supports automatic feature interaction, whereas K-NN can't. Moreover Decision trees are faster, you're just running a series of boolean comparisons. While k-nn is slower due to its expensive real time execution. KNN determines neighborhoods, so there must be a distance metric, which may be affected by varying scales between attributes and also high-dimensional space. DT, on the other hand, predicts a class for a given input vector. The KNN-based classifier, however, does not build any classification model. It directly learns from the training instances (observations). It starts processing data only after it is given a test observation to classify. Thus, KNN comes under the category of "Lazy Learner" approaches.
What similarity measures?
Cosine Distance
Jaccard Distance
can be interpreted as the percentage of identical attributes à Hamming
Distancen = # of variables
m = # of matching components
To determine the class from nearest neighbor list:
- Take the majority vote of class labels among the k neighbors
- Or weight the vote according to distance (e.g., w = 1/d^2)
Normalization and Other Issues
Different attributes are measured on different scales!
We often need to normalize them, using for instance range normalization or standard normalization
- For nominal attributes, the distance is either 0 (the value is the same) or 1 (different)
- Missing values are usually assumed to be maximally distant
Finding Nearest Neighbors Efficiently
Basic Approach: Full scan of the data
Classification time O(nd): it can be prohibitive when the training set is large
Nearest-neighbor search can be speeded up by using
- KD-Trees
- Ball-Trees
KD-Trees
Split the space hierarchically using a tree generated from the data
Tree construction
- Add points iteratively to the tree
- Each new point falls in a leaf and splits region around it based
on value of one of its attributes
Search for nearest neighbors
- Navigate tree to reach leaf and check
- Backtrack up tree to check nearby regions
- Until all k-nearest neighbors found
Effectiveness of KD-trees
Search complexity depends on depth of tree (O(log(n)) for balanced trees) —> Occasional rebalancing of tree randomizing order of data is an option
How to build a good tree?
split point split direction:
- Need to find good and
- Possible split direction: direction with greatest variance
- Possible split point: median value along that direction
NB: using value closest to mean (rather than median) can be better if data is skewed
Observation: KD-trees can become inefficient when the number of attributes is too large!
Ball Trees
Corners in high-dimensional space may mean that the query ball intersects with many regions and this is a potential limitation for KD-Trees!
Note that there is no need to make sure that regions do not overlap, so they do not heed to be hyper-rectangles
Hyper-spheres
(balls) can be used instead of hyper-rectangles. A ball tree organizes the data into a tree of k-dimensional hyper-spheres. While the balls themselves may intersect, each point is assigned to one or the other ball in the partition according to its distance from the ball's center—> More efficient search.LEZIONE 12 - NAIVE BAYES & BAYESIAN NETWORKS
before- A priori probability of C, P(C), is the probability of event evidence is seen
after- A posteriori probability of C, P(C|A), is the probability of event evidence is seen
ÙBayes Theorem says:
Naïve Bayes Classifiers assume that:
- attributes are equally important
- attributes are statistically independent
Independence assumption is almost never correct! But the scheme works well in practice…
Training:
Use the counts to compute estimates for the class probability P(y) and the conditional probability P(xi|y)
Testing:
Given an example x, computes the most likely class as
Example: REMEMBER TO NORMALIZE !!!
“Zero-Frequency Problem”
What if an attribute value does not occur with every class value?
The corresponding probability will be zero—> A posteriori probability will also be zero! estimator)
Typical remedy is to add 1 to count for every pair (Laplace smoothing
This process is called
Sometimes, it is more appropriate to add a constant different from one.
For example, we can add the same μ/3 to each count (μ regularization parameter)
If background knowledge is available on the frequency of certain values, we can add different weights to each count(p1+p2+p3=1) are prior estimates on probability of each values
How does Naïve Bayes deal with missing values?
The attribute will be omitted from calculation!
Numeric Attributes in Naïve Bayes
So far, we applied Naïve Bayes to categorical data.
What if some (or all) of the attributes are numeric?
Two options:
- Discretize the data to make it binary or discrete
- Compute a probability density for each class:
1. Assume
parametric form for distribution and estimate its parameters (e.g. gaussian distribution: estimate mean and variance with sample mean and sample variance)
Directly estimate probability density from the data (e.g. kernel smoothing)
Example:
NB: missing values during training are not included in calculation of mean and standard deviation!
Discussion
Naïve Bayes works surprisingly well, even if independence assumption is clearly violated! Why? Because classification doesn’t require accurate probability estimates as long as maximum probability is assigned to correct class
However:
- Adding too many redundant attributes will cause problems
- If many numeric attributes are not normally distributed, the Gaussian assumption may be poor and need substituting
Bayesian Belief Networks
Bayesian Belief Networks (BBN) allows us to specify which pair of attributes are conditionally independent.
They provide a graphical representation of probabilistic relationships among a set of random variables.
Two key
consumo (ricerca di un ampio spazio di possibili relazioni di indipendenza condizionale) l'aggiunta di variabili a una rete esistente è semplice.elements: - A direct acyclic graph, encoding the dependence relationships among variables - A probability table associating each node to its immediate parents node - If a node X does not have any parents, then the table contains only the prior probability P(X) - If a node X has only one any parent Y, then the table contains the conditional probability P(X|Y) - If a node X has multiple parents (Y1, ..., Yk) table contains the conditional probability P(X|Y1...Yk) NB: in general the inference is NP-complete, but there are approximating methods (e.g. Monte-Carlo) Final remarks: Bayesian Networks: - Encode causal dependencies among variables - Well suited to deal with incomplete data - Robust to overfitting - When used to build classifier, they extend Naïve Bayes by removing conditional independence assumption Other benefits: - Can be used to capture prior knowledge - Graphical model describing relationship between variables is often useful for understanding/visualizing data Building the network is time
LEZIONE 13 - ALBERI DI DECISIONE
Cosa è un Albero di Decisione?
- Un nodo interno è un test su un attributo
- Un ramo rappresenta un risultato del test
- Un nodo foglia rappresenta una classe o una distribuzione di classi
- Ad ogni nodo, viene scelto un attributo per suddividere gli esempi di addestramento in classi distinte il più possibile
- Un nuovo caso viene classificato seguendo un percorso corrispondente fino ad un nodo foglia
Esempio: Calcolo degli Alberi di Decisione
Costruzione dell'Albero dall'alto verso il basso: gli esempi vengono suddivisi in modo ricorsivo scegliendo un attributo alla volta
Potatura dell'Albero dal basso verso l'alto: rimuovere sotto-alberi o rami, in modo da migliorare la precisione stimata sui nuovi casi.
Quale Attributo utilizzare per la suddivisione?
Viene utilizzata una misura di purezza o impurità a questo scopo
Misure tipiche utilizzate:
- Information Gain (ID3)
- Information Gain Ratio (C4.5)
- Gini Index (CART)
Information Gain
Aumenta con il
Average purity of the subsets that an attribute produces
Splitting Strategy: choose the attribute that results in greatest information gain
Information is measured in bits
Given a probability distribution, we compute the distribution's entropy (log has base = 2)
Example: it results into 2 yes and 3 no
What is Information Gain?
Difference between the information before split and the information after split
- The information before the split, info(D), is the entropy
- The information after the split using attribute A is computed as the weighted sum of the entropies on each split
In the previous example: Actually, before the splitting of outlook, we had 9 yes and 5 no
NB: Not all the leaves need to be pure!
stopsSplitting when there is nothing to gain in going further or there are no remaining attributes for further partitioning.
Many-valued Attributes (or Highly-Branching Attributes)
Attributes with a large number of values are usually problematic (e.g. id, primary keys, ...)
Why?
Subsets are
likely to be pure if there is a large number of values and Information Gain is likely to choose attributes with a large number of values—> This may result in overfitting!
Information Gain Ratio
Modification of the Information Gain that reduces bias toward highly-branching attributes.
Information Gain Ratio takes number and size of branches into account when choosing an attribute.
It should be:
- Large when data is evenly spread
- Small when all data belong to one branch
Intrinsic Information S is the original dataset, and A is the set of attributes that split S into S1, S2,…
Importance of attribute decreases as intrinsic information gets larger! à
Attention: in practice, tree learners often use both IG and IGR:
- First, only consider attributes with greater than average Information Gain;
- Then, compare them using the Information Gain Ratio.
Why?
Information Gain Ratio may overcompensate and choose an attribute just because its intrinsic information is very low!
Numerical
rom the sort order of their parent. This can be done by assigning a unique index to each node in the tree, such that the index of a node is greater than the index of its parent. Then, the sort order of the children of a node can be determined by comparing their indices. For example, consider the following tree: ``` A / \ B C / \ / \ D E F G ``` We can assign the following indices to the nodes: ``` A (1) / \ B (2) C (3) / \ / \ D (4) E (5) F (6) G (7) ``` Now, we can determine the sort order of the children of each node by comparing their indices. For example, the sort order of the children of node A is B and C, because the index of B (2) is less than the index of C (3). Similarly, the sort order of the children of node B is D and E, because the index of D (4) is less than the index of E (5). By using this approach, we can avoid repeated sorting and improve the efficiency of tree-based algorithms.