What is data mining? It's the core of Knowledge Discovery in Databases (KDD): this is the full research knowledge process
from data selection Prepiscusing
1 1 Data
Data
Data processed
INPUT Target f Trasformi
Data
output Patterns
Knowledge Transformed
e 1 Data
Interpretation Mining
So KDD includes formulation of the problems, data collection, data cleaning and preprocessing, data mining and analysis of
the results produced by the model, so it transforms raw data into understandable information.
However the data mining step constitutes a so important phase in the overall KDD process to be often identified with the full
KDD process.
In this course, we will focus on Data Mining in Machine Learning. Its main task is to define a learning model to be used in
prediction, description, classification, regression and clustering.
First of all, we have to understand that not everything can be learned or is foreseeable: a good prediction, in some cases, can't
be produced by any model or tools. But in the other hand, when a prediction may be produced, a learning model should be
rich enough to capture important aspects of the problem... We have to be able to collect enough data.
Data are a collection of objects and their attributes (also called features) and are composed of several instances. They are
useful for different goals:
• Predict some attributes for the unknown (Supervised learning)
• Find some features and predict for the unknown (Unsupervised learning)
A supervised learning algorithm analyzes the training data and produces an inferred function (a model), which can be used for
mailing new examples. for
used checking
the accuracy
The training set can be incrementally or batch learning: the first one (called also online learning) is obtained incrementally
during the training process; the second one is available in advance before entering the training process (offline process).
The learning system can be mapped as:
The formal neuron (perceptron) is a simple learning machine with implements the class of function
gi sogni
ex
fa di
g
The output is 1 if the sum is greater than the threshold ( ), or -1 otherwise. The algorithm finds the value of weights and the
a
threshold (so the hyperplane).
The training process is:
It consists in finding a particular value of the parameters which selects a special function in the chosen class. Its aim is
fa
si property)
to model the process in a way that it is able to give the right answer in instances never seen before (generalization
rather than interpolating on the training set. loss function
We need a function that gives us a ”distance” between the output value and the real value: it is called and they
are several. 4
4.4
Il
y IT
yet i i
OUTPUT
REAL Il
1y
fa
I
OUTPUT
est Il
y
l
We want to minimize the expected error: ylpcxlpcxiyidxs.ly
LI fa
RIN
min
But we can't, because we don't know the probability density of input. So we can use the empirical risk: it is like risk expected
error but evaluated on the sample data that we have: È yi
xi
L f
fa
min Rcmp
How can we understand how much data we have to consider for fitting the model? If we use so much of them (overfitting) we
will have an error on training data around zero, but the error will be high for validation set; however, if we use few of them the
error on training set will be huge. We can use some tools for solving this problem: ERM, early stopping rule... The main rule is
try and error!
Does exist a relation between expected error and empirical risk?
We just saw last time the Empirical Risk minimization: f
Empirical risk depends only on data and on the functon . We introduced it because we cannot use the expected risk
a
: in general
RC
fa
Does a relationship exist between empirical risk and expected one? Yes, and it is possible to prove that:
Where “l” is the lenght of data set and “h” is the complexity of the class function. In particular C_VC is called VC confidence
term and increasing “h” the first term will decrease and instead the second one will increase.
So we want to minimize the upper bound of the expected risk:
The learning machinehould be chosen by minimizing both terms. By the way the second one depends only on the number of
samples in the training set, “l”, and on “h”: fixed che class of function, h*, and minimizating the empirical risk, we can evaluate
the upper bound as
Repeating this process for all n class of function h^j, finding for each one:
We can choose the class of functions which minimizes the upper bound:
This principle is called Structural Risk Minimization (SRM). To apply this, we need to be able to calculate the VC confidance
(the measure of the powerful of the class of functions in classifying data).
VC dimension
The VC dimension h is equal to the maximum number of vectors x_i that can be shattered, namely tha ca be separated using
this set of functions { } into two different classes when labeled as In all the Possible ways. An example is:
i
fa 2
f
lineare
fa x
fa fa.in
fa Ix
So stating that the VC dimension of a class is “h” means that we can find at least a set (it is not necessary that we be able for
any set) of “h” points that can be shattered.
Theorem: È
The VC dimension of linear functions (hyperlanes) in is n+1, it means also that all the coulpe of the points EIR
Xp
are linearly indipendent and all points are. Affinely indipendend.
Proof: Xi'ER
For each points with we associete the value for classifying them. Correctly classified means:
i yiet.i.it
e m
So we can rewrite the theorem as:
Assuming that the points are linearly indipendent, we have that the linear system
Xin Yi i vn
i
Inown
vi
Has full rank, so that admits at least a solution 112 Xi'ci
tù i e
il in
V y
yi.cl i m
1
i
So we get, assouming that O 4lb so
l
ii b
ci
4 b o
b i
sisi y
y
So to satisty it, a solution of b can be: beco.is
o
bel i
Vice versa, by contradiction assume that all the points i=2...m are linearly dependent. So we have that a linear combination
from 2 to m points exists: m
ao
Assign that:
Fixing and such that:
cieli bEIR is b
E
We get a contraddiction, inft assuming that :
O I
15
m l'it
b co
5 È
di di
tic
ad
ci X o
g'b so Hyperplane with margin
h x:
The VC dimension may depend on the dimension of the input
h=n+1.
we saw that for the class of linear functions we have that
n?
But what happen if we reduce the number of dimension
Reducing the number of features of the training data may indirectly
h
decreases and in turn limits the VC confidence term. On the
other hand it may lead to an increase of the empirical risk. Enti
n,
So, understanding that the VC dimension of class of function of hyperplanes is how can we choose "the better" one? Let's
hyperplane with margin.
introduce the Consider the hyperplane:
wtxtb.es
d( )
and call with the distance of a point from the hyperplane: I
D.
Assume that data stay in a sphere with diameter p: D,
The VC dimension depends on the value of tollerence if it is small with respect to the diameter it is still possible to shatter
3 points in 2-dimension; if it grows, the number of points that can be shattered decreases. So we can write that:
p, h F.
Increasing the value of we are decreasing and so we are finding a smaller class of function
So the SRM that use the hyperplane with tolerance gap is:
The SRM paradigm is used in two main approches:
Support Vector Machines:
• it fixes the empirical risk to a given value and minimize the VC confidence;
Deep Neural Network:
• it fixes the architecture (like the complexity and VC confidence) and minimize the empirical risk.
In both cases, we need to set up some hyperparametes that can control the trade off between the two terms in the true
objecrive function. We will try different values and see which return the best predicted outcome.
Learning process
The data that we use to start the learning process could be splitted in 2 different sets:
• Training set, the data used for defining the optimization problem and finding the parameters;
• Validation set, the data used to check the predictive performance and to adjust the hyperparameters.
We can have a huge number of data to use for learning process: some of the available data can be used as training set and the
second set of independent data (validation set) to check the predictive performance. How to choose how many goes in the
training or validation set?
k-fold cross validation:
We can use the all the data is divided into k groups; k-1 groups are used as training set and the
remaining group as validation; this procedure is then repeated for all k possible combination. The performances to use in the
model are averaged from the scores of the k runs.
To optimizate the enters we can use:
• for Support Vector Machines, the Constrained optimization (the optimization problem is linear constrained with a convex
quadratic function);
• for Deep Neural Network, the Unconstrained optimization (the optimizaion problem is uncostrained highly nonconvex).
Perceptron
n
i
i i
We have a training set {(x , y ): x R ,y {-1,1} i=1,...l} and we have to find an hyperplane that separate the sets in 2:
e wtxxb.at
lire
112
H iii gilet
Iii yi e
B
gilet
1 yi
A
With the function: if wtx.ba
f
sgnlwtx.to
tw.blxt otherwise
So the perceptron graphically is:
The goal of learning is to determinate w and b using data that we have avaible. To find it we can use the on-line learning
alghoritm:
This alghoritm is on-line because it uses a small set for training: even if it is arraving new data, the algorithm uses the first set
for training and doesn't wait all the data for start iterations. The perceptron algorithm no updates itself if imput is already
P
classified correctly and the classified is linear combination of the inner products (it values 0 if they are ortogonal,
namely they are similar): Perceptron
We just saw the perceptron algorithm, and it is an On-line one: it doesn't consider the entire data set at the same time, it only
ever looks at one exaple; training exaples appear sequentially. With the exaples of logic AND and OR, we saw that it is
convergent for linearly separable sets. Let's see the theorem that demonstates it.
What we want to do with this algorithm is find: f f inizio
wtxtb vite
be R
WEIR o
y
and to do that we update: csnvllx.y.tl
vent ypz.io
w E
The existence condition of the hyperplane with w and b is that the sets A and B should not overlap:
il PER
1
13
1 i
EIR y
y
A A 0
B
ncs.nu
con e
di di
di 20
2
Z
nuca
The theoreis:
È
3 5 steps
Più finite
after
so perception
y converge
zio
yp o b
lui
R 040 4
K min
Fi ER
11
i 1011
The proof is by contraddiction, namely zio
Jp P
Hk so
w
y
E cosa
viii 11W
11W viii il
il Il un
Il il
E
w
e P tz.PE
It twktypw li
Pz 1kt
w w w
w xp
y p
until
li E
E 11
w
Kai Pz
Pz Pll
Il w
11W Il w
2 y
y 2 zyPwktzp
ypz.si
Il
11W li
I Ix'll
you i
11W
li It
11W E
z
Eo
Hp
lini
Riti
11W R
E Il E k i
So we have that
t R
112
11W k
E i R ti
K'pl
HK Il R
K
11 KE
E i
E
w
Il
Il 2 Kai p
w p
So it exists a k finite for the stop of the algorithm. But what can we do if these isn't linearly separeble? Like an example, if the
training set is the logic XOR. A way to generalize a no linearly separable set is another algorithm:
• Voting Perceotron, voting the hyperplane for how much it survived;
• Average Perceptron,it is a modification of the voting one and it votes the vectors w and b (instead the hyperplane). The
outcomes of algorithms are: Feed Forward Neural Network
We want to define a function defined by parameters w EIR
fix cn
p w
such that it is close to the true y . The algorith has to find to minimize empirical risk.
The large-scale supervised machine learning takes like input Big Data (P couple. x , y ), dimension of each sample (features
of x ) and the parametres (d): it chooces a function and minimize approximatibely the empirical risk
wifi
MÌ Rcmp
Output is the extimate empirical risk incinta regularization term:
To find the "correct" function we can penalize its complexity, using a È s
È 21in
yi
llfcxi.ws w
Wp
mia diluito
l
typically the losses function for regression is quadratic square loss, hinge loss or logistic loss:
È lui xD Alla
È 11
y'll
HIP y
Ramp p 411WII
Wta
10
È
10,1 I y
y
max max
Ramp per
logli texpl ypy.si units
A FFN (Feed Forward Neural Network) is composed into (= neurons) that are organized into layers with one-way
connections (top-down): e lei
L N l j l.
A FFN has 3 parameters: number of layer, number of units in layer and the activation function of unit in layer
9J
There are different models depending on: e
L N
• number of layers and/or number of neurons per layer (it affects the dimension of the optimaztion problem)
g w):
• Hidden Unit type (activation function and its hyper-parameters
Multiplayer Perceptron Networks;
◦ Radial Basis Function Networks.
◦ Multiplayer Perceptron Networks
g
Units are generalization of the formal neuron and the activation function acts a trigger (on/off) and may depend on hyper-
parameters:
• L=1 for shallow networks and L>1 for deep network
w
• are the weights on the arcs connecting units and the bias in the units
j l,
Let's see the generalization of an unit. Its internal structure, of neuron of layer is: lei
Ne il
zie 2
l-1
as input from layer there is or
i i
sum to the input the bias byte
the output of sum is a.ie
it trasforms the sum with g.ie e
j l
the output of the unit of layer is z
g
As activation function we can use: et
ce
It
expect
e
giti 9 Ct
e
1
et et
<Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
-
Appunti corso di Diritto amministrativo
-
Appunti corso Obblighi formativi aggiuntivi 2023/2024
-
Appunti corso Diritto Pubblico
-
Appunti corso monografico Agamennone, introduzione