Estratto del documento

What is data mining? It's the core of Knowledge Discovery in Databases (KDD): this is the full research knowledge process

from data selection Prepiscusing

1 1 Data

Data

Data processed

INPUT Target f Trasformi

Data

output Patterns

Knowledge Transformed

e 1 Data

Interpretation Mining

So KDD includes formulation of the problems, data collection, data cleaning and preprocessing, data mining and analysis of

the results produced by the model, so it transforms raw data into understandable information.

However the data mining step constitutes a so important phase in the overall KDD process to be often identified with the full

KDD process.

In this course, we will focus on Data Mining in Machine Learning. Its main task is to define a learning model to be used in

prediction, description, classification, regression and clustering.

First of all, we have to understand that not everything can be learned or is foreseeable: a good prediction, in some cases, can't

be produced by any model or tools. But in the other hand, when a prediction may be produced, a learning model should be

rich enough to capture important aspects of the problem... We have to be able to collect enough data.

Data are a collection of objects and their attributes (also called features) and are composed of several instances. They are

useful for different goals:

• Predict some attributes for the unknown (Supervised learning)

• Find some features and predict for the unknown (Unsupervised learning)

A supervised learning algorithm analyzes the training data and produces an inferred function (a model), which can be used for

mailing new examples. for

used checking

the accuracy

The training set can be incrementally or batch learning: the first one (called also online learning) is obtained incrementally

during the training process; the second one is available in advance before entering the training process (offline process).

The learning system can be mapped as:

The formal neuron (perceptron) is a simple learning machine with implements the class of function

gi sogni

ex

fa di

g

The output is 1 if the sum is greater than the threshold ( ), or -1 otherwise. The algorithm finds the value of weights and the

a

threshold (so the hyperplane).

The training process is:

It consists in finding a particular value of the parameters which selects a special function in the chosen class. Its aim is

fa

si property)

to model the process in a way that it is able to give the right answer in instances never seen before (generalization

rather than interpolating on the training set. loss function

We need a function that gives us a ”distance” between the output value and the real value: it is called and they

are several. 4

4.4

Il

y IT

yet i i

OUTPUT

REAL Il

1y

fa

I

OUTPUT

est Il

y

l

We want to minimize the expected error: ylpcxlpcxiyidxs.ly

LI fa

RIN

min

But we can't, because we don't know the probability density of input. So we can use the empirical risk: it is like risk expected

error but evaluated on the sample data that we have: È yi

xi

L f

fa

min Rcmp

How can we understand how much data we have to consider for fitting the model? If we use so much of them (overfitting) we

will have an error on training data around zero, but the error will be high for validation set; however, if we use few of them the

error on training set will be huge. We can use some tools for solving this problem: ERM, early stopping rule... The main rule is

try and error!

Does exist a relation between expected error and empirical risk?

We just saw last time the Empirical Risk minimization: f

Empirical risk depends only on data and on the functon . We introduced it because we cannot use the expected risk

a

: in general

RC

fa

Does a relationship exist between empirical risk and expected one? Yes, and it is possible to prove that:

Where “l” is the lenght of data set and “h” is the complexity of the class function. In particular C_VC is called VC confidence

term and increasing “h” the first term will decrease and instead the second one will increase.

So we want to minimize the upper bound of the expected risk:

The learning machinehould be chosen by minimizing both terms. By the way the second one depends only on the number of

samples in the training set, “l”, and on “h”: fixed che class of function, h*, and minimizating the empirical risk, we can evaluate

the upper bound as

Repeating this process for all n class of function h^j, finding for each one:

We can choose the class of functions which minimizes the upper bound:

This principle is called Structural Risk Minimization (SRM). To apply this, we need to be able to calculate the VC confidance

(the measure of the powerful of the class of functions in classifying data).

VC dimension

The VC dimension h is equal to the maximum number of vectors x_i that can be shattered, namely tha ca be separated using

this set of functions { } into two different classes when labeled as In all the Possible ways. An example is:

i

fa 2

f

lineare

fa x

fa fa.in

fa Ix

So stating that the VC dimension of a class is “h” means that we can find at least a set (it is not necessary that we be able for

any set) of “h” points that can be shattered.

Theorem: È

The VC dimension of linear functions (hyperlanes) in is n+1, it means also that all the coulpe of the points EIR

Xp

are linearly indipendent and all points are. Affinely indipendend.

Proof: Xi'ER

For each points with we associete the value for classifying them. Correctly classified means:

i yiet.i.it

e m

So we can rewrite the theorem as:

Assuming that the points are linearly indipendent, we have that the linear system

Xin Yi i vn

i

Inown

vi

Has full rank, so that admits at least a solution 112 Xi'ci

tù i e

il in

V y

yi.cl i m

1

i

So we get, assouming that O 4lb so

l

ii b

ci

4 b o

b i

sisi y

y

So to satisty it, a solution of b can be: beco.is

o

bel i

Vice versa, by contradiction assume that all the points i=2...m are linearly dependent. So we have that a linear combination

from 2 to m points exists: m

ao

Assign that:

Fixing and such that:

cieli bEIR is b

E

We get a contraddiction, inft assuming that :

O I

15

m l'it

b co

5 È

di di

tic

ad

ci X o

g'b so Hyperplane with margin

h x:

The VC dimension may depend on the dimension of the input

h=n+1.

we saw that for the class of linear functions we have that

n?

But what happen if we reduce the number of dimension

Reducing the number of features of the training data may indirectly

h

decreases and in turn limits the VC confidence term. On the

other hand it may lead to an increase of the empirical risk. Enti

n,

So, understanding that the VC dimension of class of function of hyperplanes is how can we choose "the better" one? Let's

hyperplane with margin.

introduce the Consider the hyperplane:

wtxtb.es

d( )

and call with the distance of a point from the hyperplane: I

D.

Assume that data stay in a sphere with diameter p: D,

The VC dimension depends on the value of tollerence if it is small with respect to the diameter it is still possible to shatter

3 points in 2-dimension; if it grows, the number of points that can be shattered decreases. So we can write that:

p, h F.

Increasing the value of we are decreasing and so we are finding a smaller class of function

So the SRM that use the hyperplane with tolerance gap is:

The SRM paradigm is used in two main approches:

Support Vector Machines:

• it fixes the empirical risk to a given value and minimize the VC confidence;

Deep Neural Network:

• it fixes the architecture (like the complexity and VC confidence) and minimize the empirical risk.

In both cases, we need to set up some hyperparametes that can control the trade off between the two terms in the true

objecrive function. We will try different values and see which return the best predicted outcome.

Learning process

The data that we use to start the learning process could be splitted in 2 different sets:

• Training set, the data used for defining the optimization problem and finding the parameters;

• Validation set, the data used to check the predictive performance and to adjust the hyperparameters.

We can have a huge number of data to use for learning process: some of the available data can be used as training set and the

second set of independent data (validation set) to check the predictive performance. How to choose how many goes in the

training or validation set?

k-fold cross validation:

We can use the all the data is divided into k groups; k-1 groups are used as training set and the

remaining group as validation; this procedure is then repeated for all k possible combination. The performances to use in the

model are averaged from the scores of the k runs.

To optimizate the enters we can use:

• for Support Vector Machines, the Constrained optimization (the optimization problem is linear constrained with a convex

quadratic function);

• for Deep Neural Network, the Unconstrained optimization (the optimizaion problem is uncostrained highly nonconvex).

Perceptron

n

i

i i

We have a training set {(x , y ): x R ,y {-1,1} i=1,...l} and we have to find an hyperplane that separate the sets in 2:

e wtxxb.at

lire

112

H iii gilet

Iii yi e

B

gilet

1 yi

A

With the function: if wtx.ba

f

sgnlwtx.to

tw.blxt otherwise

So the perceptron graphically is:

The goal of learning is to determinate w and b using data that we have avaible. To find it we can use the on-line learning

alghoritm:

This alghoritm is on-line because it uses a small set for training: even if it is arraving new data, the algorithm uses the first set

for training and doesn't wait all the data for start iterations. The perceptron algorithm no updates itself if imput is already

P

classified correctly and the classified is linear combination of the inner products (it values 0 if they are ortogonal,

namely they are similar): Perceptron

We just saw the perceptron algorithm, and it is an On-line one: it doesn't consider the entire data set at the same time, it only

ever looks at one exaple; training exaples appear sequentially. With the exaples of logic AND and OR, we saw that it is

convergent for linearly separable sets. Let's see the theorem that demonstates it.

What we want to do with this algorithm is find: f f inizio

wtxtb vite

be R

WEIR o

y

and to do that we update: csnvllx.y.tl

vent ypz.io

w E

The existence condition of the hyperplane with w and b is that the sets A and B should not overlap:

il PER

1

13

1 i

EIR y

y

A A 0

B

ncs.nu

con e

di di

di 20

2

Z

nuca

The theoreis:

È

3 5 steps

Più finite

after

so perception

y converge

zio

yp o b

lui

R 040 4

K min

Fi ER

11

i 1011

The proof is by contraddiction, namely zio

Jp P

Hk so

w

y

E cosa

viii 11W

11W viii il

il Il un

Il il

E

w

e P tz.PE

It twktypw li

Pz 1kt

w w w

w xp

y p

until

li E

E 11

w

Kai Pz

Pz Pll

Il w

11W Il w

2 y

y 2 zyPwktzp

ypz.si

Il

11W li

I Ix'll

you i

11W

li It

11W E

z

Eo

Hp

lini

Riti

11W R

E Il E k i

So we have that

t R

112

11W k

E i R ti

K'pl

HK Il R

K

11 KE

E i

E

w

Il

Il 2 Kai p

w p

So it exists a k finite for the stop of the algorithm. But what can we do if these isn't linearly separeble? Like an example, if the

training set is the logic XOR. A way to generalize a no linearly separable set is another algorithm:

• Voting Perceotron, voting the hyperplane for how much it survived;

• Average Perceptron,it is a modification of the voting one and it votes the vectors w and b (instead the hyperplane). The

outcomes of algorithms are: Feed Forward Neural Network

We want to define a function defined by parameters w EIR

fix cn

p w

such that it is close to the true y . The algorith has to find to minimize empirical risk.

The large-scale supervised machine learning takes like input Big Data (P couple. x , y ), dimension of each sample (features

of x ) and the parametres (d): it chooces a function and minimize approximatibely the empirical risk

wifi

MÌ Rcmp

Output is the extimate empirical risk incinta regularization term:

To find the "correct" function we can penalize its complexity, using a È s

È 21in

yi

llfcxi.ws w

Wp

mia diluito

l

typically the losses function for regression is quadratic square loss, hinge loss or logistic loss:

È lui xD Alla

È 11

y'll

HIP y

Ramp p 411WII

Wta

10

È

10,1 I y

y

max max

Ramp per

logli texpl ypy.si units

A FFN (Feed Forward Neural Network) is composed into (= neurons) that are organized into layers with one-way

connections (top-down): e lei

L N l j l.

A FFN has 3 parameters: number of layer, number of units in layer and the activation function of unit in layer

9J

There are different models depending on: e

L N

• number of layers and/or number of neurons per layer (it affects the dimension of the optimaztion problem)

g w):

• Hidden Unit type (activation function and its hyper-parameters

Multiplayer Perceptron Networks;

◦ Radial Basis Function Networks.

◦ Multiplayer Perceptron Networks

g

Units are generalization of the formal neuron and the activation function acts a trigger (on/off) and may depend on hyper-

parameters:

• L=1 for shallow networks and L>1 for deep network

w

• are the weights on the arcs connecting units and the bias in the units

j l,

Let's see the generalization of an unit. Its internal structure, of neuron of layer is: lei

Ne il

zie 2

l-1

as input from layer there is or

i i

sum to the input the bias byte

the output of sum is a.ie

it trasforms the sum with g.ie e

j l

the output of the unit of layer is z

g

As activation function we can use: et

ce

It

expect

e

giti 9 Ct

e

1

et et

<
Anteprima
Vedrai una selezione di 10 pagine su 43
OMML - Appunti corso Pag. 1 OMML - Appunti corso Pag. 2
Anteprima di 10 pagg. su 43.
Scarica il documento per vederlo tutto.
OMML - Appunti corso Pag. 6
Anteprima di 10 pagg. su 43.
Scarica il documento per vederlo tutto.
OMML - Appunti corso Pag. 11
Anteprima di 10 pagg. su 43.
Scarica il documento per vederlo tutto.
OMML - Appunti corso Pag. 16
Anteprima di 10 pagg. su 43.
Scarica il documento per vederlo tutto.
OMML - Appunti corso Pag. 21
Anteprima di 10 pagg. su 43.
Scarica il documento per vederlo tutto.
OMML - Appunti corso Pag. 26
Anteprima di 10 pagg. su 43.
Scarica il documento per vederlo tutto.
OMML - Appunti corso Pag. 31
Anteprima di 10 pagg. su 43.
Scarica il documento per vederlo tutto.
OMML - Appunti corso Pag. 36
Anteprima di 10 pagg. su 43.
Scarica il documento per vederlo tutto.
OMML - Appunti corso Pag. 41
1 su 43
D/illustrazione/soddisfatti o rimborsati
Acquista con carta o PayPal
Scarica i documenti tutte le volte che vuoi
Dettagli
SSD
Scienze matematiche e informatiche MAT/09 Ricerca operativa

I contenuti di questa pagina costituiscono rielaborazioni personali del Publisher andrea22x di informazioni apprese con la frequenza delle lezioni di Optimization method for machine learning e studio autonomo di eventuali libri di riferimento in preparazione dell'esame finale o della tesi. Non devono intendersi come materiale ufficiale dell'università Università degli Studi di Roma La Sapienza o del prof Palagi Laura.
Appunti correlati Invia appunti e guadagna

Domande e risposte

Hai bisogno di aiuto?
Chiedi alla community