Anteprima
Vedrai una selezione di 12 pagine su 55
Riassunti Machine learning Pag. 1 Riassunti Machine learning Pag. 2
Anteprima di 12 pagg. su 55.
Scarica il documento per vederlo tutto.
Riassunti Machine learning Pag. 6
Anteprima di 12 pagg. su 55.
Scarica il documento per vederlo tutto.
Riassunti Machine learning Pag. 11
Anteprima di 12 pagg. su 55.
Scarica il documento per vederlo tutto.
Riassunti Machine learning Pag. 16
Anteprima di 12 pagg. su 55.
Scarica il documento per vederlo tutto.
Riassunti Machine learning Pag. 21
Anteprima di 12 pagg. su 55.
Scarica il documento per vederlo tutto.
Riassunti Machine learning Pag. 26
Anteprima di 12 pagg. su 55.
Scarica il documento per vederlo tutto.
Riassunti Machine learning Pag. 31
Anteprima di 12 pagg. su 55.
Scarica il documento per vederlo tutto.
Riassunti Machine learning Pag. 36
Anteprima di 12 pagg. su 55.
Scarica il documento per vederlo tutto.
Riassunti Machine learning Pag. 41
Anteprima di 12 pagg. su 55.
Scarica il documento per vederlo tutto.
Riassunti Machine learning Pag. 46
Anteprima di 12 pagg. su 55.
Scarica il documento per vederlo tutto.
Riassunti Machine learning Pag. 51
1 su 55
D/illustrazione/soddisfatti o rimborsati
Disdici quando
vuoi
Acquista con carta
o PayPal
Scarica i documenti
tutte le volte che vuoi
Estratto del documento

DAGSVM

In DAGSVM, the k-class problem is decomposed in k(k-1)/2 binary (2-class) problems as in one-

against-one

- Training is performed as in one-against-one

- But test is performed using a Direct Acyclic Graph to reduce the number of SVM

classifiers to apply:

 The test process involves only k-1 binary SVM classifiers instead of k(k-1)/2 as in one-against-

one approach

Summary

One-against-all

- cheap in terms of memory requirements

- expensive training but cheap test

One-against-one

- expensive in terms of memory requirements

- expensive test but slightly cheap training

DAGSVM

- expensive in terms of memory requirements

- slightly cheap training and cheap test

 One-against-one is the best performing approach, due to the most effective decomposition

 DAGSVM is a faster approximation of one-against-one

SVR Support Vector Regression

Support Vector Machine can also be used as a regression method

LEZIONE 9 MARKOV DECISION PROCESSES

In sequential decision making

- we take a sequence of decisions (or actions) to reach the goal

- the optimal actions are context-dependent

- we generally do not have examples of correct actions

- actions may have long-term consequences

- short-term consequences of optimal actions might seem negative

In a Markov Decision Process (MDP), the one-step dynamic can be described as:

Markov Property: future state (s’) and reward (r) only depends on current state (s) and action (a)

When holds the Markov Property and the state and action sets are finite, the problem is a finite

Markov Decision Process

To define a finite MDP, you need to define:

- state and action sets

- one-step dynamics:

 We can derive the next state distribution and expected reward as:

Example Finite MDP: Recycling Robot

Return

Agent should not choose actions on the basis of immediate reward!

In fact, long-term consequences are more important than short-term reward

So, we need to take into account the sequence of future rewards

_,

We define return, as a function of the sequence of future rewards:

 []

The agent will have to maximize the expected return

Different definition of return are possible:

Total reward

1. Discounted reward

2.

3. Average reward

Episodic Task

In episodic task the agent-environment interaction naturally breaks into chunks called episodes

It is possible to maximize the expected total reward:

Continuing Tasks

In continuing task the agent-environment interaction goes on continually and there are no

terminal state The total reward is a sum over an infinite sequence and might not be finite:

To solve this issue we can discount the future rewards by a factor (0 < < 1):

Unify Notation

In episodic tasks, we can design terminal states as absorbing states that always produce zero

reward:

 Now we can use the same definition of expected reward for episodic and continuing tasks:

- = 1 can be used if an absorbing state is always reached

- if =0, agent would only care about immediate reward

- → 1, agent would take future rewards into account more strongly

as

Goal

A goal is what we want to achieve

The Reward Hypotesis: all of what we mean by goals and purposes can be thought as

maximization of the expected value of the cumulative sum of a received scalar signal (reward).

Examples of reward design:

- representation: 1 for goal, 0 otherwise

goal-reward

- action-penalty representation: -1 for not goal, 0 once goal reached

Policy

A policy, at any given timestamp, decides which action the agent selects

 It fully defines the behavior of an agent

Policies can be:

- Markovian / Non Markovian

- Deterministic / Stochastic

- Stationary / Non Stationary

Deterministic Policy ):

In the simplest case the policy can be modeled as a function (: →

This type of policy can be conveniently represented using a table (on the left)

On the right, an example of deterministic policy in a gridworld environment

Stochastic Policy

A more general approach is to model policy as function that maps each state to a probability

distribution over the actions:

NB: stochastic policy can be used to represent also a deterministic policy

This is an example of stochastic policy in a gridworld environment

Markovian/Non Markovian Policy 

1: choose 50% R and 50% L Markovian

2: alternate R and L Non Markovian (does not depend only from current state)

NB: we can overcome this limitation by extending the state definition (e.g., in this case including

previous action)

Value Functions

Given a policy we can compute the state-value function as:

 ,

It represents the expected return from a given state following policy

We can also compute the action-value function as:

 ,

It represents the expected return from a given state when a given action is selected and

then policy is followed

Bellman Expecation Equation

The state–value function can again be decomposed into immediate reward plus discounted value

of successor state:

The action-value function can be similarly decomposed:

Let see how Bellman equations can be used in practice with an example:

Limitations of Bellman Equations

Comparing two policies

′ _π() _π’(), ∀ ∈

We say that ≥ if and only if ≥

Optimal Policy ∗

For any Markov Decision Process, there exists always at least one optimal deterministic policy

, ∀)

that is better or equal to all the others (∗ ≥

Sketch of proof:

Example:

Limitations:

Bellman Optimality Equations

We can compute optimal state-value function and optimal action-value function as:

- Bellman Optimality Equation for V*

- Bellman Optimality Equation for Q*

Why? ∗ ∗ ∗

From and we can easily compute the optimal policy

Example: –

LEZIONE 10 DYNAMIC PROGRAMMING

Why Dynamic Programming?

To solve an MDP we need to find an optimal policy

Unfortunately we cannot use a brute- force approach:

||^||

- deterministic policies to evaluate

||

- linear equations to solve for each policy

Dynamic Programming (DP) is a method that allow to solve a complex problem by breaking it

down into simpler sub-problems in a recursive manner

How do we use DP to solve an MDP thanks to the Bellman Equations?

Idea:

Iterative Policy Evaluation

We search the solution of the Bellman expectation equation:

DP solves this problem through iterative application of Bellman equation:

, _k ∈

At each iteration the value-function is updated for all states (sweep)

 _k _π _0

It can be proved that converges to as → ∞, for any

Algorithm:

Policy Improvement

Remember that

What happens if we act greedy w.r.t. non optimal value function?

’ ?

Is different from ∗

- If not, it means that is already the optimal policy

’ ?

- Otherwise, is better or as good as

Theorem:

Proof:

Policy Iteration

Algorithm:

Generalized Policy Iteration

Policy iteration alternates complete policy evaluation and improvement up to the convergence

Policy iteration framework allows also to find the optimal policy interleaving partial evaluation

and improvement steps!

In particular, Value Iteration is one of the most popular GPI method

Value Iteration

Simply iterate the update of the value function using the Bellman optimality equation:

 

_k *

It can be proved that as k ∞

Algorithm:

Asynchronous Dynamic Programming

All the DP methods described so far require sweeps of the entire state set.

Asynchronous DP does not use sweeps:

- Pick a state at random

- Apply the appropriate backup

- Repeat until convergence

 An agent’s experience can act as a guide to select states and backup intelligently

Efficiency of DP

Finding an optimal policy is polynomial in the number of states and actions:

Unfortunately, the number of states is often growing exponentially with the number of state

variables (curse of dimensionality)

 Asynchronous DP can be applied to larger problems, and is appropriate for parallel computing

BUT it is easy to come up with MDPs for which DP methods are not practical!

 Linear programming approaches can be also used instead of DP but they do not typically scale

well on larger problems

LEZIONE 11 MONTE CARLO METHODS

Why do we need sample-based methods?

Consider the update rule of the Value Iteration algorithm:

Which is the major limitation of this approach?

- We generally do not know the problem dyanmics!

 We want methods to learn the optimal policy directly from data!

Monte Carlo methods relies only on the experience (data) to learn value functions and policy

Monte Carlo methods can be used in two ways

- model-free: no model necessary and still attains optimality

- simulated: needs only a simulation, not a full model 

Monte Carlo methods learn from complete sample returns Only defined for episodic tasks

Policy Evaluation

- Goal: learn V_π(s)

- Given: some number of episodes under π which contain s

- Idea: average returns observed after visits to s

1. Every-Visit MC: average returns for every time s is visited in an episode

2. First-visit MC: average returns only for first time s is visited in an episode

 Both converge asymptotically

First-visit Monte Carlo policy evaluation

Policy Iteration

We fixed the evaluation with Monte Carlo Policy Evaluation

BUT improvement is still

Idea:

How?

Action Values Evaluation

We average return starting from state s and action a following π:

 Converges asymptotically if every state-action pair is visited:

Monte Carlo Policy Iteration with Exploring Starts (ES)

-Soft Monte Carlo Policy Iteration

Exploring Starts is a simple idea but it is not always possible!

 We need to keep exploring during the learning process!

This leads to a key problem in Reinforcement Learning: the Exploration-Exploitation Dilemma

-Greedy Exploration

It is the simplest solution to the exploration-exploitation dilemma -soft

Instead of searching the optimal deterministic policy we search the optimal policy, i.e., a

/||

policy that selects each action with a proability that is at least

-greedy

In particular we use policy:

Algorithm:

Theorem

-greedy ’ _ -soft

Any policy with respect to is an improvement over any policy

On-Policy vs Off-Policy Learning

On-Policy Learning

The agent learns the value functions of the same policy used to select actions

 We cannot easily lear

Dettagli
A.A. 2020-2021
55 pagine
SSD Ingegneria industriale e dell'informazione ING-INF/05 Sistemi di elaborazione delle informazioni

I contenuti di questa pagina costituiscono rielaborazioni personali del Publisher bonadiamatilde di informazioni apprese con la frequenza delle lezioni di Machine learning e studio autonomo di eventuali libri di riferimento in preparazione dell'esame finale o della tesi. Non devono intendersi come materiale ufficiale dell'università Politecnico di Milano o del prof Restelli Marcello.