Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
vuoi
o PayPal
tutte le volte che vuoi
DAGSVM
In DAGSVM, the k-class problem is decomposed in k(k-1)/2 binary (2-class) problems as in one-
against-one
- Training is performed as in one-against-one
- But test is performed using a Direct Acyclic Graph to reduce the number of SVM
classifiers to apply:
The test process involves only k-1 binary SVM classifiers instead of k(k-1)/2 as in one-against-
one approach
Summary
One-against-all
- cheap in terms of memory requirements
- expensive training but cheap test
One-against-one
- expensive in terms of memory requirements
- expensive test but slightly cheap training
DAGSVM
- expensive in terms of memory requirements
- slightly cheap training and cheap test
One-against-one is the best performing approach, due to the most effective decomposition
DAGSVM is a faster approximation of one-against-one
–
SVR Support Vector Regression
Support Vector Machine can also be used as a regression method
–
LEZIONE 9 MARKOV DECISION PROCESSES
In sequential decision making
- we take a sequence of decisions (or actions) to reach the goal
- the optimal actions are context-dependent
- we generally do not have examples of correct actions
- actions may have long-term consequences
- short-term consequences of optimal actions might seem negative
In a Markov Decision Process (MDP), the one-step dynamic can be described as:
Markov Property: future state (s’) and reward (r) only depends on current state (s) and action (a)
When holds the Markov Property and the state and action sets are finite, the problem is a finite
Markov Decision Process
To define a finite MDP, you need to define:
- state and action sets
- one-step dynamics:
We can derive the next state distribution and expected reward as:
Example Finite MDP: Recycling Robot
Return
Agent should not choose actions on the basis of immediate reward!
In fact, long-term consequences are more important than short-term reward
So, we need to take into account the sequence of future rewards
_,
We define return, as a function of the sequence of future rewards:
[]
The agent will have to maximize the expected return
Different definition of return are possible:
Total reward
1. Discounted reward
2.
3. Average reward
Episodic Task
In episodic task the agent-environment interaction naturally breaks into chunks called episodes
It is possible to maximize the expected total reward:
Continuing Tasks
In continuing task the agent-environment interaction goes on continually and there are no
terminal state The total reward is a sum over an infinite sequence and might not be finite:
To solve this issue we can discount the future rewards by a factor (0 < < 1):
Unify Notation
In episodic tasks, we can design terminal states as absorbing states that always produce zero
reward:
Now we can use the same definition of expected reward for episodic and continuing tasks:
- = 1 can be used if an absorbing state is always reached
- if =0, agent would only care about immediate reward
- → 1, agent would take future rewards into account more strongly
as
Goal
A goal is what we want to achieve
The Reward Hypotesis: all of what we mean by goals and purposes can be thought as
maximization of the expected value of the cumulative sum of a received scalar signal (reward).
Examples of reward design:
- representation: 1 for goal, 0 otherwise
goal-reward
- action-penalty representation: -1 for not goal, 0 once goal reached
Policy
A policy, at any given timestamp, decides which action the agent selects
It fully defines the behavior of an agent
Policies can be:
- Markovian / Non Markovian
- Deterministic / Stochastic
- Stationary / Non Stationary
Deterministic Policy ):
In the simplest case the policy can be modeled as a function (: →
This type of policy can be conveniently represented using a table (on the left)
On the right, an example of deterministic policy in a gridworld environment
Stochastic Policy
A more general approach is to model policy as function that maps each state to a probability
distribution over the actions:
NB: stochastic policy can be used to represent also a deterministic policy
This is an example of stochastic policy in a gridworld environment
Markovian/Non Markovian Policy
1: choose 50% R and 50% L Markovian
2: alternate R and L Non Markovian (does not depend only from current state)
NB: we can overcome this limitation by extending the state definition (e.g., in this case including
previous action)
Value Functions
Given a policy we can compute the state-value function as:
,
It represents the expected return from a given state following policy
We can also compute the action-value function as:
,
It represents the expected return from a given state when a given action is selected and
then policy is followed
Bellman Expecation Equation
The state–value function can again be decomposed into immediate reward plus discounted value
of successor state:
The action-value function can be similarly decomposed:
Let see how Bellman equations can be used in practice with an example:
Limitations of Bellman Equations
Comparing two policies
′ _π() _π’(), ∀ ∈
We say that ≥ if and only if ≥
Optimal Policy ∗
For any Markov Decision Process, there exists always at least one optimal deterministic policy
, ∀)
that is better or equal to all the others (∗ ≥
Sketch of proof:
Example:
Limitations:
Bellman Optimality Equations
We can compute optimal state-value function and optimal action-value function as:
- Bellman Optimality Equation for V*
- Bellman Optimality Equation for Q*
Why? ∗ ∗ ∗
From and we can easily compute the optimal policy
Example: –
LEZIONE 10 DYNAMIC PROGRAMMING
Why Dynamic Programming?
To solve an MDP we need to find an optimal policy
Unfortunately we cannot use a brute- force approach:
||^||
- deterministic policies to evaluate
||
- linear equations to solve for each policy
Dynamic Programming (DP) is a method that allow to solve a complex problem by breaking it
down into simpler sub-problems in a recursive manner
How do we use DP to solve an MDP thanks to the Bellman Equations?
Idea:
Iterative Policy Evaluation
We search the solution of the Bellman expectation equation:
DP solves this problem through iterative application of Bellman equation:
, _k ∈
At each iteration the value-function is updated for all states (sweep)
_k _π _0
It can be proved that converges to as → ∞, for any
Algorithm:
Policy Improvement
Remember that
What happens if we act greedy w.r.t. non optimal value function?
’ ?
Is different from ∗
- If not, it means that is already the optimal policy
’ ?
- Otherwise, is better or as good as
Theorem:
Proof:
Policy Iteration
Algorithm:
Generalized Policy Iteration
Policy iteration alternates complete policy evaluation and improvement up to the convergence
Policy iteration framework allows also to find the optimal policy interleaving partial evaluation
and improvement steps!
In particular, Value Iteration is one of the most popular GPI method
Value Iteration
Simply iterate the update of the value function using the Bellman optimality equation:
_k *
It can be proved that as k ∞
Algorithm:
Asynchronous Dynamic Programming
All the DP methods described so far require sweeps of the entire state set.
Asynchronous DP does not use sweeps:
- Pick a state at random
- Apply the appropriate backup
- Repeat until convergence
An agent’s experience can act as a guide to select states and backup intelligently
Efficiency of DP
Finding an optimal policy is polynomial in the number of states and actions:
Unfortunately, the number of states is often growing exponentially with the number of state
variables (curse of dimensionality)
Asynchronous DP can be applied to larger problems, and is appropriate for parallel computing
BUT it is easy to come up with MDPs for which DP methods are not practical!
Linear programming approaches can be also used instead of DP but they do not typically scale
well on larger problems
–
LEZIONE 11 MONTE CARLO METHODS
Why do we need sample-based methods?
Consider the update rule of the Value Iteration algorithm:
Which is the major limitation of this approach?
- We generally do not know the problem dyanmics!
We want methods to learn the optimal policy directly from data!
Monte Carlo methods relies only on the experience (data) to learn value functions and policy
Monte Carlo methods can be used in two ways
- model-free: no model necessary and still attains optimality
- simulated: needs only a simulation, not a full model
Monte Carlo methods learn from complete sample returns Only defined for episodic tasks
Policy Evaluation
- Goal: learn V_π(s)
- Given: some number of episodes under π which contain s
- Idea: average returns observed after visits to s
1. Every-Visit MC: average returns for every time s is visited in an episode
2. First-visit MC: average returns only for first time s is visited in an episode
Both converge asymptotically
First-visit Monte Carlo policy evaluation
Policy Iteration
We fixed the evaluation with Monte Carlo Policy Evaluation
BUT improvement is still
Idea:
How?
Action Values Evaluation
We average return starting from state s and action a following π:
Converges asymptotically if every state-action pair is visited:
Monte Carlo Policy Iteration with Exploring Starts (ES)
-Soft Monte Carlo Policy Iteration
Exploring Starts is a simple idea but it is not always possible!
We need to keep exploring during the learning process!
This leads to a key problem in Reinforcement Learning: the Exploration-Exploitation Dilemma
-Greedy Exploration
It is the simplest solution to the exploration-exploitation dilemma -soft
Instead of searching the optimal deterministic policy we search the optimal policy, i.e., a
/||
policy that selects each action with a proability that is at least
-greedy
In particular we use policy:
Algorithm:
Theorem
-greedy ’ _ -soft
Any policy with respect to is an improvement over any policy
On-Policy vs Off-Policy Learning
On-Policy Learning
The agent learns the value functions of the same policy used to select actions
We cannot easily lear