Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
vuoi
o PayPal
tutte le volte che vuoi
E
2 is the set of all the possible subsets of E. The deterministic case
is a special case of the non-deterministic one (for every transition
there is only one possible state).
Stochastic or Probabilistic Ac →
τ : R ∆(E)
It returns a probability distribution over the states in E. It is not
a supercase of the non-deterministic one, they are independent
concepts.
Examples: (r is a possible run)
– Deterministic: τ (r) = e 0 00
{e, }
– Non-deterministic: τ (r) = e , e
e 1/2
0
e 1/4
– Probabilistic: τ (r) = 00
e 1/4
τ describes how the environment is working.
• hE, i.
Env = e , τ This tuple describes how the environment is per-
0
ceived by the agent.
• Agent function E →
Ag : R Ac
Given the history till some state e, it returns the next agent action α.
AG is the set of all possible agent functions.
9
• Utility function, the way with which we express what we want our
agent to do. It is crafted by the designer of the model. We do not
consider how the agent will reach the goal.
→
u : E R
In this way, the goal is to reach the state with highest utility.
→
u : R R
This utility depends on the history.
2.2.1 When is an agent a good agent?
∈
We define p(r), r R, as the probability of a run. We then define the
expected utility as X ·
EU (Ag, Env) = u(r) p(r)
r∈R
The ideal optimal agent maximizes the EU , thus
∗
Ag = arg max EU (Ag, Env)
Ag∈AG
These two formulations are not applicable in practice, because of infinite
sets.
2.2.2 Special cases of this general framework
Purely reactive agents Just react to events, no history involved
→
Ag : E Ac
Planning → {0,
ψ : R 1}
Achievement: if a run contains a particular state, return 1, else 0;
Maintenance: I don’t want to get in some particular states.
10
3 Markov Decision Processes - MDP
0
hE, i ∈
Environment: Env = e , τ Transitions: T (s, a, s ) [0, 1]
0
E → →
Agent: Ac, Ag : R Ac Policy: π : S A
∈ →
States: S, s S Utility: u : S R 0 0
P
∈ ·
Actions: A, a A EU (u, s, a) = T (s, a, s ) u(s )
0 ∈S
s
The utility in a MDP is a function that gives the benefit of being in one
→
state: u : S where S is the set of all states. In a Markov Decision
R,
Process, S is an instance of E.
How can an environment be described in a Markov decision process? s 1
∈
is the initial state of the environment, s S. The transition function is
1
0 0
∈
) [0, 1], where s is the current state, a the action taken and s
T (s, a, s
is the possible following state. It returns a probability value in [0, 1], that
0
represents the probability of going into state s from s taking action a. It’s
a stochastic process. s
T(s,a,s ) T(s,a,s )
T(s,a,s )
S S S
0 00 000
Obvious relation: T (s, a, s ) + T (s, a, s ) + T (s, a, s ) = 1
Why is this process a Markov Decision Process? The transition function
τ is history dependent, so given the whole history of interaction between
agent and environment it will return the next state. A special case of this
general definition is when the next state depends only on the last state.
Expected utility is not defined on all the runs, but on the current state. If
I start from s and perform action a, I would expect some utility and this
0
expected utility is calculated as the sum over all possible states s of the
11
0 0
probability of going in s times the utility in state s :
X 0 0
·
EU (u, s, a) = T (s, a, s ) u(s )
0 ∈S
s
The agent’s behavior is captured by a policy which is a mapping from states
to actions. We will use π to denote the agent’s policy. Given a state s, the
policy returns the action a that the agent should perform. The agent goal
then becomes finding its optimal policy, which is the one that maximizes its
expected utility, as we might expect a selfish agent would do. This strategy is
known as the principle of maximum expected utility. The agent optimal
policy can thus be defined as:
∗
π (s) = arg max EU (u, s, a)
a∈A
Def. (Markov Decision Process) An MDP consists of an initial state
0
s taken from a set of states S, a transition function T (s, a, s ) and a
1 →
reward function r : S R.
MDPs are usually represented using graphs:
{s }.
Nodes represent states of the MDP, S = , s , s , s The initial state is s .
1 2 3 4 1
{a }.
Arcs represent transition functions and the set of actions is A = , a , a , a
1 2 3 4
Some actions can be applied in some states, others not.
12
Rewards:
s r(s)
s 0
1
s 0
2
s 1
3
s 0
4
Time is discrete, and in every timestamp the agent should perform an action.
Every time an agent reaches a state, it gets the reward associated with that
state. E.g. the agent is in s , it performs action a and goes in s , getting
3 4 2
reward 0.
Assume to have a fixed policy p, and the agent starts from s . Is the se-
1
quence of states that it will visit fixed? No, because every action is not
deterministic, but stochastic.
An intelligent agent should always follow the optimal policy.
X
∗ 0 0
·
π (s) = arg max T (s, a, s ) u(s )
a∈A 0 ∈S
s
The problem is that we don’t know the utility u!
The reward is called immediate reward because the agent gets the reward
as soon as it reaches the state. The utility is more complex: it explains
the benefit to be in a state (how good is to be in that state now? How good
is to be in this state in the future? ).
The utility is given by:
• the reward of current state;
• the reward that I will get starting from s and following the optimal
policy;
If I look only at the immediate reward, being in state s and s is the same
1 4
because the reward is zero in both cases. Looking at the utilities, I’m not
only looking at the immediate reward, but also at the reward I’ll get in
the future. Looking at the future, its better to be in s becausefrom there
4
I could go in s and get the reward 1.
3 u(s ) = r(s ) + reward I’ll get in the future
1 1
We generally prefer to use discounted rewards which allow us to smoothly
reduce the impact of rewards that are farther off in the future. We do this
13
by multiplying the agent’s future rewards by an ever-decreasing discount
factor represented by γ, which is a number between zero and one.
For example, if an agent using policy π, starts out in state s and visits states
1
· · ·
s , s , s , then we say that its discounted reward is given by
1 2 3 0 1 2 · · ·
γ r(s ) + γ r(s ) + γ r(s ) +
1 2 3 · · ·
Note that we only know s . The rest of the states s , s , depend on the
1 2 3
transition function T.
3.1 Bellman Equation
We define the Bellman equation as: X 0 0
·
u(s) = r(s) + γ max T (s, a, s ) u(s )
a∈A 0 ∈S
s
The utility of a state is the reward I can get on that state plus the expected
utility of being in the next states discounted by γ.
· · · ·
u(s ) = 0 + γ max{0.8 u(s ) + 0.2 u(s ) ; 0.8 u(s ) + 0.2 u(s )}
1 2 1 4 1
· · ·
u(s ) = 0 +
2 · · ·
u(s ) = 1 +
3 · · ·
u(s ) = 0 +
4
System of four equations with unknown u(s ), u(s ), u(s ), u(s ). The system
1 2 3 4
because of the max operator. We don’t have an easy way to
is not linear
solve a system of four equations with four unknowns with nonlinearity. But
in principle, if we can solve the system we’ll have the utilities starting from
the Bellman equation. For small system we can actually solve the system.
Why do we need utilities? to find the best policy!
For larger problems we cannot solve a large system of non linear equations
in closed form, but there’s an algorithmic way to calculate the utilities:
Value Iteration. 14
3.2 Value Iteration
We do not know the real value of the utility function, but in an iterative way
we can get a good approximation of it, exploiting the following algorithm:
r, γ, ε)
Value-Iteration(T,
repeat 0
←
u u ;
←
δ 0;
∈
for s S do
0 0 0
P
← ·
u (s) r(s) + γ max T (s, a, s ) u(s ); /* Bellman Update */
a∈A 0 ∈S
s
0
|u −
if (s) u(s)| > δ then
0
← |u −
δ (s) u(s)|;
end
end −
ε(1 γ)
;
until δ < γ
return u;
In Input we have:
• T : transition function,
• r: rewards function,
• γ: discount factor,
• : error bound.
u is an array that contains the old values of utilities of the states.
0
u is an array with the updated (new) values of utilities.
In the f or loop the new values of utilities are updated. The initialization is
0
storing u into u, that is the past new values are the new old values. The
0 0
new u is then calculated according to the old value of utilities u(s ).
δ is the maximum difference between the old utilities and the new one at
each step.
Example: 15 s r(s)
s 0
1
s s s s
1 2 3 4 Rewards: s 0
2
u 0 0 0 0 s 1
3
s 0
4
The algorithm works for any initialization of u (in this case 0, 0, 0, 0). We
take γ = 0.5. δ is initialized to 0.
First iteration:
0 ← · {a · · · ·
u (s ) 0 + 0.5 max : 0.2 0 + 0.8 0; a : 0.2 0 + 0.8 0} = 0
1 a∈A 1 2
0 ← · {a · · · ·
u (s ) 0 + 0.5 max : 0.2 0 + 0.8 0; a : 0.2 0 + 0.8 0} = 0
2 a∈A 2 3
0 ← · {a · ·
u (s ) 1 + 0.5 max : 1 0; a : 1 0} = 1
3 a∈A 3 4
0 ← · {a · · · ·
u (s ) 0 + 0.5 max : 0.2 0 + 0.8 0; a : 0.1 0 + 0.9 0} = 0
4 a∈A 4 1
s s s s
1 2 3 4
u 0 0 0 0
0
u 0 0 1 0
0
|u −
Since in s : (s ) u(s )| = 1 > δ = 0 we get δ = 1.
3 3 3
The optimal utility is ∗
− ≤ ≤
u(s) ε u (s) u(s) + ε
Suppose in our case that ε = 1.1, then
∗ ∗
−1.1 ≤ ≤ − ≤ ≤
u (s ) 1.1 0.1 u (s ) 2.1
1 3
16
Second iteration:
0 ← · {a · ·