Autonomous Agents and Multiagent Systems, prof. Amigoni, Polimi

Esame Autonomous Agents and Multiagent Systems

Facoltà Ingegneria dell'informazione

Dal corso del Prof. Amigoni Francesco

Università Politecnico di Milano

Appunto

5,0 / 5 (1)

Scarica

Appunti per il corso di Autonomous Agents & Multiagent Systems del prof. Amigoni, Politecnico di Milano basati su appunti personali del publisher presi alle lezioni del prof. dell’università degli Studi del Politecnico di Milano - Polimi. Scarica il file in formato PDF!

…continua

Anteprima

Vedrai una selezione di 14 pagine su 63

Autonomous Agents and Multiagent Systems, prof. Amigoni, Polimi Pag. 1

Autonomous Agents and Multiagent Systems, prof. Amigoni, Polimi Pag. 2

Anteprima di 14 pagg. su 63.
Scarica il documento per vederlo tutto.

Scarica

Autonomous Agents and Multiagent Systems, prof. Amigoni, Polimi Pag. 6

Anteprima di 14 pagg. su 63.
Scarica il documento per vederlo tutto.

Scarica

Autonomous Agents and Multiagent Systems, prof. Amigoni, Polimi Pag. 11

Anteprima di 14 pagg. su 63.
Scarica il documento per vederlo tutto.

Scarica

Autonomous Agents and Multiagent Systems, prof. Amigoni, Polimi Pag. 16

Anteprima di 14 pagg. su 63.
Scarica il documento per vederlo tutto.

Scarica

Autonomous Agents and Multiagent Systems, prof. Amigoni, Polimi Pag. 21

Anteprima di 14 pagg. su 63.
Scarica il documento per vederlo tutto.

Scarica

Autonomous Agents and Multiagent Systems, prof. Amigoni, Polimi Pag. 26

Anteprima di 14 pagg. su 63.
Scarica il documento per vederlo tutto.

Scarica

Autonomous Agents and Multiagent Systems, prof. Amigoni, Polimi Pag. 31

Anteprima di 14 pagg. su 63.
Scarica il documento per vederlo tutto.

Scarica

Autonomous Agents and Multiagent Systems, prof. Amigoni, Polimi Pag. 36

Anteprima di 14 pagg. su 63.
Scarica il documento per vederlo tutto.

Scarica

Autonomous Agents and Multiagent Systems, prof. Amigoni, Polimi Pag. 41

Anteprima di 14 pagg. su 63.
Scarica il documento per vederlo tutto.

Scarica

Autonomous Agents and Multiagent Systems, prof. Amigoni, Polimi Pag. 46

Anteprima di 14 pagg. su 63.
Scarica il documento per vederlo tutto.

Scarica

Autonomous Agents and Multiagent Systems, prof. Amigoni, Polimi Pag. 51

Anteprima di 14 pagg. su 63.
Scarica il documento per vederlo tutto.

Scarica

Autonomous Agents and Multiagent Systems, prof. Amigoni, Polimi Pag. 56

Anteprima di 14 pagg. su 63.
Scarica il documento per vederlo tutto.

Scarica

Autonomous Agents and Multiagent Systems, prof. Amigoni, Polimi Pag. 61

Disdici quando
vuoi

Acquista con carta
o PayPal

Scarica i documenti
tutte le volte che vuoi

Estratto del documento

E

2 is the set of all the possible subsets of E. The deterministic case

is a special case of the non-deterministic one (for every transition

there is only one possible state).

Stochastic or Probabilistic Ac →

τ : R ∆(E)

It returns a probability distribution over the states in E. It is not

a supercase of the non-deterministic one, they are independent

concepts.

Examples: (r is a possible run)

– Deterministic: τ (r) = e 0 00

{e, }

– Non-deterministic: τ (r) = e , e

 e 1/2

 0

e 1/4

– Probabilistic: τ (r) = 00

e 1/4



τ describes how the environment is working.

• hE, i.

Env = e , τ This tuple describes how the environment is per-

0

ceived by the agent.

• Agent function E →

Ag : R Ac

Given the history till some state e, it returns the next agent action α.

AG is the set of all possible agent functions.

9

• Utility function, the way with which we express what we want our

agent to do. It is crafted by the designer of the model. We do not

consider how the agent will reach the goal.

→

u : E R

In this way, the goal is to reach the state with highest utility.

→

u : R R

This utility depends on the history.

2.2.1 When is an agent a good agent?

∈

We define p(r), r R, as the probability of a run. We then define the

expected utility as X ·

EU (Ag, Env) = u(r) p(r)

r∈R

The ideal optimal agent maximizes the EU , thus

∗

Ag = arg max EU (Ag, Env)

Ag∈AG

These two formulations are not applicable in practice, because of infinite

sets.

2.2.2 Special cases of this general framework

Purely reactive agents Just react to events, no history involved

→

Ag : E Ac

Planning → {0,

ψ : R 1}

Achievement: if a run contains a particular state, return 1, else 0;

Maintenance: I don’t want to get in some particular states.

10

3 Markov Decision Processes - MDP

0

hE, i ∈

Environment: Env = e , τ Transitions: T (s, a, s ) [0, 1]

0

E → →

Agent: Ac, Ag : R Ac Policy: π : S A

∈ →

States: S, s S Utility: u : S R 0 0

P

∈ ·

Actions: A, a A EU (u, s, a) = T (s, a, s ) u(s )

0 ∈S

s

The utility in a MDP is a function that gives the benefit of being in one

→

state: u : S where S is the set of all states. In a Markov Decision

R,

Process, S is an instance of E.

How can an environment be described in a Markov decision process? s 1

∈

is the initial state of the environment, s S. The transition function is

1

0 0

∈

) [0, 1], where s is the current state, a the action taken and s

T (s, a, s

is the possible following state. It returns a probability value in [0, 1], that

0

represents the probability of going into state s from s taking action a. It’s

a stochastic process. s

T(s,a,s ) T(s,a,s )

T(s,a,s )

S S S

0 00 000

Obvious relation: T (s, a, s ) + T (s, a, s ) + T (s, a, s ) = 1

Why is this process a Markov Decision Process? The transition function

τ is history dependent, so given the whole history of interaction between

agent and environment it will return the next state. A special case of this

general definition is when the next state depends only on the last state.

Expected utility is not defined on all the runs, but on the current state. If

I start from s and perform action a, I would expect some utility and this

0

expected utility is calculated as the sum over all possible states s of the

11

0 0

probability of going in s times the utility in state s :

X 0 0

·

EU (u, s, a) = T (s, a, s ) u(s )

0 ∈S

s

The agent’s behavior is captured by a policy which is a mapping from states

to actions. We will use π to denote the agent’s policy. Given a state s, the

policy returns the action a that the agent should perform. The agent goal

then becomes finding its optimal policy, which is the one that maximizes its

expected utility, as we might expect a selfish agent would do. This strategy is

known as the principle of maximum expected utility. The agent optimal

policy can thus be defined as:

∗

π (s) = arg max EU (u, s, a)

a∈A

Def. (Markov Decision Process) An MDP consists of an initial state

0

s taken from a set of states S, a transition function T (s, a, s ) and a

1 →

reward function r : S R.

MDPs are usually represented using graphs:

{s }.

Nodes represent states of the MDP, S = , s , s , s The initial state is s .

1 2 3 4 1

{a }.

Arcs represent transition functions and the set of actions is A = , a , a , a

1 2 3 4

Some actions can be applied in some states, others not.

12

Rewards:

s r(s)

s 0

1

s 0

2

s 1

3

s 0

4

Time is discrete, and in every timestamp the agent should perform an action.

Every time an agent reaches a state, it gets the reward associated with that

state. E.g. the agent is in s , it performs action a and goes in s , getting

3 4 2

reward 0.

Assume to have a fixed policy p, and the agent starts from s . Is the se-

1

quence of states that it will visit fixed? No, because every action is not

deterministic, but stochastic.

An intelligent agent should always follow the optimal policy.

X

∗ 0 0

·

π (s) = arg max T (s, a, s ) u(s )

a∈A 0 ∈S

s

The problem is that we don’t know the utility u!

The reward is called immediate reward because the agent gets the reward

as soon as it reaches the state. The utility is more complex: it explains

the benefit to be in a state (how good is to be in that state now? How good

is to be in this state in the future? ).

The utility is given by:

• the reward of current state;

• the reward that I will get starting from s and following the optimal

policy;

If I look only at the immediate reward, being in state s and s is the same

1 4

because the reward is zero in both cases. Looking at the utilities, I’m not

only looking at the immediate reward, but also at the reward I’ll get in

the future. Looking at the future, its better to be in s becausefrom there

4

I could go in s and get the reward 1.

3 u(s ) = r(s ) + reward I’ll get in the future

1 1

We generally prefer to use discounted rewards which allow us to smoothly

reduce the impact of rewards that are farther off in the future. We do this

13

by multiplying the agent’s future rewards by an ever-decreasing discount

factor represented by γ, which is a number between zero and one.

For example, if an agent using policy π, starts out in state s and visits states

1

· · ·

s , s , s , then we say that its discounted reward is given by

1 2 3 0 1 2 · · ·

γ r(s ) + γ r(s ) + γ r(s ) +

1 2 3 · · ·

Note that we only know s . The rest of the states s , s , depend on the

1 2 3

transition function T.

3.1 Bellman Equation

We define the Bellman equation as: X 0 0

·

u(s) = r(s) + γ max T (s, a, s ) u(s )

a∈A 0 ∈S

s

The utility of a state is the reward I can get on that state plus the expected

utility of being in the next states discounted by γ.

· · · ·

u(s ) = 0 + γ max{0.8 u(s ) + 0.2 u(s ) ; 0.8 u(s ) + 0.2 u(s )}

1 2 1 4 1

· · ·

u(s ) = 0 +

2 · · ·

u(s ) = 1 +

3 · · ·

u(s ) = 0 +

4

System of four equations with unknown u(s ), u(s ), u(s ), u(s ). The system

1 2 3 4

because of the max operator. We don’t have an easy way to

is not linear

solve a system of four equations with four unknowns with nonlinearity. But

in principle, if we can solve the system we’ll have the utilities starting from

the Bellman equation. For small system we can actually solve the system.

Why do we need utilities? to find the best policy!

For larger problems we cannot solve a large system of non linear equations

in closed form, but there’s an algorithmic way to calculate the utilities:

Value Iteration. 14

3.2 Value Iteration

We do not know the real value of the utility function, but in an iterative way

we can get a good approximation of it, exploiting the following algorithm:

r, γ, ε)

Value-Iteration(T,

repeat 0

←

u u ;

←

δ 0;

∈

for s S do

0 0 0

P

← ·

u (s) r(s) + γ max T (s, a, s ) u(s ); /* Bellman Update */

a∈A 0 ∈S

s

0

|u −

if (s) u(s)| > δ then

0

← |u −

δ (s) u(s)|;

end

end −

ε(1 γ)

;

until δ < γ

return u;

In Input we have:

• T : transition function,

• r: rewards function,

• γ: discount factor,

• : error bound.

u is an array that contains the old values of utilities of the states.

0

u is an array with the updated (new) values of utilities.

In the f or loop the new values of utilities are updated. The initialization is

0

storing u into u, that is the past new values are the new old values. The

0 0

new u is then calculated according to the old value of utilities u(s ).

δ is the maximum difference between the old utilities and the new one at

each step.

Example: 15 s r(s)

s 0

1

s s s s

1 2 3 4 Rewards: s 0

2

u 0 0 0 0 s 1

3

s 0

4

The algorithm works for any initialization of u (in this case 0, 0, 0, 0). We

take γ = 0.5. δ is initialized to 0.

First iteration:

0 ← · {a · · · ·

u (s ) 0 + 0.5 max : 0.2 0 + 0.8 0; a : 0.2 0 + 0.8 0} = 0

1 a∈A 1 2

0 ← · {a · · · ·

u (s ) 0 + 0.5 max : 0.2 0 + 0.8 0; a : 0.2 0 + 0.8 0} = 0

2 a∈A 2 3

0 ← · {a · ·

u (s ) 1 + 0.5 max : 1 0; a : 1 0} = 1

3 a∈A 3 4

0 ← · {a · · · ·

u (s ) 0 + 0.5 max : 0.2 0 + 0.8 0; a : 0.1 0 + 0.9 0} = 0

4 a∈A 4 1

s s s s

1 2 3 4

u 0 0 0 0

0

u 0 0 1 0

0

|u −

Since in s : (s ) u(s )| = 1 > δ = 0 we get δ = 1.

3 3 3

The optimal utility is ∗

− ≤ ≤

u(s) ε u (s) u(s) + ε

Suppose in our case that ε = 1.1, then

∗ ∗

−1.1 ≤ ≤ − ≤ ≤

u (s ) 1.1 0.1 u (s ) 2.1

1 3

16

Second iteration:

0 ← · {a · ·

Dettagli

Publisher

A.A. 2016-2017

63 pagine

1 download

SSD Ingegneria industriale e dell'informazione ING-INF/05 Sistemi di elaborazione delle informazioni

I contenuti di questa pagina costituiscono rielaborazioni personali del Publisher alexvago93 di informazioni apprese con la frequenza delle lezioni di Autonomous Agents and Multiagent Systems e studio autonomo di eventuali libri di riferimento in preparazione dell'esame finale o della tesi. Non devono intendersi come materiale ufficiale dell'università Politecnico di Milano o del prof Amigoni Francesco.

Domande e risposte

Hai bisogno di aiuto?

Chiedi alla più grande community di studenti

Fai una domanda

Quali linguaggi di programmazione consigli per iniziare nel 2024?

di Matteo_Gallo

help c+++

di misslanternina

Programma in C (310905)

di Robertino21

Appunti correlati

Autonomous Systems

Autonomous Systems Premium

Appunto

Appunti dettagliati Multiagent Systems (Autonomous Agents and Intelligent Robotics) 2024/2025

Appunti dettagliati Multiagent Systems (Autonomous Agents and Intelligent Robotics) 2024/2025 Premium

Appunti esame

5,0 / 5

Control and Actuating Devices for Mechanical Systems

Control and Actuating Devices for Mechanical Systems Premium

Appunto

5,0 / 5

Riassunti Advanced numerical methods for coupled problems and living systems

Riassunti Advanced numerical methods for coupled problems and living systems Premium

Appunto

Recensioni

5/5

1 recensione

5 stelle

4 stelle

3 stelle

2 stelle

1 stella

1

0

0

0

0

Ti è piaciuto questo appunto?

Thegames2000

19 Dicembre 2017

Tutor AI

Ciao! Sono il tuo Tutor AI, il compagno ideale per uno studio interattivo. Utilizzo il metodo maieutico per affinare il tuo ragionamento e la comprensione. Insieme possiamo:

Risolvere un problema di matematica
Riassumere un testo
Tradurre una frase
E molto altro ancora...

Cosa vuoi imparare oggi?

Il Tutor AI di Skuola.net usa un modello AI di Chat GPT.
Per termini, condizioni e privacy, visita la relativa pagina.