Advanced computer architectures notes

Questi appunti sono presi completamente in inglese (skuola.net non permette descrizioni in inglese) e riguardano il corso di ACA seguito da lezioni e slide del professore Davide Conficconi nell'anno accademico 2022/23. Contengono tutti gli argomenti del corso.

Esame Advanced computer architectures

Facoltà Ingegneria dell'informazione

Dal corso del Prof. Conficconi Davide

Università Politecnico di Milano

Publisher simone_togn

A.A. 2022-2023

88 pagine

Appunti esame

Vota

Scarica

Estratto del documento

ROB.Reservation stations now only queue operations to FUs between the time they issue and the time they begin execution. Results are tagged with ROB entry numbers rather than with RS number. All instructions excluding incorrectly predicted branches commit when reaching head of ROB. Incorrectly predicated branch reaches head of ROB, so ROB is flushed and the execution restarts at correct successor of branch. 4748

Example: 4911- Explicit Register Renaming

Tomasulo provides implicit register renaming (user registers renamed to reservation station tags), but now we introduce the Explicit register renaming, so we use physical register file that are larger than the number of registers specified by the ISA. The key insight is to allocate a new physical destination register for every instruction that writes. This removes every chance of WAR or WAW hazards. Like Tomasulo is good for allowing full out-of-order completion. Each instruction that writes a result is allocated in a new physical register from

Freelist. When a physical register is dead (or not live) we free up. We keep a translation table where is written the ISA register and the physical register mapping. When register are written we replace entry with new register from freelist. Physical registers become free when not used by any active instructions. The unified physical register file rename all architectural registers into a single physical register file during decode stage. No register values are read. 50 FUs read and write from single unified register file holding committed and temporary registers in execution. Commit only updates mapping of architectural register to physical register without any data movement.

HW register renaming: Renaming map: is a simple data structure that supplies the physical register number of the register that currently corresponds to the requested architectural register. Instruction commit: update permanently the remaining table to indicate that the physical register holding the destination value.

corresponds to the actual architectural register.Use the ROB to enforce in-order commit.

Physical register file holds committed and speculative values. They are decoupled from ROB entries (no data in ROB).

We can reuse a physical register when the next writer of the same architectural register commits.

The pipeline can be exactly like standard DLX pipeline.

The advantages are that removes all WAR and WAW hazards, is good for allowing full out-of-order completion, allows data to be fetched from a single register file and make speculative execution/precise interrupts easier.

We can also use explicit renaming with scoreboard.

The stages of the scoreboard with explicit renaming are:

Issue: decode instruction, check for structural hazards, and allocate new physical register for result. Instructions are issued in program order, but they are not issued if there are no free physical registers or if there are structural hazards.
Read operands: wait until no hazards and read operands. All RAW

Stage 1: Instruction Fetch

Hazards are solved in this stage, since we wait for instructions to write back data.

Stage 2: Instruction Decode

Decode the instruction and determine the operands.

Stage 3: Execution

Operate on operands. The FU begins execution upon receiving operands. When the result is ready, it notifies the scoreboard.

Stage 4: Write Result

Finish the execution. There is no check for WAR or WAW hazards.

Register renaming vs ROB

Instruction commit is simpler than with ROB, but deallocating is more complex. Dynamical mapping of architectural to physical registers complicates design and debugging.

Summary

There are more physical registers than needed by ISA.

The rename table tracks current association between architectural registers and physical registers.
The translation table performs compiler-like transformation on the fly.
All registers are concentrated in a single register file.
Can utilize bypass network that looks more like 5-stage pipeline.
Introduces a register-allocation problem: need to handle branch misprediction and precise exceptions differently, needing something like

reorder buffer.
535412- ILP Limits and Superscalar Architecture
Superscalar execution wants to go beyond Scoreboard and Tomasulo (ideal CPI=1) and have a CPI < 1. (Es: 1/3 in the picture).
The key idea is to fetch more instructions per clock cycle and to decide on data and control dependencies (dynamic scheduling and dynamic branch prediction).
Superscalar issues multiple instructions per clock-cycle, varying the number of instruction/cycle scheduled by the compiler or the HW.
Modern superscalar use: dynamic scheduling + multiple issue + speculation.
Two approaches:

Assign reservation stations and update pipeline control table in half clock cycle (only supports 2 instruction/cycle, not the best).
Design logic to handle any possible dependencies between the instructions.

Issue logic is the bottleneck in dynamically scheduled superscalars.
The Tomasulo superscalar example examine all the dependences among the instructions in the bundle and if dependences exist in bundle we encode them in

reservation stations. It also needs multiple completion/commits.

To simplify the RS allocation we limit the number of instructions of a given class that can be issued in a bundle.

Limits of ILP (assumptions for ideal machine to start):

Register renaming: infinite virtual registers and all WAW and WAR avoided.
Branch prediction: perfect without mispredictions.
Jump prediction: all jumps are perfectly predicted.
Memory-address alias analysis: addresses are known.
1-cycle latency for all instructions: unlimited number of instructions issued per clock cycle.

Initial assumptions:

CPU can issue at once unlimited number of instructions, looking arbitrary ahead in computation.
No restrictions on types of instructions that can be executed in one cycle.
All functional unit latencies are 1.
Perfect caches: all loads, stores and execute in one cycle.
No CPU with these assumptions may exist.

A perfect dynamic-scheduled CPU should:

Look arbitrarily far ahead to predict all

branches perfectly.

Rename all register uses (no WAW or WAR hazards).

Determine whether there are data dependences and rename if necessary.

Determine if memory dependences exist among issuing instructions and handle them.

Provide enough replicated FU to allow all ready instruction to issue.

The window size affects the number of comparisons necessary to determine RAW dependences (with n register-to-register instructions in the issue phase with infinite registers we have to do 2n -n comparisons)

Today's CPUs have constraints deriving from the limited number of register, from the search of dependent instructions and in order issue.

All the instruction in the window must be kept in the processor.

The number of comparisons required at each cycle = max completion rate x window size x number of operands per instruction.

Other limits of today's CPUs are the number of functional units, the number of busses and the number of ports for the register file.

These limitations define that

the max number of instructions that can be issued, executed or committed in the same clock cycle is much smaller than the window size.

The max number of instruction issued per time is 6 because if we issue more instructions it takes too long to compute and the frequency of the processor could decrease.

Today the issue rate is 3-6 instructions per clock, if we try do double it to 6-12, we require the processor to issue 3-4 data memory access per cycle, resolve 2-3 branches per cycle, rename and access more than 20 register per cycle and fetch 12-24 instructions per cycle.

Most techniques for increasing the performances increase the power consumption. We are interested in seeing if a technique is energy efficient: if it increases the power consumption faster than it increases the performances.

To overcome ILP limits other form of parallelism come to rescue: multi-core and SIMD.

Parallel architectures: SIMD and intro to MIMD

Frequency of nowadays' processors is 3GHz.

With single core is

It is difficult to increase performances and clock frequency. There are problems with the heat dissipation, design, verification, and requirements. Moreover, multiple-issue CPUs are very complex. Parallel architectures are a collection of elements that cooperate and communicate to solve large problems fast. Instead of designing faster processors, we replicate processors to add performances. They extend traditional computer architecture with communication architecture, abstractions, and different structures to realize abstraction efficiently.

SISD: is a serial, non-parallel computer, where:

only one instruction stream is being acted on by the CPU during any one clock cycle (single instruction).
only one data stream is being used as input during any one clock cycle (single data).

The execution is deterministic. It is the oldest and most common type of computer.

SIMD: is a type of parallel computer, where:

all processing units execute the same instruction at any given clock cycle (Single instruction).
multiple data streams are being used as input during any one clock cycle (multiple data).

Each processing unit can operate on a different data element (Multiple data). This architecture is suitable for specialized problems, like graphics/image processing. The same instruction is executed by multiple processors using different data streams. Each processor (computing element) has its own memory. Each one of them can be special purpose. There is a single instruction memory and control processor to fetch and dispatch instructions. The programming model is simple. We have a single controller that dispatches the instructions to multiple processing elements. So we have a single program counter. All computations are fully synchronized. Each unit has its own addressing registers.

The motivations of SIMD are: the cost of control unit is shared by all execution units and only one copy of the code in execution is necessary. An example in the reality of SIMD architecture is the Sony PlayStation 2.

The variations of SIMD machines are vector architectures, SIMD extensions, and GPUs. Vector architectures: read sets of

data elements into “vectorregisters” and apply computation on set of registers. A singleinstruction operates on vectors of data.

Vector processing: vector processors have high-level operationsthat work on linear arrays of numbers. Scalar FU make 1 operation,vectorial FU makes N operations.

A vector processor consists of a pipelined scalar unit (VLIW) +vector unit. They can be of 2 types:memory-memory vector proces

Anteprima

Vedrai una selezione di 19 pagine su 88