# Affinity-Driven System Design Exploration for Heterogeneous Multiprocessor SoC

Anteprima

### ESTRATTO DOCUMENTO

## BRANDOLESE ET AL.: AFFINITY-DRIVEN SYSTEM DESIGN EXPLORATION FOR HETEROGENEOUS MULTIPROCESSOR SOC 3

Fig. 1. The proposed high-level flow.

desired system. A procedure-level internal model has been ogy is detailed in Section 6 while the timing cosimulation

defined to represent the system specification. Such a model, methodology considers the proposed heterogeneous multi-

Procedural Interaction Graph

called the [11], [32], is based on processor architecture and a high-level model for the

Procedural Call Graph

the [3], [31]. The procedure-level communication media in order to evaluate the performance

internal model is able to capture information related to the of the SoC by verifying its timing correctness.

computational elements present in an imperative, possibly

object-oriented, synthesizable behavioral specification along 4 T R M

with their relationships. HE EPRESENTATION ODEL

The first step of the flow aims at extracting as much The entry point of the proposed design flow (Fig. 1) is a

information as possible on the system by analyzing the behavioral synthesizable description of the desired system

specification in a static and fast manner. This step is (e.g., the Synopsys SystemC Behavioral Synthesizable

Coanalysis

composed of the (Section 5, [11], [12], and Subset [30], [33]). A procedure-level internal model, called

Coestimation [11] activities. The former one provides a set

Affinity the Procedural Interaction Graph [11], [32], has been

of data expressing the of the system functionalities

toward a set of processing element classes (GPP, DSP, defined to represent such a kind of system specification,

ASIC/FPGA), while the latter provides a set of estimations based on the Procedural Call Graph [3], [31].

of the time required, by each class of processing elements, The procedure-level internal model is able to capture

for the execution of each single operation that composes the information related to the computational elements present

specification. After the static analysis, the system function- in an imperative, possibly object-oriented, synthesizable

Cosimulation

alities are simulated (Functional [11]) to verify behavioral specification and the relationship between these

their correctness with respect to typical input data sets, and elements.

important data characterizing the dynamic behavior of the An intuitive graphical representation of such a graph

Profiling Communication Cost.

system are extracted: and (see Fig. 2) is obtained by associating a node with each

Combining the data provided by the previous steps (timing instance of a method (gray nodes are nonblocking, i.e., they

and profiling data) with a global time constraint allows the

Load

estimation of the [11] associated with the execution of can be executed concurrently with their caller) and an

1-GPP SoC

each system functionality on a (i.e., the worst arrow for each data transfer (black arrows indicate

case). The analysis of such data is useful to evaluate the procedure calls, gray arrows indicate communication/

necessary amount of processor cores, the level of load synchronization operations).

balancing, and the identification of those procedures that The example of Fig. 2 shows a procedure-level internal

probably need an executor more performing than a general- model where the instance of the method calls the

Main

purpose processor. instances of nonblocking methods and The other

F1, F6, F9.

System Design Exploration

Finally, the flow reaches the instances of the method perform only blocking calls, while

task that is constituted by two iterative steps: Partitioning instances and make data transfers.

Architecture Selection Timing Cosimulation

and [11], [34] and F5-F8 F6-F9

Let us formally define the procedural level model.

[11]. The partitioning and architecture selection methodol-

4 IEEE TRANSACTIONS ON COMPUTERS, VOL. 55, NO. 5, MAY 2006

is the set of methods associated with all the

Definition 7. MT S

¼

classes, i.e., . For NOO languages,

## MT MT MT

i

i

is the set of all the procedures.

## MT

0 ik 2 2

where is the instance of a

Definition 8. V I, i; k N, k

o

class .

c

i is the set of all the instances of all the classes,

Definition 9. OB

S ik

¼ .

i.e., o

OB i;k i 2

, where is the instance of the

Definition 10. i; j; k N,

f k;j

method of the instance of class . For NOO languages,

m k c

i;j i i

that allow multiple instances of a method (i.e., procedure), f k;j

is the instance of the method .

k m

0;j

is the set of all the method instances in the

Definition 11. MI S i

¼

specification, i.e., .

MI f

i;j;k k;j

Fig. 2. Internal model graphical representation. ¼ hs; is the data transfer where

Definition 12. w; d; g; fi i,

t

i

2 6¼

with are the source and the destination

s; d MI, s d,

4.1 Procedural Interaction Graph 2 ¼

method instances of the data transfer, respectively, w T Q

The Procedural Interaction Graph is a formalism composed fMC; is a data transfer qualifier where indicates a

CSg MC

of nodes (i.e., instances of methods or procedures) and method call (i.e., a procedure call) while indicates a

## CS

annotated edges (i.e., method or procedure calls, commu- 2

communication (or synchronization) operation, is the

g N

nication, or synchronization operations) that provide 2

exchanged data size in bits, and is the average number

f Z

information on the relationships between the nodes and of times that the data exchange occurs.

on the data exchanged (e.g., size, profiling, etc.). Such a is the set of all data transfers of the

Definition 13. DT

graph is suitable for representing a coarse grain view of the S

¼

specification, i.e., .

DT t

system that takes into account communications, synchroni- i

i

zations and concurrency issues. The following definitions ¼ hMI; i is the graph representing a given

Definition 14. G DT

provide a formalization of the model that is applicable both specification, where the graph nodes are the instances of

to imperative object-oriented (OO) specification languages methods and the graph edges are data transfers.

as well as to not object-oriented (NOO) ones. The first two

definitions relate to the specification language, while the 5 M C

## ETRICS FOR OANALYSIS

next ones formalize the components of the Procedural

Interaction Graph. The coanalysis activity aims at obtaining as much informa-

tion as possible about the system by statically analyzing the

¼ fV g is the adopted specification

Definition 1. LN I; V O; V T specification. The goal of this step is to statically detect the

language, where is the set of valid identifiers of is

V I LN, V O most suitable class of processing elements for the execution

the set of valid operators of and is the set of valid

LN, V T of each system functionality. To this purpose, several

types of LN. subtasks are required [11], [12]: an architectural analysis

is the set of valid specifications that can be

Definition 2. V S of the available processing elements to determine their

expressed with LN. relevant features, the definition of a set of patterns able to

¼ hq; is the method of class (see identify subsets of the specification that could exploit the

Definition 3. m k; ci j c

i;j i identified architectural features, and the definition of a set

2 2

Definition 5), where is the method name,

i; j N, q V I of metrics able to provide meaningful indications to drive

2 ¼ fB; is a method qualifier where indicates a

k MQ NBg B design choices.

blocking method while indicates a nonblocking method,

## NB

2

and is the method body. Note that the term “method”

c V S 5.1 Characterization of Executors

indicates both a method in the OO sense and a procedure in the This section describes the analysis performed to detect the

NOO one. For NOO languages, all the methods belong to a most relevant exploitable architectural features of the

fictitious class 0. identified executors classes.

¼ hq; is the member data of the class (see

Definition 4. d zi j c

i;j i 5.1.1 GPP Architectural Features

2 2

Definition 5), where is the data name, and

i; j N, q V I

2 is the data type.

z V T General Purpose Processors (GPP) have been designed to be

useful in several contexts and, so, it is difficult to detect

¼ hq; i 2

is the class where

Definition 5. c MT ; DT i, i N,

i i i particular features that strongly identify a GPP-suitable

2 is the class name, is the set of the class methods,

q V I MT

i application. They are typically adopted as control elements,

and is the set of the class member data. For

DT i I/O manager, for general computations and in complex

NOO languages, the only class with methods is a fictitious systems that use an operating system.

one (class 0). Therefore, is the set of all the methods, and

MT 0

is the set of global variables. Other classes can exist if

DT c 5.1.2 DSP Architectural Features

0 i

¼ ;

and only if (i.e., they are typically called structures).

## MT

i Digital Signal Processors (DSP) are dedicated to digital

declared in the

is the set of classes

Definition 6. CL c signal processing applications and, so, they present a loss of

i

## S

¼

specification, i.e., . generality with respect to GPP and a higher cost, but they

CL c

i

i

## BRANDOLESE ET AL.: AFFINITY-DRIVEN SYSTEM DESIGN EXPLORATION FOR HETEROGENEOUS MULTIPROCESSOR SOC 5

provide a better performance in the execution of a respect to the elements of its structure that affect its

particular set of instructions [24]. The architectural features implementation. The identified elements are meaningful;

included in a DSP allow concurrent loading of multiple however, only a few of them are supported by an effective

operands, concurrent execution of sums and multiplica- quantification approach and the metrics defined are strictly

tions, fast management of loops, and fast access to bounded to high-level synthesis issues (e.g., area estimation

sequential memory space (e.g., array). of a custom ASIC implementation). A codesign oriented

work is instead the one presented in [28], where the concept

5.1.3 ASIC-Like Devices Architectural Features hardware/software repelling

of is used to drive a hardware/

Application Specific Integrated Circuits (ASIC) are devel- software partitioning algorithm. The approach is based on

oped for specific applications. They generally exhibit high the analysis of the system functionalities; it detects a set of

performance, but their design and development costs are so repelling

features that suggest the of certain functionalities

relevant that they are only affordable for high volumes. toward a certain type of implementation. Unfortunately, the

Field Programmable Devices (FPD) are arrays of logic work considers only one type of software executor and the

blocks with programmable interconnections to shape the set of features considered is not clearly defined. Finally, the

desired functionality. Such devices are a trade-off between work in [4] considers multiprocessor systems synthesis,

processors and ASICs with respect to performance, flex- starting from an object-oriented specification, and it

ibility, and cost (e.g., Field Programmable Gate Arrays [25]). analyzes subsets of such a specification in order to detect

We worked on the identification of a set of features that control dominated, data

features that allow marking them as

allow an early selection of the functionalities able to exploit transformation dominated, memory access dominated.

or How-

ASIC-like devices. Since a mismatch between application ever, it does not consider dedicated hardware devices,

data-path requirements and the characteristics of the works with a too coarse granularity level, and poorly

processor data-path could lead to inefficient use of defines the metrics to be used within the methodology.

processor resources (nonstandard data-path), the ASIC-like Let us describe the metrics defined to compose the

devices are more suitable to perform bit manipulation affinity.

operations (shifting, Boolean operators, etc.). Moreover,

repeated operations of similar types on large regular data 5.2.1 Data Oriented Metrics

sets are also ideal candidates for ASIC-like implementa- The goal of this metric is to take into account the type of

tions. Regularity in operations imposes less demand on the data involved in the execution of a given functionality.

control unit complexity, better exploiting the available

resources. For each method and

Definition 16 (Data Ratio (DR )). m

m;t

It is worth noting that the presented methodology could for each allowed type (e.g., etc.), is

t DR

int, float, m;t

be easily extended in order to consider new technologies defined as the fraction of declarations of type with respect to

t

provided by research and the market (e.g., Tensilica the total number of declarations made in m.

Configurable Processors [35]). This can be obtained by

defining new specific metrics to highlight relevant archi- 5.2.2 Structural Metrics

tectural features and used to identify design choice. The goal of these metrics is to identify the structural

properties of a functionality focusing on the analysis of the

5.2 The Proposed Approach control flow complexity.

Considering the architectural features previously described,

it is possible to define a set of patterns able to identify For each

Definition 17 (Control Flow Complexity (CF )).

## C

m

subsets of the specification that match some executor method is defined as the ratio between the number

m, CF C

m

features and a set of metrics that quantify such a matching. of source lines that contains loop or branch statements and the

Finally, these metrics are combined to build a global metric total number of lines.

Affinity)

(the that suggests the most suitable processing For each method

Definition 18 (Loop Ratio (LR )). m, LR

elements for the execution of a given functionality. m m

is defined as the ratio between the number of source lines that

¼

The affinity

)).

Definition 15 (The Affinity (A A

m m contain loop statements and the total number of lines.

½A of a method (i.e., a procedure) is a

A A m

GP P m DSP m HW m ½0;

triplet of values in the interval that provides a

1 Such a metric allows discriminating between computa-

quantification of the matching between the structural and tional and control oriented functionalities.

functional features of the functionality implemented by the

method and the architectural features for each one of the 5.2.3 DSP Oriented Metrics

considered executor classes (i.e., GPP, DSP, ASIC/FPGA). The goal is to identify functionalities suitable to be executed

by a DSP by considering those characteristics that exploit

An affinity of 1 toward an executor class indicates a perfect the most relevant architectural features of such executor

matching, while a 0 affinity indicates no matching at all. Circular Buffering, MAC operations, Super Harvard

class: and

With respect to previous attempts to perform similar Architecture [24]. For the circular buffering, the goal is to

analysis, the proposed one is more general and accurate identify subsets of the specification accessing a linear data

since it provides a specific quantification of the matching structure (1D-array, row or column of 2D-array) to shift it

through the affinity measure. For example, in [26], the one or more positions.

efficiency of GPPs and FPGAs is evaluated only with For each

Definition 19 (Strong Circularity Degree (SCD )).

respect to the exploitation of the available area evaluating m

method is the ratio between the number of source

m, SCD

spatial efficiency

the of a device. In [27], the authors create a m ¼

lines that contain expressions of the form and

methodology that fully characterizes any algorithm with v½i v½i K

6 IEEE TRANSACTIONS ON COMPUTERS, VOL. 55, NO. 5, MAY 2006

the total number of lines, where is a vector (or a row/column For each

Definition 28 (Bit Manipulation Rate (BMR )).

v m

is the ratio between the number of source

method

of a matrix), and is a constant. m, BMR

K m

lines that contain bit-manipulation operations (e.g., and, or,

For each

)).

Definition 20 (Weak Circularity Degree (W CD

m xor, etc.) and the total number of lines.

method is the ratio between the number of source

m, W CD m

lines that contain expressions of the form v[K] = f(v[i]) The information gathered by means of the metrics

or and the total number of lines, where is a

q = f(v[i]) v previously defined is organized in a global metric that

vector (or a row/column of a matrix), is a constant, and

K allows a straightforward characterization of a functionality

is an expression involving

f(v[i]) v[i]. with respect to each considered executor class. Such a

global metric, called Affinity, is operatively defined as

For the MACs, the goal is to identify subsets of the follows.

specification that express a particular mix of operations (i.e.,

a sum and a multiplication) that a DSP can perform 5.2.6 The Affinity Metric

concurrently. Affinity

The of each method can be expressed by a

m

For each

Definition 21 (Strong MAC Degree (SMD )). normalization function applied to a linear combination of

m

method is the ratio between the number of source the metrics, with weights that depend on the considered

m, SMD

m

lines a loop that contain expressions of the form executor class, as follows:

inside

¼ þ and the total number of lines.

s1 s1 sx sy Tm T

¼ Þ;

fðW C

A m

For each

)).

Definition 22 (Weak MAC Degree (W MD

m where

method is the ratio between the number of source

m, W MD

m

lines that contain, a loop, expressions of the form

outside ¼ ½ ;

## A A A

## A

m GP P m DSP m HW m

¼ þ and the total number of lines.

s1 s1 sx sy 2 3

0 0 0 0 0 0 1 1 0 0 0 1 1 1

6 7

For concurrent memory accesses, the goal is to identify ¼ 4 5

1 1 1 1 1 1 0 0 1 0 0 0 1 0

W ;

subsets of the specification able to exploit concurrent 0 0 0 0 0 0 0 0 1 1 1 0 0 0

memory accesses to instructions and data capability [24]. is a 14-element row vector collecting, in this order,

and C

m

For each

Definition 23 (Strong Harvard Degree (SHD )).

m , , , , , ,

## the costs W CD SHD W HD SMD W MD

## SCD

m m m m m m

## P

method is the ratio between the number of source

m, SHD

m m m

, , , , , ,

## CR LR BMR DR DR

## IOR

m m m m

P z

z2fchar;stringg

bit

lines that contain, a loop, expressions with the

inside m , and . The weights of the matrix

## DR ST R W

m

z

z2fint;readg

hopi hopi

structure or and the total

v[i] w[i] q w[i] are set to 1 when the associated metric is meaningful for a

hopi

number of lines, where and are vectors, and is an

v w given executor class, to 0 otherwise. In this way, the affinity

operator different from the assignment one. represents the sum of all the contributions determined by

For each

Definition 24 (Weak Harvard Degree (W )).

HD each relevant metric normalized by means of the arctangent

m

method is the ratio between the number of source

m, SHD function scaled of a factor. Finally, to better discrimi-

=2

m

lines that contain, a loop, expressions with the

outside nate between low and high affinity values, a quadratic form

hopi hopi

structure or and the total is introduced, leading to the following normalization

v[i] w[i] q w[i] hopi

number of lines, where and are vectors, and is an function:

v w

operator different from the assignment one. 2 Þ

arctanð2x

¼ :

fðxÞ

5.2.4 GPP Oriented Metrics =2

The goal is to identify functionality that significantly relies on

operations that involve conditional dependent control flows, 5.3 Validation

complex data structures and complex I/O management. A meaningful validation process has been set up based on a

C test suite. A tool has been developed and integrated with

For each method

Definition 25 (Conditional Ratio (CR )). m,

m a C/C++ code analyzer (GENOA [29]). The adopted

is equal to , where is the Control

## CR CF C LR CF C

m m m m benchmark suite is composed of 311 procedures, each

and is the

Flow Complexity Loop Ratio.

## LR

m representing a specific functionality. A subset of these

For each instance of method,

)).

Definition 26 (I/O Ratio (IOR

m procedures (about 100) has been selected from applications

is the ratio between the number of source lines that

IOR m oriented to digital signal processing. The remaining

contain I/O operations (e.g., read, write, etc.) and the total procedures are representative of a generic set and contain

number of lines. functions such as encoding/decoding, string manipulation,

For each method

Definition 27 (Structure Ratio (ST )).

R m, and several common operations. During the validation

m

the is the ratio between the number of structures

ST R process, the values of the metrics previously defined have

m

declared and the total number of declarations. been collected to evaluate the affinity.

Interesting considerations can be made by analyzing the

5.2.5 ASIC-Like Oriented Metrics averages of the affinity values on the whole test suite, for

The goal is to identify regular functionalities that signifi- the DSP applications, and for the others (see Table 1). The

cantly rely on bit-manipulation operations. Therefore, the for the DSP applications is fairly large with

affinity A

## DSP

following metric is defined: respect to the other affinity values and with respect to the

## BRANDOLESE ET AL.: AFFINITY-DRIVEN SYSTEM DESIGN EXPLORATION FOR HETEROGENEOUS MULTIPROCESSOR SOC 7

Affinity Index )

have been allocated. To this purpose, the (I A

TABLE 1 of a solution has been defined as follows:

Affinity Average Values per Application Class P P ð1 Þ

x A

e;m e;m

g

m2MI e2fGP P ;DSP ;HW

¼

I ;

A jMIj

where , i.e., , , , are 1 or 0,

x x x x

e;m GP P m DSP m HW m

respectively, if the method instance (of the behavioral,

m

possibly OO specification) is allocated or not to the

values obtained for the other application cases. It is

A , , are the

associated type of executor; A A

## A

DSP GP P m DSP m HW m

has the largest average of the whole

worth noting that A jMIj

affinities of an instance of method; is the number of

## GP P

set, revealing the general purpose nature of the related instances of a method. The Affinity Index falls in the range

indicates, in general, those

executors class, while the A ½0; and values closer to 0 indicate a perfect match between

1

## HW

procedures that exploit some features associated with the functionalities and executors.

ASIC executor class. For each solution, it is very important to

Load Indexes.

consider the balancing of the workload over the available

processing elements. For this reason, two indexes are

6 S -L P

YSTEM EVEL ARTITIONING considered to distinguish software from hardware. The

When the functionalities of the system should be targeted ) is evaluated as

Load Index for software executors (I Lsw

for a specific implementation, the related task is called follows:

system-level partitioning. However, the design space that n

## X X

m

must be explored when considering all the parameters j

1 1

¼ jl j;

## L

I LSW i;j SW

associated with a multiprocessor architecture may easily m n L

j SW

j¼1 i¼1

become unmanageable, thus making an exhaustive analysis where is the number of software executors present in the

m

of all the possible solutions impossible: Efficient heuristics is the number of method

solution (GPP or DSP), n

are needed to manage the task complexity. j is the

instances allocated on the software executor

Existing tools dealing with multiprocessor embedded j, l i;j

estimated load [11] of the functionality on the software

systems codesign start from a heterogeneous specification ith

is the ideal load. It is theoretically

executor and

where the partitioning strongly relies on the experience of j, L SW

100 percent, but it is generally set to about 70 percent (a

the designer [13], [14] or provide a partitioning methodol- typical value that allows considering the presence of a

ogy targeted only to particular applications [37], [40]. Some possible operating system [20]). This index varies in the

approaches focus mainly on communication issues [15], ½0;

interval and captures the average of the differences

1

[16], while a general approach that considers both commu- between measured and ideal load. Values closer to 0

nications and different executors is presented in [17], but indicate that each processing element presents a load

area-delay curves

the considered in such a work are similar to the ideal situation.

impractical. The approaches closer to our proposal are ) avoids

The Load Index for hardware executors (I

[18] and [19]. They perform a clustering of the behavioral LHW

the overloading of hardware devices, i.e., the allocation of a

specification, a task classification based on area and time, number of functionalities whose implementation exceeds

and a design space exploration by means of transforma- the available gates:

tions. However, they are targeted to single processor

systems only. X

m

1 s

j

## P

¼ ;

I Lsw n

6.1 The Proposed Partitioning Approach j

m g

i;j

i¼1

j¼1

The system partitioning methodology proposed here [11], where

clustering

[34] is composed of two phases: a of the system P

refinement

functionalities to achieve a preallocation and a n

j

0 g G

i;j j

## P P

i¼1

¼

s

phase, driven by a genetic algorithm and a global cost n n

j j j

g G g > G

i;j j i;j j

i¼1 i¼1

function. The resulting design space exploration strategy and where is the number of hardware devices (ASIC or

m

works with a variable-granularity to increase efficiency in is the number of method instances allocated on

FPGA), n

allocating the functionality and the cost function includes j is the number of gates needed

the hardware executor j, g

four main components: affinity, communication, load, and i;j

to implement the functionality on the executor and

ith j, G

economical cost. j

is the maximum number of gates on the hardware

6.1.1 Metrics executor The load index values for hardware range in

j.

½0;

The comprehensiveness of the parameters composing the the interval and a value of 0 indicates that no hardware

1

cost function is crucial to drive the design space explora- executors are overloaded. The information on communica-

Communication Index.

tion. In the following, the proposed metrics are detailed.

Given a solution, it is very important to

Affinity Index. tion cost, gathered during the functional cosimulation,

allows the evaluation of the exchanged data size due to the

evaluate the matching between the properties of the interactions between functionalities allocated on different

functionalities and the processing elements on which they

### DESCRIZIONE DISPENSA

Abstract—Continuous advances in silicon technology enable the development of complex System-on-Chip as cooperation among Digital Signal Processors (DPSs), General Purpose Processors (GPPs), and specific hardware components. The impact of this choice is not only limited to the target architecture, but also encompasses the overall system specification. It is thus crucial to manage such a complexity using high-level specification languages and a tool chain supporting the designer throughout a set of strategic decisions, such as the identification of a set of possible target architectures, the verification of the correctness of the specification, and the partitioning of the specification onto a set of computational resources. This paper addresses this type of problem by proposing a design flow supporting the system-level design of heterogeneous multiprocessor system-on-chip (MP-SoC), by extracting information from the system description (e.g., SystemC)—statically and in a fast manner—and by providing a set of quantitative measures correlating the type of executor, the functionality, and a timing estimation. Partitioning and architecture selection are built on top of this data and the final analysis of the selected Hardware-Software solution over the identified candidates is finally submitted to a timing verification via simulation. Note that the possibility of actually performing a comprehensive design space exploration, in general, is tightly influenced by the interaction between partitioning/architecture-selection and timing simulation in the design flow; for this reason, the description of this aspect is particularly emphasized in the presentation of the methodology. To show the applicability of the proposed methodology, two relevant case studies are described in the paper.

I contenuti di questa pagina costituiscono rielaborazioni personali del Publisher Atreyu di informazioni apprese con la frequenza delle lezioni di Sistemi embedded e studio autonomo di eventuali libri di riferimento in preparazione dell'esame finale o della tesi. Non devono intendersi come materiale ufficiale dell'università L'Aquila - Univaq o del prof Pomante Luigi.

Acquista con carta o conto PayPal

Scarica il file tutte le volte che vuoi

Paga con un conto PayPal per usufruire della garanzia Soddisfatto o rimborsato