Estratto del documento

A High-Performance Data-Dependent

Hardware Integer Divider

MAGISTERARBEIT

zur Erlangung des akademischen Grades

Diplom-Ingenieur der Angewandten Informatik

Angefertigt am

Institut für Computerwissenschaften und Systemanalyse

der Naturwissenschaftlichen Fakultät

der Paris-Lodron-Universität Salzburg

Eingereicht von

Rainer Karl Ludwig Trummer

Eingereicht bei

Univ.-Prof. Dr. Peter Zinterhof

Salzburg, Mai 2005

c

° Copyright by Rainer K. L. Trummer, 2005

To my parents Ella and Josef, who made it all possible.

Preface

Numerical precision is the very soul of science.

Sir D’arcy Wentworth Thompson, 1917

On Growth and Form,

ardware dividers are needed in many areas of applications like cryptography, signal

processing, communication systems, computer floating-point units, etc. The perfor-

mance requirements of these applications differ regarding data and architectural issues.

High-speed dividers are usually too large in area to be incorporated in small architectures

like signal processors. This circumstance and the rapidly growing demand for small and

fast hardware dividers greatly encourages the search for more sophisticated solutions.

In this thesis, the basic principles used in hardware integer dividers are shown. As a

contribution to the challenging problem of designing new divider architectures, a high-

performance data-dependent integer divider is proposed. Based on several improvements,

integer division could be speeded up on average for 600%. To investigate the behavior

of different implementations, the Verilog hardware description language was chosen. The

characteristic data-dependent performance of each implementation was simulated on a

parallel computer cluster in order to search a larger number of cases. For this purpose,

different principles for the generation of random numbers were used to emphasize potential

advantages or drawbacks of the proposed dividers.

The author addresses those who are interested in data-dependent divider architectures

and advanced solutions associated with this specific domain. The reader is assumed to

have basic knowledge in computer organization and design.

i

ii

Contents

Preface i

List of Figures v

List of Tables vii

List of Examples ix

Acknowledgements xi

1 Introduction 1

1.1 Aim of the Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Structure of the Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Preliminaries 5

2.1 Elementary Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Addition and Subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.1 Ripple-Carry Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.2 Carry-Lookahead Adder . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.3 Carry-Save Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Wired and Logical Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.1 Wired-Shift Register . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.2 Logical-Shift Register . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Division Arithmetic 15

3.1 Restoring Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2 Non-Restoring Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.3 Radix-Two Divider . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.3.1 Shifting Over Zeros . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.4 High-Radix Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.4.1 High-Radix Non-Restoring Division . . . . . . . . . . . . . . . . . . 24

3.4.2 Basic Principles of SRT Division . . . . . . . . . . . . . . . . . . . . 25

3.4.3 High-Radix SRT Division . . . . . . . . . . . . . . . . . . . . . . . 26

3.5 Fast Array Dividers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

iii

3.6 A Different Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.7 Self-Aligning Divider . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4 Improved Division 39

4.1 Basic Idea and Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.2 Combinational Aligning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.3 Direct-Aligning Divider . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.4 Hybrid-Aligning Divider . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5 Performance Analysis 47

5.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.3 Designing the Parallel Program . . . . . . . . . . . . . . . . . . . . . . . . 51

5.3.1 Abstract Data Type . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.3.2 Performance Functions . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.3.3 MPI Parallel Program . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.4 Generating Random Numbers . . . . . . . . . . . . . . . . . . . . . . . . . 54

6 Obtained Results 57

6.1 Operand Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6.2 Simulated Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.3 Speedup Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

7 Conclusions 67

7.1 Review of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

7.2 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

7.3 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

A Source Code 71

A.1 Verilog Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

A.2 C++ Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

Bibliography 193

Index 195

iv

List of Figures

2.1 Carry-Propagate and Carry-Save Adder . . . . . . . . . . . . . . . . . . . . 10

2.2 Wired and Logical Left-Shift Register . . . . . . . . . . . . . . . . . . . . . 12

2.3 Wiring Diagram of Logical Left-Shifter . . . . . . . . . . . . . . . . . . . . 13

3.1 Basic Radix-2 Divider . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2 Radix-Two Divider . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.3 SRT Divider with CSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.4 Controlled Array Cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.5 Cellular Array Dividers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.6 Data-Dependent Divider . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.7 Self-Aligning Divider . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.1 Combinational Aligner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.2 Direct-Aligning Divider . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.3 Hybrid-Aligning Divider . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.1 True and False Circuit Paths . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.2 17-Node Cluster and Its Network . . . . . . . . . . . . . . . . . . . . . . . 52

6.1 Distribution of Data Sizes and Differences . . . . . . . . . . . . . . . . . . 58

6.2 Performance of RT, SA, DA, HA Divider . . . . . . . . . . . . . . . . . . . 61

6.3 Comparison to Radix-4 SRT Divider . . . . . . . . . . . . . . . . . . . . . 63

6.4 Comparison to Radix-2 Array Divider . . . . . . . . . . . . . . . . . . . . . 65

A.1 8-bit D-Type Latch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

A.2 8-bit MS Flip-Flop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

A.3 8-bit 2-to-1 Multiplexor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

A.4 8-bit 3-to-1/4-to-1 Multiplexor . . . . . . . . . . . . . . . . . . . . . . . . . 82

A.5 8-bit Carry-Lookahead Adder . . . . . . . . . . . . . . . . . . . . . . . . . 85

A.6 8-bit Priority Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

A.7 8-bit Logical Left-Shifter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

A.8 8-bit Logical Right-Shifter . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

A.9 8-bit Radix-Two Divider . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

v

A.10 8-bit Self-Aligning Divider . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

A.11 8-bit Direct-Aligning Divider . . . . . . . . . . . . . . . . . . . . . . . . . . 137

A.12 8-bit Hybrid-Aligning Divider . . . . . . . . . . . . . . . . . . . . . . . . . 144

vi

List of Tables

3.1 Radix-4 Quotient-Digit Lookup Table . . . . . . . . . . . . . . . . . . . . . 28

3.2 Data of Different Array Dividers . . . . . . . . . . . . . . . . . . . . . . . . 33

5.1 Synthesized Maximum-Path Delays . . . . . . . . . . . . . . . . . . . . . . 50

6.1 Performance of RT, SA, DA, HA Divider . . . . . . . . . . . . . . . . . . . 60

6.2 Performance of SRT and Array Divider . . . . . . . . . . . . . . . . . . . . 64

6.3 Average Speedup of Different Dividers . . . . . . . . . . . . . . . . . . . . 64

vii

viii

List of Examples

3.1 Paper-and-Pencil Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2 Digit-at-a-Time Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.3 Restoring Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.4 Non-Restoring Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.5 Radix-Two Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.6 Radix-4 SRT Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.7 Self-Aligning Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.1 Combinational Aligning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.2 Direct-Aligning Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.3 Hybrid-Aligning Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6.1 SRT4 Critical Path Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

ix

x

Acknowledgements

Everything that is really great and inspiring is created by the individual who

can labor in freedom. Albert Einstein, 1950

Out of My Later Years,

irst, I want to thank Peter Zinterhof for taking the responsibility for this thesis and

granting me absolutely unrestricted freedom and independence while working on it.

His incredible knowledge and experience frequently motivated all sorts of interesting and

very instructive conversations far beyond the scope of computer science.

My special thanks are devoted to Roman Trobec for supervising the present work and

supporting the early publication of an associated paper. His inspirations, comments, and

suggestions were of invaluable worth and contributed greatly to this thesis. I also would

like to thank him for the many fruitful discussions we had during that time.

My kindest thanks go to my parents Ella and Josef for their endless encouragement,

patience, and love. Moreover, without their implicit trust in me, their help and support

throughout the past years, this thesis would probably not exist.

Finally, I would like to thank my sisters and brothers in law for the relaxing time we

spent together and that they never stopped believing in me.

Salzburg, May 2005 Rainer K. L. Trummer

xi

xii

Chapter 1

We are at the very beginning of time for the human race. It is not unreasonable

that we grapple with problems. But there are tens of thousands of years in the

future. Our responsibility is to do what we can, learn what we can, improve

the solutions, and pass them on. Richard Feynman (1918−1988)

Division is the most complex of the four basic arithmetic operations and, in general,

does not produce an exact answer, since the dividend is not necessarily a multiple of

the divisor. Therefore, the corresponding quotient and remainder are usually obtained

through performing a sequence of iterations until the desired precision is reached. This

procedure is called sequential division and serves as the basic principle for many practical

implementations [1, 2, 3, 4, 5].

Based on sequential division, the most prevalent representatives are radix-β dividers,

where β denotes the radix, typically chosen to be a power of 2. In order to compute the

answer, many of these dividers perform a constant number of iterations, which makes

them very attractive for fully pipelined architectures. However, several concepts have

been developed to speedup the exhaustive sequential process. The most efficient method

is called SRT division, named after their inventors Sweeney, Robertson, and Tocher,

who had the same idea about the same time [2, 3, 4, 6]. SRT division is primarily

implemented with radix 4 and higher, generally using a carry-save-addition scheme. The

main principle is to interpret all partial remainders R and the divisor D as normalized

i

fractions that are assumed to satisfy the relations 1/2 ≤ |βR | < 1 and 1/2 ≤ |D| < 1.

i

Depending on the underlying redundant quotient digit set, which can be generalized as

q ∈ {−α, . . . , −1, 0, 1, . . . , α}, where d(β − 1)/2e ≤ α ≤ (β − 1), a certain number of

i

significant bits of the partial remainder and the divisor is used to select a quotient digit

from a lookup table. The selected radix-β digit can be converted to the corresponding

radix-complement bits on the fly, which avoids the need of additional storage place.

1

2 CHAPTER 1. INTRODUCTION

The main drawback of SRT division is the need of large lookup tables that grow

quadratically with increasing radix. Depending on the radix used and quotient digit set,

lookup tables can consist of many thousand to millions of entries. Moreover, generating

all of the required divisor multiples for radix 8 and higher is difficult and raises the cost of

additional hardware. Due to this, practical table implementations are restricted currently

to radix 2 and radix 4 [4, 6]. Another drawback is that during the development process

of a large lookup table a few bits can easily get lost, a consequence that has become very

popular as the ”Pentium Flaw” [7].

According to Moore’s Law , which states that ”the computer performance doubles each

one and a half year”, computer technology is close to be approaching physical limits

such as the speed of light and atomic dimensions. Thus, we cannot depend on ever

increasing processor speed as well as ever decreasing circuit dimensions. This fact and

the rapidly growing demand for small and fast integer dividers, needed in many areas

of applications like cryptography, signal processing, communication systems, computer

floating-point units, etc [8, 9, 10], greatly encourages the search for more sophisticated

solutions. One of them are data-dependent dividers that execute in variable time [11].

The basic principle is to skip all redundant operations and carry out only shifts as long as

there are leading zeros of the remainder or divisor, depending on the algorithm used. This

method is called shifting over zeros and also used in SRT division for keeping the operands

normalized [2, 3, 4, 6, 12]. Data-dependent dividers do not require any area-consuming

lookup tables and, in general, achieve a much higher throughput than conventional low-

radix dividers. Therefore, they are primarily incorporated in small architectures like signal

processors. Unfortunately, the speed of data-dependent dividers is currently not adequate

to be also competitive regarding fully pipelined architectures.

1.1 Aim of the Work

Compared to tons of papers that have been published about SRT division and related

methods [5, 8, 9], very little is available about data-dependent divider architectures. This

was the author’s main motivation to contribute some work in this field. Referring to the

insufficient speed of data-dependent dividers based on the shifting-over-zeros method, the

main contribution is a High-Performance Data-Dependent Hardware Integer Divider that

achieves an average speedup of 600%. A paper that summarizes the content of this thesis

was presented first at the Parallel Numerics ’05 conference in April 2005 and appeared in

the accompanying proceedings [13].

The aim of this work is to make the reader familiar with various principles and advanced

methods used in integer divider architectures. Because a floating-point divider basically

consists of an integer divider operating on the mantissas, as well as an adder operating on

the exponents, the extension from integer to floating-point division is not in the scope of

this thesis. Due to the knowledge imparted, the reader is assumed to get an impression

about the difficulties arising from division arithmetic, and why designing efficient hardware

dividers still remains challenging. 3

1.2. STRUCTURE OF THE WORK

1.2 Structure of the Work

Pleasure in the job puts perfection in the work. Aristotle (384−322 BC)

Chapter 1 provides a general overview of the present work. The preliminary Chapter 2

introduces denotations and basic components used in this thesis. Chapter 3 discusses

different principles and methods for performing binary division, to be continued with our

improvements and contributions presented in Chapter 4. The methods used to analyze

the performance of four implemented dividers are explained in Chapter 5. The obtained

results are illustrated in Chapter 6, followed by our conclusions discussed in Chapter 7.

The thesis closes with the complete source code listed in the Appendix.

Chapter 1: Introduction. The first chapter is intended to give the reader an overview

of the present thesis and considered to motivate the two main categories regarding divider

architectures: constant and variable execution time.

Chapter 2: Preliminaries. In the preliminary chapter we first present a small formal

framework of denotations appearing frequently in this thesis. Thereafter, we discuss some

essential components used in many dividers. Readers familiar with topics of this chapter

may skip some sections without loss of continuity.

Chapter 3: Division Arithmetic. In this chapter we describe various principles and

methods applied in hardware dividers. We begin our excursion of performing sequential

division with examining the

Anteprima
Vedrai una selezione di 10 pagine su 213
I divisori - tesi Pag. 1 I divisori - tesi Pag. 2
Anteprima di 10 pagg. su 213.
Scarica il documento per vederlo tutto.
I divisori - tesi Pag. 6
Anteprima di 10 pagg. su 213.
Scarica il documento per vederlo tutto.
I divisori - tesi Pag. 11
Anteprima di 10 pagg. su 213.
Scarica il documento per vederlo tutto.
I divisori - tesi Pag. 16
Anteprima di 10 pagg. su 213.
Scarica il documento per vederlo tutto.
I divisori - tesi Pag. 21
Anteprima di 10 pagg. su 213.
Scarica il documento per vederlo tutto.
I divisori - tesi Pag. 26
Anteprima di 10 pagg. su 213.
Scarica il documento per vederlo tutto.
I divisori - tesi Pag. 31
Anteprima di 10 pagg. su 213.
Scarica il documento per vederlo tutto.
I divisori - tesi Pag. 36
Anteprima di 10 pagg. su 213.
Scarica il documento per vederlo tutto.
I divisori - tesi Pag. 41
1 su 213
D/illustrazione/soddisfatti o rimborsati
Acquista con carta o PayPal
Scarica i documenti tutte le volte che vuoi
Dettagli
SSD
Ingegneria industriale e dell'informazione ING-INF/05 Sistemi di elaborazione delle informazioni

I contenuti di questa pagina costituiscono rielaborazioni personali del Publisher valeria0186 di informazioni apprese con la frequenza delle lezioni di Architetture Sistemi Elaborazione e studio autonomo di eventuali libri di riferimento in preparazione dell'esame finale o della tesi. Non devono intendersi come materiale ufficiale dell'università Università degli studi di Napoli Federico II o del prof Mazzocca Nicola.
Appunti correlati Invia appunti e guadagna

Domande e risposte

Hai bisogno di aiuto?
Chiedi alla community