A High-Performance Data-Dependent
Hardware Integer Divider
MAGISTERARBEIT
zur Erlangung des akademischen Grades
Diplom-Ingenieur der Angewandten Informatik
Angefertigt am
Institut für Computerwissenschaften und Systemanalyse
der Naturwissenschaftlichen Fakultät
der Paris-Lodron-Universität Salzburg
Eingereicht von
Rainer Karl Ludwig Trummer
Eingereicht bei
Univ.-Prof. Dr. Peter Zinterhof
Salzburg, Mai 2005
c
° Copyright by Rainer K. L. Trummer, 2005
To my parents Ella and Josef, who made it all possible.
Preface
Numerical precision is the very soul of science.
Sir D’arcy Wentworth Thompson, 1917
On Growth and Form,
ardware dividers are needed in many areas of applications like cryptography, signal
processing, communication systems, computer floating-point units, etc. The perfor-
mance requirements of these applications differ regarding data and architectural issues.
High-speed dividers are usually too large in area to be incorporated in small architectures
like signal processors. This circumstance and the rapidly growing demand for small and
fast hardware dividers greatly encourages the search for more sophisticated solutions.
In this thesis, the basic principles used in hardware integer dividers are shown. As a
contribution to the challenging problem of designing new divider architectures, a high-
performance data-dependent integer divider is proposed. Based on several improvements,
integer division could be speeded up on average for 600%. To investigate the behavior
of different implementations, the Verilog hardware description language was chosen. The
characteristic data-dependent performance of each implementation was simulated on a
parallel computer cluster in order to search a larger number of cases. For this purpose,
different principles for the generation of random numbers were used to emphasize potential
advantages or drawbacks of the proposed dividers.
The author addresses those who are interested in data-dependent divider architectures
and advanced solutions associated with this specific domain. The reader is assumed to
have basic knowledge in computer organization and design.
i
ii
Contents
Preface i
List of Figures v
List of Tables vii
List of Examples ix
Acknowledgements xi
1 Introduction 1
1.1 Aim of the Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Structure of the Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Preliminaries 5
2.1 Elementary Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Addition and Subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 Ripple-Carry Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 Carry-Lookahead Adder . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.3 Carry-Save Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Wired and Logical Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.1 Wired-Shift Register . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.2 Logical-Shift Register . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Division Arithmetic 15
3.1 Restoring Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Non-Restoring Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Radix-Two Divider . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3.1 Shifting Over Zeros . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 High-Radix Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4.1 High-Radix Non-Restoring Division . . . . . . . . . . . . . . . . . . 24
3.4.2 Basic Principles of SRT Division . . . . . . . . . . . . . . . . . . . . 25
3.4.3 High-Radix SRT Division . . . . . . . . . . . . . . . . . . . . . . . 26
3.5 Fast Array Dividers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
iii
3.6 A Different Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.7 Self-Aligning Divider . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4 Improved Division 39
4.1 Basic Idea and Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Combinational Aligning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 Direct-Aligning Divider . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.4 Hybrid-Aligning Divider . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5 Performance Analysis 47
5.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.3 Designing the Parallel Program . . . . . . . . . . . . . . . . . . . . . . . . 51
5.3.1 Abstract Data Type . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.3.2 Performance Functions . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3.3 MPI Parallel Program . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.4 Generating Random Numbers . . . . . . . . . . . . . . . . . . . . . . . . . 54
6 Obtained Results 57
6.1 Operand Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.2 Simulated Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.3 Speedup Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7 Conclusions 67
7.1 Review of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
7.2 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7.3 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
A Source Code 71
A.1 Verilog Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
A.2 C++ Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
Bibliography 193
Index 195
iv
List of Figures
2.1 Carry-Propagate and Carry-Save Adder . . . . . . . . . . . . . . . . . . . . 10
2.2 Wired and Logical Left-Shift Register . . . . . . . . . . . . . . . . . . . . . 12
2.3 Wiring Diagram of Logical Left-Shifter . . . . . . . . . . . . . . . . . . . . 13
3.1 Basic Radix-2 Divider . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Radix-Two Divider . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 SRT Divider with CSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4 Controlled Array Cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.5 Cellular Array Dividers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.6 Data-Dependent Divider . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.7 Self-Aligning Divider . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1 Combinational Aligner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Direct-Aligning Divider . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3 Hybrid-Aligning Divider . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.1 True and False Circuit Paths . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2 17-Node Cluster and Its Network . . . . . . . . . . . . . . . . . . . . . . . 52
6.1 Distribution of Data Sizes and Differences . . . . . . . . . . . . . . . . . . 58
6.2 Performance of RT, SA, DA, HA Divider . . . . . . . . . . . . . . . . . . . 61
6.3 Comparison to Radix-4 SRT Divider . . . . . . . . . . . . . . . . . . . . . 63
6.4 Comparison to Radix-2 Array Divider . . . . . . . . . . . . . . . . . . . . . 65
A.1 8-bit D-Type Latch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
A.2 8-bit MS Flip-Flop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
A.3 8-bit 2-to-1 Multiplexor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
A.4 8-bit 3-to-1/4-to-1 Multiplexor . . . . . . . . . . . . . . . . . . . . . . . . . 82
A.5 8-bit Carry-Lookahead Adder . . . . . . . . . . . . . . . . . . . . . . . . . 85
A.6 8-bit Priority Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
A.7 8-bit Logical Left-Shifter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
A.8 8-bit Logical Right-Shifter . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
A.9 8-bit Radix-Two Divider . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
v
A.10 8-bit Self-Aligning Divider . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
A.11 8-bit Direct-Aligning Divider . . . . . . . . . . . . . . . . . . . . . . . . . . 137
A.12 8-bit Hybrid-Aligning Divider . . . . . . . . . . . . . . . . . . . . . . . . . 144
vi
List of Tables
3.1 Radix-4 Quotient-Digit Lookup Table . . . . . . . . . . . . . . . . . . . . . 28
3.2 Data of Different Array Dividers . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1 Synthesized Maximum-Path Delays . . . . . . . . . . . . . . . . . . . . . . 50
6.1 Performance of RT, SA, DA, HA Divider . . . . . . . . . . . . . . . . . . . 60
6.2 Performance of SRT and Array Divider . . . . . . . . . . . . . . . . . . . . 64
6.3 Average Speedup of Different Dividers . . . . . . . . . . . . . . . . . . . . 64
vii
viii
List of Examples
3.1 Paper-and-Pencil Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Digit-at-a-Time Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Restoring Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4 Non-Restoring Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5 Radix-Two Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.6 Radix-4 SRT Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.7 Self-Aligning Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1 Combinational Aligning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2 Direct-Aligning Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3 Hybrid-Aligning Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.1 SRT4 Critical Path Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
ix
x
Acknowledgements
Everything that is really great and inspiring is created by the individual who
can labor in freedom. Albert Einstein, 1950
Out of My Later Years,
irst, I want to thank Peter Zinterhof for taking the responsibility for this thesis and
granting me absolutely unrestricted freedom and independence while working on it.
His incredible knowledge and experience frequently motivated all sorts of interesting and
very instructive conversations far beyond the scope of computer science.
My special thanks are devoted to Roman Trobec for supervising the present work and
supporting the early publication of an associated paper. His inspirations, comments, and
suggestions were of invaluable worth and contributed greatly to this thesis. I also would
like to thank him for the many fruitful discussions we had during that time.
My kindest thanks go to my parents Ella and Josef for their endless encouragement,
patience, and love. Moreover, without their implicit trust in me, their help and support
throughout the past years, this thesis would probably not exist.
Finally, I would like to thank my sisters and brothers in law for the relaxing time we
spent together and that they never stopped believing in me.
Salzburg, May 2005 Rainer K. L. Trummer
xi
xii
Chapter 1
We are at the very beginning of time for the human race. It is not unreasonable
that we grapple with problems. But there are tens of thousands of years in the
future. Our responsibility is to do what we can, learn what we can, improve
the solutions, and pass them on. Richard Feynman (1918−1988)
Division is the most complex of the four basic arithmetic operations and, in general,
does not produce an exact answer, since the dividend is not necessarily a multiple of
the divisor. Therefore, the corresponding quotient and remainder are usually obtained
through performing a sequence of iterations until the desired precision is reached. This
procedure is called sequential division and serves as the basic principle for many practical
implementations [1, 2, 3, 4, 5].
Based on sequential division, the most prevalent representatives are radix-β dividers,
where β denotes the radix, typically chosen to be a power of 2. In order to compute the
answer, many of these dividers perform a constant number of iterations, which makes
them very attractive for fully pipelined architectures. However, several concepts have
been developed to speedup the exhaustive sequential process. The most efficient method
is called SRT division, named after their inventors Sweeney, Robertson, and Tocher,
who had the same idea about the same time [2, 3, 4, 6]. SRT division is primarily
implemented with radix 4 and higher, generally using a carry-save-addition scheme. The
main principle is to interpret all partial remainders R and the divisor D as normalized
i
fractions that are assumed to satisfy the relations 1/2 ≤ |βR | < 1 and 1/2 ≤ |D| < 1.
i
Depending on the underlying redundant quotient digit set, which can be generalized as
q ∈ {−α, . . . , −1, 0, 1, . . . , α}, where d(β − 1)/2e ≤ α ≤ (β − 1), a certain number of
i
significant bits of the partial remainder and the divisor is used to select a quotient digit
from a lookup table. The selected radix-β digit can be converted to the corresponding
radix-complement bits on the fly, which avoids the need of additional storage place.
1
2 CHAPTER 1. INTRODUCTION
The main drawback of SRT division is the need of large lookup tables that grow
quadratically with increasing radix. Depending on the radix used and quotient digit set,
lookup tables can consist of many thousand to millions of entries. Moreover, generating
all of the required divisor multiples for radix 8 and higher is difficult and raises the cost of
additional hardware. Due to this, practical table implementations are restricted currently
to radix 2 and radix 4 [4, 6]. Another drawback is that during the development process
of a large lookup table a few bits can easily get lost, a consequence that has become very
popular as the ”Pentium Flaw” [7].
According to Moore’s Law , which states that ”the computer performance doubles each
one and a half year”, computer technology is close to be approaching physical limits
such as the speed of light and atomic dimensions. Thus, we cannot depend on ever
increasing processor speed as well as ever decreasing circuit dimensions. This fact and
the rapidly growing demand for small and fast integer dividers, needed in many areas
of applications like cryptography, signal processing, communication systems, computer
floating-point units, etc [8, 9, 10], greatly encourages the search for more sophisticated
solutions. One of them are data-dependent dividers that execute in variable time [11].
The basic principle is to skip all redundant operations and carry out only shifts as long as
there are leading zeros of the remainder or divisor, depending on the algorithm used. This
method is called shifting over zeros and also used in SRT division for keeping the operands
normalized [2, 3, 4, 6, 12]. Data-dependent dividers do not require any area-consuming
lookup tables and, in general, achieve a much higher throughput than conventional low-
radix dividers. Therefore, they are primarily incorporated in small architectures like signal
processors. Unfortunately, the speed of data-dependent dividers is currently not adequate
to be also competitive regarding fully pipelined architectures.
1.1 Aim of the Work
Compared to tons of papers that have been published about SRT division and related
methods [5, 8, 9], very little is available about data-dependent divider architectures. This
was the author’s main motivation to contribute some work in this field. Referring to the
insufficient speed of data-dependent dividers based on the shifting-over-zeros method, the
main contribution is a High-Performance Data-Dependent Hardware Integer Divider that
achieves an average speedup of 600%. A paper that summarizes the content of this thesis
was presented first at the Parallel Numerics ’05 conference in April 2005 and appeared in
the accompanying proceedings [13].
The aim of this work is to make the reader familiar with various principles and advanced
methods used in integer divider architectures. Because a floating-point divider basically
consists of an integer divider operating on the mantissas, as well as an adder operating on
the exponents, the extension from integer to floating-point division is not in the scope of
this thesis. Due to the knowledge imparted, the reader is assumed to get an impression
about the difficulties arising from division arithmetic, and why designing efficient hardware
dividers still remains challenging. 3
1.2. STRUCTURE OF THE WORK
1.2 Structure of the Work
Pleasure in the job puts perfection in the work. Aristotle (384−322 BC)
Chapter 1 provides a general overview of the present work. The preliminary Chapter 2
introduces denotations and basic components used in this thesis. Chapter 3 discusses
different principles and methods for performing binary division, to be continued with our
improvements and contributions presented in Chapter 4. The methods used to analyze
the performance of four implemented dividers are explained in Chapter 5. The obtained
results are illustrated in Chapter 6, followed by our conclusions discussed in Chapter 7.
The thesis closes with the complete source code listed in the Appendix.
Chapter 1: Introduction. The first chapter is intended to give the reader an overview
of the present thesis and considered to motivate the two main categories regarding divider
architectures: constant and variable execution time.
Chapter 2: Preliminaries. In the preliminary chapter we first present a small formal
framework of denotations appearing frequently in this thesis. Thereafter, we discuss some
essential components used in many dividers. Readers familiar with topics of this chapter
may skip some sections without loss of continuity.
Chapter 3: Division Arithmetic. In this chapter we describe various principles and
methods applied in hardware dividers. We begin our excursion of performing sequential
division with examining the
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.