FAQ on Xputers - Xputer Lab Kaiserslautern - Reconfigurable Computing with KressArray
FQA about Xputers:
Frequently questioned Answers about Xputers
In the late eighties we had problems
in publishing about Xputers. Although many submissions have been rejected,
in 1989 we succeeded in publishing our first speed-up data - our answers
to questions about the usefulness of the Xputer paradigm. These answers
sometimes have been questioned. Some people did not believe in our performance
figures.
Usually a conference time slot is too short for such a highly innovative
topic, since most of the audience did not have the
background to understand the key issues. Our experience has been, that
about 90 minutes have been needed to convince people, that our performance
figures are realistic. These 90 minutes we have got by colloquium talks
followed by a long discussion. Even better is a tutorial about Xputers.
These FAQ und FQA pages intend to be a kind of interactive fast mini tutorial
for the Web surfer.
Contents
Speed-up Factors by Xputers
Acceleration by multiple sequencers
Why Xputers are more powerful than Software
to Hardware Migration
Optimizing Compilation Techniques for Xputers
Xputers and the Memory Communication Gap
Call
for Partners - New Horizons in R&D
Forthcoming
Books on Xputers
Speed-up Factors by Xputers
Inalmost ten years we have implemented three different generations of
Xputer architectures: the MoM-1, MoM-2, and the MoM3 (also see the section
section about sequencers).
Our first performance figures, from the MoM-1, have been published in 1989,
showing the speed-up against Motorola 68020 by using a MoM-1 Xputer (see
following table). MoM-1 is our first generation
Xputer architecture. MoM stands for "Map-oriented Machine".
<
>
Table 1. Speed-up by MoM-1:
|
Application ( MoM-1 versus Motorola 68020)
|
speed-up factor
|
published
|
|
CMOS design rule check (grid-based)
|
optimized 68020 version
|
>2,000
|
[1] [5]
[6] [6b]
|
|
equivalent algorithms: non- optimized 68020 version
|
>15,000
|
|
grid-based electrical rule check
|
>300
|
[1] [5]
[6] [6b]
|
|
Lee routing (grid-based)
|
>160
|
[1] [5]
[6] [6b]
|
The grid-based design rule check is based on pattern matching. It is
carried out by a single video scan over the entire layout pixel map, where
a 4-by-4 scan window is moved in smallest steps. For a single metal single
poly CMOS process (Fraunhofer IMS, Duisburg, Germany) 800 reference patterns
have been linked to this 4-by-4 scan window. For Mead-&-Conway nMOS
Design rules only 256 reference patterns have been needed.
<
> Table 2. Speed-up by Multiple Scan
Windows:
|
Application ( MoM-1 vs Motorola 68020)
|
number of scan windows
|
speed-up factor
|
published
|
|
10-by-10 vector matrix multiplication
|
1
|
>9
|
[2]
|
|
2
|
150
|
Acceleration by Multiple Sequencers
By the way, the complexity of this grid-based design rule check goes
linear with chip area. This is substantially better than with conventional
design rule check algorithms, which need complex divide and conquer strategies
to obtain halfways reasonable computation speed. We have also investigated
the influence of the number of sequencers used simulatiously on the performance
(see table no. 2). In a 10 by 10 matrix
vector multiplication the use of two sequencers has brought a speed-up
by more than an order of magnitude, compared to the version with only a
single sequencer.
<
> Table 3. Speed-up of MoM-3 versus
SPARC 10/51:
|
Application
( MoM-3 vs SPARC 10/51)
|
no. of scan windows
|
speed-up factor
|
published
|
|
complete JPEG compression
|
MoM-3 execution times have been compared (without reconfiguration
times) to user-process execution times of Sun SPARC 10/51.
|
3
|
3.5
|
[8]
|
|
7
|
(table 2) 10.2
|
[8]
|
|
New present technology is here assumed for MoM-3 hardware.
MoM-3 execution time (without reconfiguration time) has been compared to
user- plus system-process execution times of Sun SPARC 10/51.
|
7
|
(table 10) 24
|
[12]
|
Most of the acceleration has been achieved due to computing 800 boolean
equations in parallel within a single clock cycle. The optimized 68020
version of the algorithm skips certain boolean equation computations depending
on results of other boolean equaltions. Such an optimization does not make
sense for the MoM version. That's why comparing to the non-optimized 68020
algorithm version is more fair, where the acceleration factor is greater
than fifteen thousand (se table above). Another source of speed-up is avoiding
address computation overhead by the Xputer's data sequencer. In the design
rule check example this contributes more than an order of magnitude (see
section on sequencers).
MoM-3 Performance versus SPARC
10/51
A couple of years later we have experimented with two different versions
of the MoM-3 architecture (the MoM-3, and the faster MoM-3NT, where NT
stands for "newer technology"). Table
3 shows the results for the core of the JPEG algorithm (a multi media
application). Also this time the performance can be improcved by using
more sequencers.
<
> Table 4. MoM-3 speed-up in Multi Media
and DSP:
|
Application ( MoM-3 vs Sun SPARC
10/51, 50 MHz, 192MB RAM)
|
different details
|
speed-up factor
|
published
|
|
complete JPEG compression
|
MoM-3 (with 7 scan windows) execution times have been compared
(without reconfiguration times) to user-process execution times of Sun
SPARC 10/51.
|
10.2
|
[8]
|
|
MoM-3 execution time (without reconfiguration time) compared
to user- plus system-process execution times of Sun SPARC 10/51.
|
11,8
|
[9]
|
|
MoM-3 execution- plus reconfiguration- times compared to user-
plus system- process execution times of Sun SPARC 10/51.
|
7,3
|
[10] [14]
|
|
New present technology is here assumed for MoM-3 hardware. MoM-3
execution time (without reconfiguration time) has been compared to user-
plus system-process execution times of Sun SPARC 10/51.
|
24
|
[12]
|
|
2-dimensional FIR filter with 3-by-3 Kernel size, 1280x1024 8-bit
pixels.
|
MoM-3 execution- plus reconfiguration- times compared to user-process
time of Sun SPARC 10/51
|
39.4
|
[11]
|
|
Ising 128 lattice, 1000 iterations
|
MoM-3 execution-time compared to user-process time of Sun SPARC
10/51
|
53.8
|
[14]
|
The Ising model (chemistry, molecular biology)
is used for the analysis of phase transitions by explaining how short-range
interactions between components of a large structure (e.g. molecules in
a crystal) give rise to a large-range, correlating behaviour, for predicting
the potential for a phase transition.
Why Xputers are more powerful than Software
to Hardware Migration
The use of an Xputer as an accelerator co-processor also means (1) software-to-hardware
migration, but not only. Most frequently used loops and their expression
body is moved onto the rALU. But only with Xputers more speed-up phenomena
are available: (2) almost complete avoidance of addressing overhead by
migration into the sequencer(s) (not into the rALU!), and, (3) run time
to compile time migration, where such is not possible on parallel von Neumann
platforms. Parallelism in Xputers does not suffer from the fine granularity
switching explosion, like parallel (von Neumann) computers, since most
of the communication in Xputer systems is defined by the compilation method.
Table 5 shows the run time addressing overhead for von Neumann implementations
of several algorithms. For example, the grid-based design rule check [5]
[6] has a high addressing overhead. It uses
92% of somputation time for addressing. By migration into a data sequencer
due to Xputer use this yields a speed-up of more than ten.
<
> Table 5. Comparative Overhead Analysis:
|
algorithm (on VAX-11/750)
|
data manipulation
|
overhead
|
|
addressing
|
control flow
|
|
design rule check
|
7%
|
93%
|
<1%
|
|
digital filter
|
28%
|
58%
|
14%
|
|
Lee router
|
seek starting point
|
14%
|
74%
|
12%
|
|
wavefront propagation
|
6%
|
92%
|
6%
|
|
backtracking
|
17%
|
67%
|
17%
|
Optimizing Compilation Techniques for Xputers
....
sorry: we are still working at this section
Xputers and the Memory Communication Gap
....
sorry: we are still working at this section <
^
Literature
[1] R.Hartenstein, A.Hirschbiel, M.Weber:
MoM - a partly custom-design architecture compared to standard hardware;
IEEE Comp Euro 89, Hamburg, Germany, 1989, IEEE Press 1989, p 5/7-9, 1989
[2] R.Hartenstein, A. Hirschbiel, M.Weber: Rekonfigurierbare
ALU erlaubt Parallelisierung auf unterster Ebene; VMEbus, 1990
[3] R. Hartenstein, A. Hirschbiel, M.Weber: Xputers
- An Open Family of Non von Neumann Architectures; Proc. of 11th ITG/GI-Conf.
Architektur von Rechensystemen, VDE-Verlag, 1990
[4] R.Hartenstein, A.Hirschbiel, M.Weber:
The Machine Paradigm of Xputers and its Application to Digital Signal Processing
Acceleration; 1990 Int´l Conference on Parallel Processing, St. Charles,
Illinois, 1990
[5] R. Hartenstein, A.Hirschbiel, M. Riedmüller,
K. Schmidt, M.Weber: Automatic Synthesis of Cheap Hardware Accelerators
for Signal Processing and Image Preprocessing; 12. DAGM-Symposium Mustererkennung
(Pattern Recognition), Oberkochen-Aalen, Germany 1990
---- Best Paper and Best Presentation Award: DM 1000.-- Speaker:
M. Weber
----
[6] R. Hartenstein, A. Hirschbiel,
M.Weber: A Novel Paradigm of Parallel Computation and its Use to Implement
Simple High Performance Hardware; InfoJapan'90- Int'l Conf. memorating
30th Anniversary Computer Society of Japan, Tokyo, Japan, 1990
[6b] R. Hartenstein, A. Hirschbiel, M.Weber:
A Novel Paradigm of Parallel Computation and its Use to Implement Simple
High Performance Hardware; Future Generation Computer Systems, no. 7, pp.
181-198 (North-Holland PublishingCo., 1991/92)
[7] R. Hartenstein, R. Kress, H. Reinig: A Dynamically
Reconfigurable Wavefront Array Architecture for Evaluation of Expressions;
Proc. ASAP'94 , Int'l Conf. on Application-Specific Array Processors, San
Francisco, Aug. 1994, IEEE CS Press 1994
[8] R. Hartenstein, J. Becker, R. Kress,
H. Reinig, K. Schmidt: A Reconfigurable Machine for Applications in Image
and Video Compression; Int`l Conf. on Compression Technologies and Standards
for Image and Video Compression, Amsterdam, Holland, March 1995
[9] Reiner W. Hartenstein, Rainer Kress,
Helmut Reinig: A Reconfigurable Accelerator for 32-Bit Arithmetic; International
Parallel Processing Symposium, Santa Barbara, USA, April 1995
[10] J. Becker, R. Hartenstein, R. Kress, H.
Reinig: High-Performance Computing Using a Reconfigurable Accelerator;
Proceedings of Workshop on High Performance Computing, Montreal, Canada,
July 1995
[11] R. Hartenstein, R. Kress: A Scalable,
Parallel, and Reconfigurable Datapath Architecture; Sixth International
Symposium on IC Technology, Systems & Applications, ISIC'95, Singapore,
Sept. 6-8, 1995
[12] R. Hartenstein (opening key note):
Custom Computing Machines - An Overview; Int`l Workshop on Design Methodologies
for Microelectronics, Smolenice Castle, Slovakia, September 1995
[13] J. Becker, R. Hartenstein, R. Kress, H. Reinig:
A Reconfigurable Parallel Architecture to Accelerate Scientific Computation;
Proc. of Int. Conf. on High Performance Computing, New Delhi, India, Dec.
1995
[14] R. Hartenstein, J. Becker, R.Kress, H. Reinig:
High-Performance Computing Using a Reconfigurable Accelerator; CPE Journal,
Special Issue of Concurrency: Practice and Experience, John Wiley &
Sons Ltd., 1996
[15] R. Hartenstein, J. Becker,
R.Kress: Custom Computing Machines versusHardware/Software Co-Design -
a globalized point of view; in [16]
[16] R. Hartenstein, M. Glesner: Field-programmable
Logic, Smart Applications, New Paradigms and Compiler; Springer Verlag,
Lecture Notes in Computer Science 1142, 1996
Do you have more questions? Do you have better questions?
If yes, please, inform our webmaster.
Our goal is the steady improvement of this list of questions.


©
Copyright 1996, Universitaet Kaiserslautern, Kaiserslautern, Germany
---- Webmaster