General methodology for mapping high-level computations into hardware structures
Data flows from memory in a rhythmic fashion, passing through many processing elements before returning to memory.
Can be two dimensional (rectangular, triangular, hexagonal) to make use of higher degrees of parallelism
Data flow may be at multiple speeds in multiple directions – both inputs and (partial) results flow (in classical pipelined systems, only results flow)
Regularity and reconfigurability arise from modularity
Key architectural issues in designing special-purpose systems
Simple and regular design
Special-purpose systems must be cost effective (to justify their limited applicability)
Costs = nonrecurring (design) and recurring (parts). Parts are cheap for both special purpose and general purpose systems, so its the design cost that must be small to make a special purpose system attractive
Design cost can be avoided by using an appropriate architecture: decompose into a few types of simple building blocks which are used repetitively with simple interfaces.
Simple, regular designs are likely to be modular (can be adjusted for various performance/cost goals, i.e. cost can be made proportional to performance required)
Concurrency and communication
Diminishing growth rate for component speed, need to make use of concurrent processing elements to continue to improve speed.
An increased number of processing elements working simultaneously requires significant amounts of coordination and communication.
Goal: algorithms that support high concurrency while employing simple, regular communication and control.
Balancing computation with I/O
Goal: balance computation rate with available I/O bandwidth
Related: a modular structure can be adjusted to match a variety of I/O bandwidths
Ideally perform multiple computations per I/O access which requires internal storage.
Need to arrange computation with an appropriate memory structure to balance with I/O time
I/O problem is severe when a large computation is performed on a small system (need to transfer results to and from the host)
Possible to compute I/O-imposed performance limits for computations
In practice, problems are “larger” than special-purpose devices, need to decompose computation in a way that minimizes I/O
Can determine how the size of a system and its memory is related to the I/O requirement, and how I/O bandwidth limits the achievable speedup ratio
Systolic architectures: the basic principle
Set of interconnected cells, each capable of performing a simple operation
Simple and regular communication and control structures using array or tree interconnects
Pipelined information flow, communication only occurs at the boundary cells
Focused on compute bound computations where multiple operations are performed on each data item in a repetitive manner
Other advantages: modular expansibility, simple and regular data and control flows, use of simple and uniform cells, elimination of global broadcasting, and fan-in and fast response time.
Convolution examples
Problem: combine two data streams (weights and inputs)
Also found in filtering, pattern matching, correlation, interpolation, polynomial evaluation (including DFT), and polynomial multiplication/division
Compute bound since each input x_i is multiplied by each of k weights
Reusing x_i from one memory fetch avoids I/O bottleneck
Can broadcast each x_i to a number of cells or fan-in results
weights stored in each cell, partial results y_i move systolically from cell to cell from left-to-right
broadcast inputs, move weights, results stay
Efficiently uses multiply/accumulate hardware
First weight (w_1) has a tag bit signaling the cell to output and reset its contents
Avoids possibly wide path required for communicating y_i (needed for numerical accuracy) since w_i typically has lets bits
Using MACs can improve precision by keeping extra bits inside the accumulator for modest cost
Does require a separate bus for collecting outputs from individual cells
fan-in results, move inputs, weights stay
formulate convolution as inner product of weight vector and a sliding window of inputs
weights preloaded into cells and stay, x_is move one cell to the right each cycle, multiplication performed in each cell and results are fanned-in and summed using an adder
for large k (number of weights), adder can be pipelined to avoid delays
Avoiding broadcast/fan-in (global communication)
Avoids long wires for bus/tree structures to required for the global communication
results stay, inputs and weights move in opposite directions
Each partial result y_i stays at a cell to accumulate its terms
x_is and w_is move systolically in opposite directions
when x meets w at a cell, they are multiplied and accumulated into the y at that cell
Consecutive x_is are separate by two cycles to ensure each x_i is able to meet each w_i
Uses systolic output path
Can interleave independent convolution computations to increase utilization
results stay, inputs and weights move in the same direcition but at different speeds
x_i moves twice as fast as w_i
requires extra register to hold w, but all cells are working all the time
weights stay, inputs and results move in opposite directions
since results move systolically, avoids need for another systolic output path
each output y_i otuputs from the left most-cell during the same cycle as its last input x_{i+k-1}
outputs a y_i every two cycle times with constant response time, but only half the cells work at any given time
weights stay, inputs and results move in the same direction but at different speeds
variant of previous that has cells working all the time (but no longer a constant response time)
used for 2D conv where throughput is more important that response time
Remarks
Other possibility: results, weights, inputs all move
Could also include cell memory and systolic control to store a set of weights (weights can be selected on the fly to implement interpolation or adaptive filtering)
Cell memories + systolic control can make one array implement different functions
Once one systolic design is obtained, likely similar designs can be derived
Criteria and advantages
Design makes multiple use of each input data item
Achieve high throughputs with modest I/O bandwidths
Can use global data communications (broadcast and fan-in), or have each input travel through the array to be used at each cell (preferable for modular expansibility)
Design uses extensive concurrency
Can pipeline stages involved in the computation of each single result, multiprocessing many results in paralell, or both
Sometimes possible to completely overlap I/O and computation times
Can use one or two dimensional arrays
When memory speed is more than cell speed, 2D arrays should be used
At each cell cycle, all the I/O ports on the array boundaries can input or output data items to and from memory to fully utilize the available bandwidth
Can compose (chain) systolic arrays
Can pipeline operations inside the cells
Only a few types of simple cells
For performance, likely need a large number of cells, so they should be simple to avoid design and implementation costs
Data and control flows are simple and regular
Avoid long-distance and irregular wires
Usually only global communication is the clock (and power/ground)
Self-timed schemes are possible but may be difficult
For an arbitrary 1D array, a global clock is fine despite large clock skew
Large 2D arrays may need slowdown to compensate for clock skew
Avoid difficult synchronization or resource conflict problems
Summary
von Neumann bottlenecks (performance limited by memory bandwidths)
Systolic architectures ensure multiple computations per memory access to speed up regular compute-bound applications
Work on automation
Convert semi-systolic systems (broadcast/fan-in) into pure-systolic (no global communication)
Need a convenient means for incorporating high-performance systolic processors into a complete system
Need building blocks for a variety of systolic processors