# Systolic Architectures

## Why Systolic Architectures? – H. T. Kung

- Introduction
- General methodology for mapping high-level computations into hardware structures
- Data flows from memory in a rhythmic fashion, passing through many processing elements before returning to memory.
- Can be two dimensional (rectangular, triangular, hexagonal) to make use of higher degrees of parallelism
- Data flow may be at multiple speeds in multiple directions – both inputs and (partial) results flow (in classical pipelined systems, only results flow)
- Regularity and reconfigurability arise from modularity

- Key architectural issues in designing special-purpose systems
- Simple and regular design
- Special-purpose systems must be cost effective (to justify their limited applicability)
- Costs = nonrecurring (design) and recurring (parts). Parts are cheap for both special purpose and general purpose systems, so its the design cost that must be small to make a special purpose system attractive
- Design cost can be avoided by using an appropriate architecture: decompose into a few types of simple building blocks which are used repetitively with simple interfaces.
- Simple, regular designs are likely to be modular (can be adjusted for various performance/cost goals, i.e. cost can be made proportional to performance required)

- Concurrency and communication
- Diminishing growth rate for component speed, need to make use of concurrent processing elements to continue to improve speed.
- An increased number of processing elements working simultaneously requires significant amounts of coordination and communication.
- Goal: algorithms that support high concurrency while employing simple, regular communication and control.

- Balancing computation with I/O
- Goal: balance computation rate with available I/O bandwidth
- Related: a modular structure can be adjusted to match a variety of I/O bandwidths
- Ideally perform multiple computations per I/O access which requires internal storage.
- Need to arrange computation with an appropriate memory structure to balance with I/O time
- I/O problem is severe when a large computation is performed on a small system (need to transfer results to and from the host)
- Possible to compute I/O-imposed performance limits for computations
- In practice, problems are “larger” than special-purpose devices, need to decompose computation in a way that minimizes I/O
- Can determine how the size of a system and its memory is related to the I/O requirement, and how I/O bandwidth limits the achievable speedup ratio

- Simple and regular design
- Systolic architectures: the basic principle
- Set of interconnected cells, each capable of performing a simple operation
- Simple and regular communication and control structures using array or tree interconnects
- Pipelined information flow, communication only occurs at the boundary cells
- Focused on compute bound computations where multiple operations are performed on each data item in a repetitive manner
- Other advantages: modular expansibility, simple and regular data and control flows, use of simple and uniform cells, elimination of global broadcasting, and fan-in and fast response time.

- Convolution examples
- Problem: combine two data streams (weights and inputs)
- Also found in filtering, pattern matching, correlation, interpolation, polynomial evaluation (including DFT), and polynomial multiplication/division

- Compute bound since each input x_i is multiplied by each of k weights
- Reusing x_i from one memory fetch avoids I/O bottleneck
- Can broadcast each x_i to a number of cells or fan-in results
- broadcast inputs, move results, weights stay variant
- weights stored in each cell, partial results y_i move systolically from cell to cell from left-to-right

- broadcast inputs, move weights, results stay
- Efficiently uses multiply/accumulate hardware
- First weight (w_1) has a tag bit signaling the cell to output and reset its contents
- Avoids possibly wide path required for communicating y_i (needed for numerical accuracy) since w_i typically has lets bits
- Using MACs can improve precision by keeping extra bits inside the accumulator for modest cost
- Does require a separate bus for collecting outputs from individual cells

- fan-in results, move inputs, weights stay
- formulate convolution as inner product of weight vector and a sliding window of inputs
- weights preloaded into cells and stay, x_is move one cell to the right each cycle, multiplication performed in each cell and results are fanned-in and summed using an adder
- for large k (number of weights), adder can be pipelined to avoid delays

- broadcast inputs, move results, weights stay variant
- Avoiding broadcast/fan-in (global communication)
- Avoids long wires for bus/tree structures to required for the global communication
- results stay, inputs and weights move in opposite directions
- Each partial result y_i stays at a cell to accumulate its terms
- x_is and w_is move systolically in opposite directions
- when x meets w at a cell, they are multiplied and accumulated into the y at that cell
- Consecutive x_is are separate by two cycles to ensure each x_i is able to meet each w_i
- Uses systolic output path
- Can interleave independent convolution computations to increase utilization

- results stay, inputs and weights move in the same direcition but at different speeds
- x_i moves twice as fast as w_i
- requires extra register to hold w, but all cells are working all the time

- weights stay, inputs and results move in opposite directions
- since results move systolically, avoids need for another systolic output path
- each output y_i otuputs from the left most-cell during the same cycle as its last input x_{i+k-1}
- outputs a y_i every two cycle times with constant response time, but only half the cells work at any given time

- weights stay, inputs and results move in the same direction but at different speeds
- variant of previous that has cells working all the time (but no longer a constant response time)
- used for 2D conv where throughput is more important that response time

- Remarks
- Other possibility: results, weights, inputs all move
- Could also include cell memory and systolic control to store a set of weights (weights can be selected on the fly to implement interpolation or adaptive filtering)
- Cell memories + systolic control can make one array implement different functions
- Once one systolic design is obtained, likely similar designs can be derived

- Problem: combine two data streams (weights and inputs)
- Criteria and advantages
- Design makes multiple use of each input data item
- Achieve high throughputs with modest I/O bandwidths
- Can use global data communications (broadcast and fan-in), or have each input travel through the array to be used at each cell (preferable for modular expansibility)

- Design uses extensive concurrency
- Can pipeline stages involved in the computation of each single result, multiprocessing many results in paralell, or both
- Sometimes possible to completely overlap I/O and computation times
- Can use one or two dimensional arrays
- When memory speed is more than cell speed, 2D arrays should be used
- At each cell cycle, all the I/O ports on the array boundaries can input or output data items to and from memory to fully utilize the available bandwidth

- Can compose (chain) systolic arrays
- Can pipeline operations inside the cells

- Only a few types of simple cells
- For performance, likely need a large number of cells, so they should be simple to avoid design and implementation costs

- Data and control flows are simple and regular
- Avoid long-distance and irregular wires
- Usually only global communication is the clock (and power/ground)
- Self-timed schemes are possible but may be difficult
- For an arbitrary 1D array, a global clock is fine despite large clock skew
- Large 2D arrays may need slowdown to compensate for clock skew
- Avoid difficult synchronization or resource conflict problems

- Design makes multiple use of each input data item
- Summary
- von Neumann bottlenecks (performance limited by memory bandwidths)
- Systolic architectures ensure multiple computations per memory access to speed up regular compute-bound applications
- Work on automation
- Convert semi-systolic systems (broadcast/fan-in) into pure-systolic (no global communication)

- Need a convenient means for incorporating high-performance systolic processors into a complete system
- Need building blocks for a variety of systolic processors