Systolic Architectures

Why Systolic Architectures? – H. T. Kung

Introduction
- General methodology for mapping high-level computations into hardware structures
- Data flows from memory in a rhythmic fashion, passing through many processing elements before returning to memory.
- Can be two dimensional (rectangular, triangular, hexagonal) to make use of higher degrees of parallelism
- Data flow may be at multiple speeds in multiple directions – both inputs and (partial) results flow (in classical pipelined systems, only results flow)
- Regularity and reconfigurability arise from modularity
Key architectural issues in designing special-purpose systems
- Simple and regular design
  - Special-purpose systems must be cost effective (to justify their limited applicability)
  - Costs = nonrecurring (design) and recurring (parts). Parts are cheap for both special purpose and general purpose systems, so its the design cost that must be small to make a special purpose system attractive
  - Design cost can be avoided by using an appropriate architecture: decompose into a few types of simple building blocks which are used repetitively with simple interfaces.
  - Simple, regular designs are likely to be modular (can be adjusted for various performance/cost goals, i.e. cost can be made proportional to performance required)
- Concurrency and communication
  - Diminishing growth rate for component speed, need to make use of concurrent processing elements to continue to improve speed.
  - An increased number of processing elements working simultaneously requires significant amounts of coordination and communication.
  - Goal: algorithms that support high concurrency while employing simple, regular communication and control.
- Balancing computation with I/O
  - Goal: balance computation rate with available I/O bandwidth
  - Related: a modular structure can be adjusted to match a variety of I/O bandwidths
  - Ideally perform multiple computations per I/O access which requires internal storage.
  - Need to arrange computation with an appropriate memory structure to balance with I/O time
  - I/O problem is severe when a large computation is performed on a small system (need to transfer results to and from the host)
  - Possible to compute I/O-imposed performance limits for computations
  - In practice, problems are “larger” than special-purpose devices, need to decompose computation in a way that minimizes I/O
  - Can determine how the size of a system and its memory is related to the I/O requirement, and how I/O bandwidth limits the achievable speedup ratio
Systolic architectures: the basic principle
- Set of interconnected cells, each capable of performing a simple operation
- Simple and regular communication and control structures using array or tree interconnects
- Pipelined information flow, communication only occurs at the boundary cells
- Focused on compute bound computations where multiple operations are performed on each data item in a repetitive manner
- Other advantages: modular expansibility, simple and regular data and control flows, use of simple and uniform cells, elimination of global broadcasting, and fan-in and fast response time.
Convolution examples
- Problem: combine two data streams (weights and inputs)
  - Also found in filtering, pattern matching, correlation, interpolation, polynomial evaluation (including DFT), and polynomial multiplication/division
- Compute bound since each input x_i is multiplied by each of k weights
- Reusing x_i from one memory fetch avoids I/O bottleneck
- Can broadcast each x_i to a number of cells or fan-in results
  - broadcast inputs, move results, weights stay variant
    - weights stored in each cell, partial results y_i move systolically from cell to cell from left-to-right
  - broadcast inputs, move weights, results stay
    - Efficiently uses multiply/accumulate hardware
    - First weight (w_1) has a tag bit signaling the cell to output and reset its contents
    - Avoids possibly wide path required for communicating y_i (needed for numerical accuracy) since w_i typically has lets bits
    - Using MACs can improve precision by keeping extra bits inside the accumulator for modest cost
    - Does require a separate bus for collecting outputs from individual cells
  - fan-in results, move inputs, weights stay
    - formulate convolution as inner product of weight vector and a sliding window of inputs
    - weights preloaded into cells and stay, x_is move one cell to the right each cycle, multiplication performed in each cell and results are fanned-in and summed using an adder
    - for large k (number of weights), adder can be pipelined to avoid delays
- Avoiding broadcast/fan-in (global communication)
  - Avoids long wires for bus/tree structures to required for the global communication
  - results stay, inputs and weights move in opposite directions
    - Each partial result y_i stays at a cell to accumulate its terms
    - x_is and w_is move systolically in opposite directions
    - when x meets w at a cell, they are multiplied and accumulated into the y at that cell
    - Consecutive x_is are separate by two cycles to ensure each x_i is able to meet each w_i
    - Uses systolic output path
    - Can interleave independent convolution computations to increase utilization
  - results stay, inputs and weights move in the same direcition but at different speeds
    - x_i moves twice as fast as w_i
    - requires extra register to hold w, but all cells are working all the time
  - weights stay, inputs and results move in opposite directions
    - since results move systolically, avoids need for another systolic output path
    - each output y_i otuputs from the left most-cell during the same cycle as its last input x_{i+k-1}
    - outputs a y_i every two cycle times with constant response time, but only half the cells work at any given time
  - weights stay, inputs and results move in the same direction but at different speeds
    - variant of previous that has cells working all the time (but no longer a constant response time)
    - used for 2D conv where throughput is more important that response time
- Remarks
  - Other possibility: results, weights, inputs all move
  - Could also include cell memory and systolic control to store a set of weights (weights can be selected on the fly to implement interpolation or adaptive filtering)
  - Cell memories + systolic control can make one array implement different functions
  - Once one systolic design is obtained, likely similar designs can be derived
Criteria and advantages
- Design makes multiple use of each input data item
  - Achieve high throughputs with modest I/O bandwidths
  - Can use global data communications (broadcast and fan-in), or have each input travel through the array to be used at each cell (preferable for modular expansibility)
- Design uses extensive concurrency
  - Can pipeline stages involved in the computation of each single result, multiprocessing many results in paralell, or both
  - Sometimes possible to completely overlap I/O and computation times
  - Can use one or two dimensional arrays
    - When memory speed is more than cell speed, 2D arrays should be used
    - At each cell cycle, all the I/O ports on the array boundaries can input or output data items to and from memory to fully utilize the available bandwidth
  - Can compose (chain) systolic arrays
  - Can pipeline operations inside the cells
- Only a few types of simple cells
  - For performance, likely need a large number of cells, so they should be simple to avoid design and implementation costs
- Data and control flows are simple and regular
  - Avoid long-distance and irregular wires
  - Usually only global communication is the clock (and power/ground)
  - Self-timed schemes are possible but may be difficult
  - For an arbitrary 1D array, a global clock is fine despite large clock skew
  - Large 2D arrays may need slowdown to compensate for clock skew
  - Avoid difficult synchronization or resource conflict problems
Summary
- von Neumann bottlenecks (performance limited by memory bandwidths)
- Systolic architectures ensure multiple computations per memory access to speed up regular compute-bound applications
- Work on automation
  - Convert semi-systolic systems (broadcast/fan-in) into pure-systolic (no global communication)
- Need a convenient means for incorporating high-performance systolic processors into a complete system
- Need building blocks for a variety of systolic processors