Systolic Architectures

Why Systolic Architectures? – H. T. Kung

  • Introduction
    • General methodology for mapping high-level computations into hardware structures
    • Data flows from memory in a rhythmic fashion, passing through many processing elements before returning to memory.
    • Can be two dimensional (rectangular, triangular, hexagonal) to make use of higher degrees of parallelism
    • Data flow may be at multiple speeds in multiple directions – both inputs and (partial) results flow (in classical pipelined systems, only results flow)
    • Regularity and reconfigurability arise from modularity
  • Key architectural issues in designing special-purpose systems
    • Simple and regular design
      • Special-purpose systems must be cost effective (to justify their limited applicability)
      • Costs = nonrecurring (design) and recurring (parts). Parts are cheap for both special purpose and general purpose systems, so its the design cost that must be small to make a special purpose system attractive
      • Design cost can be avoided by using an appropriate architecture: decompose into a few types of simple building blocks which are used repetitively with simple interfaces.
      • Simple, regular designs are likely to be modular (can be adjusted for various performance/cost goals, i.e. cost can be made proportional to performance required)
    • Concurrency and communication
      • Diminishing growth rate for component speed, need to make use of concurrent processing elements to continue to improve speed.
      • An increased number of processing elements working simultaneously requires significant amounts of coordination and communication.
      • Goal: algorithms that support high concurrency while employing simple, regular communication and control.
    • Balancing computation with I/O
      • Goal: balance computation rate with available I/O bandwidth
      • Related: a modular structure can be adjusted to match a variety of I/O bandwidths
      • Ideally perform multiple computations per I/O access which requires internal storage.
      • Need to arrange computation with an appropriate memory structure to balance with I/O time
      • I/O problem is severe when a large computation is performed on a small system (need to transfer results to and from the host)
      • Possible to compute I/O-imposed performance limits for computations
      • In practice, problems are “larger” than special-purpose devices, need to decompose computation in a way that minimizes I/O
      • Can determine how the size of a system and its memory is related to the I/O requirement, and how I/O bandwidth limits the achievable speedup ratio
  • Systolic architectures: the basic principle
    • Set of interconnected cells, each capable of performing a simple operation
    • Simple and regular communication and control structures using array or tree interconnects
    • Pipelined information flow, communication only occurs at the boundary cells
    • Focused on compute bound computations where multiple operations are performed on each data item in a repetitive manner
    • Other advantages: modular expansibility, simple and regular data and control flows, use of simple and uniform cells, elimination of global broadcasting, and fan-in and fast response time.
  • Convolution examples
    • Problem: combine two data streams (weights and inputs)
      • Also found in filtering, pattern matching, correlation, interpolation, polynomial evaluation (including DFT), and polynomial multiplication/division
    • Compute bound since each input x_i is multiplied by each of k weights
    • Reusing x_i from one memory fetch avoids I/O bottleneck
    • Can broadcast each x_i to a number of cells or fan-in results
      • broadcast inputs, move results, weights stay variant
        • weights stored in each cell, partial results y_i move systolically from cell to cell from left-to-right
      • broadcast inputs, move weights, results stay
        • Efficiently uses multiply/accumulate hardware
        • First weight (w_1) has a tag bit signaling the cell to output and reset its contents
        • Avoids possibly wide path required for communicating y_i (needed for numerical accuracy) since w_i typically has lets bits
        • Using MACs can improve precision by keeping extra bits inside the accumulator for modest cost
        • Does require a separate bus for collecting outputs from individual cells
      • fan-in results, move inputs, weights stay
        • formulate convolution as inner product of weight vector and a sliding window of inputs
        • weights preloaded into cells and stay, x_is move one cell to the right each cycle, multiplication performed in each cell and results are fanned-in and summed using an adder
        • for large k (number of weights), adder can be pipelined to avoid delays
    • Avoiding broadcast/fan-in (global communication)
      • Avoids long wires for bus/tree structures to required for the global communication
      • results stay, inputs and weights move in opposite directions
        • Each partial result y_i stays at a cell to accumulate its terms
        • x_is and w_is move systolically in opposite directions
        • when x meets w at a cell, they are multiplied and accumulated into the y at that cell
        • Consecutive x_is are separate by two cycles to ensure each x_i is able to meet each w_i
        • Uses systolic output path
        • Can interleave independent convolution computations to increase utilization
      • results stay, inputs and weights move in the same direcition but at different speeds
        • x_i moves twice as fast as w_i
        • requires extra register to hold w, but all cells are working all the time
      • weights stay, inputs and results move in opposite directions
        • since results move systolically, avoids need for another systolic output path
        • each output y_i otuputs from the left most-cell during the same cycle as its last input x_{i+k-1}
        • outputs a y_i every two cycle times with constant response time, but only half the cells work at any given time
      • weights stay, inputs and results move in the same direction but at different speeds
        • variant of previous that has cells working all the time (but no longer a constant response time)
        • used for 2D conv where throughput is more important that response time
    • Remarks
      • Other possibility: results, weights, inputs all move
      • Could also include cell memory and systolic control to store a set of weights (weights can be selected on the fly to implement interpolation or adaptive filtering)
      • Cell memories + systolic control can make one array implement different functions
      • Once one systolic design is obtained, likely similar designs can be derived
  • Criteria and advantages
    • Design makes multiple use of each input data item
      • Achieve high throughputs with modest I/O bandwidths
      • Can use global data communications (broadcast and fan-in), or have each input travel through the array to be used at each cell (preferable for modular expansibility)
    • Design uses extensive concurrency
      • Can pipeline stages involved in the computation of each single result, multiprocessing many results in paralell, or both
      • Sometimes possible to completely overlap I/O and computation times
      • Can use one or two dimensional arrays
        • When memory speed is more than cell speed, 2D arrays should be used
        • At each cell cycle, all the I/O ports on the array boundaries can input or output data items to and from memory to fully utilize the available bandwidth
      • Can compose (chain) systolic arrays
      • Can pipeline operations inside the cells
    • Only a few types of simple cells
      • For performance, likely need a large number of cells, so they should be simple to avoid design and implementation costs
    • Data and control flows are simple and regular
      • Avoid long-distance and irregular wires
      • Usually only global communication is the clock (and power/ground)
      • Self-timed schemes are possible but may be difficult
      • For an arbitrary 1D array, a global clock is fine despite large clock skew
      • Large 2D arrays may need slowdown to compensate for clock skew
      • Avoid difficult synchronization or resource conflict problems
  • Summary
    • von Neumann bottlenecks (performance limited by memory bandwidths)
    • Systolic architectures ensure multiple computations per memory access to speed up regular compute-bound applications
    • Work on automation
      • Convert semi-systolic systems (broadcast/fan-in) into pure-systolic (no global communication)
    • Need a convenient means for incorporating high-performance systolic processors into a complete system
    • Need building blocks for a variety of systolic processors