98 Out-Of-Order Matrix Processor: Implementation and Performance Evaluation
-
Published:2009
Download citation file:
Mat-Core is a research processor aiming at exploiting the increasingly number of transistors per IC to improve the performance of a wide range of applications. It extends a general-purpose scalar processor with a matrix unit for processing vector/matrix data, where scalar and vector/matrix instructions are executed out-of-order. To hid memory latency, the extended matrix unit is decoupled into two components: address generation and data computation, which communicate through data queues and executed also out-of-order. Like vector architectures, the data computation unit is organized in parallel lanes. However, on parallel lanes, Mat-Core can execute scalar-matrix, vector-matrix, and matrix-matrix instructions in addition to scalar-vector and vector-vector instructions. By extending the well known scoreboard algorithm, these instructions are executed out-of-order on parallel pipelines. This paper describes the SystemC (system level modeling language) implementation of Mat-Core and evaluates its performance on vector and matrix kernels. On four parallel lanes and matrix registers of size 4×8 or 32 elements each, the performance of Mat-Core with queues size of 10, start up time of 6 clock cycles, and memory latency of 10 clock cycles, is about 1.4, 2.1, 4.2, 2.6, 4.2, and 6.4 FLOPs per clock cycle; achieved on scalar-vector multiplication, SAXPY, Givens, rank-1 update, vector-matrix multiplication, and matrix-matrix multiplication, respectively.