Improving the Performance of DGEMM with MoA and Cache-Blocking
The goal of this paper is to demonstrate performance enhancements of the high performance dense linear algebra matrix-matrix multiply DGEMM kernel, widely implemented by vendors in the basic linear algebra subroutine BLAS library. The mathematics of arrays (MoA) paradigm due to Mullin (1988) results in contiguous memory accesses in combination with Church-Rosser complete language constructs optimized for target processor architectures. Our performance studies demonstrate that the MoA implementation of DGEMM combined with optimal cache-blocking strategies results in at least a 25% performance gain on both Intel Xeon Skylake and IBM Power-9 processors over the vendor supplied Intel MKL and IBM ESSL basic linear algebra libraries.
Results are presented for the NREL Eagle and ORNL Summit supercomputers.
Extended abstract (ARRAY_2021_paper_4 (revised).pdf) | 547KiB |
Mon 21 JunDisplayed time zone: Eastern Time (US & Canada) change
18:00 - 21:00 | |||
18:00 25mTalk | Improving the Performance of DGEMM with MoA and Cache-Blocking ARRAY Stephen Thomas National Renewable Energy Laboratory, Lenore Mullin SUNY Albany, USA, Kasia Swirydowicz Pacific Northwest National Laboratory File Attached | ||
18:25 25mTalk | Nested Object Support in a Structure-of-Arrays Dynamic Objector Allocator ARRAY File Attached | ||
18:50 25mTalk | Data Layouts are Important (Extended Abstract) ARRAY Doru Thom Popovici Lawrence Berkeley National Lab, Andrew Canning Lawrence Berkeley National Laboratory, Zhengji Zhao Lawrence Berkeley National Laboratory, Lin-Wang Wang Lawrence Berkeley National Laboratory, John Shalf Lawrence Berkeley National Laboratory File Attached |