Machine Learning for Autotuning Production Machine Learning Compilers
Search-based techniques have been demonstrated effective in solving complex optimization problems that arise in domain-specific compilers for machine learning (ML). Unfortunately, deploying such techniques in production compilers is impeded by two limitations. First, prior works require factorization of a computation graph into smaller subgraphs over which search is applied. This decomposition is not only non-trivial but also significantly limits the scope of optimization. Second, prior works require search to be applied in a single stage in the compilation flow, which does not fit with the multi-stage layered architecture of most production ML compilers.
I will present an autotuner for production ML compilers that can tune both graph-level and subgraph-level optimizations at multiple compilation stages. The autotuner applies a flexible search methodology that defines a search formulation for joint optimizations by accurately modeling the interactions between different compiler passes. The autotuner tunes tensor layouts, operator fusion decisions, tile sizes, and code generation parameters in XLA, a production ML compiler, using various search strategies. We demonstrate how to incorporate machine learning techniques such as a learned cost model and various learning-based search strategies to reduce autotuning time. In an evaluation across 150 ML training and inference models on Tensor Processing Units (TPUs), the autotuner offers up to 2.4x and an average 5% runtime speedup over the heavily-optimized XLA compiler.
Mangpo is a research scientist at Google Brain, where she leads Machine Learning for Machine Learning Compilers effort (one of Google Brain moonshots in 2020). Her research interests include compilers, machine learning for systems, program synthesis, and efficient computing. Mangpo completed her PhD in Computer Science at UC Berkeley. Her dissertation focuses on synthesis-aided compilation and programming models for emerging architectures, ranging from an ultra-low-power processor to a programmable network card.
Mon 21 JunDisplayed time zone: Eastern Time (US & Canada) change
16:45 - 19:15
|Machine Learning for Autotuning Production Machine Learning Compilers
|Pure, Low-Level Tensor Program Rewriting via Access Patterns (Representation Pearl)
Gus Henry Smith University of Washington, Andrew Liu University of Washington, Steven Lyubomirsky University of Washington, USA, Scott Davidson University of Washington, Joseph McMahan University of Washington, Michael Bedford Taylor University of Washington, Luis Ceze University of Washington, Zachary Tatlock University of Washington, Seattle
|ControlFlag: A Self-supervised Idiosyncratic PatternDetection System for Software Control Structures
|Predictive Data Locality Optimization for Higher-Order Tensor Computations