MLSys

Learning Resource: https://github.com/openmlsys/openmlsys-zh/issues/242

Field Overview

JAX:

ALPA

OpenAI Triton: write GPU Kernel in python level, but will suck at connecting to DSA. (https://openai.com/blog/triton/)

Just-in-time compilation (JIT): also called dynamic translation, run-time compilations. It involves compilation during execution of a program (at run time) rather than before execution. (Python instead of C++) It optimize for speedup gained from compilation or recompilation that would outweigh the overhead of compiling that code.

TorchDynamo: Python-level JIT compiler designed to make unmodified PyTorch programs faster by rewriting Python bytecode to FX Graph which is compiled by customizable backend.

PyTorch:

AOT Autograd: https://www.youtube.com/watch?v=KpH0qPMGGTI&ab_channel=OctoML

TorchScript: static high-performance subset of Python language https://towardsdatascience.com/pytorch-jit-and-torchscript-c2a77bac0fff

PyTorch JIT: an optimized compiler for PyTorch programs.

TorchInductor

Megatron-LM

NeMo-Megatron

DeepSpeed

JAX

Alpa: Automated Model-Parallel Deep Learning (https://ai.googleblog.com/2022/05/alpa-automated-model-parallel-deep.html)

OpenXLA/StableHLO

Introduction to ML Compiler

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning by Tianqi Chen 2018

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning by Tianqi Chen 2018

MLSys Deals with the Following problems:

MLSys: layers of abstraction

MLSys: layers of abstraction

Abstraction examples

Abstraction examples

Some example of possible optimization methods:

Multi-stage lowering: split the problem into multiple problems and exchange information using some intermediate representation (IR) in the middle. So that one group of people only need to worry about how to solve one layer efficiently.

Multi-stage lowering: split the problem into multiple problems and exchange information using some intermediate representation (IR) in the middle. So that one group of people only need to worry about how to solve one layer efficiently.

The next competitions for ML is in compilers (Soumith Chintala, Venture Beat 2020)

The next competitions for ML is in compilers (Soumith Chintala, Venture Beat 2020)

Traditionally, framework and hardware vendors hire optimization engineers who, based on their experience, come up with heuristics on how to best execute the computation graph of a model. For example, NVIDIA might have an engineer or a team of engineers who focuses exclusively on how to make ResNet-50 run really fast on their DGX A100 server. (This is also why you shouldn’t read too much into MLPerf’s results. A popular model running really fast on a type of hardware doesn’t mean an arbitrary model will run really fast on that hardware. It might just be that this model is over-optimized). (from A friendly introduction to machine learning compilers and optimizers)

Hand-designed Rules:

Auto-Tune: torch.backends.cudnn.benchmark=True is an option on Pytorch to enable cuDNN autotune so that it remembers the fastest execution optimization.

cuDNN autotune, despite its effectiveness, only works for convolution operators and, AFAIK, is only exposed for PyTorch and MXNet.

A much more general solution is autoTVM, which is part of the open-source compiler stack TVM. autoTVM works with subgraphs instead of just an operator, so the search spaces it works with are much more complex. The way autoTVM works is quite complicated, but here is the gist:

  1. It first breaks your computation graph into subgraphs.
  2. It predicts how big each subgraph is.
  3. It allocates time to search for the best possible path for each subgraph.
  4. It stitches the best possible way to run each subgraph together to execute the entire graph.

It takes ~70 trials for the ML-based TVM to outperform cuDNN. Experiment by Chen et al.

It takes ~70 trials for the ML-based TVM to outperform cuDNN. Experiment by Chen et al.

OctoML showed that the optimization made by autoTVM is almost 30% faster than hand-designed optimization by Apple’s Core ML team. Of course, as M1 matures and hand-designed optimization becomes intensive, it will be hard for auto-optimization to beat hand-designed optimization.

But a combined method will be even better.

TVM (Tensor Virtual Machine) An Automated End-to-End Optimizing Compiler for Deep Learning

TVM: a common intermediate compiler from all framework to all device, similar to LLVM that compiler different languages to all device

TVM: a common intermediate compiler from all framework to all device, similar to LLVM that compiler different languages to all device

Tianqi was working on MXNet and realized that they coundn't soly rely on mlkdnn or cudnn to reach very good performance. So code generation (such as code fusion) is needed. A new approach would combinging graph-level and operator-level optimization. Thierry Moreau wanted to use this system to target different specialized hardware architecture.

At the time, there were many approaches: - building backend for Tensorflow via XLA? - building backend for Pytorch via GLOW or TorchScript? - do you add another execution layer to ONNX runtime?

Example 100x speed up by code optimization: import from pytorch to TVM, auto-tuning, block sparse operator (GRU layers has sparsity), datatype conversion. TVM is a framework that allows this kind of optimization.

Example 100x speed up by code optimization: import from pytorch to TVM, auto-tuning, block sparse operator (GRU layers has sparsity), datatype conversion. TVM is a framework that allows this kind of optimization.

Dive into How TVM Works

Functional Definition vs Schedule Definition

Functional Definition vs Schedule Definition

Before and After, matrix multiplication with 200x performance

Before and After, matrix multiplication with 200x performance

Operator-Level Optimization: operators like matrix multiplication can be optimized via tiling and other methods

Automated Optimization Search: a learned way to search for best execution sequence.

TVM Graph-level Optimization

TVM Graph-level Optimization

TVM Graph-level Optimization

Fusion: safe DRAM access by fusing layers together.

Using IR graph to optimize for operator fusion (automatically generate fused code)

The programming language is Relay (functional) which implements common feature of ML framework like quantization and shape inference as standard optimization pass.

Quantization

Quantization

Microcontroller Support

micro-TVM

micro-TVM

Condition:

So they generate C code and it to another cross-platform compilation compiler that is avaliable (LLVM not avaliable on target machine)

Current Issue

There are two barrier for better optimizations:

  1. hand-crafted rules are not compatible with automatically generated rules
  2. multi-stage lowering enforces representation, which create barrier

整个技术路线总结下来包含三大关键点:

Other Compilers

There are other main-stream compilers:

Third-party compilers are, in general, very ambitious (e.g. you must be really confident to think that you can optimize for GPUs better than NVIDIA can).

Other 3rd party compiler

Another growthing technology is webassembly

You can't avoid Javascript when running ML models in web. But you can with WASM!

You can't avoid Javascript when running ML models in web. But you can with WASM!

If you want ML run on Web:

Tip: If you decide to try out TVM, use their unofficial conda/pip command for fast installation instead of the instructions found on the Apache site. They only have a Discord server if you need help!

When your model is ready for deployment, it makes sense to try out different compilers to see which one gives you the best performance boost.

Table of Content