# MLSys

Learning Resource: https://github.com/openmlsys/openmlsys-zh/issues/242

## Field Overview

JAX:

ALPA

OpenAI Triton: write GPU Kernel in python level, but will suck at connecting to DSA. (https://openai.com/blog/triton/)

Just-in-time compilation (JIT): also called dynamic translation, run-time compilations. It involves compilation during execution of a program (at run time) rather than before execution. (Python instead of C++) It optimize for speedup gained from compilation or recompilation that would outweigh the overhead of compiling that code.

TorchDynamo: Python-level JIT compiler designed to make unmodified PyTorch programs faster by rewriting Python bytecode to FX Graph which is compiled by customizable backend.

PyTorch:

• Eager Mode: for fast prototyping

• Script Mode: for production

TorchScript: static high-performance subset of Python language https://towardsdatascience.com/pytorch-jit-and-torchscript-c2a77bac0fff

PyTorch JIT: an optimized compiler for PyTorch programs.

TorchInductor

Megatron-LM

NeMo-Megatron

DeepSpeed

JAX

Alpa: Automated Model-Parallel Deep Learning (https://ai.googleblog.com/2022/05/alpa-automated-model-parallel-deep.html)

OpenXLA/StableHLO

## Introduction to ML Compiler

MLSys Deals with the Following problems:

• Lowering:

• providing abstraction between heterogeneous hardware (Intel CPU, Nvidia GPU, iPhone, TPU) for traning and evaluating
• providing abstraction between heterogeneous software (Pytorch, Tensorflow, MXNet, Sklearn) for traning and evaluating
• Optimizing: accelerate existing computation by modifying human-written code

• Others

• allow time vs precision tradeoff

Some example of possible optimization methods:

• Fused Operator: Operator fusion is a way to improve performance by merging one operator (typically, an activation function) into a different operator so that they are executed together without requiring a roundtrip to memory. Operator fusion allows the activation function (Relu in the above example) to be performed as part of the preceding operator (Convolution, for example). This allows the GPU to compute the activation function without waiting for the results of the preceding operator to be written into memory, and that improves performance.

• Loop Transformation: a way to optimize for and while loop. There are many of possible ways to improve loop (matrix tiling, vectorization, loop unrolling, etc, ....) see wikipedia for more.

• Graph Optimization: optimize computational graph

Traditionally, framework and hardware vendors hire optimization engineers who, based on their experience, come up with heuristics on how to best execute the computation graph of a model. For example, NVIDIA might have an engineer or a team of engineers who focuses exclusively on how to make ResNet-50 run really fast on their DGX A100 server. (This is also why you shouldn’t read too much into MLPerf’s results. A popular model running really fast on a type of hardware doesn’t mean an arbitrary model will run really fast on that hardware. It might just be that this model is over-optimized). (from A friendly introduction to machine learning compilers and optimizers)

Hand-designed Rules:

• non-optimal: unlikely to be the best way to compile

• non-adaptive: can't run on all framework and hardware

• labor intensive

Auto-Tune: torch.backends.cudnn.benchmark=True is an option on Pytorch to enable cuDNN autotune so that it remembers the fastest execution optimization.

cuDNN autotune, despite its effectiveness, only works for convolution operators and, AFAIK, is only exposed for PyTorch and MXNet.

A much more general solution is autoTVM, which is part of the open-source compiler stack TVM. autoTVM works with subgraphs instead of just an operator, so the search spaces it works with are much more complex. The way autoTVM works is quite complicated, but here is the gist:

1. It first breaks your computation graph into subgraphs.
2. It predicts how big each subgraph is.
3. It allocates time to search for the best possible path for each subgraph.
4. It stitches the best possible way to run each subgraph together to execute the entire graph.

OctoML showed that the optimization made by autoTVM is almost 30% faster than hand-designed optimization by Apple’s Core ML team. Of course, as M1 matures and hand-designed optimization becomes intensive, it will be hard for auto-optimization to beat hand-designed optimization.

But a combined method will be even better.

## TVM (Tensor Virtual Machine) An Automated End-to-End Optimizing Compiler for Deep Learning

Tianqi was working on MXNet and realized that they coundn't soly rely on mlkdnn or cudnn to reach very good performance. So code generation (such as code fusion) is needed. A new approach would combinging graph-level and operator-level optimization. Thierry Moreau wanted to use this system to target different specialized hardware architecture.

At the time, there were many approaches: - building backend for Tensorflow via XLA? - building backend for Pytorch via GLOW or TorchScript? - do you add another execution layer to ONNX runtime?

### Dive into How TVM Works

Operator-Level Optimization: operators like matrix multiplication can be optimized via tiling and other methods

Automated Optimization Search: a learned way to search for best execution sequence.

#### TVM Graph-level Optimization

Fusion: safe DRAM access by fusing layers together.

Using IR graph to optimize for operator fusion (automatically generate fused code)

The programming language is Relay (functional) which implements common feature of ML framework like quantization and shape inference as standard optimization pass.

#### Microcontroller Support

Condition:

• limited OS

• limited SRAM, Flash

• floating point is very expensive, use integer instead

• limited ML runtime and interpreter

• limited ML operator library

So they generate C code and it to another cross-platform compilation compiler that is avaliable (LLVM not avaliable on target machine)

## Current Issue

There are two barrier for better optimizations:

1. hand-crafted rules are not compatible with automatically generated rules
2. multi-stage lowering enforces representation, which create barrier

• Unify: 统一多层抽象

• Interact: 交互开放迭代

• Automate: 自动优化整合

## Other Compilers

There are other main-stream compilers:

• NVCC (NVIDIA CUDA Compiler): works only with CUDA. Closed-source.

• XLA (Accelerated Linear Algebra, Google): ​​originally intended to speed up TensorFlow models, but has been adopted by JAX. Open-source as part of the TensorFlow repository.

• PyTorch Glow (Facebook): PyTorch has adopted XLA to enable PyTorch on TPUs, but for other hardware, it relies on PyTorch Glow. Open-source as part of the PyTorch repository.

Third-party compilers are, in general, very ambitious (e.g. you must be really confident to think that you can optimize for GPUs better than NVIDIA can).

Other 3rd party compiler

• Apache TVM: works with a wide range of frameworks (including TensorFlow, MXNet, PyTorch, Keras, CNTK) and a wide range of hardware backends (including CPUs, server GPUs, ARMs, x86, mobile GPUs, and FPGA-based accelerators).

• MLIR: under the LLVM organization, allow you to build your own compiler. MLIR can run multiple IRs, including TVM’s IRs, as well as LLVM IR and TensorFlow graphs.

Another growthing technology is webassembly

If you want ML run on Web:

• Emscripten: uses LLVM codegen (only compiles from C and C++ into WASM)

• TVM: the only active compiler that I know of that compiles from ML models into WASM.

Tip: If you decide to try out TVM, use their unofficial conda/pip command for fast installation instead of the instructions found on the Apache site. They only have a Discord server if you need help!

When your model is ready for deployment, it makes sense to try out different compilers to see which one gives you the best performance boost.

Table of Content