Learning Resource: https://github.com/openmlsys/openmlsys-zh/issues/242
JAX:
ALPA
OpenAI Triton: write GPU Kernel in python level, but will suck at connecting to DSA. (https://openai.com/blog/triton/)
Just-in-time compilation (JIT): also called dynamic translation, run-time compilations. It involves compilation during execution of a program (at run time) rather than before execution. (Python instead of C++) It optimize for speedup gained from compilation or recompilation that would outweigh the overhead of compiling that code.
TorchDynamo: Python-level JIT compiler designed to make unmodified PyTorch programs faster by rewriting Python bytecode to FX Graph which is compiled by customizable backend.
PyTorch:
Eager Mode: for fast prototyping
Script Mode: for production
AOT Autograd: https://www.youtube.com/watch?v=KpH0qPMGGTI&ab_channel=OctoML
TorchScript: static high-performance subset of Python language https://towardsdatascience.com/pytorch-jit-and-torchscript-c2a77bac0fff
PyTorch JIT: an optimized compiler for PyTorch programs.
TorchInductor
Megatron-LM
NeMo-Megatron
DeepSpeed
JAX
Alpa: Automated Model-Parallel Deep Learning (https://ai.googleblog.com/2022/05/alpa-automated-model-parallel-deep.html)
OpenXLA/StableHLO
MLSys Deals with the Following problems:
Lowering:
Optimizing: accelerate existing computation by modifying human-written code
Others
Some example of possible optimization methods:
Fused Operator: Operator fusion is a way to improve performance by merging one operator (typically, an activation function) into a different operator so that they are executed together without requiring a roundtrip to memory. Operator fusion allows the activation function (Relu in the above example) to be performed as part of the preceding operator (Convolution, for example). This allows the GPU to compute the activation function without waiting for the results of the preceding operator to be written into memory, and that improves performance.
Loop Transformation: a way to optimize for
and while
loop. There are many of possible ways to improve loop (matrix tiling, vectorization, loop unrolling, etc, ....) see wikipedia for more.
Graph Optimization: optimize computational graph
Traditionally, framework and hardware vendors hire optimization engineers who, based on their experience, come up with heuristics on how to best execute the computation graph of a model. For example, NVIDIA might have an engineer or a team of engineers who focuses exclusively on how to make ResNet-50 run really fast on their DGX A100 server. (This is also why you shouldn’t read too much into MLPerf’s results. A popular model running really fast on a type of hardware doesn’t mean an arbitrary model will run really fast on that hardware. It might just be that this model is over-optimized). (from A friendly introduction to machine learning compilers and optimizers)
Hand-designed Rules:
non-optimal: unlikely to be the best way to compile
non-adaptive: can't run on all framework and hardware
labor intensive
Auto-Tune: torch.backends.cudnn.benchmark=True
is an option on Pytorch to enable cuDNN
autotune so that it remembers the fastest execution optimization.
cuDNN autotune, despite its effectiveness, only works for convolution operators and, AFAIK, is only exposed for PyTorch and MXNet.
A much more general solution is autoTVM, which is part of the open-source compiler stack TVM. autoTVM works with subgraphs instead of just an operator, so the search spaces it works with are much more complex. The way autoTVM works is quite complicated, but here is the gist:
OctoML showed that the optimization made by autoTVM is almost 30% faster than hand-designed optimization by Apple’s Core ML team. Of course, as M1 matures and hand-designed optimization becomes intensive, it will be hard for auto-optimization to beat hand-designed optimization.
But a combined method will be even better.
Tianqi was working on MXNet and realized that they coundn't soly rely on
mlkdnn
orcudnn
to reach very good performance. So code generation (such as code fusion) is needed. A new approach would combinging graph-level and operator-level optimization. Thierry Moreau wanted to use this system to target different specialized hardware architecture.At the time, there were many approaches: - building backend for Tensorflow via XLA? - building backend for Pytorch via GLOW or TorchScript? - do you add another execution layer to ONNX runtime?
Operator-Level Optimization: operators like matrix multiplication can be optimized via tiling and other methods
Automated Optimization Search: a learned way to search for best execution sequence.
Fusion: safe DRAM access by fusing layers together.
Using IR graph to optimize for operator fusion (automatically generate fused code)
The programming language is Relay (functional) which implements common feature of ML framework like quantization and shape inference as standard optimization pass.
Condition:
limited OS
limited SRAM, Flash
floating point is very expensive, use integer instead
limited ML runtime and interpreter
limited ML operator library
So they generate C code and it to another cross-platform compilation compiler that is avaliable (LLVM not avaliable on target machine)
There are two barrier for better optimizations:
整个技术路线总结下来包含三大关键点:
Unify: 统一多层抽象
Interact: 交互开放迭代
Automate: 自动优化整合
There are other main-stream compilers:
NVCC (NVIDIA CUDA Compiler): works only with CUDA. Closed-source.
XLA (Accelerated Linear Algebra, Google): originally intended to speed up TensorFlow models, but has been adopted by JAX. Open-source as part of the TensorFlow repository.
PyTorch Glow (Facebook): PyTorch has adopted XLA to enable PyTorch on TPUs, but for other hardware, it relies on PyTorch Glow. Open-source as part of the PyTorch repository.
Third-party compilers are, in general, very ambitious (e.g. you must be really confident to think that you can optimize for GPUs better than NVIDIA can).
Other 3rd party compiler
Apache TVM: works with a wide range of frameworks (including TensorFlow, MXNet, PyTorch, Keras, CNTK) and a wide range of hardware backends (including CPUs, server GPUs, ARMs, x86, mobile GPUs, and FPGA-based accelerators).
MLIR: under the LLVM organization, allow you to build your own compiler. MLIR can run multiple IRs, including TVM’s IRs, as well as LLVM IR and TensorFlow graphs.
Another growthing technology is webassembly
If you want ML run on Web:
Emscripten: uses LLVM codegen (only compiles from C and C++ into WASM)
TVM: the only active compiler that I know of that compiles from ML models into WASM.
Tip: If you decide to try out TVM, use their unofficial conda/pip command for fast installation instead of the instructions found on the Apache site. They only have a Discord server if you need help!
When your model is ready for deployment, it makes sense to try out different compilers to see which one gives you the best performance boost.
Table of Content