Building Machine Learning Systems for a Trillion Trillion Floating Point Operations

Summary notes created by Deciphr AI

Horas He, a 2020 Cornell graduate and member of Meta's Pytorch Compilers team, discusses the evolution and challenges of machine learning (ML) systems in his talk. He highlights the unprecedented infrastructure growth, driven by the need for massive computational power, and the shift in ML from diverse architectures to a Transformer-dominated landscape. Horas emphasizes the importance of efficient programming models over compiler optimizations, citing examples like Flex Attention, which offers a reliable interface for users. He also explores the complexities of distributed ML systems and the critical role of innovative programming models in advancing ML infrastructure.

Summary Notes

Introduction to Machine Learning Systems

The speaker, Horas He, works at Meta on the PCH team, focusing on compilers and machine learning.
Key contributions include Torch.compile and Flex Attention, which enhance model performance and facilitate research without leaving Python.

"If you've used things like Torch.compile, which is a thing in PyTorch that makes your model go like two to four times faster in just one line of code, or if you've used Flex Attention, which lets researchers design fast calculations for attention without leaving the world of Python, he's the person responsible for both of those things."

Horas He has significantly impacted the field by developing tools that optimize machine learning models and streamline research processes.

Infrastructure and Compute Growth

The current era is marked by unprecedented infrastructure build-out, with significant investments in data centers and GPUs.
The scale of operations has grown, requiring massive computational power for machine learning tasks.

"Basically every month, we see a different headline about a new nuclear power plant from Microsoft or a massive 300K GPU data cluster from XAI."

The industry's growth is driven by the need for extensive computational resources, reflecting in large-scale investments and infrastructure developments.

Floating Point Operations and Model Complexity

The focus is on achieving high performance through floating point operations, with models requiring trillions of such operations.
There's a shift in the understanding of machine learning tasks, emphasizing the simplicity of model logic against the complexity of infrastructure.

"The leading-edge models are currently trained with about 1826 floating point operations, which is approximately 100 trillion trillion floating point operations."

The complexity lies in optimizing these operations rather than the models themselves, highlighting the infrastructure's critical role.

Evolution of Machine Learning Frameworks

Machine learning frameworks have evolved from declarative to imperative models, improving usability and performance.
PyTorch's success is attributed to its simple execution model, allowing easy infrastructure development.

"PyTorch was very successful, and I think it's worth talking about why eager execution was so successful."

The transition from complex to straightforward execution models has facilitated widespread adoption and innovation in the field.

Programming Models and Optimization

Two approaches to performance improvement: optimization and programming models.
Optimization focuses on enhancing compiler efficiency, while programming models provide tools for user-driven performance improvements.

"One of the ways generally is basically like optimization... The other way, though, is kind of like programming models."

The balance between these approaches determines the efficiency and effectiveness of machine learning systems.

Simplification and Performance Expectations

Machine learning models are simple, but performance expectations are high, requiring efficient utilization of computational resources.
The field's consolidation has led to standardized architectures and fewer players training state-of-the-art models.

"ML models are extremely simple, and as a result, we have very high expectations from the performance of the code."

The simplicity of models contrasts with the high demands for optimization, driving innovation in computational efficiency.

Impact of Python and PyTorch

Python's simplicity and growability have made it a preferred language for machine learning, enabling easy infrastructure development.
PyTorch's design mirrors Python's simplicity, fostering a conducive environment for building and optimizing machine learning frameworks.

"Python is an exceedingly simple language, and so it's also a very growable language."

The alignment of PyTorch with Python's principles has been instrumental in its success, allowing seamless integration and performance optimization.

Conclusion

The talk underscores the intricate balance between model simplicity and infrastructure complexity in machine learning.
The evolution of frameworks and programming models reflects a broader trend toward optimizing computational efficiency while maintaining usability.

"Although the overall problems are extremely simple, the corresponding difficulty just goes into making your models hit this very high performance barrier."

The ongoing advancements in machine learning infrastructure and frameworks aim to bridge the gap between simplicity and performance, driving the field forward.

CPU and GPU Work Scheduling

The CPU is responsible for scheduling work on the GPU's work queue, while the GPU executes tasks from this queue.
Python can be viewed as having zero overhead in eager execution, making it as fast as non-eager execution languages like C++.
Eager execution simplifies the programming model for users without sacrificing speed.

"As long as Grommet is able to put down the train tracks faster than the train actually rolls along the train tracks, you can actually kind of view Python as having zero overhead."

This metaphor illustrates how Python can efficiently manage tasks without adding extra processing time compared to more efficient languages.

Introduction of Tensor Cores by Nvidia

In 2017, Nvidia introduced tensor cores, specialized hardware units on GPUs designed for matrix multiplications.
Tensor cores significantly accelerated matrix multiplications, creating a performance gap between matrix operations and other tasks on the GPU.
This led to the development of machine learning (ML) compilers to optimize non-matrix operations.

"Nvidia realized that deep learning was a big deal because all of a sudden you had this massive 10x gap between how fast matrix multiplications were on the GPU and how fast literally anything else you wanted to run on the GPU was."

The introduction of tensor cores marked a shift in GPU optimization, highlighting the importance of matrix multiplications in deep learning.

Evolution of Machine Learning Frameworks

ML frameworks evolved from graph-based execution models to imperative eager programming models.
Modern ML frameworks now capture programming models into graphs for optimization, using techniques like JIT compilation.
This evolution has improved the efficiency and performance of ML tasks on GPUs.

"Modern ML frameworks like pretty much all ML frameworks nowadays use an imperative eager programming model but almost all ML frameworks now also have some way to capture this program model into a graph of some kind so they can perform optimizations."

The shift to capturing programming models in graphs allows for more sophisticated optimizations, enhancing performance.

Performance Optimization in ML Compilers

ML compilers optimize GPU performance by focusing on compute, memory, and overhead.
Compute involves floating-point operations, primarily matrix multiplications, which are crucial for maximizing GPU utilization.
Memory optimization focuses on reducing data transfer times within the GPU.

"In reality, actually nowadays all runtime is either like matrix multiplications or essentially shuffling data."

Matrix multiplications dominate GPU compute tasks, making their optimization critical for performance.

Impact of Matrix Content on Performance

The contents of matrices can affect matrix multiplication performance due to dynamic power usage in GPUs.
Different data patterns can lead to variations in power consumption and performance.
Understanding these effects is crucial for optimizing GPU tasks.

"There's this tweet from some of you guys might know a long time ago that really I think when I first saw this I actually was very much reminded of this tweet where you know I thought I knew how a GPU worked but then I was very confused about what could possibly be causing these performance differences."

The anecdote highlights the unexpected ways in which data patterns can impact GPU performance.

Memory Bandwidth and Operator Fusion

Memory bandwidth cost is a significant factor in GPU performance, involving data transfer between VRAM and compute units.
Operator fusion minimizes memory movement by combining multiple operations into a single GPU kernel.
This optimization is essential for efficient deep learning compilers.

"A very common operation for GPU compilers to do and I'd say what I call the most important optimization in a deep learning compiler by far is called operator fusion."

Operator fusion reduces unnecessary data transfers, improving both performance and resource utilization.

Recomputations vs. Reuse Trade-offs

Decisions on recomputing values versus reusing stored data can impact memory usage and performance.
Recomputations can reduce memory accesses and improve runtime performance.
These trade-offs are crucial in optimizing deep learning models.

"By doing some amount of recomputation, you can significantly reduce your number of memory accesses."

Recomputations can lead to more efficient memory usage and faster execution times, highlighting their importance in ML optimization.

Machine Learning Program Structure

Machine learning models require both forward and backward passes, which is unique compared to typical programming structures.
During the forward pass, intermediate activations are saved, which are crucial for the backward pass to compute gradients.
The storage of these intermediates can lead to out-of-memory errors due to their size and volume.

"In machine learning, the typical model will first run the model forwards from layer zero to layer four and then need to run what's called the backwards pass, which will run the layers in reverse."

The need to save intermediates is a direct consequence of using backpropagation and gradient descent in machine learning models.

Overhead in Machine Learning

Overhead in machine learning refers to the inefficiencies that occur when CPU and GPU operations are not synchronized.
If the CPU cannot keep up with scheduling operations, the GPU remains idle, leading to inefficiencies.
Techniques such as cographs and compiler optimizations can help reduce overhead.

"If a programmer is not able to put down the train tracks faster than the train can go on the train tracks, then sometimes a train is going to be just stuck waiting."

Cographs, an Nvidia provided API, is one method to address overhead by reducing CPU-GPU synchronization delays.

Role and Challenges of Compilers

Compilers optimize code once, allowing multiple users to benefit from these optimizations.
Despite their potential, many practitioners do not use compilers due to challenges and unpredictability in compiler optimizations.

"In practice, a lot of times people do not use compilers for this reason."

Compiler optimizations can be unpredictable, lack documentation, and may change with updates, leading to instability in code performance.

Limitations of Compiler Optimizations

Compiler optimizations may not always work or may not be well-documented, creating challenges for developers.
The unpredictability of when optimizations apply can make it difficult to rely on them for performance improvements.

"Compiler optimizations don't always work, and there's no real documentation on when a compiler optimization will trigger."

Developers need a deep understanding of both the compiler and the specific optimizations to ensure desired outcomes.

Auto Vectorization and Programming Models

Auto vectorization is a compiler technique that can fail, requiring programmers to understand the compiler deeply.
A reliable programming model should allow users to understand optimizations without needing to delve into compiler internals.

"The problem with an auto vectorizer is that as long as vectorizer can fail and it will, then if you're a programmer that actually cares about what code the compiler generates, you need to deeply understand the compiler."

A consistent programming model integrates optimizations as inherent features rather than optional enhancements.

Numerical Precision and Compiler Challenges

Machine learning often deals with low precision data types, leading to numerical instability.
Operations like FMA (Fused Multiply-Add) can introduce precision issues, especially when compilers optimize differently for each branch of a computation.

"Numerical properties relied on two separate computations of the same value to be exactly identical, and in this case with FMAs can apply them to only one branch but not the other."

Precision problems can result in significant errors, such as NaNs, when compilers apply optimizations like FMAs.

Algebraic Rewrites and Compiler Limitations

Algebraic rewrites in machine learning, such as optimizing softmax operations, require mathematical transformations that compilers struggle to perform automatically.
Flash attention is an example where manual intervention is needed to fuse operations for efficiency.

"Compilers are generally very bad at math, and so it's kind of difficult for them to actually come up with this rewrite by themselves."

Providing a higher-level API that consolidates multiple operations into one can simplify the programming model but may limit flexibility.

API Design and User Challenges

Designing APIs for machine learning operations involves balancing between simplicity and flexibility to accommodate various user needs.
Consolidating operations into a single API can lead to issues when new variants or extensions are required.

"With attention, we do see that people kept on coming out with new attention variants, and as a result, the kind of fused attention kernels that people have keep on accumulating new quirks."

Once a feature or quirk is added to an API, it becomes challenging to remove, as users may rely on it, complicating future development.

Challenges in Compiler Design and Programming Models

A single monolithic operator is not always sufficient for complex programming needs.
Compilers face challenges in generating code from scratch and in allowing easy modifications by users.
New programming models can offer solutions beyond traditional compiler and API limitations.

"A single monolithic operator is not actually always sufficient, and so we're in a situation where compilers like we don't really want to rely on a compiler to generate from scratch, but it's also painful to do modifications by hand."

This quote highlights the limitations of relying solely on compilers or monolithic APIs for programming solutions.

Introduction of Flex Attention API

Flex Attention API provides a new programming model that simplifies the implementation of complex attention mechanisms.
It guarantees a single fused attention kernel with consistent memory properties, regardless of the attention variant used.
Users can experiment with various attention variants without needing to understand the underlying API implementation.

"One of the reasons that Flex Attention is so liked, even though it relies on Torch compil to work, is that it's guaranteed to always result in a single fused attention kernel and it's always guaranteed to have the same memory properties as a fuse attention kernel."

The Flex Attention API is praised for its reliability and ease of use, allowing users to focus on programming rather than implementation details.

User Innovation with Flex Attention

Users have creatively applied Flex Attention to unique scenarios, such as converting molecular graphs into attention masks.
The flexibility of the API allows users to explore applications beyond the original intentions of the developers.

"This is like a very weird mask that you know I definitely did not think about when developing Flex Attention, but when you've developed a good programming model for users, users are oftentimes able to do things that you never considered doing when you develop the abstraction."

The adaptability of Flex Attention empowers users to innovate and apply the API in novel ways.

Programming Models vs. Optimizations

Programming models can be likened to mental models that help solve problems more efficiently than brute force optimizations.
The inherent parallelism in CUDA programming models contrasts with the challenges of autovectorization in Intel compilers.

"If you've written your code in CUDA and in the CUDA programming model, it must be parallel. It's kind of an axiomatic part of the program model."

The quote emphasizes the built-in parallelism of CUDA, showcasing the importance of choosing the right programming model for specific tasks.

Differences in Distributed ML and Traditional Systems

ML systems prioritize a single query with frequent global synchronization, unlike traditional systems that handle multiple small queries.
Performance is critical in ML systems, making it intolerable to lose performance for fault tolerance or uptime.

"In ML systems, we just basically have like a single query, like you're just training your ML job. We have frequent global synchronization across all of our hardware, and performance is extremely important."

This distinction underscores the unique demands of ML systems compared to traditional distributed systems.

Parallelism in ML Systems

Different types of parallelism (data, task, and pipeline) are used to optimize ML workloads.
Each type of parallelism addresses specific challenges, such as synchronization and communication-computation overlap.

"For data parallelism, the basic idea is pretty simple: we have a bunch of different tasks, and each GPU just handles a different task."

Data parallelism distributes tasks across GPUs, but requires synchronization to ensure model convergence.

Innovations in Parallelism

Innovations often expand the search space of parallelization rather than optimizing within existing constraints.
Compilers struggle to keep up with these innovations due to their limited adaptability.

"Most of the innovation in parallelism doesn't come from searching within your existing search space; it usually comes from expanding your search space along a new dimension."

This highlights the need for flexible systems that can adapt to new parallelization strategies.

Fault Tolerance in Large-Scale ML Training

Large-scale GPU training introduces challenges in fault tolerance due to frequent failures at high scales.
Strategies to address fault tolerance are crucial for maintaining training efficiency and reliability.

"When you're training on like 16,000 GPU hours, you're only getting an error once every 1.8 hours, but now if you scale it up to 131,000 GPUs, you now get a failure like every 15 minutes."

The quote illustrates the exponential increase in failure rates with scale, emphasizing the importance of robust fault tolerance mechanisms.

Conclusion: The Future of ML and Compilers

ML has become a dominant computational workload, necessitating new approaches to system and compiler design.
The focus should be on developing programming models that enable user-driven optimizations and innovation.

"To me, the kind of most interesting question about building systems here isn't really about building the right optimizations for systems; it's instead about coming up with the right programming models."

The future of ML systems lies in creating flexible programming models that adapt to evolving workloads and user needs.

What others are sharing

Go To Library

Broadcom: The $600 Billion AI Chip Giant

Chambers Talks Episode 27: Achieving the American Dream with Hock Tan of Broadcom

E10: The Future of Jobs, Governance, and Humanity | Emad Mostaque

Building Machine Learning Systems for a Trillion Trillion Floating Point Operations

Abstract

Summary Notes

Introduction to Machine Learning Systems

Infrastructure and Compute Growth

Floating Point Operations and Model Complexity

Evolution of Machine Learning Frameworks

Programming Models and Optimization

Simplification and Performance Expectations

Impact of Python and PyTorch

Conclusion

CPU and GPU Work Scheduling

Introduction of Tensor Cores by Nvidia

Evolution of Machine Learning Frameworks

Performance Optimization in ML Compilers

Impact of Matrix Content on Performance

Memory Bandwidth and Operator Fusion

Recomputations vs. Reuse Trade-offs

Machine Learning Program Structure

Overhead in Machine Learning

Role and Challenges of Compilers

Limitations of Compiler Optimizations

Auto Vectorization and Programming Models

Numerical Precision and Compiler Challenges

Algebraic Rewrites and Compiler Limitations

API Design and User Challenges

Challenges in Compiler Design and Programming Models

Introduction of Flex Attention API

User Innovation with Flex Attention

Programming Models vs. Optimizations

Differences in Distributed ML and Traditional Systems

Parallelism in ML Systems

Innovations in Parallelism

Fault Tolerance in Large-Scale ML Training

Conclusion: The Future of ML and Compilers

What others are sharing