Async/Await on the GPU
At VectorWare, we are building the first
GPU-native software company. Today, we are excited to
announce that we can successfully use Rust's
Future trait and
async/await on the GPU. This milestone marks a significant step towards our vision
of enabling developers to write complex, high-performance applications that leverage the
full power of GPU hardware using familiar Rust abstractions.
GPU programming traditionally focuses on data parallelism. A developer writes a single
operation and the GPU runs that operation in parallel across different parts of the
data.
This model works well for standalone and uniform tasks such as graphics rendering,
matrix multiplication, and image processing.
As GPU programs grow more sophisticated, developers use warp
specialization to introduce more complex
control flow and dynamic behavior. With warp specialization, different parts of the GPU
run different parts of the program concurrently.
Warp specialization shifts GPU logic from uniform data parallelism to explicit
task-based parallelism. This enables more sophisticated programs that make better use of
the hardware. For example, one warp can load data from memory while another performs
computations to improve utilization of both compute and memory.
This added expressiveness comes at a cost. Developers must manually manage concurrency
and synchronization because there is no language or runtime support for doing so.
Similar to threading and synchronization on the CPU, this is error-prone and difficult
to reason about.
There are many projects that aim to provide the benefits of warp specialization without
the pain of manual concurrency and synchronization.
JAX models GPU programs as computation graphs that encode
dependencies between operations. The JAX compiler analyzes this graph to
determine ordering, parallelism, and placement before generating the program that
executes. This allows JAX to manage and optimize execution while presenting a high-level
programming model in a Python-based DSL. The same model supports multiple hardware
backends, including CPUs and TPUs, without changing user code.
Triton expresses computation in terms of blocks
that execute independently on the GPU. Like JAX, Triton uses a Python-based DSL to
define how these blocks should execute. The Triton compiler lowers block definitions
through a multi-level
pipeline of MLIR
dialects, where it applies
block-level data-flow analysis to manage and optimize the generated program.
More recently, NVIDIA introduced CUDA Tile.
Like Triton, CUDA Tile organizes computation around blocks. It additionally introduces
"tiles" as first-class units of data. Tiles make data dependencies explicit rather than
inferred, which improves both performance opportunities and reasoning about correctness.
CUDA Tile ingests code written in existing languages such as Python, lowers it to an
MLIR dialect called Tile IR, and executes on the
GPU.
We are excited and inspired by these efforts, especially CUDA Tile. We think it is a
great idea to have GPU programs structured around explicit units of work and data,
separating the definition of concurrency from its execution. We believe that GPU
hardware aligns naturally with structured
concurrency and changing the
software to match will enable safer and more performant code.
These higher-level approaches to GPU programming require developers to structure code in
new and specific ways. This can make them a poor fit for some classes of applications.
Additionally, a new programming paradigm and ecosystem is a significant barrier to
adoption. Developers use JAX and Triton primarily for machine learning workloads where they
align well with the underlying computation. CUDA Tile is newer and more general but has
yet to see broader adoption. Virtually no one writes their entire application with these
technologies. Instead, they write parts of their application in these frameworks and
other parts in more traditional languages and models.
Code reuse is also limited. Existing CPU libraries assume a conventional language
runtime and execution model and cannot be reused directly. Existing GPU libraries rely
on manual concurrency management and similarly do not compose with these frameworks.
Ideally, we want an abstraction that captures the benefits of explicit and structured
concurrency without requiring a new language or ecosystem. It should compose with
existing CPU code and execution models. It should provide fine-grained control when
needed, similar to warp specialization. It should also provide ergonomic defaults for the
common case.
We believe Rust's Future
trait and async/await provide such an abstraction. They encode structured
concurrency directly in an existing language without committing to a specific execution
model.
A future represents a computation that may not be complete yet. A future does not
specify whether it runs on a thread, a core, a block, a tile, or a warp. It does not
care about the hardware or operating system it runs on. The Future
trait itself is intentionally
minimal. Its core operation is
poll, which
returns either
Ready or
Pending.
Everything else is layered on top. This separation is what allows the same async code to
be driven in different environments. For more detailed info, see the Rust async
book.
Like JAX's computation graphs, futures are deferred and composable. Developers construct programs as values before executing them.
This allows the compiler to analyze dependencies and composition ahead of execution
while preserving the shape of user code.
Like Triton's blocks, futures naturally express independent units of concurrency.
Depending on how futures are combined, they represent whether a block of work runs
serially or in parallel. Developers express concurrency using normal Rust control flow,
trait implementations, and future combinators rather than a separate DSL.
Like CUDA Tile's explicit tiles and data dependencies, Rust's ownership model makes data
constraints explicit in the program structure. Futures capture the data they operate on and that captured
state becomes part of the compiler-generated state machine. Ownership, borrowing,
Pin, and bounds such as
Send and
Sync encode how data can be
shared and transferred between concurrent units of work.
Warp specialization is not typically described this way, but in effect, it reduces to
manually written task state machines.
Futures compile down to state machines that the Rust compiler generates and manages
automatically.
Because Rust's futures are just compiler-generated state machines there is no reason
they cannot run on the GPU. That is exactly what we have done.
Running async/await on the GPU is difficult to demonstrate visually because the code
looks and runs like ordinary Rust. By design, the same syntax used on the CPU runs
unchanged on the GPU.
Here we define a small set of async functions and invoke them from a single GPU kernel
using block_on. Together, they exercise the core features of Rust's async model:
simple futures, chained futures, conditionals, multi-step workflows, async blocks, and
third-party combinators.
Getting this all working required fixing bugs and closing gaps across multiple compiler
backends. We also encountered issues in NVIDIA's ptxas tool, which we reported and
worked around.
Using async/await makes it ergonomic to express concurrency on the GPU. However, in
Rust futures do not execute themselves and must be driven to completion by an executor.
Rust deliberately does not include a built-in executor and instead third parties provide
executors with different features and tradeoffs.
Our initial goal was to prove that Rust's async model could run on the GPU at all. To do
that, we started with a simple
block_on as our
executor. block_on takes a single future and drives it to completion by repeatedly
polling it on the current thread. While simple and blocking, it was sufficient to
demonstrate that futures and async/await could compile to correct GPU code. While
the block_on executor may seem limiting, because futures are lazy and composable we
were still able to express complex concurrent workloads via combinators and async
functions.
Once we had futures working end to end, we moved to a more capable executor. The Embassy
executor is designed for embedded systems and operates in Rust's
#![no_std] environment. This makes it a natural fit for GPUs, which lack a traditional
operating system and thus do not support Rust's standard library. Adapting it to run on
the GPU required very few changes. This ability to reuse existing open source libraries
is much better than what exists in other (non-Rust) GPU ecosystems.
Here we construct three independent async tasks that loop indefinitely and increment
counters in shared state to demonstrate scheduling. The tasks themselves do not perform useful computation. Each task awaits a simple
future that performs work in small increments and yields periodically. This allows the
executor to interleave progress between tasks.
Below is an Asciinema recording of the GPU running the async
tasks via Embassy's executor. Performance is not representative as the example runs
empty infinite loops and uses atomics to track activity. The important point is that
multiple tasks execute concurrently on the GPU, driven by an existing, production-grade
executor using Rust's regular async/await.
Taken together, we think Rust and its async model are a strong fit for the GPU. Notably,
similar ideas are emerging in other language ecosystems, such as NVIDIA's
stdexec work for C++. The difference is these
abstractions already exist in Rust, are widely used, and are supported by a mature
ecosystem of executors and libraries.
Futures are cooperative. If a future does not yield, it can starve other work and degrade
performance. This is not unique to GPUs, as cooperative multitasking on CPUs has the
same failure mode.
GPUs do not provide interrupts. As a result, an executor running on the device must
periodically poll futures to determine whether they can make progress. This involves
spin loops or similar waiting mechanisms. APIs such as
nanosleep
can trade latency for efficiency, but this remains less efficient than interrupt-driven
execution and reflects a limitation of current GPU architectures. We have some ideas for
how to mitigate this and are experimenting with different approaches.
Driving futures and maintaining scheduling state increases register pressure. On GPUs,
this can reduce occupancy and impact performance.
Finally, Rust's async model on the GPU still carries the same function coloring
problem
that exists on the CPU.
On the CPU, executors such as Tokio,
Glommio, and
Smol make different tradeoffs around scheduling,
latency, and throughput. We expect a similar diversity to emerge on the GPU. We are
experimenting with GPU-native executors designed specifically around GPU hardware
characteristics.
A GPU-native executor could leverage mechanisms such as CUDA
Graphs
or CUDA Tile for efficient task scheduling or shared memory for fast communication
between concurrent tasks. It could also integrate more deeply with GPU scheduling
primitives than a direct port of an embedded or CPU-focused executor.
At VectorWare, we have recently enabled std on the GPU.
Futures are no_std compatible, so this does not impact their core functionality.
However, having the Rust standard library available on the GPU opens the door to richer
runtimes and tighter integration with existing Rust async libraries.
Finally, while we believe futures and async/await map well to GPU hardware and align
naturally with efforts such as CUDA Tile, they are not the only way to express
concurrency. We are exploring alternative Rust-based approaches with different tradeoffs
and will share more about those experiments in future posts.
We completed this work months ago. The speed at which we are able to make progress on
the GPU is a testament to the power of Rust's abstractions and ecosystem.
As a company, we understand that not everyone uses Rust. Our future products will
support multiple programming languages and runtimes. However, we believe Rust is
uniquely well suited to building high-performance, reliable GPU-native applications and
that is what we are most excited about.
Follow us on X,
Bluesky,
LinkedIn, or subscribe to our
blog to stay updated on our progress. We will be sharing more about our work in
the coming months. You can also reach us at hello@vectorware.com.
Alex Chen
Senior Tech EditorCovering the latest in consumer electronics and software updates. Obsessed with clean code and cleaner desks.