Running LLMs in the Browser with Rust + WebGPU

With WebGPU landing on stable Chrome today, billions of internet users will have unimpeded access to GPGPU compute in the browser. This is very exciting and paves the way for new possibilities in web based applications. After assessing the feasibility in my previous post, I set out to leverage WebGPU to run large machine learning models locally. This effort has culminated in my pre-pre-alpha library laserbeak, which allows users to run a 250M parameter large(ish?) language model (LLM) on the web.

To showcase some of the capabilities, I've built a simple document editor on top of the excellent slate which you can see in the demo above. If you've already updated Chrome or are from the future, you can try out the demo here - and if you're interested to learn more, read on!

Applications

RunwayML Green Screen

This post by RunwayML first got the gears turning about the applications of models in the browser. In the post, they break down their AI-powered Green Screening tool, which is built using a chain of models. They expound on the difficulty of setting up a streaming inference system, rapidly shuttling frames to many server side V100s to reach the level of interactivity required for good UX.

Meta AI took this a step further with their Segment Anything Model (SAM). SAM cleverly shifts almost all the model parameters to the encoder, leaving an extremely lightweight decoder which can be conditioned on user selected points to refine the mask (check out the demo). Unfortunately, with the encoder on the server, you still need your GPU farm and an efficient streaming system. For a truly interactive experience, having the entire model locally will prove essential for real-time applications like audio and video.

With more and more applications requiring chains of models, it's no surprise that LangChain has exploded in popularity. LangChain enables developers to harness complex emergent behaviour by using LLMs as a primitive. Unfortunately, repeatedly calling into ~~the Oracle~~ GPT-4 is going to run up a huge bill and organisations that naïvely build on top of OpenAIs APIs without proper optimization are going to end up out of business. I found this stat from Chip Huyen in her detailed LLMOps post mindblowing:

GPT-4 with 10k tokens in input and 200 tokens in output, it’ll be $0.624 / prediction.

This cost is untenable for a production service, even before we take multiple calls into consideration. Fortunately, not every call requires the mental horsepower of GPT-4. Offloading some parts of the call chain to finetuned local models could dramatically reduce costs while offering additional benefits such as privacy and personalization.

WebGPU

To effectively run these models and achieve the desired interactivity and distribution, a powerful cross-platform API is needed. Enter WebGPU - a native graphics API landing in Chrome after six years of development. With its predecessor WebGL gone, compute shaders are now a first class citizen on the web, and the performance has proved quite impressive. From my benchmarks, you can expect nearly 1TFLOP/s on an M1! This post compares WebGPU against WebGL, demonstrating a huge 3.5x improvement for matrix muliplications. With WebGPU also comes a brand new shading language, WGSL, and despite some critisism, I've found the Rust inspired syntax very approachable. Fellow Rust programmers will enjoy WebGPU due to the excellent Rust support, with the wgpu crate implementing the API and naga handling shader compiliation (these two crates power the Firefox WebGPU implementation). If you're interested in delving deep into WebGPU, I highly recommend this seminal blog post by Surma.

Laserbeak

Just having access to GPGPU compute is a far cry from plug-and-play web based models. For this, I've created Laserbeak - the frontend for my WebGPU runtime, Rumble. Named after the smallest transformers, Laserbeak & Rumble empower you to integrate GPU accelerated transformers directly into your web or Electron applications. You can install Laserbeak here!

Laserbeak is more than just a thin wrapper over the runtime; it offers some creative solutions to problems with running models in the browser. For example, we want to keep models as small as possible, both over the wire and in memory, so weight sharing for encoder-decoder models is a must. Through some hacking of the ONNX specification and IndexedDB, I've implemented weight sharing by storing tensors individually and reading them from the DB for each model. Storing each of the tensors individually enables some pretty interesting use cases that I will be exploring in the near future.

The first model

If you're going to write a runtime, it's a good idea to have a model in mind that you want to run. FLAN-T5 is an encoder-decoder model trained on text-to-text tasks and released by Google in Oct. 2022 in their Scaling Instruction-Finetuned Language Models paper. I chose FLAN as the inaugural model for my runtime for the following reasons:

It's available in a wide range of sizes, with the 250M and 780M being the most attractive initially.
The 780M parameter model boasts impressive performance, achieving a 45.1 MMLU score, which is comparable to LLaMA 13B (46.9).
It can be implemented in only ~32 operations (Explore the model with this diagram).
RLHF tuning drastically improves performance and usability.
250M parameters seemed a feasible starting point even in FP32.

Although the model isn't finetuned for any particular task at the moment, its potential is promising, especially when considering the substantial performance increase demonstrated by the 780M parameter variant. Getting to this point will require some more work to reduce memory usage, but with a finetuned and larger model in the pipeline, we should be a strong competitor to OpenAI for narrow use cases.

Implementation

The primary bottleneck in executing large models locally is not processing power, but memory capacity. For instance, the smallest model from Meta's recent LLaMA release is 7B parameters, which would require 14GB of memory if stored in FP16. This is why projects like llama.cpp have caused such a frenzy. Using techniques like 4-bit quantisation, the same 7B weights now take up only 3.5GB (at the cost of higher perplexity), which is much more feasible for consumer hardware. Unfortunately, the GGML family primarly targets the CPU and whilst this gives much more flexibility, it does have a major downside - it will eat all of your cores. Since most modern devices are shipping with capable GPU/NPUs, we should capitalize on that silicon! The GGML community is very aware of this, and they've already moved whisper.cpp onto the Apple ANE.

Whilst the raw performance of WebGPU is pretty impressive, outperforming hyper optimised CPU implementations written by extremely talented engineers is still a daunting task. One of the most significant challenges stems from most modern models having dynamic input shapes. In the good old days, most models would have static input and output shapes. This makes running something like a ResNet on the GPU trivial - you just precompile all the shaders and execute them in sequence. Unfortunately, this is not the case for LLMs or diffusion models, where the input dimensions change at every time step.

To handle this, you need to just-in-time (JIT) compile the shaders, which makes beating CPU performance slightly more challenging than anticipated. You need to compile all the shaders, move all the data to the GPU, run the model, and read the results back all in the same timespan as the CPU which... simply runs the model. However, all is not lost. Given that models are typically constructed of repeating blocks, you only need to JIT compile the shaders for the first block, then you can just repeatedly fetch them from the cache.

Limitations

In its current state, my runtime can run a 250M parameter model at ~20 tokens/s without any optimization, except for caching. With further optimization, it is possible to significantly outpace CPU implementations. Moreover, the advantage over CPU will grow in tandem with the model size, making it an increasingly attractive option for running large models in the browser.

The main drawback of Rumble is that it currently only supports FP32(!), as this is what's supported via naga and wgpu, which makes our models quite large. Luckily, the SHADER-F16 extension is in the spec and will soon be available in Chrome, which should allow us to easily half our model size. We can take our size reduction even further, with clever tricks to implement INT8 and INT4 in the future. This has recently been done to great effect by the MLC-AI team for their project, web-llm.

For models that can fit within the available memory, optimizing GPU utilization is essential for achieving high performance. In my benchmarking, WebGPU is only able to reach _{40% of theoretical maximum FLOPs on native for matrix multiplies. This drops down to}30% when running in the browser, due to the strict bounds checking implemented to avoid potential abuse. This is a shame, as it is quite easy to write a shader directly in Metal that reaches ~85% of maximum FLOPs, as demonstrated by Geohot in tinygrad. To do this, he leverages lots of intrinsics that aren't currently available through WebGPU, primarily SIMD groups. However, there have been several proposals (e.g here) to introduce SIMD groups and I am confident it will eventually make its way into the browser.

Conclusions

I am really excited to see what happens in this space as the Rust + WebGPU ecosystem matures. I anticipate an explosion of web apps and Chrome extensions leveraging sub-1B parameter models locally.

With more optimizations, operator support, and quantization tricks, LaserBeak & Rumble should be able to run more useful models - and soon. Delivering this solo is extremely ambitious, and taking it to the next level will not be easy. If you find this interesting, check out the demo and if you want to get involved - reach out to me.