Replicant: Giving AI Agents GPU Access Through Google Colab

An MCP server in Rust that lets Claude run GPU experiments on Colab autonomously. Sessions persist on Drive, code runs asynchronously, and the agent never needs to touch a browser.

rustmcpcolabgpuagents

Most of the experiments on this blog run on CPU, which is fine for API calls and small models. But the moment I needed to do anything with torch, the workflow turned into: open Colab, paste code, wait, copy results, paste more code, wait again. It works, but it completely breaks the agent loop. The agent can't open a browser tab.

I found mcp-server-colab-exec, a Python project that had already reverse-engineered Colab's runtime protocol: the XSRF handshake, the Jupyter WebSocket, credential propagation. But it was essentially one-shot: spin up a runtime, run some code, tear it down. Every execution meant a fresh environment. No state carried over between calls, no way to run a multi-step experiment where step 3 depends on step 2's output. I prototyped on top of it, hit that wall pretty quickly, and ended up rewriting the whole thing in Rust with a different architecture. That became Replicant.

Agent (Claude)  --->  MCP (Replicant)  --->  Google Colab (GPU)
                                        --->  Google Drive (persistence)

◇ Sessions and runtimes

The one-shot problem with the Python version came down to state. If every execution tears down the runtime, there's nowhere for intermediate results to live. But even if you keep the runtime alive, Colab's free-tier T4s get reclaimed after about 90 minutes of inactivity, sometimes less. So any architecture where state lives on the runtime is asking for trouble either way.

Replicant splits things into sessions and runtimes. A session is just a folder on Google Drive holding data, notebooks, and execution history. A runtime is the actual GPU box, ephemeral and disposable. When you attach a runtime to a session, Replicant installs packages and pulls Drive files into /content/. When the runtime dies (not if, when), the session folder still has everything. The agent attaches a fresh runtime and keeps going.

Execution is async too. The agent submits code, gets an ID back, and polls later. I wanted this because GPU work can take minutes, and blocking the agent on a training loop felt wrong.

◇ The Colab protocol

Colab has no public API for any of this. The protocol that mcp-server-colab-exec figured out involves a two-step XSRF handshake (GET then POST to /tun/m/assign), a Jupyter kernel over WebSocket, and keep-alive pings every 60 seconds with an X-Colab-Tunnel: Google header. Miss a ping and the connection drops silently.

One thing that tripped me up early: every HTTP response from Colab has an XSSI prefix, )]}'\n, prepended to the body. If you don't strip it, the JSON parser just gives you garbage errors with no indication of what's actually wrong. Authentication is two separate OAuth2 flows: one for Colab (using the VS Code extension's public client ID, which feels a bit like borrowing someone's Netflix login) and one for Drive through your own GCP app. There's also a colab_request mechanism that passes credentials into the runtime, so code on the GPU can access Drive directly.

Porting this to Rust forced me to handle a lot of things the Python version got away with ignoring. WebSocket message framing, token refresh races, reconnection after network blips. The Python prototype just retried and usually got lucky. The Rust version is more explicit about all of it, which was tedious to write but caught real bugs.

◇ Rough edges

runtime_push is a hack. It base64-encodes the file and pipes it through the kernel's stdout over WebSocket. Works for config files and logs, but anything over a few MB will choke. For model checkpoints and datasets, the agent has to submit code that uploads from the runtime via the Drive API instead.

exec_cancel doesn't actually interrupt the kernel either. It marks the execution as cancelled in Replicant's tracking, but the Python code keeps running. A real interrupt would need a Jupyter interrupt_request, which I haven't gotten to yet.

And Colab's free tier only gives you one runtime at a time, with aggressive reclamation. Replicant persists the runtime connection state locally so it can reconnect on restart, but there's no getting around Colab deciding it wants the GPU back.

The code is on GitHub. It's three crates: replicant-api for the protocol and Drive integration, replicant-mcp-server exposing 18 tools over stdio, and replicant-mcp-client for manual testing.

ARTICLES-Reading