How I run AI bots in Discord, a build walkthrough you can do free

Start here: a Discord server you can talk to

I run a Discord server where some of the members are bots I built. You type a message in a channel and a bot answers, the way a person would. Ask it a question and it replies. Ask it for a picture, a short video or a piece of music and it makes the file and drops it back in the channel. A few of the bots have their own personalities and run on their own. One of them works in the background on a timer without anyone prompting it.

The brain behind the talking is a language model, and the pictures and video come off a graphics card. The point of this page is that none of it has to cost money. You can run the whole thing on NVIDIA's free hosted models, or you can run it on your own machine with a local model and a local GPU, and the bots behave the same either way. I will show both.

I did not type most of the code. I told the Claude Code agent what I wanted and it wrote the scripts, which is also how you would rebuild any of this. So the page reads two ways. You can follow it to understand how each piece works, or you can copy the prompt at the end of each part and have Claude build your own version.

Nothing here leaks a real key or login. Anywhere a secret would go you will see a placeholder like <YOUR_TOKEN>, and the real values live in one file that never leaves my machine.

Read in order, or jump around

01What these bots are, and the two ways to run them 02What you need: accounts, downloads, and the prompt 03The gateway bridge: a discord.js bot that listens in one channel 04The brain: turning a message into a reply 05The personas: Pixel, Mochi, and the Planner daemon 06Slash commands, buttons, and modals 07Making images, video and music on your own GPU 08Posting out, and keeping the bots alive 24/7

What these bots are, and the two ways to run them

There is one main bot, the bridge, that listens in a channel and answers. It is a normal Discord bot built on discord.js: it logs in with a token, watches for messages, and posts replies. The difference from a toy bot is what sits behind it. Each message is handed to a language model, and the model's answer is sent back to the channel. The same bot also runs slash commands that generate images, video and music.

Around the bridge I run a few smaller bots with fixed personalities, named Pixel and Mochi, plus a background worker called the Planner that acts on a timer instead of waiting to be spoken to. They share the same plumbing but each is its own process.

The cloud-free way. NVIDIA gives away a generous free tier of hosted models through its NIM API. You make a key, point the bot at integrate.api.nvidia.com, and the brain, the image descriptions and the prompt rewriting all run there at no cost. This is the easiest start and needs no special hardware.

The fully-local way. If you have a GPU, you can run the language model on your own box behind a small local server, and run the image and video models in ComfyUI on the same card. Now nothing leaves your machine and there is no bill at all. The bot code points at a local address instead of the NVIDIA one, and everything else stays the same. You can also use the Claude command-line tool as the brain, which is how my live-chat bot on the Content Factory page works.

Both paths run the identical bot. Start on the free NVIDIA tier, and move the brain local whenever you want.

What you need: accounts, downloads, and the prompt

This is the shopping list for the whole stack. You will not need all of it for a plain chat bot, so skip what does not apply. Everything here has a free path.

Accounts and keys

Each row links to where you sign up. The Discord token and the model key are the only two you need for a talking bot.

Service / API	What it does here	Docs
Discord Developer Portal	Where you create the bot application, get its token, and turn on the message-content intent.	docs ↗
discord.js	The Node library the bot is built on. It connects to the Discord gateway, receives messages, and sends replies, buttons and modals.	docs ↗
NVIDIA NIM API	Hosted inference on a free tier. The bot brain and the vision and prompt-rewrite helpers call it. Swap in a local model and the cost goes to zero.	docs ↗
Claude Code	The agent you hand each build prompt to. It writes and wires the scripts for you.	docs ↗
ComfyUI	Runs image and video models on your own GPU, so generation is free after the hardware.	docs ↗
pm2	Keeps every bot process running across crashes and reboots.	docs ↗
Discord Gateway (Developer Portal)	Where the bot logs in with its token and where the privileged Message Content intent is enabled.	docs ↗
Node.js http	Built-in. Runs the localhost /health server via listenWithRetry.	docs ↗
NVIDIA NIM (integrate.api.nvidia.com)	Hosted inference for the chat brain. The :9340 proxy forwards Nova turns here and the answer comes back from meta/llama-3.3-70b-instruct. The same endpoint serves image description via meta/llama-3.2-90b-vision-instruct and the video prompt-enhance fallback via a minimax model. The bridge refuses to boot without this key, so in practice it is the only live chat route. Free developer key from build.nvidia.com.	docs ↗
OpenShell sandbox CLI	Isolated per-instance sandbox. resolveOpenshell locates the binary; openshell sandbox ssh-config <name> yields an SSH config used to push and pull media for img2img. In the reference design this is also where the OpenClaw agent runs the turn.	,
Vertex AI (Imagen, grounded search, Grok/Gemini proxy)	Reached with a service-account OAuth2 bearer. It backs the Imagen image proxy on :9339, the grounded knowledge-base search used during context assembly, and the no-key chat fallback, which routes to xAI Grok through Vertex's OpenAI-compatible endpoint. A full OpenAI-to-Gemini generateContent translator with a 30-minute context cache on the system prompt and tools also lives in the :9340 proxy, but it sits below the Grok return and stays dormant in the shipped routing. It is also the default backend (gemini-2.5-flash) for the separate hermes inference shim.	docs ↗
Local OpenAI-compatible LLM server (vLLM / Ollama / TensorRT-LLM)	The fully-local brain option. The proxy has a LOCAL_LLM_URL branch that forwards the OpenAI /v1/chat/completions shape to your own server, so the bot can run with no cloud calls on your own GPU. As written the branch is dormant because the NIM and Grok blocks return before it.	docs ↗
Discord API (discord.js v14)	Each persona is its own gateway client and bot user, with the MessageContent intent so they can read and reply to messages.	docs ↗
Qdrant (via local :7338 HTTP wrapper)	Shared crew memory. Pixel and Mochi POST {cmd:'search'\|'store'} to recall and save notes across personas; Qdrant itself runs on :6333 behind the wrapper.	docs ↗
Brave Search API	Optional web lookup for Pixel and Mochi; degrades to nothing when the key is unset.	docs ↗
memegen.link	Free, no-auth meme image generation for Mochi's chatroom via a [MEME: template \| top \| bottom] token.	docs ↗
Reddit hot.json	No-auth hot-post enrichment for Mochi's chatroom answers.	docs ↗
Tenor GIF (through the Discord bot token)	GIF search for Mochi via Discord's built-in Tenor proxy, using the existing bot token.	docs ↗
Google Drive API	Optional backup for Pixel's separate trend store (pixel-memory.js); skipped unless a folder id is set.	docs ↗
Discord Application Commands API	REST endpoint the bot PUTs its command definitions to (global registration via Routes.applicationCommands)	docs ↗
Discord Interactions (receiving and responding)	The reply/defer/update/showModal lifecycle and the 3-second acknowledge window every handler obeys	docs ↗
NVIDIA NIM (build.nvidia.com)	Cloud-free inference fallback at integrate.api.nvidia.com. Hosts FLUX image models and LLMs for prompt expansion on a free key, OpenAI-compatible. The no-GPU path.	docs ↗
LTX-Video 2.3 (Lightricks)	Local text-to-video and image-to-video model run as a ComfyUI workflow. The low-VRAM path uses Q4 GGUF weights plus a distilled LoRA and 2-pass sampling so it fits a consumer card.	docs ↗
ACE-Step	Local text-to-music model run as a ComfyUI workflow. Generates songs with optional lyrics, or instrumental when the lyric string is blank.	docs ↗
Z-Image Turbo	Fast local text-to-image model run as a ComfyUI workflow, a few seconds per still. Drives /zturbo and the /create image option.	,
FFmpeg	Local compositor in video-editor.js: Ken Burns, crossfades, beat-synced jumpcuts, lyric overlays, audio mix, and a re-encode pass to fit Discord's upload limit.	docs ↗
CapCut Mate API	Local draft builder on http://localhost:30000. The second compose path: assembles a CapCut draft with effects, filters, captions, and beat-synced cuts for desktop export.	,
Suno (studio-api)	Optional cloud music. Needs a refresh JWT pulled from the browser plus a captcha solver, so the local ACE-Step path is the stable free alternative.	,
Instagram Graph API (Facebook Graph API v21.0)	Publish images and Reels to an Instagram Business/Creator account via the create-container-then-publish flow, using a long-lived Page access token.	docs ↗
Cloudflare R2 (via a Worker)	Primary public host for the rendered media. The bridge PUTs the bytes to a small Worker that stores them in an R2 bucket and returns a public URL. Auth is a static upload-secret header, no OAuth.	docs ↗
Catbox / Litterbox	Free anonymous fallback host for video (72h TTL). Used when R2 is unavailable.	docs ↗
Imgur API	Free anonymous fallback host for images, via Client-ID auth and a base64 upload.	docs ↗
Google Drive API v3	Backup target for every generated file, plus a 30-minute scripts/env/pm2 tarball. Auth is a personal OAuth refresh token (preferred) or a service-account JWT.	docs ↗
Google OAuth2 token endpoint	Mints short-lived access tokens for Drive from either a refresh token or a service-account JWT (RS256); results cached ~55 minutes.	docs ↗

What to install on your machine

On Windows I run all of this inside Ubuntu through WSL. On a Mac use Homebrew, on Linux use your package manager.

Node, for the bot itself

npm init -y
npm i discord.js
npm i -g pm2          # process manager to keep it running

ffmpeg, for stitching and captioning media

winget install Gyan.FFmpeg     # Windows
brew install ffmpeg            # macOS
sudo apt install ffmpeg        # Debian / Ubuntu

ComfyUI, only if you want local image and video on your own GPU

git clone https://github.com/comfyanonymous/ComfyUI
cd ComfyUI && pip install -r requirements.txt
python main.py                 # serves a local API on 127.0.0.1:8188

Or just point Claude at it

You do not have to wire any of this up by hand. Open Claude Code in an empty folder, paste a prompt like the one below, and answer the questions it asks. It will tell you which keys to make, write the bot, and run it with you.

starter prompt

I want to build a Discord bot that answers messages in one channel using a language model. Use discord.js for the gateway connection and read the bot token and the channel ID from environment variables. For the model, call an OpenAI-compatible chat endpoint whose base URL I can set, so I can point it at NVIDIA's free NIM API now and a local model later. Before you write code, tell me exactly which Discord settings and intents to turn on and which keys I need. I'm on <YOUR_OS>. Then build it and help me run it under pm2.

For a specific piece, copy the matching prompt from the bottom of its section below.

The gateway bridge: a discord.js bot that listens in one channel

discord-bridge.js is the front door of the stack: one Node process under PM2 holding a single gateway WebSocket to Discord. It decides which human messages deserve the agent and posts the reply back to the same channel, while the agent, media services, and memory store all sit behind it and never touch Discord directly.

It listens narrowly. DISCORD_CHANNELS names the guild-and-channel pairs it answers in, with an optional :mention flag so a channel stays quiet until the bot is @-mentioned. Most of the file is about not acting on a message twice, since Discord replays traffic on reconnect and PM2 restarts often. The pieces worth studying are the dedup that survives restarts, a per-user queue so one person cannot open two agent sessions at once (concurrent calls lock the sandbox session), and a small /health endpoint. It began as NVIDIA's NemoClaw reference bridge.

APIs & services

Service / API	What it does here	Docs
discord.js	Gateway client. Opens the WebSocket and exposes the Client / GatewayIntentBits / Partials API.	docs ↗
Discord Gateway (Developer Portal)	Where the bot logs in with its token and where the privileged Message Content intent is enabled.	docs ↗
NVIDIA NIM API	Cloud inference at integrate.api.nvidia.com. In this file it powers image vision (llama-3.2-90b-vision) and the prompt-enhance fallback; the main agent call goes to a localhost OpenAI-compatible proxy. NVIDIA_API_KEY is a hard requirement at boot regardless.	docs ↗
Node.js http	Built-in. Runs the localhost /health server via listenWithRetry.	docs ↗

How it's built, step by step

Create a Discord app and bot in the Developer Portal, enable the Message Content intent, copy the token.
Install discord.js and load secrets from an env file at startup (PM2 has no shell env).
Build the Client with the Guilds/GuildMessages/MessageContent/GuildMessageReactions intents and Channel/Message/Reaction/User partials.
Parse DISCORD_CHANNELS into an allow-list of guildId/channelId/mentionOnly entries.
In messageCreate: drop bots, drop anything outside the allow-list, honor mentionOnly.
Run the persisted dedup layer before the agent runs, then forward through the per-user queue and reply.
Add health counters, /health via listenWithRetry, an unhandledRejection guard, then client.login under PM2.

Under the hood

Client and login

MessageContent is privileged: enable it in the portal or msg.content arrives empty. Partials cover reactions and messages discord.js has not cached.

const { Client, GatewayIntentBits, Partials } = require("discord.js");
const client = new Client({
  intents: [GatewayIntentBits.Guilds, GatewayIntentBits.GuildMessages,
    GatewayIntentBits.MessageContent, GatewayIntentBits.GuildMessageReactions],
  partials: [Partials.Channel, Partials.Message, Partials.Reaction, Partials.User],
});
client.login(process.env.DISCORD_BOT_TOKEN).catch(() => process.exit(1));

Allow-list and the messageCreate gate

DISCORD_CHANNELS is guild:channel with an optional :mention, comma-separated.

const ALLOWED = (process.env.DISCORD_CHANNELS || "").split(",").filter(Boolean)
  .map((s) => { const [guildId, channelId, flag] = s.trim().split(":");
    return { guildId, channelId, mentionOnly: flag === "mention" }; });

client.on("messageCreate", (msg) => {
  if (msg.author.bot) return;
  const entry = ALLOWED.find((e) => e.guildId === msg.guildId && e.channelId === msg.channelId);
  if (!entry) return;
  if (entry.mentionOnly && !msg.mentions.has(client.user.id)) return;
  handle(msg);
});

Dedup that survives restarts

An exact-id Set (reloaded from disk), a 120s age cutoff for reconnect replays, and a per-user content hash for edit-and-resend.

const processed = new Set();   // boot: add last ~500 ids from a disk log
const seen = new Map();         // "userId:first-100-chars" -> timestamp
// inside handle(msg):
if (processed.has(msg.id)) return;
if (Date.now() - msg.createdTimestamp > 120000) return;          // reconnect replay
const key = `${msg.author.id}:${(msg.content||"").slice(0,100).toLowerCase()}`;
if (seen.get(key) > Date.now() - 300000) return;                 // edit-and-resend
processed.add(msg.id); seen.set(key, Date.now());                // persist both to disk
setTimeout(() => processed.delete(msg.id), 600000);

Per-user queue, then reply

.then(fn, fn) keeps a rejected call from stalling the chain; one in-flight call per user prevents concurrent sandbox sessions.

const queue = new Map();
const enqueue = (uid, fn) => {
  const next = (queue.get(uid) || Promise.resolve()).then(fn, fn);
  queue.set(uid, next);
  next.finally(() => { if (queue.get(uid) === next) queue.delete(uid); });
  return next;
};
let reply = await enqueue(msg.author.id, () => runAgent(fullMessage));
await msg.reply(reply.replace(/\b\d{17,19}\b/g, "[user]")); // never echo raw IDs

Health and listenWithRetry

const health = { msgIn:0, msgOk:0, msgErr:0, agentCalls:0, agentFails:0, dedups:0, rejections:0 };
function listenWithRetry(server, port, host, label, n = 5) {
  server.on("error", (e) => {
    if (e.code === "EADDRINUSE" && n-- > 0) setTimeout(() => server.listen(port, host), 2000);
  });
  server.listen(port, host);
}
// GET /health returns { ok, uptime, queue: queue.size, counters: health,
//   discord: { connected: client.ws?.status === 0, ping: client.ws?.ping } }
process.on("unhandledRejection", () => { health.rejections++; });

Run it local & free

Run it for nothing

The gateway side is already free: one Discord app, one bot, one server you own, and discord.js itself. Only inference costs anything, and there are two zero-cost paths.

Cloud-free with NVIDIA NIM. The bridge needs NVIDIA_API_KEY at boot and calls integrate.api.nvidia.com directly for image vision and prompt-enhance fallback. Grab a free key at build.nvidia.com; the request is OpenAI-compatible, so you can route the agent call there too:

const res = await fetch("https://integrate.api.nvidia.com/v1/chat/completions", {
  method: "POST",
  headers: { "Content-Type": "application/json",
    "Authorization": `Bearer ${process.env.NVIDIA_API_KEY}` },
  body: JSON.stringify({ model: "meta/llama-3.1-8b-instruct",
    messages: [{ role: "user", content: fullMessage }], max_tokens: 1024 }),
});

Fully local. The main agent call here already POSTs to an OpenAI-compatible proxy on localhost, so you can point that proxy at a local model server (Ollama on http://localhost:11434, llama.cpp, vLLM, or a self-hosted NIM container) and no text inference leaves the machine. The startup check still wants NVIDIA_API_KEY set, so feed it any placeholder or relax that guard. The allow-list, dedup, queue, and /health behave the same either way.

Gotchas & hard-won lessons

MessageContent is privileged. Without the portal toggle, messages arrive with empty content and the bot looks deaf while throwing no errors.
Dedup must survive restarts. Reconnects replay traffic and PM2 cycles often, so reload the id set and content map from disk on boot or the backlog replays every time.
Key the queue by user id and chain with .then(fn, fn), not .then(fn). The two-arg form lets the next message run after a rejected call instead of stalling the chain.
listenWithRetry exists because a fast PM2 restart can fire before the old process released the port; without the EADDRINUSE backoff the health server fails to bind.
The bridge hard-exits if the bot token or NVIDIA_API_KEY is missing, and it masks any 17-to-19 digit run in replies so raw Discord IDs never leak into the channel.

Prompts to build it yourself

The kind of instructions you'd hand an AI coding agent (Claude Code) to build this from scratch.

prompt

Build a Node.js Discord bot in discord-bridge.js with discord.js. Use the Guilds, GuildMessages, MessageContent, and GuildMessageReactions intents and Channel/Message/Reaction/User partials. Parse DISCORD_CHANNELS (comma-separated guildId:channelId with an optional :mention third segment) into objects with guildId, channelId, mentionOnly. In messageCreate, ignore bots, ignore messages outside the allow-list, and when mentionOnly only proceed if the bot is at-mentioned. Load secrets from ~/.nemoclaw_env at startup, exit if the token is missing, and call client.login.

prompt

Add three things to discord-bridge.js. First, restart-safe dedup: a Set of message ids reloaded from its last ~500 disk entries on boot, a content-hash Map keyed by userId plus the first 100 lowercased chars with a 5-minute window persisted to disk, and a hard cutoff dropping messages older than 120 seconds. Second, a per-user queue: a Map from userId to a promise chain appended with .then(fn, fn) and auto-deleted once it is the settled tail. Third, health counters plus a localhost /health JSON endpoint started via a listenWithRetry helper that retries server.listen on EADDRINUSE, and an unhandledRejection handler that increments a counter instead of crashing.

The brain: turning a message into a reply

Every reply this bot sends starts the same way: a message lands in an allowed Discord channel, my handler walks it through a few filters, glues some context onto it, and hands the result to a single function called runAgentInSandbox. That name is a small lie I kept on purpose. The original NemoClaw reference design forwarded each turn to an OpenClaw agent living inside an OpenShell sandbox, reached over SSH. In my build the chat turn does not SSH anywhere. It POSTs an OpenAI-shaped payload to a local model proxy on localhost:9340, and that proxy decides which model actually answers. I left the old name because the file header still describes the sandbox idea and I find it useful to remember where the wiring came from.

The proxy is the interesting part. When an NVIDIA_API_KEY is present it forwards the request straight to NVIDIA's hosted inference at integrate.api.nvidia.com, and the Nova persona answers with meta/llama-3.3-70b-instruct. With no key the very next branch hands the turn to a Grok proxy that runs grok-4.20-non-reasoning through Vertex AI's xAI endpoint. There is more code below that: a full OpenAI-to-Gemini generateContent translator with a 30 minute context cache on the system prompt and tools, plus a hook for a local OpenAI-compatible server. Both of those sit after the Grok block returns, so neither one fires in the shipped routing. And since the bridge process exits at startup when NVIDIA_API_KEY is missing, the NIM path is the route this bot takes every single time.

So what is the sandbox doing if the chat turn skips it? Media. The SSH plumbing is real and it earns its keep: resolveOpenshell finds the openshell binary, openshell sandbox ssh-config hands back a connection config, sshArgs bolts on an ephemeral known_hosts so a container restart does not wedge me, and pushImageToSandbox / pullImageFromSandbox base64-pipe images in and out of the sandbox's /tmp for img2img. The sandbox is where skills and image work run, not where the sentence comes from.

The persona itself is just a system prompt. I read SOUL.md off disk at startup and prepend it as the system message. Before the model ever sees the user's words I stack on crew memory, Mochi's past corrections, a knowledge-base search result, current trend data, and any crew pre-consult, then I serialize per user so two messages from the same person cannot trip over each other inside one session. After the model replies I scrub raw user IDs, run a hallucination filter, and intercept media tags like [ZTURBO:] or [GIF:] before the text reaches the channel.

APIs & services

Service / API	What it does here	Docs
NVIDIA NIM (integrate.api.nvidia.com)	Hosted inference for the chat brain. The :9340 proxy forwards Nova turns here and the answer comes back from meta/llama-3.3-70b-instruct. The same endpoint serves image description via meta/llama-3.2-90b-vision-instruct and the video prompt-enhance fallback via a minimax model. The bridge refuses to boot without this key, so in practice it is the only live chat route. Free developer key from build.nvidia.com.	docs ↗
discord.js	Gateway client. Receives messageCreate events with the MessageContent intent, exposes attachments and embeds, and sends or edits the reply including the progressive typing preview.	docs ↗
OpenShell sandbox CLI	Isolated per-instance sandbox. resolveOpenshell locates the binary; openshell sandbox ssh-config <name> yields an SSH config used to push and pull media for img2img. In the reference design this is also where the OpenClaw agent runs the turn.	,
Vertex AI (Imagen, grounded search, Grok/Gemini proxy)	Reached with a service-account OAuth2 bearer. It backs the Imagen image proxy on :9339, the grounded knowledge-base search used during context assembly, and the no-key chat fallback, which routes to xAI Grok through Vertex's OpenAI-compatible endpoint. A full OpenAI-to-Gemini generateContent translator with a 30-minute context cache on the system prompt and tools also lives in the :9340 proxy, but it sits below the Grok return and stays dormant in the shipped routing. It is also the default backend (gemini-2.5-flash) for the separate hermes inference shim.	docs ↗
Local OpenAI-compatible LLM server (vLLM / Ollama / TensorRT-LLM)	The fully-local brain option. The proxy has a LOCAL_LLM_URL branch that forwards the OpenAI /v1/chat/completions shape to your own server, so the bot can run with no cloud calls on your own GPU. As written the branch is dormant because the NIM and Grok blocks return before it.	docs ↗

How it's built, step by step

Stand up a discord.js client with the Guilds, GuildMessages, MessageContent, and GuildMessageReactions intents, and load DISCORD_BOT_TOKEN, NVIDIA_API_KEY, and the allowed guild:channel list from env. The process exits if the token or NVIDIA key is missing.
On messageCreate, drop anything outside the allowed channel, anything from a blocked user, and any message ID you have already processed (a persisted dedup set survives restarts).
If the message carries an image, describe it first by calling NVIDIA's vision model (meta/llama-3.2-90b-vision-instruct), so the brain gets text it can reason about.
Read SOUL.md once at startup and keep it as the system prompt for the persona, with a hardcoded fallback if the file is missing.
Assemble the contextual message: user prefix, crew memory, Mochi's past corrections, a knowledge-base search snippet, trend data, optional crew pre-consult, then the user's text.
Serialize per user with an enqueueAgent promise chain so one person's messages run one at a time, then call runAgentInSandbox.
runAgentInSandbox builds an OpenAI chat payload and POSTs it to the local proxy on localhost:9340 with an X-Agent-Id header and a 120s timeout.
The proxy routes by env: NVIDIA_API_KEY present forwards to integrate.api.nvidia.com (Llama-3.3-70B); with no key the next branch forwards to a Grok proxy (grok-4.20 via Vertex's xAI endpoint). A Gemini generateContent translator and a local-LLM branch also exist in the proxy but sit after the Grok return, so they stay dormant.
Stream a throttled progress preview back into Discord as an edited in-progress message while the model thinks.
Clean the reply: replace 17-to-19 digit user IDs with [user], run the hallucination filter, and unwrap any leaked JSON array.
Intercept media tags ([ZTURBO:], [GIF:], [SUNO:], [MAKE_GIF], [ACESTEP:] and friends) and run the matching generator, pushing or pulling images through the sandbox over SSH when img2img is involved.
Send the final text and any attachments to the channel.

Under the hood

The handler builds context, then calls one function

The message handler does the boring, important work: filter, describe images, stack context, serialize per user. The actual brain call is a single function with a deliberately misleading name.

// per-user serialization so two messages from one user can't collide
const agentQueue = new Map(); // userId -> promise chain
function enqueueAgent(userId, fn) {
  const prev = agentQueue.get(userId) || Promise.resolve();
  const next = prev.then(fn, fn);            // run after the previous one resolves OR rejects
  agentQueue.set(userId, next);
  next.finally(() => { if (agentQueue.get(userId) === next) agentQueue.delete(userId); });
  return next;
}

// inside the message handler, after context is assembled:
const contextualMessage =
  userPrefix + memoryContext + correctionsContext +
  vertexSearchContext + trendContext + crewPlanContext + fullMessage;

const reply = await enqueueAgent(msg.author.id, () =>
  runAgentInSandbox(contextualMessage, `dc-${msg.author.id}-${msg.id}`, onProgress)
);

The brain call: an HTTP POST to the local proxy

Despite the name, this never opens an SSH session. It loads the persona from SOUL.md and POSTs an OpenAI chat payload to the proxy on localhost:9340.

let systemPrompt = "You are Nova, lead of a small agent crew. Be concise and decisive.";
try { systemPrompt = fs.readFileSync(SOUL_PATH, "utf8"); } catch {}

function runAgentInSandbox(message, sessionId, onProgress) {
  return new Promise((resolve) => {
    const body = JSON.stringify({
      model: GEMINI_DEFAULT_MODEL,   // the proxy overrides this per route
      messages: [
        { role: "system", content: systemPrompt },
        { role: "user",   content: message },
      ],
      max_tokens: 4096,
      temperature: 0.8,
    });
    const req = http.request({
      hostname: "localhost",
      port: 9340,                    // the local model proxy
      path: "/v1/chat/completions",
      method: "POST",
      headers: {
        "Content-Type": "application/json",
        "Content-Length": Buffer.byteLength(body),
        "X-Agent-Id": "nova",
      },
    }, (res) => {
      let buf = "";
      res.on("data", (c) => { buf += c; /* throttled onProgress preview elided */ });
      res.on("end", () => {
        try {
          const data = JSON.parse(buf);
          resolve(data.choices?.[0]?.message?.content || "Agent returned nothing");
        } catch (e) { resolve(`Agent error: ${e.message}`); }
      });
    });
    req.setTimeout(120000, () => { req.destroy(); resolve("Agent timed out after 120s"); });
    req.on("error", (e) => resolve(`Agent error: ${e.message}`));
    req.end(body);
  });
}

The proxy picks the model

The :9340 server is where the model choice happens. With an NVIDIA key it forwards straight to NIM. With no key the next branch is unconditional and hands the turn to the Grok proxy. The NVIDIA path is the one the bot always takes, because the bridge will not boot without the key.

const NIM_KEY         = process.env.NVIDIA_API_KEY || "";
const NIM_NOVA_MODEL = process.env.NIM_NOVA_MODEL || "meta/llama-3.3-70b-instruct";
const GROK_MODEL      = process.env.GROK_MODEL      || "xai/grok-4.20-non-reasoning";
const GROK_PROXY_PORT = 9342;

// inside the :9340 server, handling POST /v1/chat/completions:
if (NIM_KEY) {
  const payload = JSON.stringify({ ...oai, model: NIM_NOVA_MODEL, stream: false });
  const nimReq = https.request({
    hostname: "integrate.api.nvidia.com",
    path: "/v1/chat/completions",
    method: "POST",
    headers: {
      "Content-Type": "application/json",
      "Content-Length": Buffer.byteLength(payload),
      "Authorization": `Bearer ${NIM_KEY}`,
    },
  }, nimRes => { res.writeHead(nimRes.statusCode, { "Content-Type": "application/json" }); nimRes.pipe(res); });
  nimReq.on("error", e => { res.writeHead(502); res.end(JSON.stringify({ error: { message: e.message } })); });
  nimReq.write(payload); nimReq.end();
  return;
}

// No NVIDIA key: this block is UNCONDITIONAL and returns. It forwards to the
// Grok proxy, which then hits Vertex AI's xAI endpoint with grok-4.20.
{
  const grokPayload = JSON.stringify({ ...oai, model: GROK_MODEL });
  const grokReq = http.request({
    hostname: "127.0.0.1", port: GROK_PROXY_PORT,
    path: "/v1/chat/completions", method: "POST",
    headers: { "Content-Type": "application/json", "Content-Length": Buffer.byteLength(grokPayload) },
  }, grokRes => { res.writeHead(grokRes.statusCode, { "Content-Type": "application/json" }); grokRes.pipe(res); });
  grokReq.on("error", e => { res.writeHead(502); res.end(JSON.stringify({ error: { message: e.message } })); });
  grokReq.write(grokPayload); grokReq.end();
  return;
}

// Everything below is dead in the current routing:
//   - a LOCAL_LLM_URL branch (forward to your own OpenAI-compatible server)
//   - a full OpenAI -> Vertex Gemini generateContent translator with a
//     30-minute context cache on the system prompt + tools
// To reach either, move it above the two returns above.

The sandbox over SSH carries media, not the sentence

The SSH helpers are genuinely used, just not for the chat turn. resolveOpenshell finds the binary, the CLI hands back an SSH config, and sshArgs adds an ephemeral known_hosts so a sandbox restart does not block me on a changed host key.

const { resolveOpenshell } = require("./lib/resolve-openshell");
const OPENSHELL = resolveOpenshell();
const SANDBOX   = process.env.SANDBOX_NAME || "my-assistant";

function sshArgs(confPath) {
  return ["-T", "-F", confPath,
    "-o", "StrictHostKeyChecking=accept-new",
    "-o", "UserKnownHostsFile=/dev/null",   // ephemeral: restarts rotate host keys
    "-o", "ConnectTimeout=10"];
}

// push the user's image in for img2img
function pushImageToSandbox(imageBuffer) {
  return new Promise((resolve) => {
    let sshConfig;
    try { sshConfig = execFileSync(OPENSHELL, ["sandbox", "ssh-config", SANDBOX], { encoding: "utf-8" }); }
    catch { return resolve(false); }
    const confDir   = fs.mkdtempSync("/tmp/img-push-");
    const confPath  = `${confDir}/config`;
    fs.writeFileSync(confPath, sshConfig, { mode: 0o600 });
    const proc = spawn("ssh", [...sshArgs(confPath), `openshell-${SANDBOX}`,
      "base64 -d > /tmp/input_image.png"], { timeout: 30000, stdio: ["pipe", "pipe", "pipe"] });
    proc.stdin.end(imageBuffer.toString("base64"));
    proc.on("close", (code) => resolve(code === 0));
    proc.on("error", () => resolve(false));
  });
}

SANDBOX_NAME is validated against RFC 1123 label rules before any of this runs, which keeps shell metacharacters and path traversal out of the openshell-<name> host string.

Cleaning the reply before it ships

The model's text is not trusted as-is. I scrub anything that looks like a raw Discord ID, then run a precompiled hallucination filter and a structural check before the message goes out.

// replace raw 17-19 digit user IDs in public replies
response = response.replace(/\b\d{17,19}\b/g, "[user]");

const hallucinationHit = HALLUCINATION_PATTERNS.find(p => p.test(response));
// structural check: only fire when several bold numbered items AND a dramatic
// closer AND an invented module name (that isn't a known-real one) all appear
const boldNumberedItems = (response.match(/^\d+\.\s+\*\*/gm) || []).length;
const structuralHit =
  boldNumberedItems >= 3 &&
  DRAMATIC_CLOSER_RE.test(response) &&
  INVENTED_MODULE_RE.test(response) && !KNOWN_REAL_RE.test(response);

if (hallucinationHit || structuralHit) {
  await msg.reply("My response got filtered, it looked like I was making stuff up. Try again with a bit more context.");
  return;
}

Run it local & free

There are two zero-cost ways to run this brain, and they differ in whether any traffic leaves your machine.

Cloud-free: NVIDIA NIM (what the proxy already does)

This is the path the code takes when an NVIDIA_API_KEY is set, and the developer tier is free. It is also the only path the bridge will boot on, since it exits without the key.

Make a free key at https://build.nvidia.com and put it in your env file as NVIDIA_API_KEY=<YOUR_NVIDIA_API_KEY>.
Start the proxy on :9340 and the bridge. The brain answers with meta/llama-3.3-70b-instruct, and image description uses meta/llama-3.2-90b-vision-instruct, both on integrate.api.nvidia.com.
Sanity check the route directly:

curl -s https://integrate.api.nvidia.com/v1/chat/completions \
  -H "Authorization: Bearer <YOUR_NVIDIA_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{"model":"meta/llama-3.3-70b-instruct",
       "messages":[{"role":"user","content":"say hi in five words"}],
       "max_tokens":32}'

Nothing about the bot changes. The proxy is already pointed here, so a free key is the whole setup.

Fully local: your own GPU, no cloud calls

Run any OpenAI-compatible server on your machine and make the proxy talk to it instead of NIM. Ollama is the easiest start; vLLM or TensorRT-LLM give you more throughput on an NVIDIA card.

# option A: Ollama
ollama serve
ollama pull llama3.3:70b   # or a smaller model that fits your VRAM

# option B: vLLM, OpenAI-compatible server
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.3-70B-Instruct --port 8000

Two honest caveats. First, the proxy has a LOCAL_LLM_URL branch, but in the shipped routing the NVIDIA block and the Grok block both return before it, so setting the env var alone does nothing. To go fully local you move that branch to the top of the request handler, ahead of both the NVIDIA and Grok blocks, so a local server becomes the active route. Second, the bridge itself calls process.exit(1) at startup if NVIDIA_API_KEY is unset, so for a truly key-free run you also relax that startup guard.

const LOCAL = process.env.LOCAL_LLM_URL; // e.g. http://localhost:8000
if (LOCAL) {
  const u = new URL(LOCAL.replace(/\/$/, "") + "/v1/chat/completions");
  const body = JSON.stringify({ ...oai, model: process.env.LOCAL_LLM_MODEL || "llama3.3", stream: false });
  const lReq = http.request({ hostname: u.hostname, port: u.port, path: u.pathname, method: "POST",
    headers: { "Content-Type": "application/json", "Content-Length": Buffer.byteLength(body) } },
    lRes => { res.writeHead(lRes.statusCode, { "Content-Type": "application/json" }); lRes.pipe(res); });
  lReq.on("error", e => { res.writeHead(502); res.end(JSON.stringify({ error: { message: e.message } })); });
  lReq.end(body);
  return;
}

For media on the local path, point the image and video tags at a local ComfyUI on your GPU rather than any hosted generator, which keeps the whole loop on one box.

Swapping the brain entirely

If you would rather the reply come from an agent CLI than from an HTTP model, you can run a coding-agent CLI (OpenClaw, or the claude CLI) inside the OpenShell sandbox as the brain and have runAgentInSandbox actually shell into it over the SSH config it already builds. That is closer to the original reference design, and it trades the plain HTTP call for a real agent loop with tools.

Gotchas & hard-won lessons

The function is named runAgentInSandbox but it does not SSH into the sandbox. It POSTs to the proxy on localhost:9340. Read the body, not the name, or you will waste an afternoon looking for an SSH call that handles chat.
The brain default is meta/llama-3.3-70b-instruct, not Nemotron. Nemotron (nvidia/llama-3.1-nemotron-ultra-253b-v1) is only the default the hermes inference shim uses on its NIM path, and that shim is a separate process.
With no NVIDIA key the proxy does not fall back to Gemini. The very next branch is unconditional and forwards to the Grok proxy (grok-4.20 via Vertex's xAI endpoint). The Gemini generateContent translator lives further down and never runs in the shipped order.
The bridge calls process.exit(1) at startup if NVIDIA_API_KEY is unset, so you cannot run it on the Grok-only path without also satisfying that key check. NIM is the de-facto brain every time.
The proxy also has a LOCAL_LLM_URL branch, but the NVIDIA block and the Grok block both return before it, so that branch never runs as written. Going fully local means promoting it above both returns, not just setting the env var.
The hermes inference shim looks local because the container thinks it is talking to llama.cpp, but it proxies to NVIDIA NIM or Vertex Gemini. It is another cloud path, not local inference.
The proxy on :9340 must be listening before the bridge handles its first message, otherwise every brain call returns Agent error: connect ECONNREFUSED. The proxy servers bind at require-time when services/vertex.js loads, so start the bridge process cleanly.
Per-user serialization through enqueueAgent matters. Remove it and concurrent messages from one user can lock a session and produce empty or crossed replies.
known_hosts is intentionally /dev/null with accept-new. Sandbox container restarts rotate host keys, and pinning them would block media transfer after every restart.
The 120s timeout in runAgentInSandbox is the hard ceiling for a reply. A slow cold model start can hit it and the user gets Agent timed out after 120s instead of an answer.
The hallucination structural check is a blunt instrument. A legitimate numbered roadmap with three or more bold items plus a flourish at the end can trip it, so keep the three conditions ANDed together (bold-numbered count, dramatic closer, invented-but-not-known module) as they are.

Prompts to build it yourself

The kind of instructions you'd hand an AI coding agent (Claude Code) to build this from scratch.

prompt

Build a Discord bot in Node with discord.js whose brain is a single function. On each message in an allowed channel, after dedup and channel filters, assemble a system prompt read from SOUL.md plus the user text, then POST an OpenAI-shaped chat payload to a local proxy at http://localhost:9340/v1/chat/completions with a 120 second timeout and an X-Agent-Id header. Serialize calls per user with a promise-chain map so one user's messages run one at a time. Return the assistant text from choices[0].message.content. Then write the proxy itself as a small http server on port 9340 that reads NVIDIA_API_KEY from env: if set, forward the request to https://integrate.api.nvidia.com/v1/chat/completions with model meta/llama-3.3-70b-instruct and a Bearer header; if not set, return a clear 502 for now. Use placeholders for every token and channel id. No hardcoded secrets.

prompt

Add two features to the bot above. First, a reply-cleaning step that replaces any 17 to 19 digit numeric ID in the model output with [user], and blocks replies that either match a small list of hallucination regexes or trip a structural check (three or more bold numbered list items AND a dramatic closing phrase AND an invented-sounding module name that is not on a known-real allowlist). Second, an SSH media channel to an OpenShell sandbox: write resolveOpenshell to find the openshell binary, an sshArgs helper that passes -o StrictHostKeyChecking=accept-new and -o UserKnownHostsFile=/dev/null and -o ConnectTimeout=10, and pushImageToSandbox that runs 'openshell sandbox ssh-config <SANDBOX_NAME>' to get a temp SSH config (mode 0600) then nova a base64 image into the sandbox at /tmp/input_image.png. Validate SANDBOX_NAME against RFC 1123 label rules first. Keep all ids and tokens as placeholders.

The personas: Pixel, Mochi, and the Planner daemon

I run three personas as their own long-lived processes, kept apart from the main Nova bridge. Each one is a standalone Node script that reads its own bot token from its own .env file, carries a system prompt that defines its character, and points at whatever model I assign it. Pixel handles the social and creative side, Mochi plays the cat-flavored validator that checks logic and odds, and the Planner is the odd one out: a headless compute daemon with no Discord presence at all. That last difference is the one that matters, since Pixel and Mochi show up in the server and talk, while the Planner never touches Discord and only reads and writes files.

The reason I split them up is lane discipline. Pixel answers only when she is addressed by name or @mentioned, and she deliberately bails out of any message aimed at Nova so the two bots don't talk over each other. Mochi runs as a reaction layer that weighs in after Nova, and it also carries a second path for an open chatroom where it banters with other bot peers on a cooldown. Because each persona is its own PM2 process, I can restart one without disturbing the others, point each at a different model, and keep their personalities isolated in their own prompt files.

The Planner is the part worth studying. Mochi's command handlers (!odds, !drop, !validate, !result) compute nothing on their own. The compute ones append a task to a shared JSONL file and then poll a results file for the answer. The Planner daemon sits in a one-second loop reading that task file, sending each new task to an NVIDIA-hosted model under a strict JSON-only system prompt, and appending the parsed result. It is a file-backed work queue between two processes, so the Discord side and the compute side fail independently of each other.

One honest warning before you read the code: the function and comment names do not match what the code does. Mochi's LLM call is named callDeepSeek but actually posts to NVIDIA's llama-3.3-70b. Pixel's is named callNemotron but posts to a configurable OpenAI-compatible endpoint that defaults to a local model. I describe what each request really sends, not what the name suggests.

APIs & services

Service / API	What it does here	Docs
NVIDIA NIM (integrate.api.nvidia.com)	Free-tier hosted inference. Mochi calls meta/llama-3.3-70b-instruct; the Planner calls deepseek-ai/deepseek-v3.1-terminus; Pixel can target it through her nvidia.com branch.	docs ↗
Discord API (discord.js v14)	Each persona is its own gateway client and bot user, with the MessageContent intent so they can read and reply to messages.	docs ↗
Qdrant (via local :7338 HTTP wrapper)	Shared crew memory. Pixel and Mochi POST {cmd:'search'\|'store'} to recall and save notes across personas; Qdrant itself runs on :6333 behind the wrapper.	docs ↗
Brave Search API	Optional web lookup for Pixel and Mochi; degrades to nothing when the key is unset.	docs ↗
memegen.link	Free, no-auth meme image generation for Mochi's chatroom via a [MEME: template \| top \| bottom] token.	docs ↗
Reddit hot.json	No-auth hot-post enrichment for Mochi's chatroom answers.	docs ↗
Tenor GIF (through the Discord bot token)	GIF search for Mochi via Discord's built-in Tenor proxy, using the existing bot token.	docs ↗
Google Drive API	Optional backup for Pixel's separate trend store (pixel-memory.js); skipped unless a folder id is set.	docs ↗

How it's built, step by step

Scaffold one Node script per persona (pixel.js, mochi.js, planner-daemon.js), each loading its own .env file and reading a persona-specific bot token with a generic fallback.
Create a discord.js Client with the Guilds, GuildMessages, MessageContent, and DirectMessages intents for Pixel and Mochi; give the Planner no client at all.
Write each persona's identity as a system prompt: Pixel's creative-director voice, Mochi's terse cat-validator plus a chattier chatroom voice, the Planner's strict JSON spec in a markdown file.
Gate the message handler so each persona replies only when addressed (mention or name) and stays out of other agents' lanes (Pixel ignores nova-addressed and her own messages).
Wire each persona's LLM caller as an OpenAI-compatible /v1/chat/completions POST and point it at NVIDIA NIM or a local endpoint.
Add shared Qdrant memory recall and store over the local :7338 wrapper, and parse the side-channel tokens each persona emits ([REMEMBER], [SEARCH], [ASK_NOVA], [ASK_MOCHI] for Pixel; [REMEMBER], [GIF], [MEME] for Mochi's chatroom) out of the model reply.
Build the file-backed task queue: Mochi's !odds and !drop append to tasks.jsonl and poll results.jsonl; the Planner daemon loops over tasks.jsonl, runs the model, and writes results.jsonl.
Run all three as separate PM2 processes so each can be restarted and configured on its own.

Under the hood

Three processes, three identities

Every persona loads its own env file and its own token, then either spins up a Discord client or doesn't. That single difference is what separates a chatty persona from the headless Planner.

const { Client, GatewayIntentBits, Partials } = require("discord.js");
const fs = require("fs"), path = require("path");

// Each persona loads its own env file so tokens never collide.
const envFile = path.join(__dirname, "..", ".env.pixel"); // .env.mochi / .env.deepseek
if (fs.existsSync(envFile)) {
  for (const line of fs.readFileSync(envFile, "utf8").split("\n")) {
    const m = line.match(/^([A-Z_][A-Z0-9_]*)=(.*)$/);
    if (m && !process.env[m[1]]) process.env[m[1]] = m[2];
  }
}

const DISCORD_BOT_TOKEN = process.env.PIXEL_BOT_TOKEN || process.env.DISCORD_BOT_TOKEN;

const client = new Client({
  intents: [
    GatewayIntentBits.DirectMessages,
    GatewayIntentBits.MessageContent,
    GatewayIntentBits.Guilds,
    GatewayIntentBits.GuildMessages,
  ],
  partials: [Partials.Channel, Partials.Message],
});
client.login(DISCORD_BOT_TOKEN); // the Planner daemon has none of this block

Lane gating

Pixel only speaks when spoken to, refuses to answer herself, and steps aside whenever Nova is the addressee. This is plain regex on the message content, so tune the names to your own bots.

const PIXEL_USER_ID = "<PIXEL_BOT_USER_ID>"; // her own snowflake, kept out of source

client.on("messageCreate", async (msg) => {
  if (msg.author.bot) return;

  // Stay in lane: ignore anything aimed at Nova, and never answer herself.
  if (/\b(nova|sportstwo|bignova)\b/i.test(msg.content)) return;
  if (msg.author.id === PIXEL_USER_ID) return;

  const isMentioned = msg.mentions.has(client.user.id);
  if (!isMentioned && !/\bpixel\b/i.test(msg.content)) return; // addressed-only

  // ...recall memory, call the model, strip side-channel tokens, reply
});

The persona LLM call

Pixel's caller is a generic OpenAI-compatible POST. The endpoint and model come from env, and it flips to HTTPS with a Bearer key the moment the URL points at NVIDIA. That is the whole cloud-or-local switch.

const LLM_URL = process.env.PIXEL_LLM_URL || "http://127.0.0.1:9342";
const MODEL   = process.env.PIXEL_MODEL   || "<LOCAL_MODEL_NAME>";

async function callModel(userMessage, systemPrompt) {
  const payload = JSON.stringify({
    model: MODEL,
    messages: [
      { role: "system", content: systemPrompt },
      { role: "user",   content: userMessage },
    ],
    temperature: 0.85, top_p: 0.95, max_tokens: 512, stream: false,
  });

  const url = new URL(LLM_URL);
  const isNvidia = LLM_URL.includes("nvidia.com");
  const lib = isNvidia ? require("https") : require("http");

  return new Promise((resolve, reject) => {
    const req = lib.request({
      host: url.hostname,
      port: url.port || (isNvidia ? 443 : 80),
      path: "/v1/chat/completions",
      method: "POST",
      headers: {
        "Content-Type": "application/json",
        "Content-Length": Buffer.byteLength(payload),
        ...(isNvidia ? { Authorization: `Bearer ${process.env.NVIDIA_API_KEY}` } : {}),
      },
      timeout: 180000,
    }, (res) => {
      let body = ""; res.on("data", c => body += c);
      res.on("end", () => {
        const choice = JSON.parse(body).choices?.[0]?.message;
        // some instruct models route the answer to reasoning_content
        resolve((choice?.content || choice?.reasoning_content || "").trim());
      });
    });
    req.on("error", reject);
    req.write(payload); req.end();
  });
}

Mochi's caller, despite being named callDeepSeek, hardcodes NVIDIA and adds 429 backoff. Two souls drive it: a terse validator at temperature 0.15 and a looser chatroom voice at 0.85.

// Mochi: posts straight to NVIDIA NIM, llama-3.3-70b.
const payload = JSON.stringify({
  model: "meta/llama-3.3-70b-instruct",
  messages: [{ role: "system", content: soul }, { role: "user", content: userMessage }],
  temperature: opts.chatroom ? 0.85 : 0.15,
  max_tokens:  opts.chatroom ? 600  : 1500,
});
const req = https.request({
  hostname: "integrate.api.nvidia.com",
  path: "/v1/chat/completions",
  method: "POST",
  headers: { "Content-Type": "application/json",
             Authorization: `Bearer ${process.env.NVIDIA_API_KEY}` },
}, /* 429 → wait (attempt+1)*15s, up to 3 tries */);

Shared crew memory and side-channels

Both personas reach the same memory through a small HTTP wrapper in front of Qdrant on :7338. The model can also emit bracket tokens that the persona acts on and then deletes from the visible reply.

async function memorySearch(query, userId, limit = 2) {
  const r = await fetch("http://localhost:7338", {
    method: "POST", headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ cmd: "search", query, userId, limit }),
  }).then(r => r.json()).catch(() => null);
  return r?.results || [];
}

// Tokens the model emits; acted on, then stripped before sending.
for (const m of reply.matchAll(/\[REMEMBER:\s*([\s\S]*?)\]/gi)) memoryStore(m[1].trim(), userId);
for (const m of reply.matchAll(/\[ASK_NOVA:\s*([\s\S]*?)\]/gi)) swarmCall("ask_nova", { message: m[1].trim() });
reply = reply.replace(/\[(?:REMEMBER|ASK_NOVA|ASK_MOCHI|SEARCH):[\s\S]*?\]/gi, "").trim();

Note that pixel-memory.js is a different store: a social-media trend log written to ~/.nemoclaw/pixel-trends.jsonl with detectPatterns/compareTrends for velocity and clustering, plus an optional Google Drive backup. It ships with the director persona but is not what Pixel queries on each message.

The file-backed task queue and the Planner loop

Mochi's compute commands write a task line and then poll for a matching result line. The Planner is the other end of that file.

// Mochi side: append a task, then poll for its result.
function submitTask(type, payload) {
  const taskId = `task-${Date.now()}-${Math.random().toString(36).slice(2, 9)}`;
  fs.appendFileSync(TASKS_FILE,
    JSON.stringify({ taskId, type, payload, submittedAt: new Date().toISOString() }) + "\n");
  return taskId;
}
function waitForResult(taskId, timeoutMs = 30000) {
  const start = Date.now();
  return new Promise((resolve) => {
    const poll = setInterval(() => {
      const lines = fs.readFileSync(RESULTS_FILE, "utf8").split("\n").filter(Boolean);
      for (const line of lines) {
        const r = JSON.parse(line);
        if (r.taskId === taskId) { clearInterval(poll); return resolve(r); }
      }
      if (Date.now() - start > timeoutMs) { clearInterval(poll); resolve(null); }
    }, 500); // rescans the whole results file every half second
  });
}

// Planner daemon: no Discord, just files and the model.
const processed = new Set();
async function main() {
  while (true) {
    const lines = fs.existsSync(TASKS_FILE)
      ? fs.readFileSync(TASKS_FILE, "utf8").split("\n").filter(Boolean) : [];
    for (const line of lines) {
      const task = JSON.parse(line);
      if (processed.has(task.taskId)) continue;
      const text = await callModel(task);   // NVIDIA deepseek-v3.1-terminus, temp 0.1
      let result;
      try { result = JSON.parse(text); }    // the system prompt demands JSON-only
      catch { result = { taskId: task.taskId, type: task.type,
                         result: "ERROR", details: { error: "non-JSON output" } }; }
      fs.appendFileSync(RESULTS_FILE, JSON.stringify(result) + "\n");
      processed.add(task.taskId);
    }
    await new Promise(r => setTimeout(r, 1000)); // poll every second
  }
}
main();

The Planner's system prompt is a markdown file that names four task types (validateOutcome, calculateOdds, generateDrop, auditLogic), pins a deterministic xorshift32 RNG so anti-cheat re-simulation matches the client, and forbids any prose in the output. Every result line has to carry taskId, type, result, details, and completedAt.

Run it local & free

Cloud-free with NVIDIA NIM

Mochi and the Planner already call integrate.api.nvidia.com, so this path needs no rewriting. Create a free API key at build.nvidia.com, drop it into the env as NVIDIA_API_KEY (Mochi also accepts MOCHI_NVIDIA_KEY), and the two of them run as-is. Mochi uses meta/llama-3.3-70b-instruct and the Planner uses deepseek-ai/deepseek-v3.1-terminus, both available on the free NIM tier. To put Pixel on the same free tier, set PIXEL_LLM_URL to an integrate.api.nvidia.com URL and PIXEL_MODEL to a NIM model id; her isNvidia branch switches to HTTPS plus Bearer auth on its own (it reads PIXEL_NVIDIA_KEY or falls back to NVIDIA_API_KEY).

Fully local on your own GPU

Pixel's default PIXEL_LLM_URL is http://127.0.0.1:9342, a plain OpenAI-compatible endpoint with no key. Point it at any local server that speaks /v1/chat/completions: llama-server from llama.cpp, Ollama, or LM Studio, all of which run a 7B to 13B model comfortably on a single consumer card like the author's RTX 5080. To take Mochi and the Planner local too, swap their hardcoded integrate.api.nvidia.com host for your local server's host and port and set a local model id; both callers are otherwise the same OpenAI-compatible shape.

Free supporting services

Shared memory is local Qdrant (on :6333) behind the :7338 wrapper, so nothing leaves the machine. Mochi's media is free and mostly keyless: memegen.link needs no auth, Reddit's hot.json needs none, and GIF search rides your existing Discord bot token through Discord's Tenor proxy. Brave Search is optional and quietly does nothing when BRAVE_SEARCH_API_KEY is unset. Pixel's trend store writes to a local JSONL file and only reaches Google Drive when you set a folder id, so you can run the whole stack without a single paid service.

Gotchas & hard-won lessons

The Planner's processed-task tracking is an in-memory Set, and the loop re-reads the entire tasks.jsonl every cycle. Restart the daemon and it replays every historical task, appending duplicate result lines.
tasks.jsonl and results.jsonl are append-only and never truncated. waitForResult rescans the whole results file every 500ms, so both files grow without bound and polling gets slower over time. Rotate or compact them.
Function and comment names misname their backends. Mochi's callDeepSeek hits NVIDIA llama-3.3-70b, Pixel's callNemotron posts to a configurable endpoint that defaults to a local model, and only the Planner's deepseek name is accurate. Read the request, not the name.
Pixel and Mochi recall the shared Qdrant store on :7338 on every message inside a try/catch. If that wrapper is down they answer with no memory and give the user no error, which can look like amnesia rather than an outage.
pixel-memory.js (the trend store) is not the memory Pixel queries per message. pixel.js requires only the Brave helper, so the two memories are unrelated despite the shared persona name.
Each persona runs an internal HTTP server so the bridge or swarm-mcp can post as them (Pixel on :7701, Mochi on :7702). Launch a second copy of either and you get EADDRINUSE.
Lane gating is regex on raw text. If someone writes the word that matches another bot's name in an unrelated sentence, the persona stays silent by design, so adjust the patterns to your own bot names.
Mochi's compute commands depend on the Planner being alive. With the daemon stopped, !odds and !drop time out after 30s with no computed answer, because nothing ever writes the matching result line. !validate only replies with a static prompt for game-state input and !result just reads the existing results file, so those two keep working regardless.

Prompts to build it yourself

The kind of instructions you'd hand an AI coding agent (Claude Code) to build this from scratch.

prompt

Build a standalone Node.js Discord persona bot as its own process using discord.js v14. It must: load its own .env file (for example .env.pixel) and read its bot token from a persona-specific variable with a generic DISCORD_BOT_TOKEN fallback; create a Client with the Guilds, GuildMessages, MessageContent, and DirectMessages intents; reply only when it is @mentioned or its name appears in the message, and explicitly ignore messages addressed to other bots and its own messages; hold its whole personality in one system-prompt string; call an OpenAI-compatible /v1/chat/completions endpoint whose URL and model come from env, sending HTTPS plus a Bearer NVIDIA_API_KEY when the URL contains nvidia.com and plain HTTP otherwise; recall and store shared memory by POSTing {cmd:'search'} and {cmd:'store'} to a local HTTP wrapper on port 7338; and support side-channel tokens like [REMEMBER: ...] and [ASK_NOVA: ...] that it parses out of the model reply, acts on, then strips before sending. Use placeholders for every token, channel id, and bot user id, and hardcode no secrets.

prompt

Build a two-process file-backed task queue. Process A is a Discord bot. Its !odds and !drop commands each build a task object {taskId, type, payload, submittedAt}, append it as one JSON line to ~/.nemoclaw/tasks.jsonl, then poll ~/.nemoclaw/results.jsonl every 500ms for a line whose taskId matches, with a 30s timeout; !result looks up an existing result by id in results.jsonl; !validate just replies with the game-state fields it expects. Process B is a headless daemon with no Discord client: in a one-second loop it reads tasks.jsonl, skips taskIds it has already handled, sends each new task to NVIDIA NIM (integrate.api.nvidia.com, model deepseek-ai/deepseek-v3.1-terminus, temperature 0.1) under a strict 'return ONLY valid JSON' system prompt loaded from a markdown file, parses the reply, and appends the result to results.jsonl, wrapping any non-JSON output as an ERROR result. Add a comment noting that the in-memory processed-set means a restart reprocesses the whole file and suggest persisting processed ids. Read NVIDIA_API_KEY from env.

Slash commands, buttons, and modals

This is the part of the bot a user actually touches. Everything visible in Discord (the autocomplete list when you type /, the buttons under a generated image, the little popup form for a GIF clip) lives across five files. slash-commands.js declares the commands and pushes them to Discord over REST. Three handler modules under handlers/ carry the runtime logic: commands.js (slash commands), buttons.js (every button click), and modals.js (which holds both the select-menu handler and the modal-submit handler in one file because they share the /create flow). And ui/components.js builds the button and menu rows. I kept the wiring deliberately boring so I can add a command without re-reading the whole codebase.

The shape is one dispatcher and four handler functions. A single client.on("interactionCreate", ...) block in discord-bridge.js checks what kind of interaction came in and forwards it: isButton() goes to handleButton, isStringSelectMenu() to handleStringSelectMenu, isModalSubmit() to handleModalSubmit, and a plain isChatInputCommand() to handleCommand. Each handler takes (interaction, deps). The ComfyUI and Grok service clients are required directly at the top of each handler module, so deps is reserved for the higher-level helpers: the Instagram poster (postToBuffer), the prompt enhancer (enhanceVideoPrompt), the local ZTurbo image generator, the sandbox agent runner, the backup and segment helpers, and the owner id. One quirk worth knowing is that those deps objects start out empty when the dispatcher is wired, then get filled in with Object.assign further down the file once the rest of the bridge has defined the helpers.

The trick that makes the whole thing hold together is the customId. Discord hands you back nothing but a short string when a component is clicked, so I encode both the route and a context key into it: btn_<action>_<msgId>. The button handler parses that with one regex, then looks up the per-message generation context (the prompt, the image buffer, the video buffer) from an in-memory map keyed by the id baked into that string. That key is the originating interaction id for a fresh generation, and the handler falls back to interaction.message.id and the referenced message id when it does not find a hit. Select menus carry their state the same way with colon-delimited ids, like create_edit_style:short, so a multi-step wizard can carry its choices forward without any session storage. When the context map gets cleared (a restart, say), the handlers fall back to re-downloading the media straight off the Discord message attachment.

Modals are the one piece with a sharp edge. A modal has to be the very first reply to an interaction, so any button that needs text input opens the modal immediately and does the slow work later, on the modal's submit. That single rule explains most of the control flow you see. A "Make Video" button pops a duration modal (modal_i2v_dur_video_<msgId>), and the actual render runs in handleModalSubmit, where the duration field is read and the clip is generated through ComfyUI.</overview> <apis">[{"name":"discord.js","role":"Builders (SlashCommandBuilder, ActionRowBuilder, ButtonBuilder, StringSelectMenuBuilder, ModalBuilder, TextInputBuilder) plus the REST client and Routes for registering commands","doc_url":"https://discord.js.org/docs"},{"name":"Discord Application Commands API","role":"REST endpoint the bot PUTs its command definitions to (global registration via Routes.applicationCommands)","doc_url":"https://discord.com/developers/docs/interactions/application-commands"},{"name":"Discord Interactions (receiving & responding)","role":"The reply/defer/update/showModal lifecycle and the 3-second acknowledge window every handler obeys","doc_url":"https://discord.com/developers/docs/interactions/receiving-and-responding"},{"name":"NVIDIA NIM (integrate.api.nvidia.com)","role":"Free OpenAI-compatible chat completions for the prompt-enhance and caption-rewrite steps the handlers call into","doc_url":"https://build.nvidia.com"},{"name":"ComfyUI","role":"Local GPU backend the generation helpers (image/video/music) talk to over HTTP","doc_url":"https://docs.comfy.org"}]

APIs & services

Service / API	What it does here	Docs
discord.js	Builders (SlashCommandBuilder, ActionRowBuilder, ButtonBuilder, StringSelectMenuBuilder, ModalBuilder, TextInputBuilder) plus the REST client and Routes for registering commands	docs ↗
Discord Application Commands API	REST endpoint the bot PUTs its command definitions to (global registration via Routes.applicationCommands)	docs ↗
Discord Interactions (receiving and responding)	The reply/defer/update/showModal lifecycle and the 3-second acknowledge window every handler obeys	docs ↗
NVIDIA NIM (integrate.api.nvidia.com)	Free OpenAI-compatible chat completions for the prompt-enhance and caption-rewrite steps the handlers call into	docs ↗
ComfyUI	Local GPU backend the generation helpers (image/video/music) talk to over HTTP	docs ↗

How it's built, step by step

Declare each command with SlashCommandBuilder in slash-commands.js: setName, setDescription, then chain addStringOption / addIntegerOption / addBooleanOption / addAttachmentOption, with addChoices for dropdowns and addSubcommand for grouped actions (the /mc command does this).
On the client 'ready' event, call registerCommands(token, clientId): build a REST({ version: '10' }) client and PUT the JSON of all commands to Routes.applicationCommands(clientId) for a global register.
Wire one interactionCreate listener that branches on interaction type (isButton / isStringSelectMenu / isModalSubmit / isChatInputCommand) and forwards to the matching handler with a shared deps object.
In handleCommand, switch on interaction.commandName, read options with interaction.options.getString/getInteger/getAttachment, deferReply() for slow work, run the generation, then editReply with files plus component rows.
Build every button/menu row through a helper in ui/components.js so the customId convention (btn_<action>_<msgId>) and Discord's 5-buttons-per-row / 5-rows-per-message limits live in one place.
Store the per-message generation context (prompt, buffers, type) in the state map under the originating interaction id (the same id the buttons carry); button-driven follow-ups re-key the context under the new message id.
In handleButton, parse customId with /^btn_(\w+)_(.+)$/, resolve context by that key with fallbacks to interaction.message.id and the referenced message, then dispatch by action: reply, defer+editReply, update the message, or showModal for actions that need text input.
In handleStringSelectMenu, advance the wizard by reading interaction.values[0] and either interaction.update() with the next menu or showModal for the final input step, carrying state forward in colon-delimited customIds.
In handleModalSubmit, match the customId, read fields with interaction.fields.getTextInputValue(id), then run the deferred generation and reply with the result plus a fresh row of buttons.

Under the hood

Declaring commands

Commands are plain SlashCommandBuilder objects in an array. Options come in typed flavors, and addChoices turns a free-text option into a dropdown. Subcommands group related actions under one name, which is how the real /mc command exposes start, stop, and status.

const { SlashCommandBuilder, REST, Routes } = require("discord.js");

const commands = [
  new SlashCommandBuilder()
    .setName("imagine")
    .setDescription("Generate an image")
    .addStringOption(o => o.setName("prompt").setDescription("What to generate").setRequired(true))
    .addStringOption(o => o.setName("ratio").setDescription("Aspect ratio")
      .addChoices(
        { name: "1:1 (Square)",     value: "1:1" },
        { name: "16:9 (Landscape)", value: "16:9" },
      )),

  // Subcommands: /mc start | stop | status
  new SlashCommandBuilder()
    .setName("mc")
    .setDescription("Control the agent crew")
    .addSubcommand(s => s.setName("start").setDescription("Start the crew")
      .addIntegerOption(o => o.setName("agents").setDescription("1-4").setMinValue(1).setMaxValue(4)))
    .addSubcommand(s => s.setName("stop").setDescription("Stop the crew")),
];

Registering over REST

Registration is one PUT. Routes.applicationCommands(clientId) registers globally, which is what I want in production. Global commands can take up to an hour to propagate, so during development I register to a single guild instead (see gotchas).

async function registerCommands(token, clientId) {
  const rest = new REST({ version: "10" }).setToken(token);
  await rest.put(
    Routes.applicationCommands(clientId),
    { body: commands.map(c => c.toJSON()) },
  );
}
// called from the ready handler:
//   await registerCommands(process.env.DISCORD_BOT_TOKEN, client.user.id);

The one dispatcher

Every interaction lands here and is forwarded by type. The deps objects are how the higher-level helpers reach the handlers without the handlers importing them directly. They start empty and get filled in with Object.assign once the rest of the bridge has defined those helpers.

const { handleCommand }                             = require("./handlers/commands");
const { handleButton }                              = require("./handlers/buttons");
const { handleStringSelectMenu, handleModalSubmit } = require("./handlers/modals");

const commandsCtx = {}, buttonsCtx = {}, modalsCtx = {}; // filled later via Object.assign

client.on("interactionCreate", async (interaction) => {
  try {
    if (interaction.isButton())            { await handleButton(interaction, buttonsCtx); return; }
    if (interaction.isStringSelectMenu())  { await handleStringSelectMenu(interaction, modalsCtx); return; }
    if (interaction.isModalSubmit())       { await handleModalSubmit(interaction, modalsCtx); return; }
    if (!interaction.isChatInputCommand()) return;
    await handleCommand(interaction, commandsCtx);
  } catch (e) {
    const reply = interaction.deferred || interaction.replied
      ? (m) => interaction.editReply(m).catch(() => {})
      : (m) => interaction.reply({ content: m, ephemeral: true }).catch(() => {});
    await reply(`Error: ${e.message.slice(0, 200)}`);
  }
});

A command handler

handleCommand switches on interaction.commandName. The pattern is always the same: read options, deferReply() because generation is slow, do the work, then editReply with the file and a row of buttons. The originating interaction's id becomes the key for the stored context, and the same id is baked into the button customIds. generateImage below stands in for whatever backend you wire (in this stack that is a local ComfyUI workflow or the sandbox agent).

if (cmd === "imagine") {
  const prompt = interaction.options.getString("prompt");
  const ratio  = interaction.options.getString("ratio") || "1:1";
  await interaction.deferReply();

  const imgBuf = await generateImage(prompt, ratio);   // your backend
  const tmp = `/tmp/img-${Date.now()}.png`;
  fs.writeFileSync(tmp, imgBuf);

  await interaction.editReply({
    content: `🎨 *"${prompt.slice(0, 80)}"*`,
    files: [new AttachmentBuilder(tmp, { name: "image.png" })],
    components: imageButtons(interaction.id),           // builder from ui/components.js
  });
  state.setGenerationContext(interaction.id, { prompt, ratio, imageBuf: imgBuf, type: "image" });
  fs.unlink(tmp, () => {});
}

Component builders and the customId convention

ui/components.js is the only place that knows the id format and Discord's layout limits (5 buttons per row, 5 rows per message). Encoding the context key into every id is what lets a click hours later still find its data.

function imageButtons(msgId) {
  return [
    new ActionRowBuilder().addComponents(
      new ButtonBuilder().setCustomId(`btn_video_${msgId}`).setLabel("🎬 Make Video").setStyle(ButtonStyle.Primary),
      new ButtonBuilder().setCustomId(`btn_enhance_${msgId}`).setLabel("✨ Enhance & Video").setStyle(ButtonStyle.Primary),
      new ButtonBuilder().setCustomId(`btn_groki2v_${msgId}`).setLabel("🤖 Grok i2v").setStyle(ButtonStyle.Primary),
    ),
    new ActionRowBuilder().addComponents(
      new ButtonBuilder().setCustomId(`btn_post_${msgId}`).setLabel("📱 Post to IG").setStyle(ButtonStyle.Success),
      new ButtonBuilder().setCustomId(`btn_regen_${msgId}`).setLabel("🔄 Regenerate").setStyle(ButtonStyle.Secondary),
    ),
  ];
}

// Rows can be built conditionally. A "Stitch All" button only appears
// once two or more video segments have accumulated for this message:
function videoButtons(msgId) {
  const items = [ new ButtonBuilder().setCustomId(`btn_gif_${msgId}`).setLabel("🎞 Make GIF").setStyle(ButtonStyle.Secondary) ];
  const segs = state.getStorySegments(msgId);
  if (segs && segs.length >= 2) {
    items.push(new ButtonBuilder().setCustomId(`btn_stitch_${msgId}`).setLabel(`🎬 Stitch All (${segs.length})`).setStyle(ButtonStyle.Danger));
  }
  return [ new ActionRowBuilder().addComponents(...items) ];
}

The grid builder for Grok's four-image output is where the per-row cap actually bites: it packs the selection buttons plus a Regen button into one or two rows depending on count, so it never overflows a single ActionRow.

The button handler

Parse the id, resolve the context, dispatch on the action. Note the fallback chain for context, and that some actions open a modal rather than acting immediately.

async function handleButton(interaction, deps) {
  const [, action, msgId] = interaction.customId.match(/^btn_(\w+)_(.+)$/) || [];
  if (!action) return;

  const ctx = state.getGenerationContext(msgId)
    || state.getGenerationContext(interaction.message?.id)
    || state.getGenerationContext(interaction.message?.reference?.messageId)
    || {};

  // Action that needs text input -> open a modal as the FIRST response:
  if (action === "video") {
    const modal = new ModalBuilder()
      .setCustomId(`modal_i2v_dur_video_${msgId}`)
      .setTitle("🎬 Make Video")
      .addComponents(new ActionRowBuilder().addComponents(
        new TextInputBuilder()
          .setCustomId("i2v_duration")
          .setLabel("Duration in seconds (2-30, default 10)")
          .setStyle(TextInputStyle.Short)
          .setRequired(false),
      ));
    return interaction.showModal(modal);
  }

  // Action that just acts -> defer, work, edit:
  if (action === "regen") {
    await interaction.deferReply();
    if (!ctx.prompt) return interaction.editReply("⚠️ Lost the prompt for this image.");
    const imgBuf = await deps.generateImageWithZTurbo(ctx.prompt, seed, ctx.style || "none");
    /* write temp, editReply with imageButtons(interaction.message.id), store ctx */
  }

  // Owner-only action:
  if (action === "post") {
    if (interaction.user.id !== deps.OWNER_ID_GLOBAL) {
      return interaction.reply({ content: "⚠️ Owner only.", ephemeral: true });
    }
    /* ... */
  }
}

Select menus as a wizard, ending in a modal

/create opens a type menu; each pick edits the same message into the next menu with interaction.update(); the final pick opens a modal. State rides along in the customId.

async function handleStringSelectMenu(interaction, deps) {
  if (interaction.customId === "create_type") {
    const type = interaction.values[0];                  // "image" | "video" | ...
    if (type === "image") {
      const row = new ActionRowBuilder().addComponents(
        new StringSelectMenuBuilder()
          .setCustomId("create_image_model")
          .setPlaceholder("Choose image model…")
          .addOptions([{ label: "ZImage Turbo", value: "zturbo", emoji: "⚡" }]),
      );
      return interaction.update({ content: "🎨 Pick a model:", components: [row] });
    }
  }

  if (interaction.customId === "create_image_model") {
    const model = interaction.values[0];
    const modal = new ModalBuilder()
      .setCustomId(`modal_create_img_${model}`)          // carries the model choice
      .setTitle("Describe the image")
      .addComponents(new ActionRowBuilder().addComponents(
        new TextInputBuilder().setCustomId("create_prompt").setLabel("Prompt").setStyle(TextInputStyle.Paragraph).setRequired(true),
      ));
    return interaction.showModal(modal);
  }
}

The modal submit

Match the id, pull fields by their input ids, then run the deferred work.

async function handleModalSubmit(interaction, deps) {
  const m = interaction.customId.match(/^modal_create_img_(\w+)$/);
  if (m) {
    const model  = m[1];
    const prompt = interaction.fields.getTextInputValue("create_prompt").trim();
    await interaction.deferReply();
    const imgBuf = await deps.generateImageWithZTurbo(prompt, seed, "none");
    /* write temp, editReply with imageButtons(interaction.id), store ctx */
  }
}

Adding one command end to end

Say I want /remix <prompt> that generates and shows a Regenerate button.

In slash-commands.js, add the builder to the commands array and let registerCommands push it:

new SlashCommandBuilder()
  .setName("remix")
  .setDescription("Generate a remix from a prompt")
  .addStringOption(o => o.setName("prompt").setDescription("What to remix").setRequired(true)),

In handlers/commands.js, add a branch:

if (cmd === "remix") {
  const prompt = interaction.options.getString("prompt");
  await interaction.deferReply();
  const buf = await generateImage(prompt, "1:1");
  const tmp = `/tmp/remix-${Date.now()}.png`; fs.writeFileSync(tmp, buf);
  await interaction.editReply({
    content: `🎛 *"${prompt.slice(0, 80)}"*`,
    files: [new AttachmentBuilder(tmp, { name: "remix.png" })],
    components: remixButtons(interaction.id),
  });
  state.setGenerationContext(interaction.id, { prompt, imageBuf: buf, type: "remix" });
  fs.unlink(tmp, () => {});
  return;
}

In ui/components.js, add and export the builder:

function remixButtons(msgId) {
  return [ new ActionRowBuilder().addComponents(
    new ButtonBuilder().setCustomId(`btn_remixregen_${msgId}`).setLabel("🔄 Remix again").setStyle(ButtonStyle.Secondary),
  )];
}

In handlers/buttons.js, handle the remixregen action (the dispatcher already routes every button here, so no change there):

if (action === "remixregen") {
  await interaction.deferReply();
  const buf = await deps.generateImage(ctx.prompt, "1:1");
  /* write temp, editReply with remixButtons(interaction.message.id), re-store ctx */
}

Restart, and because the command was re-registered on boot, /remix shows up in Discord with its button working.

Run it local & free

The interactive shell costs nothing on its own. A Discord bot token is free from the Discord Developer Portal, and discord.js plus the REST registration in registerCommands run anywhere Node runs. The only thing that costs money is what the buttons trigger, and both of those can be free.

Free LLM for the enhance and caption steps

Several handlers call out to a small language model. The "Enhance" path (enhanceVideoPrompt) rewrites a video prompt before rendering, and the Instagram caption path (rewriteTitle / rewriteQuote) turns a raw prompt into a title and a short caption. For a from-scratch rebuild, point those at NVIDIA's free NIM endpoint, which is OpenAI-compatible, so the request body barely changes.

const res = await fetch("https://integrate.api.nvidia.com/v1/chat/completions", {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${process.env.NVIDIA_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    model: "meta/llama-3.1-8b-instruct",       // any catalog model from build.nvidia.com
    messages: [{ role: "user", content: `Rewrite this prompt with rich cinematic detail: ${prompt}` }],
    max_tokens: 300,
    temperature: 0.8,
  }),
});
const enhanced = (await res.json()).choices?.[0]?.message?.content?.trim();

Grab a key at build.nvidia.com (the free tier gives you a credit allowance with no card). A fully local option also exists: run any small instruct model through Ollama or another OpenAI-compatible server and change only the hostname.

Free media generation on your own GPU

The image, video, and music helpers (generateImageWithZTurbo, comfy.generateVideoWithComfyUI, comfy.generateFastT2V, and the ACE-Step music call) are all HTTP calls to a local ComfyUI instance. ComfyUI is open source and runs on a consumer GPU, so the actual generation is free electricity rather than an API bill. Install ComfyUI, load a fast image workflow (a Turbo-style checkpoint) and a text-to-video workflow, expose them over ComfyUI's HTTP API, and have your generation helper queue a prompt and poll for the result. The handler code in this section does not change at all, because it only ever sees a Buffer come back.

Putting it together for zero dollars

Run the bot process locally, register commands globally once on boot, keep ComfyUI running on the same machine for media, and route the text steps to NVIDIA NIM (cloud-free) or a local model (fully free). The interaction layer, the buttons, the modals, and the wizard all behave identically whether the backend is a paid cloud service or your own box.

Gotchas & hard-won lessons

Every interaction must be acknowledged within 3 seconds. If the work is slow, deferReply() (or deferUpdate() for a menu) right away and editReply later. Skip the defer and Discord shows 'This interaction failed'.
showModal must be the FIRST response to an interaction. You cannot deferReply and then showModal. That is why buttons that need input open the modal immediately and do the heavy work on the modal's submit instead.
Global command registration via Routes.applicationCommands can take up to an hour to appear. For fast iteration register to one guild with Routes.applicationGuildCommands(clientId, guildId), which updates instantly, then switch to global for release.
customId is capped at 100 characters. The btn_<action>_<msgId> scheme fits, but do not try to stuff a whole prompt in there. Keep ids to a route plus a lookup key and hold the real payload in your state map.
Discord enforces 5 buttons per ActionRow and 5 rows per message. The grid builder splits rows to respect this. A 6th button in one row throws at send time, not at definition time.
Action names are parsed out of customId by a single regex. If two actions would collide on a numeric suffix (grokvid0 versus grokvid), encode the index inside the action name so the parser does not confuse them.
Use reply for a brand-new message, update to edit the message the component lives on (menus do this to advance a wizard), and editReply only after you have deferred. Mixing these up double-replies and errors out.
In-memory generation context is lost on restart. The handlers recover by re-downloading the image or video straight off the Discord message attachments, so design any new action to tolerate an empty ctx.
Owner-only actions compare interaction.user.id against your own user id. Read that id from an env var rather than pasting it inline, or the check becomes a published secret and an awkward thing to rotate.

Prompts to build it yourself

The kind of instructions you'd hand an AI coding agent (Claude Code) to build this from scratch.

prompt

Build a Discord.js v14 slash-command layer in Node. Create slash-commands.js that defines commands with SlashCommandBuilder (support string/integer/boolean/attachment options, addChoices dropdowns, and addSubcommand groups) and exports registerCommands(token, clientId) that PUTs them via new REST({ version: '10' }) to Routes.applicationCommands(clientId). Then add a single client.on('interactionCreate') dispatcher that branches on isButton / isStringSelectMenu / isModalSubmit / isChatInputCommand and forwards each to a handler in handlers/ with a shared deps object that I populate later via Object.assign. Wrap the whole dispatcher in try/catch that replies an error, using editReply if already deferred and an ephemeral reply otherwise. Read the bot token from process.env.DISCORD_BOT_TOKEN and never hardcode any id.

prompt

Add interactive components to my Discord bot. Create ui/components.js with builder functions that return arrays of ActionRowBuilder, using the customId convention btn_<action>_<msgId> and respecting the 5-buttons-per-row and 5-rows-per-message limits (split rows when needed, and add a button conditionally based on state). Then write handlers/buttons.js exporting handleButton(interaction, deps): parse customId with /^btn_(\w+)_(.+)$/, look up per-message context from a state map (with fallbacks to interaction.message.id and the referenced message), and dispatch by action. Some actions should deferReply then editReply, one should call interaction.showModal as the first response, and one should be gated to a single owner id read from an env var. Also write handlers/modals.js exporting handleStringSelectMenu (advance a multi-step /create wizard with interaction.update and carry choices in colon-delimited customIds, ending in showModal) and handleModalSubmit (match the customId, read fields with interaction.fields.getTextInputValue, then run the deferred generation). Show me how to add one new command end to end across all four files.

Making images, video and music on your own GPU

The media side is my favorite part to run locally, because once a GPU is in the box the marginal cost of a generated image drops to zero. Video and music ride the same hardware. The shape of every media command is the same: a user types something like /video a neon river at night in Discord, the bot loads a ComfyUI workflow graph, patches my prompt into the right node, submits it to the GPU, waits for the render, and posts the finished file straight back into the channel as an attachment. No per-call billing sits anywhere on that path.

Three kinds of output run through ComfyUI on my own card. Stills come from a ZImage Turbo workflow behind /zturbo. Clips run through LTX Video 2.3, with /video for plain text-to-video and /combi for a first-frame-plus-last-frame blend, while /music writes full songs with ACE-Step. The cloud lane is there when I want it: Grok/Aurora for images through a Playwright-driven browser on a local grok-server, and Suno for songs. Those cost money or ride fragile auth, so I treat them as optional and lead with the local path.

Once I have raw clips and a track, two compositors turn them into something postable. One is a plain ffmpeg engine (video-editor.js) that does Ken Burns on stills, crossfades or beat-synced jumpcuts, lyric overlays, an audio mix, and a final size pass to fit Discord's upload cap. The other (capcut-compose.js plus capcut-client.js) builds CapCut drafts over a local API so I can finish by hand in the desktop app. The deeper timeline math and renderer internals are shared with the Content Factory page, so I only cover the bot-specific glue here and point you at /factory.html for the rest.

If you do not have an NVIDIA card, none of this is out of reach. NVIDIA's NIM endpoints at integrate.api.nvidia.com give you a free hosted lane for the inference, and the bot glue stays the same: a command in, a file out.</overview> <apis">[]

APIs & services

Service / API	What it does here	Docs
ComfyUI	Local inference server. The bot submits a workflow graph to /prompt, polls /history, and pulls the output from /view. Runs ZImage Turbo, LTX video, and ACE-Step music on one GPU.	docs ↗
NVIDIA NIM (build.nvidia.com)	Cloud-free inference fallback at integrate.api.nvidia.com. Hosts FLUX image models and LLMs for prompt expansion on a free key, OpenAI-compatible. The no-GPU path.	docs ↗
LTX-Video 2.3 (Lightricks)	Local text-to-video and image-to-video model run as a ComfyUI workflow. The low-VRAM path uses Q4 GGUF weights plus a distilled LoRA and 2-pass sampling so it fits a consumer card.	docs ↗
ACE-Step	Local text-to-music model run as a ComfyUI workflow. Generates songs with optional lyrics, or instrumental when the lyric string is blank.	docs ↗
Z-Image Turbo	Fast local text-to-image model run as a ComfyUI workflow, a few seconds per still. Drives /zturbo and the /create image option.	,
FFmpeg	Local compositor in video-editor.js: Ken Burns, crossfades, beat-synced jumpcuts, lyric overlays, audio mix, and a re-encode pass to fit Discord's upload limit.	docs ↗
CapCut Mate API	Local draft builder on http://localhost:30000. The second compose path: assembles a CapCut draft with effects, filters, captions, and beat-synced cuts for desktop export.	,
Suno (studio-api)	Optional cloud music. Needs a refresh JWT pulled from the browser plus a captcha solver, so the local ACE-Step path is the stable free alternative.	,
discord.js	Slash command registration plus file upload. The finished Buffer is written to a temp file, wrapped in an AttachmentBuilder, and sent with editReply.	docs ↗

How it's built, step by step

Register a slash command (for example /video, /zturbo, or /music) with a prompt string option in discord.js.
On invoke, call deferReply() right away, since a GPU render takes far longer than Discord's 3-second ack window.
Load the matching ComfyUI workflow JSON from disk at call time, not at boot, so a missing or stale graph surfaces per request.
Patch the input nodes by id: the positive-prompt text, a random seed, the duration, and any uploaded image names.
Call freeComfyMemory() so the GPU unloads the previous model and can load this one.
POST the graph to ComfyUI /prompt and capture the returned prompt_id.
Poll /history/<prompt_id> every few seconds until status.completed, handling the error and timeout cases.
Download the finished file from /view as a Buffer.
Optionally compose: feed stills, clips, and the track through video-editor.js (ffmpeg) or capcut-compose.js, then size-guard the result under Discord's 25MB cap.
Write the Buffer to a temp file and editReply with it wrapped in an AttachmentBuilder, so the file lands back in the channel.

Under the hood

The command-to-ComfyUI-to-Discord loop

Every media command has the same skeleton in handlers/commands.js. Defer first, render, write the Buffer to a temp file, attach it, then clean up. This is the /video text-to-video path, trimmed to the load-bearing lines:

// handlers/commands.js
if (cmd === "video") {
  const prompt = interaction.options.getString("prompt");
  await interaction.deferReply(); // a render outlasts Discord's 3s ack window

  const queue = await comfy.getComfyQueueStatus();
  await interaction.editReply(`Rendering: "${prompt.slice(0, 60)}"` +
    (queue.total ? ` (${queue.total} in queue)` : ""));

  const videoBuf = await comfy.generateVideoWithComfyUI(prompt, null); // T2V, returns a Buffer
  const tmp = `/tmp/video-${Date.now()}.mp4`;
  fs.writeFileSync(tmp, videoBuf);
  await interaction.editReply({
    content: `"${prompt.slice(0, 60)}" via LTX Video 2.3`,
    files: [new AttachmentBuilder(tmp, { name: "video.mp4" })],
  });
  fs.unlinkSync(tmp);
}

The image command /zturbo is the same idea with generateImageWithZTurbo(prompt, seed, style) returning a PNG Buffer, and /music swaps in generateMusicWithAceStep(tags, lyrics, duration). Only the generator changes between commands; the defer-render-attach wrapper stays put.

Patching a workflow graph by node id

A ComfyUI workflow is a JSON object keyed by node id. The bot reads the exported graph, overwrites the inputs it cares about, and submits. Node numbers are specific to the saved file, so the code addresses them directly. This is the fast text-to-video workflow:

// services/comfy.js
async function generateFastT2V(prompt, durationSec = 10) {
  await freeComfyMemory();                 // unload whatever model is resident
  const seed = Math.floor(Math.random() * 2147483647);
  const dur  = Math.max(2, Math.min(30, durationSec || 10));

  const workflow = JSON.parse(fs.readFileSync(COMFY_T2V_FAST_WORKFLOW, "utf-8"));
  workflow["18"].inputs.text       = prompt; // positive prompt node
  workflow["33"].inputs.noise_seed = seed;   // pass 1 seed
  workflow["61"].inputs.noise_seed = seed;   // pass 2 seed
  workflow["48"].inputs.value      = dur;    // duration in seconds

  const promptId = await submitComfyWorkflow(workflow);
  const fileInfo = await waitForComfyResult(promptId, 900000);
  return await downloadComfyFile(fileInfo);  // Buffer of the finished mp4
}

For image-to-video the bot uploads the source frame first with uploadImageToComfyUI() and writes the returned name into the LoadImage node, then runs the same submit-and-poll loop.

Talking to ComfyUI

One small HTTP helper handles every call. The interesting part is host discovery: from WSL, ComfyUI running on the Windows host is reachable at whatever nameserver shows up in /etc/resolv.conf, not localhost. The code also prefers a queue proxy on port 5002 when its health check reports redis ok, and falls back to direct ComfyUI on 8188 otherwise.

function getComfyHost() {
  if (_comfyQueueAvailable === true) return "localhost"; // queue proxy on :5002
  if (process.env.COMFYUI_HOST) return process.env.COMFYUI_HOST;
  // WSL: ComfyUI on the Windows host is the resolv.conf nameserver
  try {
    const m = fs.readFileSync("/etc/resolv.conf", "utf-8").match(/^nameserver\s+(\S+)/m);
    return m ? m[1] : "<WINDOWS_HOST_IP>";
  } catch { return "<WINDOWS_HOST_IP>"; }
}
function getComfyPort() {
  return _comfyQueueAvailable === true ? 5002 : 8188; // proxy vs direct
}

freeComfyMemory() is what lets a single card do stills, video, and music in one session. It POSTs /free with unload_models before each job so the next model has room:

async function freeComfyMemory() {
  await comfyRequest("POST", "/free",
    JSON.stringify({ unload_models: true, free_memory: true }));
}

Music locally with ACE-Step

Same pattern, audio output. Tags go in the caption node, lyrics in the lyrics node, and an empty lyric string flips the workflow to instrumental so the model does not sing gibberish over silence:

async function generateMusicWithAceStep(tags, lyrics, durationSec = 60) {
  await freeComfyMemory();
  const workflow = JSON.parse(fs.readFileSync(ACEMUSIC_WORKFLOW, "utf-8"));
  workflow["4"].inputs.caption      = tags;            // genre / mood prompt
  workflow["3"].inputs.lyrics       = lyrics;
  workflow["2"].inputs.duration     = durationSec;
  workflow["2"].inputs.seed         = Math.floor(Math.random() * 2147483647);
  workflow["2"].inputs.instrumental = !lyrics.trim();  // blank lyrics => instrumental

  const promptId = await submitComfyWorkflow(workflow);
  const fileInfo = await waitForComfyAudio(promptId);  // prefers the SaveAudioMP3 output
  return await downloadComfyFile(fileInfo);            // Buffer of the mp3
}

Composing the result

After generation I usually want a real edit, not a bare clip. video-editor.js is the ffmpeg engine: it builds a timeline from stills and clips, applies a color style, optionally syncs cuts to detected beats (the brainslop and ludicrous styles use a jumpcut timeline), burns lyric captions, mixes the song, and ends with a size guard that re-encodes anything over 24MB so it clears Discord's 25MB upload limit. The CapCut path in capcut-compose.js and capcut-client.js instead builds a draft over a local API (all times in microseconds, and the *_infos endpoints hand back JSON strings you feed straight into the add_* calls) for finishing in the desktop app. The shared timeline and renderer internals live with the Content Factory, so I keep those out of this section and send you to /factory.html.

Run it local & free

Fully local on your own GPU ($0 per generation)

Everything the bot makes by default runs through one ComfyUI install on a single NVIDIA card:

Stills: a ZImage Turbo workflow, a few seconds per image.
Video: LTX Video 2.3 on a low-VRAM path (Q4 GGUF weights, a distilled LoRA, 2-pass sampling) so it fits a consumer card.
Music: an ACE-Step workflow, full songs with optional lyrics.

The piece that lets one card cover all three is freeComfyMemory(), which POSTs /free with unload_models before each job. Install ComfyUI, drop the workflow JSONs in a folder, point the bot at http://<COMFY_HOST>:8188, and the per-render cost is whatever your power bill says.

No GPU? Cloud-free through NVIDIA NIM

Grab a free key at build.nvidia.com and call integrate.api.nvidia.com. It is OpenAI-compatible, so the prompt-expansion step runs there with no card:

curl https://integrate.api.nvidia.com/v1/chat/completions \
  -H "Authorization: Bearer <YOUR_NVIDIA_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{"model":"meta/llama-3.1-70b-instruct",
       "messages":[{"role":"user",
         "content":"Rewrite into a vivid LTX video prompt: neon rain over a river"}]}'

NIM also hosts FLUX image models in the same catalog, so the /zturbo image step has a hosted fallback. Swap the ComfyUI call for a NIM request and keep the rest of the command identical.

Where both options exist, pick local

Music: ACE-Step in ComfyUI is local and free. Suno (suno.js) is the cloud alternative. It renews short-lived access tokens every minute or so, but it depends on a refresh JWT you pull from browser DevTools plus a captcha solver, so it breaks when that refresh token expires. Local wins on price and reliability.
Images: ZImage Turbo local, NIM FLUX cloud-free, or Grok/Aurora (paid, browser-driven). Default to the first.
Prompt writing: a local LLM through Ollama, or a NIM LLM on the free key. Either is $0.
Chained video frames: the chain mode calls a hosted image model to paint the target last frame between segments. Point that at a local ComfyUI still instead to keep the whole chain free.

Gotchas & hard-won lessons

A GPU render blows past Discord's 3-second interaction ack window. You must deferReply() immediately and editReply with the file later, or the interaction dies before the render finishes.
freeComfyMemory() before every job is what lets one card swap between LTX video, ZImage stills, and ACE-Step music. Skip it and the second model out-of-memories on a consumer card.
Workflows are patched by node id (for example 18 = positive prompt and 48 = duration in the T2V graph; 2/3/4 in the ACE-Step graph). If you re-export the graph from ComfyUI the ids renumber and the patch silently writes to the wrong nodes. Pin the workflow JSON.
From WSL, ComfyUI on the Windows host is the nameserver in /etc/resolv.conf, not localhost. Set COMFYUI_HOST to override when you run ComfyUI somewhere else.
The bot checks a queue proxy on port 5002 at startup; if its health endpoint reports redis ok it routes jobs there, otherwise it falls back to direct ComfyUI on 8188. Know which one is actually serving when a job stalls.
ACE-Step with an empty lyric string produces gibberish vocals. The code forces instrumental when the lyrics are blank, so keep that guard if you reimplement it.
Discord caps uploads near 25MB. video-editor.js re-encodes anything over 24MB to fit, so render at high quality and let the size guard shrink it on the way out.
The LTX low-VRAM path leans on Q4 GGUF weights, a distilled LoRA, and 2-pass sampling to fit a consumer card. That is what makes the node map dense and fragile to re-export.
CapCut Mate takes all times in microseconds, and its *_infos endpoints return JSON strings you pass directly into add_videos / add_images / add_captions. It also needs a local file server so the CapCut side can fetch your media by URL.
Suno's direct path renews short-lived access tokens about every 60 seconds, but the real fragility is the refresh JWT you pull from the browser plus a captcha solver. When that refresh token expires the whole path breaks. Local ACE-Step has none of that overhead.
A second ComfyUI instance bound to a second GPU (cuda:1) can render stills in parallel while the main instance renders video, via the withComfyPort port-override (set _COMFY_PORT_OVERRIDE for the still job).

Prompts to build it yourself

The kind of instructions you'd hand an AI coding agent (Claude Code) to build this from scratch.

prompt

Build a discord.js slash command /video for an existing bot. It takes a prompt string option. On invoke: call deferReply() first because a GPU render exceeds Discord's 3-second ack window; load a ComfyUI workflow JSON from disk; patch the positive-prompt node and a random seed by node id; before submitting, POST /free with {"unload_models":true,"free_memory":true} so the GPU can load this model; POST the graph to ComfyUI /prompt at http://<COMFY_HOST>:8188 and read the prompt_id; poll /history/<prompt_id> every 5 seconds until status.completed, handling the error and timeout cases; download the finished mp4 from /view; write it to a temp file; then editReply with the file wrapped in an AttachmentBuilder. Add WSL host discovery that reads the nameserver from /etc/resolv.conf when COMFYUI_HOST is unset. Keep all hosts, ports, and ids in env vars or placeholders, with no hardcoded secrets.

prompt

Add a /music command that drives a local ComfyUI ACE-Step workflow: patch the caption (tags), lyrics, duration, and seed nodes by id; force instrumental when the lyric string is blank; submit the graph, poll /history for the audio output (prefer the SaveAudioMP3 output), download the mp3, and post it back to the channel as an attachment. Then add a compose fallback with ffmpeg: take a set of generated images plus the mp3, build a Ken Burns timeline, mix the audio, and re-encode the result to stay under Discord's 25MB upload cap. Reuse the existing ComfyUI HTTP helpers. No secrets in code; use placeholders for any host or key.

Posting out, and keeping the bots alive 24/7

Once a bot can make images, music, and video inside Discord, the next problem is getting that media out into the world and keeping the whole stack running while nobody is watching. This section covers the last mile: pushing a finished render to Instagram, parking the file on a public URL first because Instagram refuses to read a Discord CDN link or a raw buffer, backing every generated file up to Google Drive, and running the bot under pm2 so a crash or a reboot does not take it down for the night.

The piece that surprises most people is that you cannot hand Instagram a file. The Graph API wants a publicly reachable URL that it can fetch on its own, so my bridge uploads the bytes somewhere public, gets back a link, and only then starts the Instagram post. I run a Cloudflare R2 worker as the primary host (no OAuth, just a static upload secret, and a 7-day object lifecycle that is plenty since Instagram fetches within seconds), with Catbox Litterbox for video and Imgur for images as the fallback when R2 is not configured or is down. After the URL exists, Instagram takes two write calls with a wait wedged between them: create a media container, poll it until it reports FINISHED, then publish.

On the ops side, the single most important fact is that pm2 does not source your shell profile. If your secrets live in .bashrc or .profile, a pm2-managed process starts blind. My fix is one plain env file, ~/.nemoclaw_env, that the bridge parses by hand on startup, and that the smoke test reads to confirm the critical keys are present before anything restarts. Everything else about keeping the lights on follows a few hard rules I learned the painful way: restart processes, never delete them, run the smoke test as a gate before every restart, and pm2 save after adding a process so it survives a reboot.

I also auto-back-up every render to Drive. Generation is cheap to trigger and slow to redo, so the moment a buffer comes back larger than a kilobyte I drop it to a temp file and fire it off to a Drive folder in the background. A separate cron tarball bundles the scripts, the env files, and the pm2 state to Drive every thirty minutes on top of that, keeping the last 48 copies.

APIs & services

Service / API	What it does here	Docs
Instagram Graph API (Facebook Graph API v21.0)	Publish images and Reels to an Instagram Business/Creator account via the create-container-then-publish flow, using a long-lived Page access token.	docs ↗
Cloudflare R2 (via a Worker)	Primary public host for the rendered media. The bridge PUTs the bytes to a small Worker that stores them in an R2 bucket and returns a public URL. Auth is a static upload-secret header, no OAuth.	docs ↗
Catbox / Litterbox	Free anonymous fallback host for video (72h TTL). Used when R2 is unavailable.	docs ↗
Imgur API	Free anonymous fallback host for images, via Client-ID auth and a base64 upload.	docs ↗
Google Drive API v3	Backup target for every generated file, plus a 30-minute scripts/env/pm2 tarball. Auth is a personal OAuth refresh token (preferred) or a service-account JWT.	docs ↗
Google OAuth2 token endpoint	Mints short-lived access tokens for Drive from either a refresh token or a service-account JWT (RS256); results cached ~55 minutes.	docs ↗
NVIDIA NIM (integrate.api.nvidia.com)	Free-tier hosted inference used as the prompt-enhance fallback (minimaxai/minimax-m2.7) and for the image-description vision call (meta/llama-3.2-90b-vision-instruct). The smoke test pings /v1/models as a liveness check.	docs ↗
pm2	Process manager that keeps the bridge and sibling agents alive, restarts on crash, and resurrects on reboot. Not a web API, but the ops backbone.	docs ↗

How it's built, step by step

Put all secrets in one flat env file (~/.nemoclaw_env) as KEY=value lines, because pm2 will not load your shell profile.
At the top of the bridge, parse that env file line by line into process.env before any other module reads a key.
When a render finishes, back the buffer up to Google Drive in the background if it is larger than ~1KB.
To post, first upload the bytes to a public host: try the Cloudflare R2 worker (PUT with a secret header), fall back to Catbox for video or Imgur for images.
Create an Instagram media container with the public URL and the caption (image_url for stills, media_type=REELS + video_url + share_to_feed for video).
Poll the container's status_code every few seconds until it reads FINISHED (bail on ERROR), giving video a longer interval than images.
Call media_publish with the container's creation id to push it live.
Separately, run a token-expiry check that warns when the Page token is within 7 days of expiring so you re-mint it before it dies.
For ops, gate every restart: node --check the entrypoint and run the bridge unit tests (pre-restart-bridge.sh) or run node scripts/smoke-test.js, and only pm2 restart (never delete) if the gate exits 0.
Run pm2 save after adding any process so the dump includes it and it comes back on reboot.

Under the hood

One env file that pm2 can actually read

pm2 launches processes without sourcing ~/.bashrc or ~/.profile, so a key you export-ed in your shell is invisible to a pm2-managed bot. The bridge solves this by reading one flat file at startup, before any service module touches process.env:

const fs   = require("fs");
const os   = require("os");
const path = require("path");

// Load ~/.nemoclaw_env if present (pm2 doesn't source shell profiles)
const envFile = path.join(os.homedir(), ".nemoclaw_env");
if (fs.existsSync(envFile)) {
  for (const line of fs.readFileSync(envFile, "utf8").split("\n")) {
    const m = line.match(/^([A-Z_][A-Z0-9_]*)=(.*)$/);
    if (m) process.env[m[1]] = m[2];
  }
}

The env file is just KEY=value lines, never committed:

DISCORD_BOT_TOKEN=<YOUR_DISCORD_BOT_TOKEN>
NVIDIA_API_KEY=<YOUR_NVIDIA_API_KEY>
IG_USER_ID=<YOUR_IG_BUSINESS_USER_ID>
FB_PAGE_TOKEN=<YOUR_LONG_LIVED_PAGE_TOKEN>
FB_APP_ID=<YOUR_FB_APP_ID>
FB_APP_SECRET=<YOUR_FB_APP_SECRET>
R2_WORKER_URL=https://<your-worker>.workers.dev
R2_UPLOAD_SECRET=<YOUR_R2_UPLOAD_SECRET>
GDRIVE_FOLDER_ID=<YOUR_DRIVE_FOLDER_ID>
GDRIVE_SA_KEY=/home/you/secrets/gdrive-service-account.json

Get the media onto a public URL

Instagram fetches the file itself, so I need a link, not bytes. R2 is the primary host through a tiny Cloudflare Worker that takes a PUT /upload with a secret header and returns { ok, url }:

function uploadToR2(fileBuffer, mimeType) {
  if (!R2_WORKER_URL || !R2_UPLOAD_SECRET) {
    return Promise.reject(new Error("R2 not configured"));
  }
  const url = new URL("/upload", R2_WORKER_URL);
  return new Promise((resolve, reject) => {
    const req = https.request({
      hostname: url.hostname, path: url.pathname, method: "PUT",
      headers: {
        "Content-Type":   mimeType || "application/octet-stream",
        "Content-Length": fileBuffer.length,
        "X-Upload-Secret": R2_UPLOAD_SECRET,
      },
    }, res => {
      let d = ""; res.on("data", c => d += c);
      res.on("end", () => {
        if (res.statusCode !== 200) return reject(new Error(`R2 ${res.statusCode}`));
        const j = JSON.parse(d);
        if (!j.ok || !j.url) return reject(new Error("R2 bad response"));
        resolve(j.url);
      });
    });
    req.on("error", reject);
    req.write(fileBuffer);
    req.end();
  });
}

The fallback uses free anonymous hosts: Catbox Litterbox for video (multipart form, 72h TTL) and Imgur for images (base64 with a Client-ID). The Imgur client id is yours, never hardcode mine:

async function getPublicMediaUrl(fileBuffer, mimeType) {
  if (mimeType?.startsWith("video/")) {
    // Catbox Litterbox: multipart POST to litterbox.catbox.moe with
    // reqtype=fileupload and time=72h; returns a plain-text https:// URL
    /* ... build the multipart body, POST, resolve(text.trim()) ... */
  }
  // Images -> Imgur anonymous upload
  const body = `image=${encodeURIComponent(fileBuffer.toString("base64"))}&type=base64`;
  return new Promise((resolve, reject) => {
    const req = https.request({
      hostname: "api.imgur.com", path: "/3/image", method: "POST",
      headers: {
        "Authorization":  "Client-ID <YOUR_IMGUR_CLIENT_ID>",
        "Content-Type":   "application/x-www-form-urlencoded",
        "Content-Length": Buffer.byteLength(body),
      },
    }, res => {
      let d = ""; res.on("data", c => d += c);
      res.on("end", () => {
        const j = JSON.parse(d);
        if (!j.success) return reject(new Error("Imgur upload failed"));
        resolve(j.data.link);
      });
    });
    req.on("error", reject); req.write(body); req.end();
  });
}

The Instagram container-then-publish dance

Every Graph call is a POST to graph.facebook.com/v21.0 with the Page token in the query. One small helper covers them all:

function graphApiRequest(apiPath, params = {}) {
  return new Promise((resolve, reject) => {
    const qs = new URLSearchParams({ ...params, access_token: FB_PAGE_TOKEN }).toString();
    const req = https.request({
      hostname: "graph.facebook.com",
      path: `/v21.0${apiPath}?${qs}`,
      method: "POST",
    }, res => {
      let d = ""; res.on("data", c => d += c);
      res.on("end", () => { try { resolve(JSON.parse(d)); } catch { resolve({ raw: d }); } });
    });
    req.on("error", reject);
    req.end();
  });
}

In the bridge this lives in one function called postToBuffer: it takes the raw buffer, gets it onto a public URL, then runs the create/poll/publish cycle. Video needs media_type: REELS plus share_to_feed, while a still just needs image_url:

// The real bridge calls this postToBuffer; it uploads the buffer, then posts.
async function postToBuffer({ text, mediaBuffer, mimeType, channels = ["instagram"] }) {
  if (!FB_PAGE_TOKEN) throw new Error("FB_PAGE_TOKEN not set");
  const isVideo = mimeType?.startsWith("video/");

  // 1. Get the bytes onto a public URL: R2 first, Catbox/Imgur fallback
  let mediaUrl;
  try {
    mediaUrl = await uploadToR2(mediaBuffer, mimeType || (isVideo ? "video/mp4" : "image/png"));
  } catch {
    mediaUrl = await getPublicMediaUrl(mediaBuffer, mimeType || (isVideo ? "video/mp4" : "image/png"));
  }
  if (!channels.includes("instagram")) return [{ mediaUrl }];

  // 2. Create the media container
  const containerParams = {
    caption: text || "",
    ...(isVideo
      ? { media_type: "REELS", video_url: mediaUrl, share_to_feed: "true" }
      : { image_url: mediaUrl }),
  };
  const container = await graphApiRequest(`/${IG_USER_ID}/media`, containerParams);
  if (container.error) throw new Error(container.error.message);
  const creationId = container.id;

  // 3. Poll until the container finishes processing (video gets a longer interval)
  for (let i = 0; i < 24; i++) {
    await new Promise(r => setTimeout(r, isVideo ? 5000 : 3000));
    const qs = new URLSearchParams({
      fields: "status_code,status,error_message",
      access_token: FB_PAGE_TOKEN,
    }).toString();
    const status = await new Promise((resolve, reject) => {
      https.get(`https://graph.facebook.com/v21.0/${creationId}?${qs}`, res => {
        let d = ""; res.on("data", c => d += c);
        res.on("end", () => resolve(JSON.parse(d)));
      }).on("error", reject);
    });
    if (status.status_code === "FINISHED") break;
    if (status.status_code === "ERROR")
      throw new Error(`container failed: ${status.error_message || "unknown"}`);
  }

  // 4. Publish
  const publish = await graphApiRequest(`/${IG_USER_ID}/media_publish`, { creation_id: creationId });
  if (publish.error) throw new Error(publish.error.message);
  return publish.id;
}

The real function returns a per-channel results array and catches each channel's error instead of throwing, so one failed post does not take down the others. The shape above is the teaching version.

Watch the Page token expiry

A long-lived Page token lasts about 60 days. The bridge does not auto-rotate it. The function is named refreshPageTokenIfNeeded, but despite the name it only checks debug_token and logs a warning when there are fewer than 7 days left, so I re-mint it by hand before it dies.

// Despite the name, this only warns; it does not re-mint the token.
async function refreshPageTokenIfNeeded() {
  if (!FB_APP_ID || !FB_APP_SECRET || !FB_PAGE_TOKEN) return;
  const qs = new URLSearchParams({
    input_token: FB_PAGE_TOKEN,
    access_token: `${FB_APP_ID}|${FB_APP_SECRET}`,
  }).toString();
  const res = await new Promise((resolve, reject) =>
    https.get(`https://graph.facebook.com/v21.0/debug_token?${qs}`, r => {
      let d = ""; r.on("data", c => d += c); r.on("end", () => resolve(JSON.parse(d)));
    }).on("error", reject));
  const exp = res?.data?.expires_at;
  if (exp && exp > 0) {
    const daysLeft = (exp - Date.now() / 1000) / 86400;
    if (daysLeft < 7) console.warn(`[ig] Page token expires in ${daysLeft.toFixed(1)} days, refresh soon`);
  }
}

Back every render up to Drive

The moment a render comes back, I write it to a temp file and push it to a Drive folder in the background. Anything under ~1KB is skipped as junk.

const MEDIA_FOLDER_ID = process.env.GDRIVE_MEDIA_FOLDER_ID || process.env.GDRIVE_FOLDER_ID || "";
function backupMedia(buf, fileName, mimeType) {
  if (!MEDIA_FOLDER_ID || !buf || buf.length < 1024) return;
  const tmp = `/tmp/gdrive-upload-${Date.now()}-${fileName}`;
  fs.writeFileSync(tmp, buf);
  gdrive.uploadToDrive(tmp, mimeType, fileName, MEDIA_FOLDER_ID)
    .then(r => { console.log(`[gdrive] backed up ${fileName}`); try { fs.unlinkSync(tmp); } catch {} })
    .catch(e => { console.warn(`[gdrive] backup failed: ${e.message}`); try { fs.unlinkSync(tmp); } catch {} });
}

The Drive module prefers a personal OAuth refresh token when one is set, and falls back to a service-account JWT (signed RS256, no user consent screen). It caches the resulting access token for 55 minutes. The folder you upload to must be shared with the service-account email:

// Service-account JWT -> Google token endpoint -> access_token
async function getServiceAccountToken(scopes) {
  const key = JSON.parse(fs.readFileSync(SA_KEY_PATH, "utf8"));
  const now = Math.floor(Date.now() / 1000);
  const header  = base64url(Buffer.from(JSON.stringify({ alg: "RS256", typ: "JWT" })));
  const payload = base64url(Buffer.from(JSON.stringify({
    iss: key.client_email,
    scope: scopes.join(" "),
    aud: "https://oauth2.googleapis.com/token",
    exp: now + 3600, iat: now,
  })));
  const sign = require("crypto").createSign("RSA-SHA256");
  sign.update(`${header}.${payload}`);
  const jwt = `${header}.${payload}.${base64url(sign.sign(key.private_key))}`;
  // POST grant_type=urn:ietf:params:oauth:grant-type:jwt-bearer&assertion=<jwt>
  // to oauth2.googleapis.com/token, returns access_token
  /* ... */
}

Share the target folder with the address printed by the setup helper, which looks like <SA_NAME>@<YOUR_PROJECT>.iam.gserviceaccount.com, then set GDRIVE_FOLDER_ID in the env file.

Keep it alive: gate the restart, never delete

Before I bounce the bridge I run a pre-restart gate. The real script syntax-checks the entrypoint and runs the bridge unit tests, and aborts if either fails:

#!/usr/bin/env bash
# pre-restart-bridge.sh
set -euo pipefail
cd "$(dirname "$0")/.."

echo "[pre-restart] syntax check"
node --check scripts/discord-bridge.js || { echo "SYNTAX ERROR, aborting restart"; exit 1; }

echo "[pre-restart] running bridge unit tests"
npx vitest run --project bridge --reporter=dot || { echo "TESTS FAILED, aborting restart"; exit 1; }

echo "All checks passed. Safe to restart."

The deeper gate is scripts/smoke-test.js. It node -c syntax-checks every critical script, require()s each sibling module to catch broken exports, asserts DISCORD_BOT_TOKEN and NVIDIA_API_KEY actually loaded from the env file (those two hard-fail, everything else only warns), and curls a few live endpoints including https://integrate.api.nvidia.com/v1/models and the ComfyUI /queue on the LAN host. After the restart, a tiny smoke-bridge.sh curls the bridge's own /health on port 9341 to confirm it came back and Discord reconnected. The operational rules around all this are short and non-negotiable:

node scripts/smoke-test.js && pm2 restart discord-bridge --update-env
pm2 save        # AFTER adding a process, so the dump includes it on reboot
# NEVER: pm2 delete <name>   (wipes env vars and tokens loaded into the process)

Run it local & free

What is already free here

Most of this last mile costs nothing. The public-URL step has a fully free path: Catbox Litterbox (anonymous, 72h video) and Imgur (anonymous image upload with a free Client-ID you register) need no paid account. Cloudflare R2 has a free tier that comfortably covers a hobby bot, and the Worker in front of it is free too, so R2 as primary plus Catbox/Imgur as fallback is a zero-dollar hosting chain. Google Drive backup is free on a normal account; a service-account key from a free Google Cloud project signs the JWT with no billing attached, and a personal OAuth refresh token works just as well. pm2 is free and open source.

Inference: cloud-free vs fully local

For the LLM work that surrounds posting, this stack uses NVIDIA's free hosted NIM endpoints as the fallback path. The prompt-enhance step tries Gemini first (through the bridge's vertex proxy) and falls back to minimaxai/minimax-m2.7 on NIM, while the image-description vision call posts straight to NIM with meta/llama-3.2-90b-vision-instruct. Both hit https://integrate.api.nvidia.com/v1/chat/completions with Authorization: Bearer <YOUR_NVIDIA_API_KEY>. That NIM key is the cloud-free option: sign up at build.nvidia.com, grab a key, and you get rate-limited but free access to those models.

curl https://integrate.api.nvidia.com/v1/chat/completions \
  -H "Authorization: Bearer <YOUR_NVIDIA_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{"model":"meta/llama-3.2-90b-vision-instruct","messages":[{"role":"user","content":"hello"}],"max_tokens":64}'

For a fully local alternative, point those same chat calls at a local server. The media side of this stack already runs locally through ComfyUI on a local GPU (the smoke test even curls the ComfyUI /queue on the LAN host), so the renders you are posting cost nothing to produce. For the text calls, run a local LLM that speaks the OpenAI chat shape (an OpenAI-compatible server on, say, http://localhost:11434/v1) and swap the hostname. The bridge already has a primary-then-NVIDIA-fallback seam for prompt enhancement: that is the natural place to drop in your local server, either as the primary or in place of the NVIDIA hostname.

What still needs a real account

Instagram posting itself is free but not anonymous. You need a Facebook App, an Instagram Business or Creator account, and that account linked to a Facebook Page so you can mint a long-lived Page token. All of that is free to create through Meta, it just takes setup. If you do not want Instagram at all, stop after the public-URL step and let the bot hand you the Catbox or R2 link to post by hand. The hosting, the Drive backup, the smoke-test gate, and pm2 all work without ever touching Meta.

Gotchas & hard-won lessons

pm2 does not source your shell profile. A key you export-ed in bash is invisible to a pm2 process. Put every secret in one flat env file and parse it yourself on startup.
Instagram will not accept a raw buffer or a Discord CDN link. It fetches the URL you give it, so you must host the bytes on a publicly reachable URL first.
The Instagram container is processed asynchronously. If you call media_publish before status_code is FINISHED you get an error. Poll the container until it finishes, and give video a longer interval than images.
Video posts need media_type=REELS plus video_url and share_to_feed=true. Stills use image_url. Mixing these up returns a confusing container error.
The Page-token check is named refreshPageTokenIfNeeded but only warns when expiry is under 7 days; it does not refresh the token for you. A long-lived Page token still dies around 60 days, so re-mint it before then or posting goes silent.
Drive uploads with a service account fail unless the target folder is explicitly shared with the service-account email. The bot can authenticate fine and still see nothing.
R2 objects expire on a 7-day lifecycle in my setup, which is fine only because Instagram fetches within seconds. Do not rely on those URLs as permanent storage; the Drive copy is the archive.
Catbox Litterbox links last 72 hours and Imgur anonymous uploads are rate-limited, so treat both as last-resort fallbacks behind R2.
Never pm2 delete a process. Delete-then-start wipes the env vars and tokens that were loaded into it, and if you forget pm2 save afterward the process vanishes on reboot. Always pm2 restart.
Run pm2 save only when pm2 list looks healthy. Saving during a half-started state writes a broken dump that resurrect will replay.
The gdrive-backup cron uses pm2 via an absolute path because cron has no PATH, and it does not silence stderr. The earlier version hid a failure for a day before anyone noticed the backups had stopped.

Prompts to build it yourself

The kind of instructions you'd hand an AI coding agent (Claude Code) to build this from scratch.

prompt

Build a Node module that publishes a media buffer to Instagram through the Facebook Graph API v21.0. Expose one function, postToBuffer({ text, mediaBuffer, mimeType, channels }), that first uploads the bytes to a public URL: try a Cloudflare R2 Worker via PUT /upload with an X-Upload-Secret header, and fall back to Catbox Litterbox for video or Imgur (Client-ID auth, base64) for images. Then create an IG media container (image_url for stills; media_type=REELS + video_url + share_to_feed for video), poll the container's status_code field every few seconds for up to ~24 tries until FINISHED (5s interval for video, 3s for images, bail on ERROR), and call media_publish with the creation id. Add a separate function that calls debug_token and logs a warning when the Page token has under 7 days left (warn only, do not auto-refresh). Read IG_USER_ID, FB_PAGE_TOKEN, FB_APP_ID, FB_APP_SECRET, R2_WORKER_URL, R2_UPLOAD_SECRET from process.env. Use placeholders for every secret, no hardcoded ids.

prompt

Set up the ops layer for a pm2-managed Discord bot. First, at the top of the entrypoint, parse a flat ~/.nemoclaw_env file (KEY=value lines via the regex /^([A-Z_][A-Z0-9_]*)=(.*)$/) into process.env, because pm2 does not source shell profiles. Second, write a smoke-test.js that node -c syntax-checks the critical scripts, require()s each sibling service/handler/ui module to catch broken exports, hard-fails if DISCORD_BOT_TOKEN or NVIDIA_API_KEY did not load from the env file (warn for other missing vars), and curls liveness endpoints including https://integrate.api.nvidia.com/v1/models and a local ComfyUI /queue. Third, write a pre-restart gate script that node --checks the entrypoint and runs the bridge unit tests, and only does pm2 restart <name> --update-env when both pass, never pm2 delete. Add a Google Drive backup helper that prefers a personal OAuth refresh token and falls back to a service-account JWT (RS256), exchanges it at oauth2.googleapis.com/token with a 55-minute token cache, and multipart-uploads a file to a Drive folder id from env. Add a cron that tarballs the scripts, env files, and pm2 state to Drive every 30 minutes and prunes to the last 48 copies.

See the Content Factory →