A Personal Discord Agent on a 3090

Discord is already where my homelab alerts land and where I talk to friends, so it's a natural home for an agent I can DM, @-mention in a thread, and trust with a shell on my own machines — no SSH, no web UI, just a chat window I already have open.

The whole stack runs on hardware I already had: one RTX 3090, a Qwen 27B distill, pi as the agent runtime, and discord.py as the frontend. This post is the generalized recipe.

The Shape of It

Discord user
    │
    ▼
discord.py bot (Python)  ◄── long-running container
    │
    ▼
pi agent loop (Node/TS)  ◄── tool calls, planning, compaction
    │
    ▼
LiteLLM proxy            ◄── OpenAI-compatible routing layer
    │
    ▼
llama.cpp server         ◄── Qwen 27B distill, GGUF on the 3090

Four moving parts, each one replaceable. The Discord bot only knows how to shovel messages in and out. pi handles the actual agent loop — tool calls, retries, context compaction. LiteLLM gives you one stable URL regardless of what model is loaded behind it. llama.cpp does the inference.

The Model: Qwen 27B Distill on a 3090

A 3090 has 24GB of VRAM, which is the sweet spot for running a ~27B parameter model at a usable quantization. I use a Qwen 2.5 distill quantized to Q4_K_M — about 17GB on disk, fits in VRAM with ~6GB left for KV cache at a 32K context window.

Launched with llama.cpp's server binary:

llama-server \
  --model /models/qwen-27b-distill-q4_k_m.gguf \
  --ctx-size 131072 \
  --n-gpu-layers 999 \
  --host 0.0.0.0 \
  --port 8080 \
  --parallel 2 \
  --cont-batching

A few things worth calling out:

--ctx-size 131072 gives you the full 128K context. The Qwen 27B variants support long context and an agent chews through tokens fast — tool call traces, file reads, search results.
--parallel 2 lets the server process two requests at once. If the agent is thinking in one Discord thread and a webhook triggers a second, they don't block each other.
--cont-batching enables continuous batching so parallel requests share compute efficiently.

You can distill however aggressively you want here. I started with Q5_K_M, found that Q4_K_M had indistinguishable quality for my workload, and the extra VRAM headroom was worth it for longer KV cache.

The Routing Layer: LiteLLM

Putting LiteLLM in front of llama.cpp is one of those decisions that seems like over-engineering until you've done it once. It gives you:

A stable OpenAI-compatible endpoint regardless of what's loaded. Swap the GGUF, the URL your agent uses doesn't change.
Virtual API keys so you can scope different clients (the agent, your IDE, other skills) to different quotas.
A single place to log, rate-limit, and observe every LLM call across your homelab.

My config.yaml is mostly one block per model:

model_list:
  - model_name: qwen27b
    litellm_params:
      model: openai/qwen
      api_base: http://llamacpp:8080/v1
      api_key: none
  - model_name: embeddings
    litellm_params:
      model: openai/qwen-embed
      api_base: http://llamacpp-embed:8081/v1
      api_key: none

LiteLLM runs in its own container and everything — the agent, search skills, indexers — points at http://litellm:4000/v1. Over Tailscale, even remote machines hit the same URL.

The Agent Loop: pi

The part I don't want to write from scratch is the agent loop. Tool calling, error recovery, context compaction when the conversation gets long, retries on malformed tool output — this is all fiddly and easy to get wrong. pi (from @mariozechner/pi-coding-agent) gives you a solid default loop with a clean extension model.

Install:

npm install -g @mariozechner/pi-coding-agent

Then point it at your local model by adding an entry to ~/.pi/agent/models.json:

{
  "providers": {
    "local": {
      "type": "openai",
      "baseURL": "http://litellm:4000/v1",
      "apiKey": "sk-your-litellm-virtual-key"
    }
  },
  "models": {
    "qwen27b": {
      "provider": "local",
      "model": "qwen27b",
      "contextWindow": 131072
    }
  }
}

Now pi --model qwen27b drops you into an interactive agent that talks to your local model, with the standard set of tools (bash, read, write, edit, web search) already wired up.

The critical feature is that pi is scriptable. You can invoke it non-interactively with a prompt and get back a structured result, which is exactly what a Discord bot needs.

The Frontend: discord.py

The Discord side is thinner than you'd expect. Its entire job is: watch for messages, forward them to the agent, stream the result back.

import discord
from discord.ext import commands
import asyncio

bot = commands.Bot(command_prefix="!", intents=discord.Intents.all())

@bot.event
async def on_message(message: discord.Message):
    if message.author.bot:
        return
    if not (bot.user in message.mentions or isinstance(message.channel, discord.DMChannel)):
        return

    async with message.channel.typing():
        reply = await run_agent(message.content, session_id=str(message.channel.id))

    for chunk in split_for_discord(reply):
        await message.reply(chunk)

The bits that matter:

Session per channel. Keying the agent session by channel.id means each Discord thread gets its own persistent context. Switch threads, switch conversations — same way you'd use Claude Code projects.
DM or mention only. The bot ignores everything else. Agents that auto-reply to every message in a busy server get annoying fast.
Typing indicator during work. channel.typing() is the one piece of UX that makes the agent feel alive. Discord rate-limits it, and it doesn't work in DMs — special-case those or you'll get noisy errors.
Chunking for Discord's 2000-char limit. A splitter that respects code fences and paragraph boundaries is worth the 30 lines it takes to write.

Bridging to pi

The run_agent function is just a subprocess call into pi with a session file per channel:

async def run_agent(prompt: str, session_id: str) -> str:
    session_file = f"/data/sessions/{session_id}.json"
    proc = await asyncio.create_subprocess_exec(
        "pi",
        "--model", "qwen27b",
        "--session", session_file,
        "--print",
        prompt,
        stdout=asyncio.subprocess.PIPE,
        stderr=asyncio.subprocess.PIPE,
    )
    stdout, _ = await proc.communicate()
    return stdout.decode()

If you'd rather not shell out, pi also exposes its core as a TypeScript library — you can host it in a small Node service and have the Python bot talk to it over HTTP. I went with the subprocess approach because it's one less thing to keep running.

Tools Worth Wiring Up

Out of the box, pi gives you bash, file read/write/edit, and web fetch. The tools I actually added to make the agent useful as a personal assistant:

A memory tool — a simple append-only JSON file the agent can read and write to remember preferences, inside jokes, and context that should survive across sessions. It's remarkable how much of "feeling like a real assistant" comes from this one thing.
Web search — Tavily's API is cheap and the results are cleaner than scraping DuckDuckGo.
A safe shell executor — a wrapper around bash that has an allowlist of directories the agent can write to and a blocklist of commands (no rm -rf, no sudo, no network reconfiguration). The agent can still be destructive inside its sandbox, but it can't nuke my home directory.
A self-patch tool — lets the agent modify its own source code, guarded by a "protected files" list (the safety validator itself is protected). This is the difference between an agent you tweak every week and an agent you ask to tweak itself.

The self-patch tool is the one that feels most like magic. Ask the agent to add a new tool, and it edits its own Python, restarts its own container, and the next message you send uses the new capability.

What I Actually Use It For

The shape of "useful" is whatever you teach it with tools. Mine ended up being the five things I used to juggle across half a dozen apps:

Calendar management — the agent has a tool for reading and writing my calendar, so "DM me my week" and "move my 3pm to Thursday" are one-liners instead of a five-tap dance on my phone.
Recipe book — a JSON file of recipes I've liked, plus a tool for adding, searching, and scaling them. "What can I make with what's in the fridge" works because the model is good enough at fuzzy matching ingredients.
Workout planner — tracks which lifts I did and when, and spits back a suggested session based on whatever program I'm running. The state is just another file; the intelligence is in the prompt.
r/LocalLLaMA digest — a cron trigger has the agent pull the top posts from r/LocalLLaMA every morning, summarize anything interesting, and drop a digest in a dedicated channel. I skim it with coffee.
Reminders — cron-backed, natural language in, Discord pings out. "Remind me to take the trash out Sunday night" becomes a scheduled job without me having to learn a reminders app.

None of this needs a frontier model. A 27B distill on local hardware handles all of it, the per-query cost is zero, and nothing I say to it leaves my house.

Closing Thoughts

The reason to build this instead of just using a hosted product: it's yours. The memory file is yours. The tool allowlist is yours. The model weights are on your disk. When the agent does something clever, you can look at the exact prompt and tool trace and understand why. When it does something dumb, you can fix it with a git commit.

If you have a 3090 (or any ~24GB GPU), a spare afternoon, and a Discord account, this whole stack is one evening of plumbing. pi handles the hard part of the agent loop, LiteLLM hides the model, llama.cpp does the math, and discord.py is a 200-line bridge. Everything else is yours to shape.