How to Put LLMs into Discord | The Complete 2026 Guide

How to Put LLMs into Discord | The Complete 2026 Guide
Learn exactly how to put LLMs into Discord using Python, Ollama, OpenAI, and ready-made tools like llmcord. Step-by-step guide for beginners and developers alike — no fluff, just working methods.
Discord started as a gaming chat app. Today it’s the digital living room for millions of communities — developers, artists, researchers, gamers, students — all gathered in servers buzzing with conversation. And in 2025, there’s one thing nearly every active Discord community is thinking about: what if the AI could just be part of the conversation?
That’s not a fantasy anymore. Putting a large language model (LLM) directly inside Discord is not only possible, it’s surprisingly approachable. Whether you want a bot that answers server questions, summarizes long threads, helps with coding problems, roleplays as a character, or simply chats intelligently with your members — you can build it.
This guide covers every major method: from zero-code tools to full Python bots, from cloud APIs like OpenAI and Claude to locally-run models via Ollama and LM Studio. By the end, you’ll know exactly which path suits your use case and how to walk it.
Understanding the Architecture: How LLMs Connect to Discord
Before diving into code, it helps to understand what’s actually happening under the hood. When a user types a message in a Discord channel and a bot responds with an AI-generated reply, here’s the complete data flow:
- The user sends a message in Discord.
- Discord’s API routes that message to your bot application.
- Your bot script receives the message via a webhook or WebSocket connection.
- The message text is sent as a prompt to an LLM (either a local model or a cloud API).
- The LLM generates a response and returns it.
- Your bot posts that response back into the Discord channel.
The LLM can live anywhere — on OpenAI’s servers, Anthropic’s infrastructure, Google’s cloud, or right on your own laptop running Ollama. The Discord bot is just the bridge.
The LLM can live anywhere — on OpenAI’s servers, Anthropic’s infrastructure, Google’s cloud, or right on your own laptop running Ollama. The Discord bot is just the bridge. That single insight makes the whole thing click.
llmcord — The Easiest No-Boilerplate Option
If you want to skip writing most of the bot infrastructure yourself, llmcord (github.com/jakobdylanc/llmcord) is the most popular ready-to-run solution available. It supports any OpenAI-compatible API including Ollama, xAI, Gemini, and OpenRouter.
What Makes llmcord Special
- Hot reloading of config — change settings without restarting
- Caches message data in a size-managed global dictionary to minimize Discord API calls
- Reply-chain conversations that build context naturally
- Per-user, per-role, and per-channel permission controls
- Supports vision models (image attachments) and text file attachments
Setup in 5 Steps
- Create a Discord Bot at discord.com/developers/applications → Bot tab → generate token → enable Message Content Intent
- Clone the repo:
git clone https://github.com/jakobdylanc/llmcord.git - Install dependencies:
pip install -r requirements.txt - Configure your provider — add base_url and optional api_key. OpenAI, OpenRouter, and Ollama are pre-configured. First model in your list is the default.
- Run the bot — the invite URL prints to your console automatically.
Conversation Experience
@ the bot to start a conversation and reply to continue. You can branch conversations into threads — just create a thread from any message and @ the bot inside. Back-to-back messages from the same user are automatically chained. In DMs, conversations continue without needing to reply each time.
"User messages are prefixed with their Discord ID as <@ID>" in your system prompt to help the model understand the user format and mention users back properly.
How to Put LLMs into Discord
Python + Ollama (Local LLM)
If you want full control and zero API costs, running Ollama locally is your move. Ollama lets you run LLMs like Llama 3, Mistral, and Gemma entirely offline — your data never leaves your machine.
Cloud vs. Local: At a Glance
| Feature | Cloud API (OpenAI, Claude) | Local LLM (Ollama) |
|---|---|---|
| Cost | Pay per token | Free (hardware only) |
| Privacy | Data sent to provider | Fully offline |
| Speed | Fast (dedicated servers) | Depends on GPU/CPU |
| Setup | Minimal | Moderate |
| Model Variety | Limited to provider | Hundreds on HuggingFace |
| Internet Required | Yes | No |
Step-by-Step Setup
Step 1 — Install Ollama and pull a model
ollama pull llama3
ollama run llama3 # verify it works
# Keep this terminal open — your bot needs it runningStep 2 — Set up your Python environment
mkdir discord-llm-bot && cd discord-llm-bot
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install discord.py python-dotenv requestsStep 3 — Create your .env file
DISCORD_TOKEN=your_bot_token_here
OLLAMA_MODEL=llama3Step 4 — Write bot.py
import discord
import requests
import os
from dotenv import load_dotenv
load_dotenv()
TOKEN = os.getenv("DISCORD_TOKEN")
OLLAMA_MODEL = os.getenv("OLLAMA_MODEL", "llama3")
OLLAMA_URL = "http://localhost:11434/api/generate"
intents = discord.Intents.default()
intents.message_content = True
client = discord.Client(intents=intents)
def ask_ollama(prompt, temperature=0.7):
payload = {
"model": OLLAMA_MODEL,
"prompt": prompt,
"stream": False,
"options": {"temperature": temperature, "num_predict": 500}
}
try:
response = requests.post(OLLAMA_URL, json=payload, timeout=60)
if response.status_code == 200:
return response.json().get("response", "").strip()
return "Something went wrong with the model."
except requests.exceptions.ConnectionError:
return "Error: Ollama is not running! Start it with `ollama serve`."
@client.event
async def on_ready():
print(f"Logged in as {client.user}")
@client.event
async def on_message(message):
if message.author == client.user:
return
if client.user.mentioned_in(message):
user_input = message.content.replace(f"<@{client.user.id}>", "").strip()
async with message.channel.typing():
reply = ask_ollama(user_input)
await message.reply(reply)
client.run(TOKEN)Step 5 — Run it: python bot.py — look for Logged in as YourBot#1234 in the console, then @ the bot in your server.
Python + OpenAI API (Cloud LLM)
For the highest-quality responses and zero hardware management, connecting your Discord bot to OpenAI’s API (or any OpenAI-compatible provider like Anthropic or Gemini) is the cleanest cloud option.
pip install discord.py openai python-dotenvimport discord
import os
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()
client_ai = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
conversation_history = {} # Per-user memory
intents = discord.Intents.default()
intents.message_content = True
bot = discord.Client(intents=intents)
@bot.event
async def on_message(message):
if message.author == bot.user:
return
if bot.user.mentioned_in(message):
user_id = str(message.author.id)
user_input = message.content.replace(f"<@{bot.user.id}>", "").strip()
if user_id not in conversation_history:
conversation_history[user_id] = []
conversation_history[user_id].append({"role": "user", "content": user_input})
async with message.channel.typing():
response = client_ai.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant in a Discord server."},
*conversation_history[user_id]
],
max_tokens=800
)
reply = response.choices[0].message.content
conversation_history[user_id].append({"role": "assistant", "content": reply})
await message.reply(reply)
bot.run(os.getenv("DISCORD_TOKEN"))LM Studio + Node.js
If you prefer JavaScript, LM Studio gives you a polished desktop GUI for managing and running local LLMs, and its JavaScript SDK makes integrating with a Discord bot clean and straightforward.
npm install discord.js @lmstudio/sdk dotenvimport { LMStudioClient } from '@lmstudio/sdk';
import { Client, GatewayIntentBits } from 'discord.js';
import 'dotenv/config';
const lms = new LMStudioClient();
const discord = new Client({
intents: [GatewayIntentBits.Guilds, GatewayIntentBits.GuildMessages, GatewayIntentBits.MessageContent]
});
discord.on('ready', () => console.log(`Ready: ${discord.user.tag}`));
discord.on('messageCreate', async (message) => {
if (message.author.bot) return;
if (!message.mentions.has(discord.user)) return;
const prompt = message.content.replace(`<@${discord.user.id}>`, '').trim();
const model = await lms.llm.get({ path: 'lmstudio-community/gemma-2-2b-it-GGUF' });
await message.channel.sendTyping();
const response = await model.respond([{ role: 'user', content: prompt }]);
await message.reply(response.content);
});
discord.login(process.env.DISCORD_TOKEN);In LM Studio, navigate to the server section, select your model from the dropdown, and start the local API server before running this script. Models like Gemma 2 2B work well on most consumer hardware.
Using Fine-Tuned Models in Discord
If you want your Discord bot to behave like a domain expert — trained on your own data — you can connect fine-tuned LLMs using the same architecture described above. The key difference is in your API call: instead of pointing at gpt-4o or llama3, you point it at your fine-tuned model’s endpoint.
Platforms like Hugging Face Inference Endpoints, Together AI, Fireworks AI, and Replicate all let you host fine-tuned models with an OpenAI-compatible API — meaning the bot code stays identical, only the endpoint URL and model name change.
Scaling to Multi-Server Bots: Real-World Lessons
Once your bot works locally, scaling it reveals hard truths fast. A Gonzaga University CS team discovered this when they built a production-grade LLM Discord bot: the project required TypeScript code, web API calls, asynchronous and synchronous models, Docker containers, GPU configuration, and Discord libraries all working together.
Their key finding: running Ollama on consumer laptops was far too slow for active servers. They ultimately moved to a dedicated GPU research server, which enabled near real-time responses comparable to cloud API services.
Recommended Hardware by Use Case
| Use Case | Recommended Setup | Notes |
|---|---|---|
| Personal / hobby server | CPU + Ollama (7B model) | Slow but completely free |
| Small community (<100 active) | Consumer GPU + Ollama | RTX 3060+ works well |
| Mid-size server (100–1,000) | Cloud API or rented GPU | OpenAI / Together AI |
| Large community (1,000+) | Dedicated server + Docker | GPU cluster recommended |
Adding Memory and Context
Out of the box, most simple implementations are stateless — each message is treated independently. For natural conversation, you need persistent memory. Here are the three tiers:
In-Memory Dict
Python dictionary keyed by user ID. Fast to implement, resets on bot restart. Good for testing.
SQLite Database
Save and retrieve conversation history across restarts. Enough for most real-world programs.
Vector DB (FAISS / Chroma)
Store and retrieve semantically relevant past messages. The bot can recall topics from weeks ago.
Giving Your Discord Bot a Personality
A generic “helpful assistant” bot is boring. A bot with a strong, consistent persona is something your community will actually use. The persona lives entirely in your system prompt:
You are Axiom, the no-nonsense tech support bot for this server.
You speak in short, direct sentences. You never apologize unnecessarily.
You're an expert in Python, Linux, and self-hosted tools.
When you don't know something, you say "I don't know" — no guessing.
Today is {date}. Current time: {time}.The temperature parameter controls creativity — 0.7 is the sweet spot for most chatbots. Higher values (0.9+) produce more varied, creative responses; lower values (0.3) produce more focused, deterministic answers.
Troubleshooting Common Issues
Bot Is Online But Not Responding
Ollama Returns “Connection Refused”
Ollama must be running before you start the bot. Open a separate terminal and run ollama serve. Keep that window open for the duration of your bot’s operation.
Responses Are Cutting Off
Increase max_tokens or num_predict in your API call. Also: Discord has a 2,000-character message limit. Add logic to detect long responses and split them across multiple messages:
if len(reply) > 1900:
chunks = [reply[i:i+1900] for i in range(0, len(reply), 1900)]
for chunk in chunks:
await message.channel.send(chunk)
else:
await message.reply(reply)Bot Responds to Other Bots
Add if message.author.bot: return at the very top of your on_message handler to filter out all bot-authored messages, including your own bot’s messages.
All Methods Compared Side by Side
| Method | Skill Level | Cost | Privacy | Best For |
|---|---|---|---|---|
| llmcord | Low | Depends on provider | High w/ Ollama | Quick setup, small–medium servers |
| Python + Ollama | Medium | Free (hardware) | Excellent | Privacy-focused, custom logic |
| Python + OpenAI | Medium | Pay per token | Data sent to OpenAI | High quality, low hardware burden |
| Node.js + LM Studio | Medium | Free | Excellent | JS developers, local inference |
| Fine-tuned endpoint | High | Variable | Depends on host | Domain-specific expert bots |
| Docker + GPU server | High | Hardware / cloud | Excellent | Production, large communities |
Security Best Practices
- Never hardcode tokens or API keys — always use
.envfiles and add them to.gitignore - Rate-limit users — a single user can flood your bot with expensive API calls or degrade the experience for everyone
- Set a system prompt with explicit limits — without guardrails, users can manipulate the bot into off-topic or harmful content
- Log interactions (without storing sensitive personal data) so you can audit unusual activity
- Restrict which channels the bot responds in using role-based permissions or channel allowlists in your config
Advanced Features Worth Adding
Slash Commands
Register /ask, /summarize, or /explain slash commands using Discord’s application command system. These show up as autocomplete suggestions for users — much more discoverable than @mentions.
Streaming Responses
Instead of waiting for the full response before posting, stream tokens in real time and edit the bot’s message as text arrives. This feels much more natural and eliminates the awkward silence before long responses appear.
Multimodal Support
Models like GPT-4o, Claude 3.5 Sonnet, and LLaVA can process images. Users can attach a screenshot and ask the bot to explain, debug, or describe it. Enable by adding image processing to your message handler.
Thread-Based Conversations
Route each new conversation into its own Discord thread automatically. This keeps main channels clean while preserving full context for each individual chat — ideal for busy servers.
Frequently Asked Questions
Yes. The Anthropic Python SDK is straightforward to integrate. Replace the OpenAI client with Anthropic’s, adjust the model name to claude-sonnet-4-5 or claude-opus-4-5, and the rest of your bot code stays largely the same.
For llmcord, minimal coding is needed — mostly YAML configuration. For a custom bot, basic Python knowledge is required. The barrier is lower than most people expect; the hardest part is usually setting up the Discord Developer Portal correctly.
Running local models via Ollama is free (you pay only for electricity and hardware). Cloud API providers charge per token. OpenAI’s GPT-4o pricing is listed on their platform page and varies by input vs. output tokens.
Text-only by default. Voice channel support requires an additional TTS (text-to-speech) layer using Discord.py’s voice client and a TTS engine like Coqui TTS or ElevenLabs. It’s possible but significantly more complex to implement.
Deploy to a cloud server (AWS EC2, DigitalOcean, Google Cloud, Railway.app) or use a process manager like systemd or pm2 on a home server. Railway.app even has a one-click Discord bot deployment template for fast cloud hosting.
For local use: Llama 3 8B via Ollama is the best balance of speed and quality on consumer hardware. For cloud: GPT-4o mini is fast and cost-effective for high-volume servers. For premium quality: GPT-4o or Claude Sonnet.
Which Method Should You Use?
Putting an LLM into Discord has never been easier, and the right approach depends on your situation.
If you want to be up and running in under an hour without writing much code, llmcord is your answer. Clone it, configure a YAML file, and you have a production-quality multi-model bot with conversation threading and permissions built in.
If you want full ownership and zero ongoing API costs, the Python + Ollama approach gives you a completely private, locally-run LLM bot. It scales with your hardware and costs nothing after setup.
If you want the highest quality responses and don’t mind paying per token, connect your Python bot to OpenAI, Anthropic Claude, or Google Gemini. The code is minimal and the results are excellent.
Whatever path you choose, the fundamental architecture is the same: Discord receives the message, your bot passes it to an LLM, the response comes back. Once you understand that loop, every enhancement — memory, personas, slash commands, multimodal input — is just one more layer on top.
The AI is already part of your community. Now you know exactly how to make it official.
If You Have Hotel and You want IT Solutions For Hotel Contact Cyber Hospitality
