Project

Self-Hosted AgentSpace/ChatGPT

Deployed 12-10-2025

Fully Local, Private AgentSpace/CoPilot/ChatGPT Platform with Tools, Agents and Observability

I designed and deployed a self-hosted AgentSpace/CoPilot/ChatGPT-style workspace that also doubles as an agent platform lab—where I can iterate on tools, retrieval patterns, and orchestration while keeping the infrastructure honest. This project is functionally deployed and I’m expanding it feature-by-feature.


The Goal and High-Level Overview

  • Design and control agentic automation (tool use, multi-step workflows, guardrails)
  • Practice LLMOps (telemetry, tracing, evaluation, cost/perf awareness)
  • Agentspace / ChatGPT-style experience with:
    • Document storage + retrieval patterns (RAG)
    • Artifact creation (interactive outputs, UI/code side panels)
    • Coding environment
    • MCP and custom tool experimentation
    • Context window + memory behavior exploration
  • Real infrastructure:
    • Reverse proxy + TLS
    • Authentication layer (OIDC/SSO)
    • Clean separation of stateless app vs persistent data
    • Self-hosted model runtimes and OpenAI-compatible API’s
    • Containers that make the platform portable, reproducible, and easy to backup
    • VM/LXC hosts in Proxmox for datacenter realism and ease of use
  • Multiple device access (phone + desktops) with one consistent identity, chat history, and tool surface.

This is a learning project, but it’s also a blueprint. The modularity is the point: each component can be swapped later without rewriting the whole stack.


Why LobeChat fits these goals (and why I didn’t stay with OpenWebUI)

OpenWebUI is awesome at “spin it up fast and start chatting”. My project needed something closer to an enterprise-shaped control plane, and LobeChat-Server (DB mode) pushed me into that architecture by design.

LobeChat-Server specifics that mattered
  • Built to be extensible: plugins, a plugin index, and a formal manifest + OpenAI API compatibility.
  • Supports MCP installation and treats “tools” as a first-class surface area for the user.
  • Artifacts are native—the “Claude Artifacts” style experience is not a bolt-on; it’s part of the product identity.
  • Knowledge Base / retrieval features exist as product features, not as an afterthought.
  • The Server-DB deployment assumes a real auth layer instead of a casual, local-only login story. In practice, that means you treat it like a service that must live behind TLS + proper identity.

I wanted a system with stateless app, external identity, separate persistence and reverse proxy. OpenWebUI’s own docs/README emphasize rapid self-hosting and broad integrations, which is great, but my objective here is enterprise-style composition and tool surfaces.


The Architecture

The main stack is two planes, an app plane and a data plane, with identity and ingress defining the boundary:

App Layer

Reverse proxy: Nginx Proxy Manager
NPM is perfect for this kind of lab because it gives you:

  • Clean host-based routing
  • Let’s Encrypt certificates
  • Advanced Nginx config capability

Auth: Authentik (OIDC)
Authentik is the identity backbone: users, MFA, policies, OIDC clients, and future “who can do what” controls. It’s not optional, LobeChat-Server DB deployments are designed around an external auth service pattern, without the integration the Web-UI will not work.

Data Layer

Postgres
Chat history, metadata, embeddings/KB indexing (depending on your configuration) all imply the web ui container is disposable; the data is not. This allows for easier data management and consistent data across client devices for the UI.

S3 object storage:
File Uploads and larger “blob” assets don’t belong in Postgres. They belong in object storage. LobeChat explicitly supports S3-compatible storage configuration. This allows for your knowledge data for the built in context features to live on a NAS instead of the lightweight service environment.


LLM Hosting: Ollama + vLLM (plus local STT/TTS)

A huge part of why this project is fun is that everything speaks at least a variant of OpenAI’s standard protocols at the edges, even when nothing is OpenAI underneath.

Ollama (fast iteration, GGUF life)

Ollama is the “swap models like LEGO” runtime:

  • Pull a new model, test it, discard it
  • Iterate on quantized GGUF builds
  • Keep the developer experience friction-free

The GGUF story matters because it’s the practical bridge between “model weights” and “actually runnable locally,” and llama.cpp’s toolchain is the backbone for converting/quantizing. This allows for you to keep a lighter weight LLM server running 24/7 and only use your main power-intensive LLM server when you need to if you’re looking to minimize your electricity costs.

vLLM (the high-throughput workhorse)

Features that make it more enterprise ready.

  • PagedAttention for KV-cache efficiency and better throughput under load
  • Continuous batching / high GPU utilization patterns
  • An always ready model in your GPU VRAM (no loading/unloading wait times)

On this platform:

  • Ollama is the rapid local workshop.
  • vLLM is the full deployment tool.
Local Whisper STT + Wyoming TTS

Whisper STT exposes an OpenAI-compatible endpoint that the chat UI can treat as “just another OpenAI API”. Whisper’s original release is still the canonical reference for what it does and why it works well across languages.

Wyoming protocol TTS likewise exposes an OpenAI-compatible API that keeps the client integration clean.

Other OpenAI-compatible TTS servers that I am looking to potentially integrate are Kitten-TTS and Chatterbox-TTS for swappable voice backends (especially once custom voice training becomes part of the platform)


Self-Hosted Models and Privacy

An important feature of this stack is privacy. When you host the models and control the platform from end to end, you have complete control over your data. The models, services and data set behind your firewall, allowing you to kill any data leaks or phoning home. Only the tools you want to access the internet, can like Firecrawl. The hardware cost can be high, but if Nvidia’s famous research paper holds true, the smaller models are going to win out. It also means you can use the models and tools as hard as your hardware will allow without running up an API provider cost while you experiment and break things. Running your own infrastructure grants you the ability to really know how everything works and fits together ensuring you understand how to troubleshoot issues when they occur. This is crucial with the speed of innovation in the Agentic AI space.


Firecrawl, MCP Friction, and Custom LobeChat plugins

Firecrawl is the missing link between “agents” and “the internet”: scraping, crawling, search, extraction, and actions that turn websites into LLM-ready payloads. Self-hosting Firecrawl is a great tool stack that includes Playwright and SearXNG as part of the features.

MCP Issue: “localhost means inside the container

Firecrawl has an official MCP server that supports a “Streamable HTTP Mode” where the MCP endpoint lives at http://localhost:3000/mcp.

That’s works on a single machine like a desktop MCP client but in a dockerized, multi-service stack, localhost binding like that breaks the service:

  • localhost from inside the Firecrawl stack container is not your LobeChat host
  • MCP tooling is intentionally bound to loopback for safety in many instances which prevents it from working on a network server, unless you’re running your entire deployment on that host.
The Fix: a custom API designed to install as a custom LobeChat plugin

LobeChat’s plugin system is clean: a manifest + an OpenAPI spec, invoked via tool/function calling.

That let’s me:

  • Front Firecrawl with a small service that speaks the exact operations I care about
  • Keep the Firecrawl stack in a stand-alone portable container on another host.
  • Expose both the LobeChat Plugin Manifest and OpenAI Tool Manifest.

This is the part that turns “cool self-hosted chat” into a real agent platform lab. You start designing tools the way an enterprise does and you learn how to build and integrate any tool you want your agents or chats to have.


Roadmap: Langfuse, ChromaDB, and agentic workflow control

Observability + tracing: Langfuse (in progress)

I’m adding Langfuse as the tracing/observability spine:

  • Every conversation becomes a trace
  • Tool calls become spans
  • Latency, cost, and error patterns become visible

Self-hosting Langfuse is also wonderfully “real world”: it uses ClickHouse for high-volume event data and a relational store for the rest. That’s the LLMOps lesson: if you can’t see it, you can’t improve it. It’s currently deployed but issues are still being worked out.

Vector storage + RAG control: ChromaDB (future)

ChromaDB as an explicit vector layer where I can explore:

  • Chunking strategies (how text becomes retrieval units)
  • Collection boundaries (what knowledge belongs together)
  • Retrieval tuning techniques
  • Custom Built MCP
  • Agent Memory
  • Just-in-Time-Prompting

I’m intentionally keeping pgvector in the vocabulary too, for native LobeChat RAG and as an additional option for exploration and contrasting. ChromaDB as one of the canonical VectorDB will be the place for experimentation and pushing the bounds of VectorDB capability with chat and agents.

Agentic workflow control: LangChain/Langflow + n8n (future)

This is where the platform becomes an AgentSpace:

  • LangChain / LangGraph for agent orchestration patterns (agents + tools + retrieval as deliberate architecture)
  • Langflow to visually prototype chains/agents and then harden them
  • n8n to operationalize workflows (schedules, triggers, connectors) and treat agents like automation nodes

The best part is that now I already have a suite tools ready to use in these Agents:

  • Firecrawl as the web search and reading tool.
  • ChromaDB as the retrieval memory layer.
  • Langfuse for agent observability.

The LobeChat surface stays the same: one interface where I can chat, build artifacts, call tools, invoke agents, while building, exploring and iterating .


Project References