Field Notes

Agents for the Process-Minded: An RPA Developer's Primer on Building AI That Acts

Published 2026-03-22

  • agents
  • automation
  • tooling
  • architecture

You have spent years building automations that follow rules with mechanical precision. Every selector, every queue item, every exception path exists because you put it there. Now the industry is asking you to build systems that make their own decisions, and the pitch sounds suspiciously like magic. It isn’t. Agents are systems, and systems are what you already know how to build. But the engineering discipline is genuinely different, and the gap between “kinda works in a demo” and “works reliably in production” is wider than most people realize.

I have spent over a decade in automation: Blue Prism, Automation Anywhere, and UiPath. The last two years I have been building agentic AI projects and teaching other RPA developers how to think about this shift. What follows is the field guide I wish had existed when I started crossing that gap myself. Not theory. Not hype. The actual engineering discipline that makes agents work.


The Probability Machine You’re Actually Programming

If you have built RPA bots, you have built deterministic systems. Given input X, the bot always does Y, because you wrote every branch. An LLM does not work that way. It predicts the next token from a probability distribution shaped by everything in its context window. Your prompt is the input that determines which region of probability space the output lands in. A vague prompt spreads that distribution wide and you get unpredictable results. A precise, structured prompt narrows it and you get consistency.

The shift is not from “rules” to “intelligence.” It is from writing explicit control flow to constraining probability. That distinction matters enormously for how you debug, test, and trust what you build.

Three mechanical details make this concrete. First, the model does carry knowledge from its training data: general language understanding, reasoning patterns, facts about the world up to its training cutoff date. That baseline knowledge is real and useful. However, it is frozen and incomplete. The model does not know your company’s policies, your customer records, or anything that happened after its training cutoff. For agent work, the context window is where you bridge that gap. Your system prompt, tool definitions, conversation history, and the user’s message compose the working universe for any given call. The model reasons using its trained knowledge plus whatever you put in the window, and the window is the only part you control at runtime. This sounds obvious, but it changes how you architect data access in ways that surprise people coming from stateful automation platforms. The model is not a blank slate; it is a knowledgeable collaborator with no access to your specific world unless you provide it.

Second, training knowledge has a cutoff date, and that date matters for production agents. If your agent needs to reference current events, recent product changes, or live data, you cannot rely on the model’s training alone. This is one of the core reasons context grounding and retrieval-augmented generation exist: they inject current, company-specific information into the window at runtime, supplementing what the model already knows with what it cannot.

Third, temperature is your most immediate control lever. Temperature governs how much randomness the model introduces when sampling from its probability distribution. For classification, extraction, and routing, which is most of what production agents do, you want temperature at or near zero. Increasing variablity in the output is something you want to opt into deliberately. There will always be some slight variability with agents, dependent on the output structure, data sources and prompting, but if you need more the temperature is the control lever.


Prompts Are Code, Not Conversation

This is the single biggest mental shift for developers coming from RPA. In traditional automation, you write code that is the logic. In agent development, you write natural language that instructs a probability machine. The same software engineering principles still apply: modularity, single responsibility, task decomposition, version control. The medium is different enough that the analogies require honest qualification.

A production agent prompt is not a typical paragraph of friendly instructions. It is a structured specification: a role definition, explicit decision criteria, labeled examples, edge case handling, and output format constraints, each in its own section. You can think of each section as a module, and you should organize them so that the logic is easy to find and maintain. This is where the analogy to independently testable code units breaks down: the entire prompt shapes the probability distribution over all outputs simultaneously. Changing the edge case section can shift how the model interprets your role definition. Adding an example can reweight the decision criteria. The sections are not independent functions with clean interfaces; they are regions of a single input that the model processes holistically.

This means prompt engineering lives in a tension. You structure for maintainability, the same way you would organize any codebase, but you accept that changes can have non-local effects. Sometimes you can bolt on a new rule additively and everything holds. Sometimes the accumulated weight of instructions pushes the model in a direction where you need to restructure the whole prompt to restore coherent behavior. Anyone who has written long-form prose will recognize the feeling: sometimes you can edit a paragraph in isolation, and sometimes fixing one section means rewriting the chapter to make the argument hold together. The discipline is in knowing which situation you are in, and the only way to know is to test.

That said, the toolkit for writing effective prompts is concrete and learnable.

Few-shot examples Show the model three correct classifications and it will pattern-match against those examples more reliably than it follows abstract instructions. Labeled data beats rules almost every time.

XML structure Using tags like <instructions>, <examples>, <rules>, and <edge_cases> gives the model parsing anchors. These are not just formatting; the model uses them to organize its processing, and they make your own maintenance work tractable when a prompt grows past a few hundred words. The same goes with Markdown, these formats are tools that help you convey the meaning and importance of the instructions.

Structured output Utilizing constraints, like requiring JSON with specific keys, force parseable responses and make downstream integration reliable. If your agent’s output feeds another system, this is non-negotiable.

Positive framing Using the right framing matters more than you might expect. “Classify into exactly one of these four categories” is a stronger instruction than “don’t misclassify.” The model is better at doing things than at avoiding things. Say what to do. Remember, vague statements follow variable output.

Descriptions are prompts. Every tool description, every context source label, every field name that the model sees is effectively a tiny prompt. A vague tool description means the wrong tool gets called. Treat every text field the model reads as a first-class piece of your specification.

The contrast with how people naturally write instructions is stark. A human SOP says “use good judgment when handling escalations.” An agent specification says “ESCALATE if ANY: confidence below 0.7, customer mentions legal action, refund exceeds $500, three or more failed resolution attempts. Otherwise RESOLVE with: acknowledgment of issue, specific fix, maximum two follow-up questions.” The model does not have judgment. It has probability. Give it decision criteria. There many more techniques to study and learn to help improve your prompt writing; your audience is the LLM in your prompts.


The Configuration Stack: Five Layers That Determine Everything

When an RPA bot misbehaves, you open the workflow, find the activity, and read the logic. When an agent misbehaves, there is no single place to look because behavior emerges from the interaction of multiple configuration surfaces. I think of it as a five-layer stack, and walking it systematically is how you debug.

Layer 1: System Prompt. The agent’s operating instructions. Its persona, constraints, decision logic. This is the script that defines what the agent does, how it decides, and when it stops. Everything else supports this layer.

Layer 2: Tool Descriptions. What the agent can do, described in natural language. Each tool has a description, input schema, and output schema. The description tells the model when to use the tool. The schema tells it how. Bad descriptions lead to wrong tool selection. Too many tools create decision overhead; the sweet spot is five to eight focused tools. If you need more surface area, split into focused sub-agents: a classification agent, a lookup agent, a resolution agent, each with three to five tools, will outperform a single agent juggling fifteen.

Layer 3: Model Settings. Temperature, top-p, output token limits, and model choice. Temperature at zero for consistency. Model choice is a cost-accuracy tradeoff; start with the smaller, cheaper model, upgrade only if evaluations fail. And pin to a specific model version in production. Model updates from providers can silently change behavior in ways that break your carefully tuned prompts without changing a single character of your configuration.

Layer 4: Context Grounding. What data the agent can access beyond its training. Knowledge bases for static documents, just-in-time retrieval for dynamic data, multi-document synthesis for complex research questions. The choice is not which approach is “best” but which matches your data freshness requirements. More on this in the next section.

Layer 5: Escalation Rules. When the agent asks for help. Low confidence, policy violations, novel situations, high-stakes decisions. Escalation is not failure; it is architecture. A well-designed escalation policy with memory means the agent learns from every human resolution, steadily narrowing the cases that require intervention.

The debugging implication is direct. When something goes wrong, walk the stack. Is the system prompt ambiguous? Are tool descriptions leading the model to the wrong tool? Is the temperature set for creativity when you need consistency? Is context grounding returning irrelevant chunks? Are escalation thresholds too high or too low? The answer lives in one of these five layers. This is the agent equivalent of stepping through a debugger, and once you internalize the stack, diagnosis becomes systematic rather than guesswork.


Getting Your Data on Stage

Your company’s proprietary data does not exist inside the model. Policies, customer records, product catalogs, process documentation: the LLM has never seen any of it. If you want the agent to use that data, you have to put it into the context window at runtime. This is what retrieval-augmented generation does, and the mechanism underneath is vector search. In RPA terms, think of it as a dynamic data lookup, except instead of querying a database by key, you are querying by meaning.

The pipeline has two phases. First is indexing, which is a setup operation. You take your documents, chunk them into pieces of 256 to 512 tokens at natural boundaries, and run each chunk through an embedding model that converts text into a vector: a list of numbers that represents semantic meaning. “Invoice” and “bill” map to similar vectors. “Invoice” and “puppy” map to vectors far apart. Those vectors get stored in a vector database alongside the original text.

Second is retrieval, which happens on every agent query. The user’s question gets embedded with the same model into the same vector space. Cosine similarity search finds the stored vectors closest to the query vector. The top three to five matching chunks get pulled out and injected into the context window. Now the agent has your company’s data “on stage.”

This is semantic search: it finds meaning, not keywords. “How do I process refunds?” matches documents about refund workflows even if those documents use completely different terminology. That is a significant upgrade over keyword-based lookups, but it comes with its own engineering discipline.

Chunking strategy matters more than model choice for RAG quality. Split at natural boundaries, not mid-thought. Use 10 to 20 percent overlap between chunks so that context spanning a boundary does not get lost. Bad chunking produces incomplete retrieval, which produces bad decisions, and no amount of prompt engineering will fix data that arrives in the context window already broken.

The other discipline is the token budget. Every piece of information competing for the context window has a cost. Fifteen tool descriptions at 200 tokens each: 3,000 tokens. Five context results at 400 tokens each: 2,000 tokens. System prompt at 500 tokens. That is 5,500 tokens consumed before the user even speaks. Retrieve three to five chunks, not twenty. Every token you save is response capacity for the agent to actually reason and answer.


The Development Cycle: Build, Trace, Fix, Evaluate

In RPA, your development cycle might be build, test in Studio, deploy to Orchestrator, monitor. Agent development follows a similar loop, but with critical additions: traces and systematic evaluation. Not as afterthoughts. As requirements at every stage of development.

A trace shows you everything the agent saw, reasoned, decided, called, and returned. It is the agent equivalent of a stack trace, and it is the only way to debug from evidence instead of assumptions. The question is never “what did I expect the agent to do?” It is “what did the trace show the agent actually did?” When a classification comes back wrong, the trace tells you whether the model misread the context, called the wrong tool, received bad data from retrieval, or applied your prompt logic in an unexpected way. Each of those is a different fix in a different layer of the configuration stack.

But tracing alone does not catch regressions. This is where evaluation becomes essential, and I mean essential in the most literal sense: you should not be developing without it. A golden test set of 30 to 50 or more input-and-expected-output pairs is your regression suite. After every prompt change, run the full set. Not a spot check on the case you just fixed. The full set. Every time.

The reason is something every developer has felt: fixing one thing breaks another. In deterministic code, a change in one function is usually isolated unless there is a shared dependency. In an agent, every prompt change reshapes the probability distribution across all inputs. Adding a rule for one edge case can shift how the model weights evidence on a completely unrelated case. The prompt is a holistic system, not a collection of independent parts, and the only way to verify that your fix did not introduce a regression is to test comprehensively.

The techniques that mitigate this are worth internalizing. Additive rules over rewrites: do not restructure what is working; bolt on the exception. Sectioned prompts for isolation: group related logic so that changes are at least locally contained, even if not perfectly independent. Scoped examples: when adding an edge case example, pair it with a normal case so the model sees the boundary explicitly. And progressive specificity: general rules first, exceptions after, so the model’s baseline behavior is established before you layer in special cases.

Three evaluator types cover most production needs. LLM-as-judge for semantic assessment, asking whether a response is accurate and complete. Trajectory evaluation for the full execution path, checking whether the agent called the right tools in the right order with the right arguments. Deterministic evaluation for exact matches, verifying that output JSON contains the expected fields and values.

The iteration mindset is the connective tissue. The first version of your prompt will not be the last. Structure your prompt so it is easy to modify. Log what you change and why. Run evaluations after every change. This is software engineering applied to natural language, and the developers who internalize this cycle build agents that survive contact with production.


What You Already Know That Transfers

The narrative in the industry sometimes frames the move from RPA to agents as learning something entirely foreign. That undersells what process-minded developers bring to the table.

Process engineering, the discipline of mapping and understanding a workflow before you automate it, is exactly what agent design demands. In RPA, you would never start building a bot without understanding the full process: the inputs, the decision points, the exceptions, the downstream systems. Agent development requires the same thoroughness, but the output is different. You are not mapping the process to if-then logic in a workflow designer. You are constructing an LLM-specific SOP: a prompt that captures the process with enough precision that a probability machine can execute it reliably, but with enough flexibility to handle the natural ambiguity that deterministic automation could never touch. Understanding the full process end-to-end is how you write a prompt that handles the messy, real-world cases where simple branching logic falls short. This is where your RPA background gives you a genuine advantage; you already know how to extract and document the operational knowledge that most agent builders are guessing at.

Exception handling, the instinct to ask “what happens when this fails?” before you ship, maps directly to escalation architecture and guardrails. The difference is that in RPA, you enumerate every exception explicitly. In agent development, you define escalation thresholds and trust the model to recognize the boundary. But the discipline of thinking about failure modes proactively is identical, and it is rarer than it should be among people who come to agents from a pure software engineering background without the operational automation mindset.

The documentation habit transfers too. Building PDDs and SDDs that capture decisions, constraints, and process logic becomes prompt version control and evaluation set maintenance. If you have spent years writing process design documents, you are already practiced at translating messy human processes into structured specifications. That is exactly what a production agent prompt is.

What is genuinely new is the probabilistic nature of the execution engine. You cannot step through an agent line by line. You cannot guarantee the same output from the same input if the context window contains anything dynamic. You have to build comfort with a system that is correct 97% of the time instead of 100% with dynamic inputs, and design your architecture, your escalation policies, and your human oversight to account for that gap. This requires new instincts, and it takes practice.

The engineering discipline underneath is not new. The rigor you already bring is the foundation.


The craft of building reliable agents is not about abandoning what you know. It is about extending it into a domain where the execution engine is probabilistic instead of deterministic. The same obsession with edge cases, the same insistence on testing, the same refusal to ship something you cannot explain: that is exactly what separates agents that survive production from agents that impress in a demo and then quietly get turned off. The tools changed. The discipline did not.