Tool-Response Engineering: The Missing Layer After Prompt Engineering

What comes after prompt engineering?

The AI world is saturated with advice about prompts.

We have system prompts, custom instructions, skills, rules files, CLAUDE.md, AGENTS.md, copilot-instructions.md, spec-driven development, superpowers, and endless Markdown files designed to tell AI agents how to behave before they start working.

That work certainly matters. Prompt engineering is genuinely helpful. Good instructions help agents understand the project, respect conventions, use the right tools, ask better questions, and avoid obvious mistakes. Creation of planning documents, reference materials, system instructions, and troubleshooting guides can all contribute positively to creating repeatable workflows that reliably produce desired results.

But today I want to discuss a different aspect to working successfully with AI agents: Tool responses. Unlike chatbots that simply return responses to the user, long-running AI agents travel very far beyond the initial prompt before finally returning control to the user. A coding agent, for instance, might read files, search the Git commit history, check files again, edit code, run tests, parse errors, make more edits, rerun tests, and so on, all triggered by a single prompt.

After the model is selected and the prompt is supplied, the rest of the workflow is dominated not by the contents of the system instruction, plan documents, and chat prompt, but by what happens next: Tool calls, tool results, intermediate state, errors, warnings, partial successes, retries, previews, and recovery decisions.

That is where prompt engineering runs out of leverage, and where tool-response engineering begins.

The question this post answers: What is tool-response engineering?

TIP

Tool-response engineering is the practice of designing what tools return to AI agents after each action so the next step is safer, clearer, more efficient, and easier to verify.

A well-engineered tool response should help the agent understand:

what changed,
what did not change,
what state the workflow is now in,
what risks remain,
what should happen next,
what recovery paths are available,
and what should not be done yet.

If AI agents were not involved, a tool response could be treated like the output of any function's completion, and thus could be as simple as an HTTP status code or a console log. Most AI agent tools are indeed built in precisely this manner. But inside an agentic workflow, the tool response has a far more important role to play: It is the engineer's chance to steer the next step.

What tool-response engineering is

Every tool an AI coding assistant calls returns a JSON-RPC response object. That is true whether the tool was built by the model vendor, shipped with the IDE, or added through the Model Context Protocol. The call goes out as a request, the tool does its work, and a structured object comes back.

So at its most basic, tool-response engineering is the deliberate design of those response objects: What fields they contain, what shape they take, what the tool chooses to tell the agent, and what it chooses to omit.

Notice that tool-response engineering takes far more discretion in many instances than tool-input engineering. Tools should have a simple, well-defined input schema — say, a rename operation that takes a file path, an old name, and a new name — with all optional parameters and extra arguments scrutinized skeptically by the tool engineer for genuine necessity, omitting all but the truly essential arguments. Optimization of the tool input schema primarily requires minimizing what information truly must be supplied by the agent to call the tool successfully.

By contrast, the tool response object is an open design space. A tool author must decide, at minimum:

Whether to return a boolean success flag, a status code, a status string, or something richer.
Whether to include a count of matches, replacements, or affected lines.
Whether to describe what changed, or only that something changed.
Whether to report the before-and-after state of the file, or leave the agent to discover it.
Whether to flag the scope of the operation in any way — files touched, lines changed, unrelated regions potentially affected.
Whether to name the next reasonable action, or leave the agent to infer it.
Whether the write is final, or held in some intermediate state the agent can still inspect, refine, or cancel.
Whether errors, partial successes, and recoverable failures are distinguishable from each other.
Whether the response is structured for the agent, for the human reviewer, or for both.

None of these choices is obvious from the input schema. Two tools with identical inputs, serving functionally equivalent roles, can return profoundly different response objects, and the difference will ripple through every subsequent decision the agent makes in that workflow. At a minimum, the tool response must inform the agent whether the input was syntactically valid, and should return an error if the request was malformed. A better response to a malformed request will supply the agent with information on how to retry the request successfully, such as by giving an example.

The most interesting problems do not arise when tool calls are malformed, however; those are in a sense the more straightforward problems to solve. The deep challenge in creating effective AI-driven workflows via thoughtful tool-response engineering designs is to accommodate the vast array of possible "success" scenarios where the agent's tool call, though syntactically correct, is semantically or logically wrong.

TIP

Good tool-response engineering, then, is the craft of designing response objects that give the agent exactly what it needs to make its next decision safely — and nothing it does not.

The key word is decision. Every tool call results in a new decision point for the agent. The question is what information the agent has available to it after calling a tool. If the response is sparse, the agent still needs to make a decision and will do so anyway, but it has to make the decision with less information. Perhaps the agent makes the determination that it must now spend more tool calls gathering information that could have been returned in the initial tool response in the first place, creating inefficiencies. Worse, perhaps the agent concludes, based on a terse { "success": true } response, that its overall mission is complete. The benefits of brevity in designing a tool-input architecture do not necessarily apply to tool-response design. When tool responses are under-engineered, the outcome is not unlike providing a prompt that lacks sufficient context to guide the agent to the desired end state. The difference, however, is that, while we may craft new prompts and instructions on a regular basis, tool architecture requires stability and universality across all response possibilities.

A well-engineered response does not always need to be lengthy. Verbosity is not the point. A helpful response contains what the agent needs to decide the next step, but not more.

Consider a common coding scenario: The agent is prompted by the user to rename a variable across a file. After the rename runs, next steps might depend on answers to questions like these:

Did the agent's tool call match only the intended occurrences, or did it catch others too?
How many places changed?
What is the current state of the file — written to disk, or held for review?
Did the change touch regions not within the scope of the user's request?
If the tool call was executed but something is wrong, is there a recovery path that does not require restoring the file from backup?

A terse response of "success" leaves every one of those questions unanswered. The agent might still choose to find answers to some or all questions, by re-reading the file, performing new searches, or comparing against a remembered state, but each follow-up is its own small gamble. The quality of the verification depends not only on whether such verifications are feasible for the agent to perform, but also on the agent choosing to do so at all. System instructions that direct the agent to check its work after performing a complex operation can help, but they do not explicitly tell the agent in the moment of the tool response whether further tool calls are: 1. desirable, 2. recommended, 3. essential, 4. unnecessary, or 5. something else altogether. An agent's decision regarding subsequent tool calls does not need to be left to chance, yet an under-engineered tool response does just that.

An engineered response answers those questions inside the same exchange that performed the action. A good response provides details that are designed and tested through validation to inform the agent's next step, across the entire range of possible tool outputs.

Tool-response engineering asks, for every tool the agent might call: What information should be returned to the agent to boost the likelihood that the agent makes its next decision correctly? The best answer goes far beyond evaluating whether the tool itself executes correctly, or whether it properly returns a success message when the tool execution succeeded and a failure or error message to the agent if not. The best tool responses are designed to provide whatever the agent needs to know before it acts again, and can handle the full array of possible scenarios when that tool response lands in the agent's context.

Why the post-prompt part matters

Most prompt-engineering advice focuses on the beginning of the workflow: Define the role, explain the goal, provide constraints, add examples, specify the output format, ask the model to verify its work.

That is useful. But in agentic coding, the most important decisions usually happen after the initial instruction.

Diagram: The agent loop. Tool-response engineering is the design of the highlighted node — the tool response that feeds the agent's next decision. This edge runs dozens or hundreds of times in a long workflow.

The agent reads a file. What context does the read tool return?
The agent searches the repository. Does the result explain what was omitted, truncated, or de-ranked?
The agent edits a file. Does the edit tool write directly to disk, or stage the change for inspection?
The agent hits an error. Does the tool return a generic failure, or explain the likely cause and point to a repair path?
The agent changes forty files. Does the tool warn about blast radius before saving?
The agent partially succeeds. Can it refine the failed operation without discarding everything else?

These problems are not capable of being solved via prompt engineering, because they must be answered between the user's prompt and the agent's final response. Answering these questions requires thoughtful analysis of the design of the agent's runtime feedback surfaces and thorough consideration of all the ways things can go wrong.

In a long-running workflow, the prompt is the opening move, and the last time the human user intervenes (other than possibly canceling midstream) before the workflow ends. The tool responses are the continuing conversation between the environment and the agent.

A sparse response is not merely unhelpful. It is dangerous, because it creates a false sense of completion. { "success": true, "message": "File updated." } may be technically accurate, yet it hides whether the file update was small or large, whether unrelated sections were touched, whether the change is reviewable, or whether the next step should be inspection rather than another edit.

Why coding agents make the issue obvious

Tool-response engineering matters in all kinds of agentic systems, but coding agents make the problem especially visible.

Coding workflows involve stateful artifacts. Files can be partially modified, syntactically valid but semantically wrong, or correct but too noisy to review. A change may "work" while still producing a diff far larger than the user intended.

The most obvious failure mode is file corruption: The agent breaks syntax, deletes the wrong block, mangles formatting, or leaves the file inconsistent. This mode is annoying when it happens, but at least the problem is obvious: Unit tests now fail, the compiler and linter are complaining, and the file clearly looks wrong.

A more challenging and arguably more important failure mode to handle is subtler: Diff bloat. The agent makes a small intended change, but the resulting diff touches too much surrounding content. It rewrites nearby lines, reflows formatting, alters unrelated sections, or changes more of the file than the task required.

The final code may run, and the rewritten code might be correct. But the human reviewer is left with a wall of text, forced to determine not only whether the requested change was made, but whether anything unrelated was altered by accident. The cost is an expanded review surface, multiplied across every AI-assisted PR, every day.

A good coding agent should not merely produce working code. It should produce minimal, reviewable, well-scoped changes. Getting such optimizations from coding agents requires far more than a better prompt: Agents need midstream feedback from their tool responses about scope, state, risk, and reviewability, while the workflow is still in progress, and without requiring human intervention.

The problem of writing tools for coding agents also underscores one other distinct feature of response engineering that may feel alien: A tool response is good if it helps the agent, even if such a response would not help a human. Very frequently, content that a human might find duplicative or distracting is precisely what the agent needs in order to continue along the correct path. Humans remember why they did something; agents do not. Sight gives (most) humans the ability to perceive changes without reading a file linearly from start to finish, whereas coding agents do not see in any human sense of the word. Humans recall prior mistakes and gain experience, but every turn and tool call is a brand-new day for AI coding assistants. Prompts only go so far under these constraints. AI assistants need guidance to accommodate their distinctive limitations. Embedded AI agent guidance in tool responses can mitigate these disabilities; under-engineered tool responses can exacerbate these weaknesses.

Prompt engineering sets intent; tool-response engineering governs execution

A prompt can say:

Make small, surgical edits. Do not rewrite unrelated code. Review your changes before finishing.

That is helpful.

But a tool response can say:

You have staged a 12-line localized edit in one file. Nothing has been written to disk. Inspect the staged result before saving.

Or:

This operation would touch 240 lines across 6 files. That exceeds the expected blast radius for the requested change. Consider narrowing the edit and review carefully before saving.

Or:

Four operations succeeded and one failed. The successful operations remain staged. Refine the failed operation instead of restarting the batch.

Those are different kinds of guidance.

The prompt expresses a general norm. The tool response applies that norm to the current state.

That is the core distinction. Prompts tell the agent what good behavior looks like. Tool responses tell the agent what good behavior requires right now.

Relationship to context engineering and tool design

Tool-response engineering is not a rejection of prompt engineering, context engineering, or tool design. It sits alongside them.

Discipline	What it designs
Prompt engineering	The instructions, examples, constraints, and reasoning patterns supplied before or during the task.
Context engineering	The information available to the model: Retrieved documents, file snippets, memory, workspace state, tool definitions, prior messages, and prior tool results.
Tool engineering	The action surface (what the agent can do): Tool names, descriptions, permissions, input schemas, output schemas, and availability.
Tool-response engineering	The in-flight guidance surface (what the agent should consider during a workflow before making its next decision): Structured results, previews, warnings, progress updates, recovery hints, next-step suggestions, staged states, and human-review affordances.

These categories overlap. Tool responses are part of context. Tool schemas are part of tool design. Tool-result formatting and content is influenced by prompt engineering.

But the distinction is useful because tool responses recur throughout the agent's execution loop, and are different from the question of how the tool works. Responses, built properly, are repeated steering events.

Every tool call provides the tool-response engineer with an opportunity to help the agent recover and stay oriented.

Practical principles for tool builders

The principles that make tool-response engineering successful are not specific to any one tool category. They apply as much to a database query tool as to a file-editing tool, and as much to a browser-automation tool as to a system shell or a cloud SDK. The underlying question is always the same: After this tool finishes, what does the agent need to know to decide the next step safely and successfully?

We offer the following suggestions when building tool responses in your own workflows.

Return state, not just status. A SQL write that reports { "success": true } tells the agent less than one that reports how many rows were affected, whether the change is inside an uncommitted transaction, and whether any constraint warnings were raised. A browser tool that reports "clicked" tells the agent less than one that reports which element was actually clicked, what the page state became after the click, and whether the click triggered an unexpected navigation. The word success is rarely enough on its own. Status is necessary but frequently insufficient to inform next steps.

Make next actions explicit, especially when they are not obvious. If the agent is about to drift toward a dangerous continuation — committing before inspection, retrying a query that will compound the error, following a redirect loop — the response is the last chance to identify the safer path. Don't assume that human intervention is a reasonable blockade against incorrect continuations. Identify all the plausible actions that might be taken after a given tool call, and consider how the shape and content of the tool response can assist the agent in understanding its options and remind the agent what it ought to consider in that moment.

Quantify scope. Tool responses should tell the agent how much state changed, and ideally, should encapsulate the scope of affected content. Whether the measure of the change is rows returned, files edited, requests sent, tokens consumed, dollars billed, DOM nodes modified, or something else, scope quantities give a critical shorthand to evaluate whether the result is the right size and shape.

Preserve recoverable work. Failure is not always total. If four of five operations succeed, the response should say so and tell the agent how to refine the failing one without completely discarding what already worked. A tool that throws everything away on partial failure pushes agents into "start over" loops that burn tokens and amplify errors.

Distinguish audiences. Some information aids the agent's reasoning. Some information is most helpful to the human reviewer who must approve the work. Well-designed tool responses treat those audiences distinctly. The primary audience of a tool response is the AI agent, who needs it in real time to decide on the next step in an ongoing workflow. Response designs that assume a human reviewer can and will intervene in the moment often overlook the reality of how tool responses are consumed. The best tool-response designs will materially improve outcomes even if a human user never reviews them.

Use structured tool outputs that vary depending upon context. JSON is not the point, but structure is. Structured responses compound well; prose responses drift. Consider breaking up generic fields like "message" by implementing conditional logic that displays special fields in important circumstances, such as "warning" or "suggestion". Varying the structure of the tool response depending on circumstances takes more effort, but the rewards are often significant: AI models are excellent at spotting patterns, and the change will help redirect its focus on the important details that matter.

Avoid flooding context. Verbose responses are not better responses. It can be tempting to provide a comprehensive response that includes every detail that might be useful across the entire spectrum of possibilities, so as to avoid the complexities of adjusting the schema of the tool response dynamically, but unfortunately the verbose strategy usually backfires. Every field returned costs context budget and reading time. When a response must be long, consider chunking, paginating, or summarizing. The principle of progressive disclosure plays a major role in guiding good tool-response engineering. At a minimum, agents should get fair warning before being returned a massive output so they can evaluate whether a more surgical approach is warranted.

Emphasize end-to-end validation and real-world testing. It can be tempting to rely upon comprehensive unit tests to validate that a new function or tool behaves as expected, but automated tests, though necessary, are insufficient to evaluate whether tool-response engineering is succeeding in promoting intended outcomes. Focus on seeing how an agent responds to the tool responses across repeated runs and realistic scenarios, and compare conditions such as whether a particular tool is available or whether changing the content or structure of the response alters results. If you are performing rigorous analysis of the results in benchmark studies, consider using within-subjects paired design to evaluate whether the tool responses really are providing the benefits expected.

Consider agent UX. We do not want this final point to be taken as an endorsement of the view that AI agents are sentient beings with feelings, goals, intentions, or morals. But when it comes to tool-response engineering, much can be gained by assuming that agents require a good experience to respond optimally after obtaining the results of a tool call. If the agent struggles to respond in the desired manner after calling a particular tool, despite the fact that the tool performed as expected in making the change or retrieving the data or whatever else it was designed to do, the problem is likely an agent UX issue. Traditional UX (user experience) focuses not merely on what an interface like a web app does in response to user actions like hovering or clicking, but on providing users with meaningful and intuitive experiences. Agent UX focuses not merely on what a tool interface allows the agent to do, but also on how the tool response guides the rest of the workflow.

These principles are domain-agnostic. This essay grew out of work we did for the past year while building HIC Mouse, a precision file-editing toolkit for AI coding agents. Taking inspiration from front-end UX design principles, our response-engineering strategy for HIC Mouse included implementing a Dialog Box pattern to enable agents to review staged changes and inspect, refine, save, or cancel, and to provide embedded guidance mechanisms in tool responses. However, the same principles and best practices for tool-response engineering apply anywhere an agent calls a tool: Database adapters, browser drivers, API clients, cloud SDKs, shell wrappers, payment systems, observability agents, and categories of AI tools that do not yet exist.

Conclusion

The AI community is focused on each new model, how it performs on benchmarks, how much it costs, how fast it is, and whether it represents a transformative change in the landscape. New techniques in prompt engineering similarly attract widespread attention from the community. Developers seem to recognize, however, that the harness plays as significant a role, if not more so, in whether the agent performs its workflow successfully. What I think has gone under-explored to date in these discussions are all the aspects of tool engineering beyond the prompt. So much effort is made to hone the system instructions, context, and even the tool instruction metadata, as well as the user's natural-language request itself, but little public discussion has addressed the importance of responses generated by an AI agent's tools. Tool-response engineering is arguably the least explored frontier in AI engineering, and one of the most interesting levers of performance. The principles described in this article hopefully are helpful to anyone building AI tools or simply wanting to understand how to make AI workflows more likely to succeed, safer, more secure, or more reliable.

A model can only reason over what it is given. In the middle of an agent workflow, most of what the model is handed between the opening prompt and before generating its final response is tool output. Every tool call is a decision point, and every tool response is the input to the next decision.

Prompt engineering is the opening move. It sets intent, defines the work, shapes the plan, and describes the expectations of the user. Prompts will continue to matter.

But once the agent starts acting, the leverage shifts. What the tool gives back, whether structured or unstructured, stateful or stateless, auditable or opaque, defines the most important content that the agent will consider when deciding what to do next. A thoughtful response object can turn a tool call from a one-shot event into a workflow stage: Inspect, refine, save, or cancel, for instance. An under-engineered response can cause failure modes based on poor, seemingly inexplicable decision-making at the precise moment when the agent needs context to react appropriately.

For anyone building AI tools, we hope you will agree that the response object is a uniquely important aspect to the tool's architecture worthy of serious and dedicated effort. By treating tool responses as a first-class priority in building AI workflows, we unlock the potential to exercise influence over agent workflows long after typing out a prompt and hitting Enter.