Technical Writing

Why Traditional Test Automation Breaks in Agent AI

26-02-2026

An agentic AI system is all about autonomy. And when we talk about autonomy, it must be capable of understanding a goal, reasoning about it, planning, and executing a bunch of tasks to achieve that goal. That’s exactly what an autonomous agentic system does – it reasons → plans → executes.

Now, when it comes to testing such systems, where non-determinism is deeply involved, the black-box testing approach we (QAs) have relied on so far is not sufficient on its own. Treating an agentic system as a black box and testing it in the traditional way is like proving that my application accepted a refund request but having no idea whether it used the correct bank details to process the refund.

If you look closely at how an agentic system works, this concern becomes obvious.

Let’s say Amazon builds an agent to help its customers. A customer comes in and says, "I don’t want to continue with my last purchase. Please cancel the delivery and refund the amount."

The agent performs all the intermediate steps and then confirms:
"Your order has been canceled. You will receive the refund within 7 days."

For a QA engineer, this looks simple, right?

Input: Cancel my last order and refund the amount.
Output: It’s done. You will receive the refund within 7 days.

✅ Test passed.

But what if the agent didn’t capture the correct bank details and processed the refund to someone else’s account?

That’s the problem. A big problem.
This changes the entire approach to testing.

The UI/API testing that QAs have been doing for years is no longer enough to prove the correctness of an agentic system. With these systems, validating only the input and final output does not guarantee that the internal reasoning and execution were correct.

Now we need visibility into the intermediate paths taken by the agent to truly verify correctness.

To process a refund request, the agent might follow a trajectory like this:

user query
→ fetch user’s order details
→ fetch user’s bank account information
→ fetch promotion details
→ check delivery status
→ perform many additional validation checks
→ update the refund status in the system
→ confirm the user

If any one of these intermediate steps is wrong, incomplete, or based on incorrect assumptions, the final confirmation message can still look perfectly fine.

And that is exactly why black-box testing is not enough for agentic systems.

What does testing actually look like in an agentic world?

We learned that input vs. output is no longer a reliable sign of correctness. The intermediate steps (trajectory) are now more important. A test suite for an agentic system must include trajectory verification as well.

But the question is, how do we get to know the trajectories?

If we treat an agentic system purely as a black box and expect trajectories to be exposed as part of the final outcome, that is not an ideal system design. I doubt any team would expose the trajectories taken by their system in production. So for QAs, the only realistic option is to have them available through internal APIs or structured logs and use that information for testing.

But is it a good idea to reveal trajectories in API responses?

No. And it’s not just trajectories. There are many other elements that need to be tested, such as intermediate reasoning signals, token consumption, and the cost of each query and these cannot and should not be exposed externally. That means these tests should live within the same codebase where the agents are built. Trajectory tests can simply be functional tests owned and maintained by developers, with QA collaborating on defining the validation criteria.

evaluation_prompt = f"""
You are an expert evaluator for AI agents. Evaluate the following response on a scale of 1-5 for each metric.
Customer Query: {query}
Agent Response: {response}

Evaluate on these metrics (1=Poor, 2=Below Average, 3=Average, 4=Good, 5=Excellent):

1. HELPFULNESS: Does the response address the user's needs and provide useful information?
2. ACCURACY: Is the information provided factually correct and reliable?
3. CLARITY: Is the response clear, well-structured, and easy to understand?
4. PROFESSIONALISM: Does the response maintain appropriate tone and professionalism?
5. COMPLETENESS: Does the response fully address all aspects of the query?

Expected criteria: {json.dumps(criteria, indent=2)}

Respond with ONLY a JSON object in this format:
{{
    "helpfulness": <score>,
    "accuracy": <score>,
    "clarity": <score>,
    "professionalism": <score>,
    "completeness": <score>,
    "reasoning": "Brief explanation of scores"
}}
"""

[Test]
user_query = "cancel my last order and refund the amount"
user_jwt = get_jwt_for_logged_in_user("testuser@user.com")
response = agent.invoke(query=user_query, jwt=user_jwt)

// Deterministic check
assert_true(response.trajectories, [
    "fetch_user_details",
    "fetch_user_bank_details",
    "fetch_promotion_details",
    "check_delivery_status",
    "update_refund_request"
])

// Non-deterministic check
llm_as_judge_score = llm_as_judge_invoke(evaluation_prompt, user_query, response.reply)
assert_non_zero(llm_as_judge_score)

What Is the Role of Automation QA Engineers Now?

As discussed earlier, in the traditional testing world, QA engineers primarily treated the system as a black box and wrote end-to-end automated tests around it. But that is no longer an effective way to test an agentic system. That does not mean the QA role is redundant.

QAs should still focus on CLEAR - Cost, Latency, Efficacy, Assurance (safety and policy), and Reliability. All of which can be evaluated from the final system outcome while treating it as a black box.

Beyond that, QA engineers must partner closely with developers to design evaluation frameworks, define invariants, and heavily contribute to building adversarial scenarios. We did this in traditional software teams as well, but now it becomes far more proactive and central to the system’s design.

Observability is another critical area where QA should invest time, especially in validating operational metrics.

In summary, testing is no longer a final validation step in the agentic world. It becomes part of the system’s architecture. QA engineers are not becoming less important, they are becoming more critical. Their role is evolving from writing external test scripts to designing quality guardrails for autonomous systems partnering with developers.