Technical Writing

Solving the Blind Spot in AI Testing by Evaliphy

09-04-2026

I've been building a tool called Evaliphy for a while now. Not because existing AI testing tools are lacking but because I kept running into a gap which was risky to leave unaddressed for my product.

LLM Testing != System Testing

Right now, if you look at the AI testing space, it's largely driven by AI developers and ML engineers. You'll find only a small number of QA engineers actively contributing. And to be fair - that's not a bad thing. For years, we've been advocating for shift-left testing. Move quality closer to development. Let developers own testing earlier in the lifecycle. So in many ways, this shift in AI systems feels like a natural evolution.

But I kept noticing something missing. The moment testing moved beyond the model or pipeline, things started to feel… incomplete.

We have enough testing coverag at model or pipeline level, but when I step outside the model/pipeline, the testing becomes incomplete. Because real systems just don't begin at the LLM.

Before a request even reaches the model, there are multiple layers involved: authentication, request shaping, routing logic, retrieval orchestration,prompt construction etc.

And after the model responds post-processing, formatting, filtering, API response handling and many such activities happen before it reaches to end user.

All of these steps influence what the user actually experiences. Imagine, the LLM accuracy is good enough to ship in production, but your APIs are slow and introducing latency by some downstream systems.

In that case, LLM testing is not deciding factor to ship the product.

You need another layer of testing which ensures that the product is equally well when integrated with other components, thats where the need of end-to-end testing arises.

The current AI testing tooling landscape and gaps

Today, we have strong, developer-friendly tools like Ragas, DeepEval, and Promptfoo. They're powerful. They're well-designed. And they fit naturally into a developer workflow. Most of these tools are built to live right next to your application code. Which means:

you import your pipeline components
you construct test cases programmatically
you evaluate behavior with full control over inputs and context And that works really well… as long as you're testing from inside the system. For example:
DeepEval tests your pipeline internals
Ragas evaluates retrieval and generation quality within your setup
Promptfoo helps you iterate and compare prompts at the model level

They give you precision. Control. Debuggability. But they also assume something important: You have access to the internals of the system. That assumption quietly limits who can participate in testing - and what kind of testing gets prioritized.

Because once you move to an end-to-end perspective:

you may not have access to the retriever
you may not know how context is constructed
you may only have an API endpoint

And suddenly, the testing approach needs to change.

That's what Evaliphy does.

Instead of importing retrievers and constructing context, I did something simpler:

Make an HTTP request
Get the response
Validate the answer against the context. It does not need internal access. Just reality.

Anatomy of an Evaliphy test

import { evaluate, expect } from 'evaliphy';

const sample = {
  query: "What is the return policy?",
  expectedContext: "Items can be returned within 30 days."
};

evaluate("Return Policy Chat", async ({ httpClient }) => {
  // 1. Hit your RAG endpoint
  const res = await httpClient.post('/api/chat', { message: sample.query });
  const data = await res.json();

  // 2. Assert in plain English
  await expect({
    query: sample.query,
    response: data.answer,
    context: sample.expectedContext
  }).toBeFaithful();

  await expect({
    query: sample.query,
    response: data.answer,
    context: sample.expectedContext
  }).toBeRelevant({threshold:0.7});
});

Am I trying to replace tools like DeepEval, Ragas, or Promptfoo?

No. Evaliphy is not built to replace tools like DeepEval, Ragas, or Promptfoo. Those tools solve a different problem, and they solve it really well. They focus on: evaluating model and pipeline behavior giving developers deep control and visibility helping iterate on prompts, retrieval, and generation.

Evaliphy comes in at a different layer. It focuses on: testing the system as a whole validating what the user actually experiences working without access to internals.

If anything, they complement each other. You might use: DeepEval or Ragas to improve your pipeline quality Promptfoo to refine prompts Evaliphy to validate that everything works correctly in production, end-to-end.

So this isn't about replacing existing tools. It's about filling a gap that shows up once those tools have already done their job.