How Munchkin and Artera Ship AI Fast — Without Compromising Safety or Quality

Every engineering team has a north star when it comes to AI and automation.

For some, it’s a meticulous testing process before shipping. For others, it’s meeting customers’ needs with speed. Then there are the few that can balance both.

“At Munchkin, speed and safety aren’t trade-offs — they’re designed to work together,” AI Solutions Manager Vrej Sanati said. “Our approach to releasing any new technology is built on a thoughtful blend of automation and human expertise, with a clear goal in mind.”

For Artera, a healthcare SaaS company that creates AI-powered virtual agents to help patients, the true north is solving problems before they start — all with the help of AI.

“We built an automated testing framework that runs simulated patient conversations against our healthcare AI agents before every deployment,” Senior Staff Software Engineer Anav Sanghvi said.

Sanghvi explained how AI agents have to pass a rigorous test where it hits a 95 percent or better quality score.

“The system uses an LLM to play the role of patients with different personas — appointment seekers, confused patients and prescription inquiries — then a separate LLM judge evaluates the agent’s performance.”

Built In spoke with Sanati and Sanghvi in detail about how the two automation experts guide their teams to ship AI fast, safely and meaningfully.

Vrej Sanati

AI Solutions Manager

Munchkin is a children’s retail manufacturer that has expanded to include a curated line of home goods inspired by curiosity.

What’s your rule for fast, safe releases — and what KPI proves it works?

At Munchkin, speed and safety aren’t trade-offs — they’re designed to work together. Our approach to releasing any new technology is built on a thoughtful blend of automation and human expertise, with a clear goal in mind: delivering a higher-quality, more reliable experience for our customers. As an example, we leverage AI to automate a significant portion of our website quality review to catch issues earlier and more consistently before they ever reach users. This automation helps us proactively identify potential gaps in functionality, performance and usability — directly enhancing the customer experience. But automation is only the first layer. Every release includes a human-in-the-loop review, where our QA partners and engineers validate results, review code and assess security. Rather than replacing human oversight, we use AI to strengthen it. Automation handles repetitive and time-intensive checks, allowing our teams to focus on higher-impact reviews that consider real-world usage, edge cases and customer impact. This approach allows WhyBrand to continue advancing AI automation responsibly, while keeping people — and the customer experience — at the center of everything.

What standard or metric defines “quality” in your stack?

At Munchkin, quality is not an afterthought — it’s engineered into everything we build. We define quality as the intersection of reliability, security, usability and real-world adoption. As a technology-forward company, we rely on AI automation to raise the bar on how software is built and released. This automation strengthens our foundation, but it doesn’t replace human judgment. Every release is reviewed and validated by experienced engineers and QA partners to ensure accuracy, accountability and customer readiness.

“As a technology-forward company, we rely on AI automation to raise the bar on how software is built and released. This automation strengthens our foundation, but it doesn’t replace human judgment.”

Quality is equally measured through a business and user lens. Adoption and engagement are critical signals — if a solution ships but isn’t actively used, it doesn’t meet our standard. High usage, strong internal feedback and minimal post-release issues demonstrate that a product is not only technically sound, but thoughtfully designed for real-world needs. Ultimately, quality isn’t defined by a single metric or tool. It’s defined by our ability to combine advanced AI automation with human expertise to deliver secure, dependable systems that people trust, use and rely on every day.

Name one AI or automation that shipped recently and its impact on your team or the business.

One of the most impactful AI automations recently launched at Munchkin is an intelligent, always-on security monitoring system designed to keep pace with an increasingly complex cyber security work. The system continuously scans global security news, breach disclosures and vulnerability feeds, then uses AI to instantly cross-reference that intelligence against our real application and vendor ecosystem.

The results have been transformative. The automation dramatically reduced manual effort, accelerated response times and moved our security posture from reactive to proactive. Potential risks are identified earlier, prioritized more intelligently and addressed with confidence before they escalate.

Today, this AI-powered system serves as a trusted internal intelligence layer across the organization — enabling faster awareness, smarter prioritization and more decisive action. It’s a clear example of how Munchkin uses AI not just to automate tasks, but to elevate how we protect our technology, our teams and our customers.

Munchkin, Inc. is Hiring | View All Jobs

Anav Sanghvi

Senior Staff Software Engineer

Artera, a SaaS leader in digital health, guides the patient experience with AI-powered virtual agents (voice and text) for every step of the patient journey.

What’s your rule for fast, safe releases — and what KPI proves it works?

Our rule: No AI agent is deployed without passing automated LLM-as-judge evaluation at equal to or greater than 95 percent quality score. Before any agent release, our CI/CD pipeline runs simulated patient conversations through the agent using realistic personas — appointment seekers, confused patients and emergency callers. An LLM judge then evaluates each conversation across multiple dimensions: Was information accurate? Was the tone appropriate for healthcare? Were safety protocols followed?

The system runs headlessly in GitHub Actions on every pull request touching agent code. Patient simulation uses bedrock Claude to generate realistic multi-turn conversations, while a separate LLM instance evaluates the agent’s responses against configurable criteria. Configure driven seeding ensures clean, reproducible test environments. This catches regressions before they reach patients. An agent that starts giving unclear medication instructions or fails to escalate emergencies gets blocked at the pull request level — not discovered in production.

KPIs include things like an LLM-as-judge score equal to or greater than 95 percent is our deployment gate; time-to-first-token tool use accuracy 100 percent — agents must call the right APIs. And a safety check pass rate of 100 percent.

What metrics define “quality” in Artera’s stack?

The 4 Evaluation Types That Measure Quality of Artera’s AI Agents

LLM-as-judge evaluation (equal to or greater than 95 percent required) — An LLM evaluates conversation quality against specific criteria: accuracy of information, appropriate healthcare tone, completeness of answers and safety compliance. This catches subtle issues, like technically correct but awkwardly phrased responses, that traditional tests miss.
Tool Use Accuracy (100 percent required) — Agents must invoke the correct tools in the correct sequence. If a patient asks to reschedule, the agent must call “get_appointment_slots” before “reschedule_appointment.” A wrong tool selection equals an automatic failure.
System Prompt Adherence — Agents must follow their configured workflows: proper patient verification before accessing records, handoff protocols for out-of-scope requests and HIPAA-compliant information handling.
Performance Metrics — Performance metrics like time-to-first-token, average response time and total conversation time tracked per test suite.

For voice agents, we’re implementing Speech-to-Speech (S2S) LLM-as-judge evaluation — evaluating the actual audio output, not just text transcripts.

Name one AI or automation that shipped recently and its impact on your team or the business.

One was the AI agent CI/CD benchmarking system with LLM-as-judge evaluation. We built an automated testing framework that runs simulated patient conversations against our healthcare AI agents before every deployment. The system uses an LLM to play the role of patients with different personas — appointment seekers, confused patients and prescription inquiries — then a separate LLM judge evaluates the agent’s performance against an equal to or greater than 95 percent quality threshold.

How the AI agent CI/CD benchmarking system works

Patient simulator (bedrock Claude) generates realistic multi-turn healthcare conversations
Agent test runner orchestrates WebSocket conversations with sub-second latency tracking
LLM judge evaluates responses against criteria: accuracy, tone, safety and workflow compliance
For voice agents: S2S-based evaluation validates actual speech output quality
Results posted as PR comments with pass/fail status

Team impact is seen when engineers validate agent changes in CI before merging. Previously, testing conversational AI required manual QA calls, which were time-consuming and inconsistent. Now we have 35-plus automated test suites covering patient verification, scheduling, rescheduling and cancellation across multiple EHR systems.

Artera is Hiring | View All Jobs

The 4 Evaluation Types That Measure Quality of Artera’s AI Agents

How the AI agent CI/CD benchmarking system works

Recent Articles