Why We Built Our Own Workflow Engine

Most companies evaluating compliance workflow engine requirements start with established engines like Camunda, Temporal, or Apache Airflow. We did too. Then we ran into regulatory-specific requirements that made customization more expensive than building:

  • Every state transition must be auditable with 21 CFR Part 11 compliant records — not just logged, but cryptographically hashed and attributable.
  • E-signature gates must be hard blocks, not optional steps that an administrator can disable.
  • Tenant isolation must extend to workflow definitions — Tenant A’s workflow rules cannot affect Tenant B’s instances.
  • AI agents need a “preview” mechanism to evaluate what will happen before committing to a transition.

This is a trade-off we should be honest about: building in-house gives us more control over regulatory-specific features, but it also means more maintenance burden for a small team. Off-the-shelf engines have larger communities, better documentation, and more battle-tested edge cases. Our bet is that the regulatory specificity justifies the investment.

Architecture: State Machines with Compliance Gates

The V5 engine is built on a state machine architecture with three state types:

  • START — initial state when a workflow instance is created
  • INTERMEDIATE — states between start and end, with subtypes: DECISION (branching logic), FORK (parallel branches), LOOP (iterative processing), WAIT (external event), COMPENSATE (rollback on failure)
  • END — terminal states (completed, cancelled, rejected)

Transitions between states have three components:

  1. Conditions — rules that must be satisfied (field values, expressions, AI classification results)
  2. Actions — operations that execute during the transition (notifications, data updates, trigger invocations)
  3. Gates — hard blocks that prevent the transition until a human acts (e-signature approval, manual review confirmation)

The 50+ Endpoint API Surface

The workflow service exposes endpoints across six categories:

Instance Lifecycle

  • POST /instance/start — create a new workflow instance for an entity
  • POST /instance/advance — execute a specific transition by ID
  • POST /instance/advanceToState — advance to a target state (server resolves the transition)
  • POST /instance/upsertAndAdvance — the unified entry point: find or create an instance, then advance. This is what most callers use.
  • POST /instance/suspend, resume, cancel, retry — lifecycle management

Task Management

  • POST /task/list — list tasks filtered by assignee, team, status, tenant
  • POST /task/{id}/complete, delegate, escalate, clarify, comment

Schedule, Dead Letter Queue, Webhooks, Email Actions

Plus scheduling (cron triggers), a dead letter queue for failed instances (list, retry, resolve, discard), webhook registration for event-driven integrations, and tokenized email action links for task completion from email.

The preCheck Pattern: Look Before You Leap

This is the feature we think matters most for AI agents. When an AI agent wants to advance a workflow, it can call POST /instance/preCheck first:

  • Input: instance ID, target transition or state
  • Output: list of warnings (severity: WARN or BLOCK), whether the transition is blocked, and what conditions are unmet

A WARN-severity result means the transition will succeed but conditions aren’t ideal. A BLOCK-severity result means the transition will be rejected. The AI agent can read the block reasons and explain them to the human before attempting the advance.

This matters because AI agents shouldn’t fail silently. A human clicking a button and getting an error message can read the screen. An AI agent getting a 400 response needs structured information about why the action failed and what the human needs to do to unblock it.

Handling State Drift

In a system where both humans and AI agents interact with the same entities, state drift is inevitable. The entity’s own storage (e.g., SubmissionPlan.Status in GLB_DATA) and the workflow instance’s CURRENT_STATE_ID can disagree when:

  • A human changes status in the UI while an AI agent has a stale view
  • A previous failed advance left the instance at the wrong state
  • Legacy code updated the entity status without going through the workflow

The upsertAndAdvance endpoint handles this with an initialStateID parameter. If the caller passes the entity’s known state and it differs from the workflow instance’s current state, the engine reconciles — syncing the instance to match the entity, logging the drift, then proceeding with the advance.

The entity’s own storage is always the source of truth. The workflow instance tracks it, not the other way around.

Trigger Types

Transitions can be triggered by:

  • Cron — time-based scheduling
  • Webhook — external event notification
  • File drop — document upload to a watched location
  • Inbound email — watched inbox processing
  • Workflow chain — completion of another workflow instance
  • SLA breach — deadline exceeded
  • Calendar — business calendar events
  • AI classification — AI model output triggers a transition

Idempotency and Failure Handling

Dual-layer idempotency protection:

  • In-memory LRU cache — catches duplicate requests within a short window
  • Database-level check — catches duplicates across service restarts

Failed instances land in the dead letter queue rather than being silently dropped. Each entry captures the failure context: which transition failed, what error occurred, the state at failure time. Operators can retry, resolve, or discard from the DLQ.

What This Enables for AI Agents

An AI agent connected to DnXT via MCP can:

  1. Query — “What workflows are running for this dossier? What state are they in?”
  2. Preview — “If I try to advance this submission to ‘Ready for Review’, what will happen? Are there blockers?”
  3. Act — “Advance this submission to ‘Ready for Review’.” (If blocked by e-signature gate: “This requires a human signature. I’ll notify the reviewer.”)
  4. Monitor — “Are there any tasks assigned to the regulatory team that are overdue?”
  5. Recover — “What’s in the dead letter queue? Can I retry the failed classification?”

A compliance workflow engine for AI agents isn’t just about exposing endpoints. It’s about giving the agent enough information to make good decisions — including the decision to stop and ask a human.

This article was written by the DnXT Solutions team. We’ve aimed to present both the capabilities and trade-offs of our workflow architecture honestly. Questions are welcome at se******@***********ns.com.