We Built a Three-Layer AI Classification System for Regulatory Documents — Here's What We Learned

We Built a Three-Layer AI Classification System for Regulatory Documents — Here’s What We Learned

If you’ve ever been part of a regulatory team, you know the drill: the frantic rush to assemble a submission, the mountains of documents, and the painstaking, often soul-crushing task of manually classifying each one into its correct eCTD module section. It’s tedious, it’s error-prone, and it’s a bottleneck that can bring an entire publishing workflow to a grinding halt. I’ve seen it firsthand, countless times, across organizations big and small.

When we started DnXT Solutions, we were determined to solve this problem. We looked at the market, hoping to find an off-the-shelf solution that truly understood the nuances of regulatory document structure, that could handle the sheer volume and complexity. What we found was a lot of buzzwords, a lot of “AI-powered” claims, but very little that actually delivered on the promise of intelligent, reliable AI document classification regulatory submissions.

So, we did what builders do: we decided to build it ourselves. What emerged from that journey wasn’t a single, monolithic AI model, but a pragmatic, three-layered system. This wasn’t some theoretical design cooked up in a lab; it evolved directly from the messy reality of regulatory operations, from countless hours spent looking at documents, understanding user workflows, and iterating on what worked and what didn’t. And what we learned along the way fundamentally reshaped our understanding of how AI can, and should, serve the life sciences industry.

The Problem We Set Out to Solve: The Manual Classification Bottleneck

Let’s be blunt: manual document classification is a huge drain on resources. Imagine a large Module 5 clinical study report, potentially hundreds of individual documents – protocols, amendments, statistical analysis plans, patient narratives, case report forms. Each one needs to be placed precisely within the eCTD hierarchy, perhaps 5.3.5.1 for a clinical study report, or 5.3.5.2 for a summary of clinical safety. These aren’t just arbitrary folders; they’re mandated structures that impact how reviewers interact with the submission, how quickly they can find critical information, and ultimately, how efficiently a product moves through the approval process.

The consequences of misclassification are significant: delays, rejections, rework, and increased costs. Regulatory specialists, highly skilled professionals, spend hours on this administrative task, taking them away from more strategic work. It’s not just about speed; it’s about accuracy, auditability, and ensuring compliance. This isn’t just a “nice-to-have” problem; it’s a critical operational challenge that directly impacts product timelines and patient access.

Why Off-the-Shelf AI Missed the Mark for Regulatory

Before we committed to building, we rigorously evaluated existing “AI” solutions. The problem was clear: most general-purpose AI classification tools, even those marketed for enterprise document management, simply don’t grasp the intricate context of regulatory documents. They might do a decent job of identifying a “contract” or an “invoice,” but they fall flat when faced with a “Summary of Clinical Efficacy” (2.7.3) versus a “Summary of Clinical Safety” (2.7.4), or differentiating between a “Nonclinical Overview” (2.4) and a “Nonclinical Written and Tabulated Summaries” (2.6).

Here’s why they failed:

Lack of Domain Specificity: Regulatory documents have unique structures, terminology, and relationships that generic models aren’t trained on.
Context is King: It’s not just about keywords. A document might contain the word “protocol” but be a protocol deviation report, not the actual protocol itself. The surrounding context, the document type, and its relationship to other documents are crucial.
Auditability and Explainability: In a regulated environment, you can’t just have a black box spit out a classification. You need to understand *why* a decision was made. Most generic AI provides little to no explanation.
Proprietary Data: Training robust AI requires vast amounts of data. Regulatory documents are highly confidential, making it impossible to use public datasets or share internal documents for generic model training.

This gap in the market, coupled with the critical need, convinced us that a purpose-built solution for AI document classification regulatory submissions was not just an opportunity, but a necessity.

Our Three-Layer AI Classification System: Built from the Trenches

The journey to our three-layer system was one of iterative refinement, driven by practical experience. We quickly realized that no single AI technique was a silver bullet. Some documents are straightforward; others are incredibly ambiguous. Our system evolved to tackle this spectrum of complexity.

Layer 1: Rule-Based Classification – The Unsung Hero

This is where we start, and frankly, it’s often overlooked in the rush to implement “advanced AI.” Rules are boring, predictable, and incredibly effective for a large percentage of documents. If a document’s filename clearly states “Protocol” and it’s tagged as a “clinical document,” you can, with very high confidence, classify it as a Module 5 document. If it’s a “Stability Report,” it’s likely Module 3.2.P.8.3.

This layer leverages:

Filename patterns: Regular expressions to match common naming conventions.
Metadata: Information from upstream systems (e.g., document type in a DMS, author, project ID).
Simple keyword matching: For highly specific, unambiguous terms.

In our experience, this layer reliably handles about 60-70% of documents with near-perfect accuracy. It’s fast, auditable (you can see the rule that triggered the classification), and forms a solid foundation. It might not be glamorous, but it drastically reduces the volume of documents that need more complex processing.

Layer 2: Few-Shot Retrieval – Learning from History

Once the rule-based layer has done its job, we’re left with documents that aren’t immediately obvious. This is where our few-shot retrieval layer comes into play. The core idea here is to leverage the organization’s own historical data. Every time a document has been correctly classified (either manually or by the system and confirmed by a user), it becomes a valuable data point.

Here’s how it works:

When a new, unclassified document comes in, we generate a numerical representation (an embedding) of its content.
We then compare this embedding to the embeddings of all previously classified documents in the customer’s historical dataset.
If the new document is “similar enough” to a set of historically classified documents (e.g., three documents previously classified as Module 2.7.1), the system suggests that classification.

This layer is incredibly powerful because it adapts to an organization’s specific document types, internal naming conventions, and even unique content patterns. It catches another 20-25% of documents that rules alone couldn’t handle, particularly those with similar content but varying filenames. This layer is also the primary beneficiary of our self-learning mechanism, which I’ll discuss shortly.

Layer 3: LLM Classification – The Contextual Arbiter

After rules and retrieval have had their say, we’re left with the truly ambiguous cases – typically 5-15% of documents. These are the ones that could genuinely fit into multiple categories, or that have highly novel content. This is where Large Language Models (LLMs) shine.

For these documents, we feed the LLM not just the document’s content, but also crucial contextual information:

The potential eCTD module sections (e.g., “Is this 2.5, 2.7.1, or 2.7.2?”).
Descriptions of what each of those sections typically contains.
Any metadata or partial classifications from previous layers.

The LLM, with its vast general knowledge and ability to understand nuanced language, can then make a judgment call based on the provided context. It acts like a highly intelligent regulatory specialist, weighing the evidence and making a best-fit decision. This layer is invaluable for those tricky documents that defy simple pattern matching or historical comparison, providing a level of contextual understanding that was previously impossible without human intervention.

The key insight here is that the LLM isn’t doing all the heavy lifting. It’s reserved for the hardest problems, making its application efficient and targeted. This layered approach ensures that we use the right tool for the right job, optimizing for both accuracy and computational cost.

“No single AI approach works for all regulatory documents. The magic happens when you intelligently layer different techniques – rules for the obvious, retrieval for the historical, and LLMs for the ambiguous. It’s about pragmatism, not purism.”

The Self-Learning Engine: Getting Smarter with Every Correction

This is where our system truly differentiates itself and becomes a practical AI solution for regulatory teams. We knew that no pre-trained model, no matter how sophisticated, could perfectly understand every customer’s unique document ecosystem from day one. Regulatory teams have their own internal naming conventions, their own document templates, and their own historical quirks.

Our solution incorporates a robust feedback loop:

User Correction: When a user reviews a system-suggested classification and makes a correction, that correction isn’t just a one-off override.
System Learning: The system immediately registers this correction. The document and its *correct* classification are added to that customer’s specific historical dataset for the few-shot retrieval layer.
Continuous Improvement: The next time a similar document comes through, the retrieval layer has a new, accurate data point to compare against. Over time, with just a few hundred corrections, the system’s accuracy for that specific customer’s document patterns becomes remarkably high.

This isn’t a generic model that works “out-of-the-box” for everyone. It’s a system that gets smarter for *your* specific document types, *your* naming patterns, and *your* historical data. This practical AI approach means that the investment in correcting classifications pays dividends almost immediately, creating a virtuous cycle of improved accuracy and reduced manual effort.

What the Industry Often Gets Wrong About AI in Regulatory

Building this system exposed some fundamental misconceptions about AI in regulated environments:

“AI-Powered” as a Buzzword: Many vendors slap “AI-powered” on what is essentially glorified keyword matching or simple rules engines. While rules are vital (as we learned), they aren’t the sum total of intelligent classification. Real AI document classification regulatory submissions requires deeper contextual understanding.
Ignoring Document Context: It’s not just about the text within a document. It’s about the document type, its relationship to other documents, its origin, and its place within the eCTD hierarchy. Generic AI often misses this crucial context.
Lack of Auditability: In regulatory, you need to know *why* a decision was made. A black box AI that just gives a label without an explanation is a non-starter. Our layered approach, especially with the rule-based and retrieval layers, provides inherent explainability. Even the LLM layer can be prompted to provide a rationale for its decision.
AI as a Replacement, Not Augmentation: The goal of AI in regulatory should never be to fully replace the human specialist. It should be to augment them, to offload the repetitive, high-volume tasks, allowing them to focus on critical thinking, quality review, and strategic decision-making. The human remains in the loop, making the final call, with AI providing a highly accurate, pre-classified starting point.

The Hard-Won Lessons: Building AI for Regulated Environments

Building this system was far from easy. We faced unique challenges inherent to the life sciences regulatory space:

Getting Training Data: This was perhaps the biggest hurdle. Regulatory documents are highly confidential and proprietary. You can’t just scrape the internet for eCTD submissions. We had to develop strategies for creating synthetic data, leveraging anonymized client data under strict agreements, and building robust data governance frameworks. This scarcity of public, labeled data forces a different approach to AI development, emphasizing few-shot learning and human-in-the-loop validation.
Handling Edge Cases: What about a document that legitimately contains elements of both Module 2.5 (Clinical Overview) and Module 2.7 (Clinical Summary)? Or a document that is a protocol *amendment* but is often mislabeled? Our layered approach helps, with the LLM layer being particularly adept at these nuanced judgments. But it also reinforced that humans are still essential for the truly ambiguous cases, and the system needs to gracefully flag these for review rather than making a confident, but wrong, guess.
Multi-Tenant Learning: Every customer is unique. A global top-20 pharma might have different document patterns than a mid-size biopharma. Our system needed to learn *per-tenant* without cross-contamination of data or learning. This required careful architectural design to ensure each customer’s self-learning engine was isolated and optimized for their specific environment. It meant that while the core algorithms were universal, the learned knowledge base was highly personalized.

What I’d Do Differently Next Time

Hindsight is 20/20, and building something this complex always comes with lessons learned:

Start with Rules Earlier: In our initial excitement about cutting-edge AI, we probably under-emphasized the power of simple, robust rules. They’re boring, yes, but they handle the majority of cases reliably and with high auditability. If I were to start over, I’d invest even more heavily in building out the rule-based layer from day one. It provides an immediate, tangible win and dramatically reduces the workload for the more complex AI layers.
Invest in the Feedback Loop from Day One: The self-learning capability, where user corrections improve the system, is what makes our solution truly powerful long-term. We knew it was important, but if I could go back, I’d make it an even higher priority in the initial design and development phases. It’s the engine that drives continuous improvement and makes the AI truly practical and adaptable for each unique customer environment. Without a strong feedback loop, even the most sophisticated AI will eventually stagnate.

“Don’t chase the shiny LLM object first. Start with robust rules. They’re boring, but they reliably handle the majority of cases and give you a strong foundation. Then, layer on the more sophisticated AI for the harder problems.”

The Future of AI Document Classification in Regulatory Submissions

What we’ve built isn’t just a product; it’s a testament to how practical, well-engineered AI can genuinely transform regulatory operations. It’s about taking the drudgery out of classification, freeing up highly skilled professionals, and ultimately accelerating the path of vital medicines to patients.

We continue to refine our three-layer system, exploring how to further enhance the contextual understanding of LLMs, improve the explainability of complex classifications, and broaden the scope of documents it can handle. The journey of AI document classification regulatory submissions is ongoing, but we believe we’ve laid a strong, pragmatic foundation.

Conclusion & Call to Action

Building an effective AI document classification system for regulatory submissions is a complex undertaking. It demands a deep understanding of the regulatory domain, a pragmatic approach to AI development that combines multiple techniques, and a relentless focus on auditability and continuous learning.

We’ve walked that path, from frustrating manual processes to building a self-learning, three-layered system that truly understands regulatory documents. The lessons learned – from the power of simple rules to the necessity of a robust feedback loop – have been invaluable. We believe this approach represents the future of AI in regulatory operations: intelligent, adaptable, and always with the human expert in control.

Ready to see how our classification system can transform your regulatory submissions workflow?

See How Our Classification Works

Related Resources

About DnXT Solutions

DnXT Solutions provides cloud-native eCTD publishing, review, and regulatory compliance tools for life sciences companies. With 340+ submissions published and 20+ customers, DnXT is the regulatory platform purpose-built for speed and accuracy.

Request a Demo
View Pricing
See How DnXT Compares

We Built a Three-Layer AI Classification System for Regulatory Documents — Here’s What We Learned