LLM-Aided Intelligent Data Consolidation for Regulatory & Legal Knowledge Archives

Executive Summary

Engagement with emerging technologies, including recent advancements in Generative AI is increasingly important for organizations seeking to remain competitive. We’ve applied our experience with large language models to address a client and partner need. Specifically, we developed a workflow to automate the complex process of collecting, distilling, analyzing and consolidating an extensive amount of widely spread out legal data from diverse sources. We developed a robust and reliable LLM-powered pipeline with the end results neatly showcased in a content platform. Our results demonstrate a proficiency with Generative AI, integrating GenAI use cases into applications and building software with an AI-first approach.

 


 

Introduction

Our firm leverages Large Language Models (LLMs) and Generative AI to deliver innovative solutions tailored to our clients’ needs. We enhance software products with AI, making them more dynamic and responsive to user requirements. Our expertise in Generative AI allows us to streamline processes across various domains, from content creation to elaborate development projects, while also tackling challenging issues and improving documentation practices.

A prime example is our development of an AI-driven pipeline that automates the collection of legal information for construction projects. This solution demonstrates our ability to address both longstanding and emerging business challenges, tailor AI technologies to specific industry needs, and significantly reduce time and resources spent on due diligence processes.

 

Problem Description

The United States has a vast repository of legal documents, with the added challenge of variation between laws at the federal level versus at the state level. Sifting through information of this size for meaningful content is a tedious and extremely time consuming manual process. It is further made difficult by the expanse upon which this regulatory data is spread out; there is no single resource that hosts these documents.

Furthermore, these regulations and permits are often mandated to be revised after a set period of time has elapsed. This means that your data has an expiration date, and as a consequence, part of the aforementioned process will ultimately have to be repeated again.

Finally, not all of the information present in the larger document will be of consequence or value for your specific use cases. These challenges hinder effective research, compliance efforts, and decision-making processes for individuals and organizations. An efficient solution is needed to streamline and hasten the collection, analysis, and maintenance of relevant legal information across regions and archives.

 

What outcome do we propose?

A singular resource platform that hosts an array of legal information guides, curated to answer the most persistent questions relating to their domains, is our answer to these problems. This platform leverages a robust automation layer, powered by LLMs, to assist in the collection of the required information in an effort to reduce the amount of time spent in its manual alternative by an average of 80%.

To ensure that we maintain maximum accuracy in LLM responses, instead of expecting the model to provide us one elaborate response to a broad topic, we are instead identifying different problem areas, breaking them down into a set of questions,
and aggregating the answers to those questions. Our question set is designed such that we are able to repeatedly look up reliable answers to each question. ‘Look up’ refers to querying the LLM’s predisposed knowledge base directly for a response. This approach is more reliable because it borrows from the fact that these answers have largely remained the same over the past several years, therefore narrowing the focus of the LLM to a particular domain would ensure the highest degree of output accuracy. Breaking down the problem inherently gives structure to our content.

‘Structure’ in our context more readily means dividing our requirements into segments. By doing so, we have the opportunity to deploy plug-and-play pipelines independently to parallelize our operations, therefore greatly speeding up our process. For example, imagine automating data collection for Environmental Permit Guides. Since there are
a plethora of permits under this umbrella, one could not possibly develop a custom pipeline for each one. At the same time, articles encompassing details of such permits typically incorporate similar details, like project applicability requirements, environmental considerations, permit exemptions etc. These are the sort of details that will form your question set, with their answers forming the larger part of the final output.

Furthermore, the identification of facets in our required data that do not change regularly overtime is necessary to streamline revision requests. This absolves us from the need to rerun our pipelines for the larger information deck and proves to be imperative in managing runtimes and keeping costs low.

We utilize Storyblok, a headless content management system, to store and later deliver our content with agility. Its support for structural content storage, version control and API offerings help reduce friction in the pipeline to a great extent. Algolia is used for its powerful indexing capabilities to provide performant search functionality on our website. The data is now ready for consumption.

The final step is to host our consolidated information via an SEO friendly medium so it can generate maximum traffic for stakeholders. Particular attention has been paid to maximize the user experience by integration of visually appealing UI elements and maintaining an easy-to-follow and highly navigable content layout.

Figure 1- High Level Architecture

 

Technical Details

Our pipeline leverages multiple Large Language Models, each selected for their individual strengths. Claude 3 is used for content generation, GPT-4 is used for formatting. The selection of Claude as our aid in content automation was made after thorough analysis of outputs, and contrasting responses with other LLMs, most notably with GPT. Having manually verified the correctness of the generated content, Claude exceeds our expectations and we find it far more capable in this regard than its competitors. GPT is still a winner with transforming input data; its JSON mode is highly useful in producing segmented content, and is easy to enforce guardrails upon because of its structural guarantee.

Several measures are also implemented to counter hallucinations. ‘Hallucination’ refers to a phenomenon where an AI model outputs (rather confidently) an unfactual or inaccurate response to a user prompt. Due to the fundamental nature of how these models work, processing vast amounts of sometimes inconsistent data, hallucinations cannot be completely eliminated. AI responses are probabilistic rather than deterministic, meaning the model generates outputs based on learned patterns and probabilities derived from its training data, which can sometimes lead to inaccuracies. To reduce the risk of hallucinations in our pipeline, we have mandated format coherence in responses with JSON outputs. Furthermore, sanity checks are baked in by forcing references in LLM responses, with the references also being validated. Certain knowledge is also extracted via Retrieval Augmented Generation (RAG), search engine knowledge panels, and LLM aided web search.The OpenAI Assistants API is used for querying a corpus of legal documents by transforming it into a vector database, hence facilitating RAG.

Content is also sourced where permissible using Puppeteer for Javascript and BeautifulSoup for Python. We also employ the services of third party tools like SERP and FireCrawl for obtaining hosted data and analyzing search trends. Our chosen framework for the hosted platform is Astro, which is a Javascript framework especially suited for content driven websites. Astro’s dynamicity with server-side rendering and inherent encouragement for SEO built into the framework makes it an easy choice for the task at hand.

 


 

Conclusion

Our efforts vindicate LLMs as being incredibly capable and prove that they can be engineered to be reliable for automation tasks, and that their use with RAG and in tool use has matured for enterprises. Our clients have vouched for our solutions, hailing accuracy numbers as high as 95% for the generated micro-blogs, which far exceeds expectations associated with leveraging Generative AI for tasks where factuality is paramount. There are clear signs of our results paving the way for onboarding new clientele, and also serving the community at large by consolidating information into a single platform, which will help profusely in the decision making process for all concerned parties.

 


 

Interested in learning more?

At Conrad Labs, we are committed to pushing the boundaries of innovation through our research and development efforts. As we continue to explore new frontiers and share our findings, we invite you to join us on this journey. Head over to https://conradlabs.substack.com/conrad-labs for more content. Stay connected with us for more groundbreaking research and be a part of the future we are building.