Better Than TAR. Nearly Expert: What a Major Study Shows About Gen AI and TAR in a Complex Document Review
For more than a decade, technology-assisted review (TAR) has helped legal teams conduct large-scale document review more efficiently and defensibly than traditional search-term or linear-review approaches. Courts, practitioners, and discovery professionals have developed familiar ways to validate results, measure recall and elusion, and explain why a review process was reasonable and proportional. The question for generative AI-based review tools is not whether they can equal TAR, but whether they can outperform it.
TAR has repeatedly shown that it can perform well in finding documents related to specific topics. The harder question for generative AI (gen AI) is whether it can go further: whether it can understand the contents of a document, apply a complex review protocol to its contents, and determine whether the document is responsive.
We recently conducted an empirical study examining this question on an intentionally complex document review. We conducted a managed TAR review using Relativity Active Learning (RAL), a continuous active learning tool that uses reviewer coding decisions to help prioritize likely responsive documents, supported by a 24-person review team provided by Cimplifi, a managed review vendor. In all, this workflow required 1,123 hours of human effort. Separately, we developed and ran a gen AI workflow using Relativity aiR for Review, a gen AI tool powered by a large language model (LLM) that uses a natural-language prompt to classify documents and explain its reasoning. In all, this workflow required just 18 hours of human effort. After both reviews were complete, an independent subject-matter expert (SME) performed a blinded random sample review to support an impartial comparison of results.
The results were striking. aiR for Review found substantially more responsive documents and missed fewer than the active learning workflow. The tradeoff was modest: aiR for Review identified more documents as potentially responsive, but a lower proportion of those documents ultimately proved to be responsive compared to RAL. Separately, when the SME later reviewed aiR for Review’s reasoning for certain documents he had initially coded not responsive, he changed 7% of those calls to responsive. Those overturns came after the blinded comparison and were not counted toward the reported metrics, but they suggest aiR for Review’s finer-grained judgment can help even an expert reviewer find more responsive documents.
Our Study: A Complex Review, Measured Against Independent Ground Truth
The review population consisted of 45,004 documents from the public Mallinckrodt corpus in the UCSF Opioid Industry Documents Archive, including emails, Word documents, spreadsheets, presentations, and other common business records. We used one of the initial complaints in the opioid litigation to create a review protocol that simulated a new document request against a produced population.
The responsiveness standard was intentionally demanding: a document was not responsive merely because it mentioned opioids, sales activity, drug promotion, or the company’s business. It also had to contain evidence of compliance with, violation of, or reckless disregard of federal requirements governing pharmaceutical marketing or controlled substances. That made the review better simulate large document reviews in actual legal matters. The goal was more than simple topic tagging. Instead, gen AI needed to apply a complex review protocol to a large document population and deliver reliable results.
After both workflows were complete, an experienced SME reviewed 1,000 randomly selected documents without seeing aiR for Review’s scores, aiR for Review’s rationale, or the RAL workflow’s prediction. His determinations supplied the ground truth for the primary analysis. Of those 1,000 documents, he coded 73 as responsive.
We then compared the TAR and gen AI outputs against that same ground truth using three common validation metrics: recall, or how many truly responsive documents the workflow found; elusion, or how many responsive documents were left behind in the documents treated as not responsive; and precision, or how many documents flagged as responsive actually were responsive. These metrics do not answer every legal or strategic question, but they make the tradeoffs visible.
The Core Result: aiR for Review Found 88% of Responsive Documents, vs. 64% for TAR
Compared with the expert’s independent review, aiR for Review identified 64 of the 73 responsive documents in the sample, or 88% recall (95% CI: 78.2%–93.4%). The active learning managed-review workflow identified 47, or 64% recall (95% CI: 52.9%–74.4%). aiR for Review also had a lower elusion rate: 1%, compared with 3%. In practical terms, aiR for Review found more responsive material and left fewer responsive documents behind.
That result matters. Continuous active learning remains a proven and widely used tool in modern discovery. But on this difficult, low-richness review requiring application of a complex legal standard, gen AI found more responsive documents and missed fewer than TAR. The result should not be treated as a universal recall benchmark for every matter, but it is concrete evidence that gen AI review can perform well on nuanced responsiveness decisions that have historically been difficult to automate.
Lower Precision and the Effect of Low Richness
The aiR for Review workflow did not outperform on every metric, however. Its precision was moderately lower: 29% (95% CI: 23.1%–34.8%), compared with 39% for the active learning workflow (95% CI: 30.9%–48.1%). In practical terms, aiR for Review identified more documents for potential review: 224 documents in the sample, compared with 120 for the RAL workflow.
Extrapolated to the full 45,004-document population, aiR for Review would have identified roughly 9,971 documents for potential review, compared with 5,019 for the RAL workflow. In return, it would have missed only an estimated 376 responsive documents, compared with 1,177 for active learning—roughly 801 additional responsive documents recovered.
The precision numbers for both workflows need additional context because this was a low-richness review. In discovery, “richness” means the percentage of documents in a population that are actually responsive. Here, only 73 of the 1,000 documents in the validation sample were responsive, a richness rate of 7%.
Low richness depresses precision across the board because responsive documents are rare. At 20% richness, closer to the range often targeted in TAR validation studies, projected precision would rise to 56% for aiR for Review and 67% for the active learning workflow. Those projected figures put the precision results in a more familiar range for validated TAR workflows.
The Secondary Result: Human Plus AI
The head-to-head comparison was not the only meaningful result. The study also showed what can happen when AI reasoning is put in front of a human expert.
After the blinded review, we conducted a one-directional informed re-review. The expert was shown documents he had coded not responsive but aiR for Review had predicted responsive, along with aiR for Review’s rationale, considerations, and document citations. The expert changed 10 of those calls to responsive, a 7% overturn rate among the documents re-reviewed. Because the re-review was a secondary inquiry, these overturns did not change the ground truth or the metrics reported for this study.
The re-review was intentionally limited: it did not re-examine every expert call, and it did not replace the primary analysis. But the result is still important. aiR for Review surfaced responsive material that an experienced SME initially missed, and the expert agreed after reviewing the AI’s analysis.
This highlights one of the unique—and often overlooked—strengths of gen AI review. aiR for Review did not just classify documents. It also provided reasoning that helped the human reviewer evaluate difficult calls. The point is not that the expert deferred to the machine—for 141 other documents, he did not. But in some cases, the reasoning and citations changed his view.
What Legal Teams Should Take Away
On a complex, low-richness review, gen AI achieved higher recall and lower elusion than TAR, with only moderately lower precision as a tradeoff. Additional research across different matters and document types will refine these findings, but the evidence is clear: it is possible for gen AI to outperform TAR in identifying responsive documents. Just as important, aiR for Review’s reasoning helped the human expert identify responsive documents he had initially missed.
Traditional TAR tools are not going away—and they should not. They remain proven, important, and defensible. Gen AI appears capable of extending that toolkit to harder responsiveness decisions involving complex rule sets—and, in some difficult reviews, may help legal teams find documents that other workflows or even expert reviewers might otherwise miss.
A preprint version of the detailed study paper, including the full methodology and statistical analysis, is available on the Redgrave website.
Author Bios
Robert Keeling is a co-managing partner of Redgrave LLP and a nationally recognized authority on eDiscovery. He serves as discovery counsel in complex, data-intensive matters, including litigation, investigations, and regulatory reviews.
Ray Mangum is a partner at Redgrave LLP whose practice focuses on eDiscovery, data analytics, and information governance. He advises clients on litigation, investigations, regulatory reviews, and the defensible use of machine learning and generative AI in discovery.
Eli Nelson is senior counsel at Redgrave LLP. He is a litigator and data scientist who focuses on complex data, eDiscovery, litigation strategy, and the use of analytics and technology in discovery and fact development.
Kevin Reiss is counsel at Redgrave LLP. His practice focuses on eDiscovery in government investigations and civil litigation, including document review strategy, reviewer management, and discovery workflows for complex matters.
The views expressed in this article are those of the authors and do not necessarily represent the views of their law firm or any of its clients.