Transfer from good goals to gold standards – artificial lawyer

By Jim Wagner, CEO, Contract Network
A legal project of legal benchmarking is planned and it is planned and is forced to cause debate. While many think, ‘Vals 1’ was far from Smooth – companies withdrew and participants questioned the methodology. I support transparency to the legal one, but past comparison efforts in our industry, including Vals 1, have at best produced mixed results.
When executed well, the benchmarking offers a genuine overview. This allows legal professionals to choose tools with confidence, nudges sellers to improve and, in rare cases, moves the whole field forward, (thanks again, Mara Gossman and Gordon Cormack).
However, really effective comparison in the legal one remains elusive. Preliminary efforts, both inside and by external law, have often generated more clouds than light, leaving them unsafe will be unsure of what these tools can accomplish in real world conditions.
The dangers of ‘quite well’ – when standards fall short
Even well -intentioned studies can cheat. We saw that with the first report of the VALS legal (‘Vals 1’). While in bad condition, she fought with three issues:
Deadline for results. The landscape of it is moved at high speed. As Vecflow pointed out in his commentary on Vals 1, his product had “significantly advanced” over six months by sharing data collection from the publication. At a time when readers saw the numbers, they remained behind the reality.
Sample size and scope. Some tasks faced questions if the database was large enough to support broad conclusions. Analysis ‘Mass – Mfn – Gate’ by Noah Waisberg argued that data extraction tests should include a much larger document set to paint a reliable view of accuracy.
Perceived conflicts of interest. Vals 1 revealed that ‘Vals he has a relationship with customers with one or more of the participants’. Transparency helps, but such connections naturally raise doubts about impartiality and highlight the need for independent evaluation significantly.
The border model factor
Most specialized legal remedies – it rely on the same border models – Claude, Openi or their peers. This fact means that the macro performance will tend to progress in those basic models more than with the additional improvements of the seller. When Openai or anthropic releases a model with a stronger reasoning, the tools built on them will improve at the same time, despite the accord of domain.
Implementation still matters. Effective intake – augmented generation, specific domain data and fast thoughtful engineering can give significant benefits. We see it every day in the clinical research agreements we treat. But we all have to admit that the latest ceilings are set by the foundation models themselves. An ideal but unattainable standard would separate what stems from the basic model from the value created by the legal layer – it.
How do we communicate the results
Beyond the relations itself lies a wider problem: marketing. Some seller trumpets the title accuracy figures as ‘verified XX % correct’. Within a tightly controlled test – known documents, known issues and expert users – the claim is likely to be true. The problem comes when that number, stripped of context, morphone in a promise of universal performance. In the messy reality – unknown types of contracts, new clauses, users with uneven skills – the results will sometimes change, sometimes fiercely. Overestimated claims create skepticism and, ultimately, disbelief for the whole community of it.
Particularly is particularly important to note that courts and regulatory bodies regularly cite comparison studies when weighing efforts assisted by it. Our homework will face a record control.
Benchmarking Real World Challenges: Logs lessons
After years of building and evaluating the means of the legal community, I get into two constant obstacles:
- Band width. Proper estimates require great effort. Experts need to label the data. Sellers have to dedicate engineers. Evaluators must create and direct rigorous protocols. The temptation to cut the corners – worse documents, simpler tasks, automated note – always approaches, but underlines the value of the exercise.
- A golden standard in motion. Legal professionals often disagree on what is ‘correct’. Identifying the clause, risk assessment, and even the basic debate of interpretation. Put three experts in one room and you can get four answers. When half approve an exit of him and half of the opposition, is that success or failure?
Project: Evolution towards gold – legal standards – that benchmarking
If we want to compare to fulfill its promise, we must treat it as a constant discipline. Main priorities:
- Transparency and independence. Publish the methodology, data characteristics, marking rubrics, resources of funding and governance. Show how conflicts of interest are managed.
- Durability and realism. Use plenty of large, different, different data and tasks that reflect the true practice, not the cases of the synthetic skirt.
- Objective, contextual metrics. Move beyond a single number of accuracy. Balanced Results – Definition, Remembrance, F – Score, plus specific qualitative notes for task – give a fuller look.
- Continuous, accessible evaluation. It evolves rapidly; Standards must keep the pace. Evaluate the current models in the rotation schedules and share the findings widely.
- Guard against “Testing Learning”. Rotate blind test groups, change tasks and focus on the essential reasoning skills to prevent overlap.
A pragmatic route ahead
He’s legal benchmarking is in his crawling stage. Mistreatment is inevitable, but poorly designed estimates can do more harm than well by distorting skills perceptions.
While the Vals 2 and other initiatives continue, I hope that we see the standards that possess their borders, emphasize the real -world service and incorrect us to fair, contextual appreciation. The goal is not to mourn the champions, but to deepen the understanding of the industry, to foster responsible innovation, and to help legal professionals approve it wisely. The future of it in the law depends on it.
–
For the author:
Jim Wagner is a CEO and CEO of the Contract Network, where he and his colleagues deal with the lost effort in contracting. Before establishing TCN, he served as Vice President of the Cloud Strategy of the Agreement in Docusign after Docusign won Seal Seal, where he was president. A serial founder in legal technology, Jim, holds numerous patents in the use of he and the analytics for legal documents.
–
(This is a part of the educational thought written by Jim Wagner for artificial lawyer.)
–
Legal Innovators Conference in California, San Francisco, June 11 + 12
If you are interested in the advantage of legal and innovation – and where we are all going – Then come to the legal innovators California, in San Francisco, June 11 and 12, where speakers from the leading law firms, inboh teams and technology companies will share their knowledge and experiences of what is really happening and where we are all going.
We already have an extraordinary list of companies to hear. This includes: Legora, Harvey, Structureflow, Ivo, Flation Law Group, Pointton, Centar, Lexisnexis, Ebrevia, Legatics, Recognition, Draftwise, Newcode.ai, Riskaway, SimpleClurs and more.
See you all there!
More information and Tickets here.