S3 releases - LLM EVAL 'for any jurisdiction, language + model'

S3 releases – LLM EVAL ‘for any jurisdiction, language + model’ – artificial lawyer

Raymond Blyd, a renowned legal technology expert, has begun S3, a new LLM assessment framework for legal needs, which focuses on ‘identifying essential deficiencies rather than benefits’.

As Blyd Al explained, the S3 was created to calibrate and compare open -source models during the development of Sabaio (his early company Ai), aiming at accuracy and hallucinations.

Provides:

‘Standardized Assessment Metrics: Implement the standard standards of industry and custom metrics adapted for legal tasks.
Reproductive work streams: ensures that evaluation processes can be repeated and verified by others.
Extendable architecture: easily add new evaluation modules or integrate with other legal technology tools.
Transparent reporting: generates clear, obedient reports for regulatory and internal review. ‘

Blyd commented: ‘I needed a sustainable method to evaluate improvements in basic model skills. For example, many models failed to quote correct items or referral numbers. To prove this, I conducted a simple ‘strawberry’ test by compensating the numbers of legal items to control the accuracy of the model. Most models failed, exposing their inconsistency.

‘This mirror led to creating a quick model for model testing. The model uses a fixed structure – jurisdiction, code, item number, compensation and legal topic – to ensure sustainability. This allows measurable, reproducible comparisons of model performance in language and legal systems. ‘

The frame uses a ‘direct quantitative approach: each model corresponds to a given set of objective questions and calculate the correct answers’. Performance is reported as a ratio (eg 12/12), enabling transparent and reproducible comparisons between patterns and test directions, he explained.

Below is a more in -depth interview with Blyd about how and why the project.

Why do I do that?

For Sabaiio, I was looking for a way to check if a large language model could accurately refer to Dutch civil legislation, specifically identifying the appropriate article regarding the torture law. None of the open source models I have run in the country have managed this in the country. So, I wanted to see if any model there has this basic legal ability. Other evaluation frames look at model skills or specific product skills, while the S3 frame looks at deficiencies only in basic models.

How can you tell what is correct or not?

Deliberately including a wrong number of the item and then asking the model to verify if the number is correct. This results in a direct “true or false” test – as a legal version of “strawberry test”. This works in the Law Code, as well as references to judicial practice.

What measure do you use?

We use a simple ratio, like 12/12, to provide clear and reproducible comparisons in different models and directions of the test. This also helps evaluate consistency when repeating tests with identical inputs. For example, the first run can reach 12/12, while a second run may be 10/12. Some models perform better continuously. Therefore, we see sellers and firms watching Evals S3 as an MCP service or tool calls to verify the results. The S3 provides an essential infrastructure for the stability of model production, making it reliable legal.

Which one have you tested?

We tested Deepseek R1 0528 with Dutch and Jordanian laws, especially in Dutch and Arabic languages. The Arabic legal test was carried out in Egypt to help create a new vehicle for judges. S3 allows us to test each model, in every language for every legal jurisdiction.

In which data are you trying?

We generate our tests using local legislative texts and databases of the Affairs Law.

If testing for citation accuracy, which library of law do you use for comparison?

Currently, we are not testing quotation formats. Our tests are limited to verification if the case reference number is accurately matched by the name of the case. In those cases, S3 tests will have to rely on customer access to the law of the Affairs Law. However, we see opportunities to add quotation formats as additional rating to S3.

How can you measure accuracy for more subjective areas, such as design and renewal?

In short, we do not currently test in subjective areas. We do not believe that design and renewal can be measured objectively if it is not approached from the point of view of a judge. Each party usually wants to strengthen their arguments in a matter or contract negotiation. In litigation, this may have been the cause of the cited quotes in court cuts. This will be said, understanding these conditions allows us to create orderly in the case of specific use.

Special thanks for Emma Kelly AND Khrizelle lascano for their main contributions. We invite legal communities and help build a more reliable future for the legal one. If you are a legal expert, seller or a legal firm, connect with Emma@legalcomplex.com.

–

You can see more about S3 here at Gititub.

–

Legal Innovators New York and Great Britain Conference – Both in November ’25

If you want to stand in front of him the legal curve of him…. Then come to the legal innovators New York, 19 November + 20, where the brightest minds will share their knowledge of where we are now and where we are going.