15 March 2021
Considerable research has been done on the Natural Language Processing (NLP) of generic texts such as blogs and works of literature. NLP is a form of AI whereby algorithms understand text and spoken words. NLP is not entirely suitable for legal texts like contracts, because the models have simply not been optimised for this purpose, at least not yet. Still, models can be of much use here. Even in this day and age, barristers and solicitors continue to do a great deal of old-fashioned ‘manual work’ when they research their cases. Examples include the search for relevant sections of statutory law and pertinent case law.
Rossi’s research, under the supervision of Prof. Evangelos Kanoulas (Faculty of Science, Informatics Institute) began in 2018. He is using BERT, an NLP model launched in that year, as his point of departure. What characterises such models is their 'pre-training' on huge amounts of text data. The absence of legal documents in that data means the models are not quite ready for the intended purpose. However, developing a model himself right from scratch is not an option. 'Reproducing the pre-training phase requires deep financial pockets. Just think: the pre-training of BERT cost the creators about 8,000 euros; OpenAI’s GPT-2, a more powerful model, cost as much as 250,000 euros; and GPT-3 will probably cost even more. You’re now talking about hundreds of billions of parameters. What’s nice about these models is that you can use them free of charge and then think about how to improve them. And this doesn’t only happen with legal texts; there are also projects involving medical texts that we’re following with great interest.'
In this work, powerful NLP models, such as BERT and the GPT-3 developed by OpenAI, cannot be applied just like that. As Rossi explains: 'There are several challenges. Even if texts from the data sets used to train the models are written in the same language as the legal documents, legal language is still different. These documents exhibit long sentences, complex structures and words that may have different connotations or that aren’t even used anywhere else. And the meaning of sentences depends on their context.'
Rossi is training his model on additional legal texts, including judgments by Canadian courts. And why not the Dutch courts? 'English is a language spoken by many more people and the available data sets are much larger. For a new section of the research, I also use a comprehensive data set of all court decisions in the United States over a period of several years. It’s rough material that we need to prepare first. Having done so, we’ll try to find interesting problems that we can solve with the data and then we’ll see how we can translate this into a working product.'
The challenges Rossi runs into are sometimes unexpected. 'BERT, for instance, can only process texts of 500 words at most. But the bulk of the documents we work with have a word count at least 5 times as high. You then have to split a text into parts but, at the same time, you’re dealing with an overall context. So all the results uncovered about 1 part need to take into account any findings about the other parts. And you also need to be realistic about what you want to achieve. The fact is that, whenever you double the amount of data, you also double the amount of time it takes to process the data. Even with access to specific hardware, not everything is possible overnight. Some calculations now take days, months or even centuries.
Rossi’s own background is as a software developer and IT manager. One of the things he was involved in was the development of the OV chipkaart (public transport chip card). After completing an MBA course about big data at the Amsterdam Business School, he was invited to teach there and simultaneously do research on AI. 'My goal from the outset was to work with NLP. Moreover, I considered it a challenge to work with legal texts because there was still so much to be gained in this area. We’ve collaborated with the UvA law faculty (Faculteit der Rechten), among others. Case law retrieval ‒ i.e. sifting through documents of relevant cases to reinforce your own arguments ‒ is an extremely time-consuming task for lawyers all over the world. That’s an area where we can add considerable value with a model like this.'