BlackCat Solutions (for Thomson Reuters)

ETL project for document deduplication

Project Synopsis

Thomson Reuters had an urgent requirement for a new web application to map duplicative documents between the legal and risk side of the business such that duplicate documents could be found and removed for an upcoming release of a client facing system. De-duplicating this content was critical to the release of this client facing system.

The Challenge

BlackCat Solutions, the sole technical partner to Thomson Reuters, required Yobibyte Solutions to assist with the de-duplication of documents by way of a new web application. This new web application was required to map duplicative documents between the legal and risk side of the business across in excess of 89,000 documents such that editors could remove the duplicate documents from the system.

This web application had to automatically identify as many of the duplicative documents as possible to reduce the manual effort required by the editors. Identifying these duplicate documents manually across the 89,000 odd documents would be a very time-consuming process. The more that could be automated, the quicker and more cost effective the process would be.

The process of finding duplicate content was not going to be easy due to differences in document contents, structure and metadata between the different types of documents. Many of the documents were also versioned and to complicate matters further the effective start/end dates of the documents between risk and legal were often not aligned.

What was required was a flexible application where the matching algorithm could be configured for different document sets.

The Process

Working as part of a small Agile (SCRUM) team, we began to understand the initial requirements of the system populating the backlog with initial stories. Being a greenfield project we started with a skeleton of the application and began with the building blocks.

Yobibyte Solutions was able to utilise previous experience with the Quartz Scheduler to quickly add functionality to allow matching jobs to be scheduled. These jobs would become responsible for finding duplicate documents through a matching process.

Working with the product owner at Thomson Reuters, functionality was added to the matching process to support the first document set. Once the how of finding duplicate documents was understood it was possible to then create the basics of a framework to support different algorithms for different document sets.

Yobibyte Solutions became responsible for most of the backend design and development. As the sprints progressed and more document sets were supported, the matching side of the application was refactored to support the ever-growing complexity of matching different document sets using different metadata and document contents.

The Success

Yobibyte Solutions was part of a small team that was seen as making quick productive progress delivering a web application that solved the business problem. This web application was delivered to the editors in advance of the release of the client-facing system to allow the editors to remove the duplicate documents.

The result?

Let’s take a look at the numbers…

The matching algorithms proved to be flexible enough to support all the required document sets and allowed us to gain an overall auto match rate of 86% across the 89,180 documents.

That figure doesn’t really do the statistics justice. In fact, the match rate was higher than this as some of the documents were a table of contents documents that did not have an equivalent on the legal side. Therefore there was never a duplicate document to find meaning the overall auto match rate is actually higher than 86%. Unfortunately, it is not possible to get the statistics for these table of contents documents as it is a manual process to identify them.

Tech Stack