As a software provider in the mortgage and loan sector, we started to investigate automated document processing techniques more than 10 years ago. We have come a long way, and now there are data extraction techniques incorporated in most of our solutions. Below is the story of our quest for ultimate data extraction quality from digitized documents.
When asking for a mortgage, banks require many supporting documents to prove your income, identity, status and more. These are for example ID cards, passports, salary slips, bank statements, various contracts, etc. In the old fashion mortgage origination process, the documents were reviewed, validated and data was processed manually. These processes took days. Banks were often using external companies for document processing, which is costly and error prone. We wanted to make the process more effective; not only via digitization of scanned documents and automated extraction of required data, but also by validating, cross checking and applying bank specific rules and scoring to guarantee low fallout and incomparably quicker access to bank products.
Ten years ago, we started to search the market for tools available for automated data extraction. We found out that international companies used big scanner machines (like Kofax) to extract data out of scanned documents. These scanners were bundled with software for Optical Character Recognition (OCR). Later, the text extraction was followed by template based scanning. These scanners were able to extract text from scanned documents at a reasonable quality as they had a low-level access to the scanning hardware. The OCR results were acceptable only because the OCR engine knew exactly how the scanner and the scanning configuration was physically set. So the first bottleneck was that the extraction quality was only acceptable on documents scanned on a particular scanner machine, with software developed for it.
We couldn’t guarantee such laboratory conditions – our documents came from various sources (home scanners of various brands, photos, etc.) with various quality, formats, DPI, etc. And there was still the additional step of extracting specific data out of the extracted text.
This revealed another bottleneck of the existing software packages. They were using template based data extraction techniques, and lacked any intelligence under the hood. The technique of specifying the exact positions of the data could not handle document types such as salary slips, as almost every company uses its own layout. There were some approaches to minimize the number of templates. However, for our scenario this would lead to an extreme initial effort and an unmaintainable number of templates. We needed to understand the document structure beyond the layout and extract complex data structures.
We needed to find a way to automatically extract structured data from unstructured documents, whilst the documents are coming from thousands of different sources and were digitalized in various ways, formats and quality. Furthermore, the documents are often scanned in batches in which a single file contains multiple documents. In other words, we have to handle dozens of document types in dozens of variations and on top of that there were a number of other challenges to come.
With regards to document processing automation, it is important to realize that it is better to process 70% of the documents with 100% data extraction quality, than to process 100% of the documents with 99% data extraction quality. For example, with 100 documents to process, you have to look at 30 in the first case, but 100 in the second case.
We started to use ABBYY OCR engine. ABBYY was able to extract, with good quality, text from documents regardless of their source or format. In addition to the extracted text, we were also able to access additional metadata about the extracted text and the extraction itself, such as character positions, text blocks, format, font, etc. We used this metadata to create a model of the extracted document, so we could start to work with structures like lines, columns, text blocks, etc. But we needed more. We needed a tool that can extract data based on configuration, which can dynamically adapt to layout mutations of a specific document type. This couldn’t be a simple template based tool.
We started a project called DTA (Document Text Analyzer) – a configurable, pluggable engine that we can use in our solutions as a black box and feed with documents of various types, formats and quality. DTA then returns extracted structured validated data with high quality.
Data extraction consists of multiple steps; calling ABBYY OCR is just one of them. Other steps either proceed the data extraction, like document splitting, normalizing, enhancement, classification, etc., or come after, like scoring of the data extraction quality.
Base for good data extraction is a good configuration for a specific document type. Currently we are using a DTA-specific language to configure document types. As there are multiple configurations, document type classification becomes a crucial step. DTA has to decide quickly what kind of document type is being processed, and then use the most suitable configuration for the data extraction. We started out with rule based classification. Although that worked very well for a limited number of document types, the maintenance of the rules became a burden when the number of document types increased. Therefore, we introduced classification based on machine learning, where we train a classification model by presenting it with a number of manually classified sample documents. Once the training is done, the classification mechanism can learn how to recognize various document types on its own. Currently the document classification is using machine learning techniques to achieve maximal success rate of properly classified documents.
For example, one of the things which we used to boost the number of successfully processed documents was the fact that we often have to process multiple documents referring to the same persons or companies. This means that during extracting, we build up knowledge about the entity we extract the data for, so we can feed it back to the engine. Among other things, we can give it some hints about what we are looking for, which can be used to resolve OCR errors or correct scores for extracted data based on defined fuzzy logic and thresholds. For example, when we expect ‘Stoop’ as last name, but there was an OCR mistake and the engine finds something like ‘Steop’, fuzzy logic can tell us if it is close enough to consider it as a correct matching last name.
We also developed many different recognizers and validators to localize and verify extracted data. These can be used in the DTA configuration language to simplify and boost the output data quality. There are multiple ways to define the location of particular data to be extracted from absolute position (either textual or graphical, like it was in template based approaches) down to more relative approaches, like fuzzy or machine learning techniques, localization by regexes, labels, dictionaries, mathematic formulas, Coala ***, etc. This way we can handle by one configuration even document types which exist in hundreds of layout variations, guaranteeing high data extraction quality even for documents we encounter for the first time.
***Coala is a model based extractor for data in documents in natural language, such as contracts. It uses machine leaning techniques. By marking data as interesting during the learning phase, the system creates models, which can be used in the extraction phase to locate the same data by applying models to the currently processed document and searching for the best match.
There were still a number of factors in the real world we had to deal with in order to make our solution usable:
When we needed to tackle the image quality problem, we searched the market for a suitable product which can enhance image quality for OCR. There were few, but they were again tightly coupled with the specific scanning hardware. So we started to develop own graphic image pre-processing. It uses a number of graphical image manipulations and smart filters which in combination with statistical models and machine learning algorithms helps us to graphically classify an image, perform smart crop, deskew and auto rotation, perform clean up on backgrounds, watermarks, holograms and pictures and enhances the text. It is used as a pre-processor plugin before the OCR step for documents of poor quality.
A challenge we took on just recently is processing files that contain multiple business documents, in which even the pages are often in random order. Think of a mortgage consultant who saves the salary slip and the driving license of his customer in one file, and also wants to extract information from this file in one go. Until recently, this meant a lot of handwork, now it can be done automatically. We developed a program which is able (with certain reliability) to decide which pages belong to each business document and in what order they should be.
The machine learning module is able to automatically recognize the type of the incoming document. The module can recognize the document types because it has seen many examples of these document types, and has learned itself what words can best be used to distinguish between them. We wanted to extend this principle to the automatic reorganizing of dossiers.
The first attempt was to determine the document type for each page, and then to assume groups of pages that are close together and have the same document type belong to the same document. That was a good start, but had two major disadvantages: (1) pages in the incorrect order kept that incorrect order and (2) if a document followed another document of the same document type, it was seen as part of the previous document.
The next attempt was to teach the machine learning application to recognize the page number. We did this by showing it examples of first pages, second pages, third pages, etc., so it could get an idea of how a ‘second page’ typically looks. This went better than expected; of course it used it the page numbers in the corners of the pages, but there turned out to be many useful clues on the documents. For example, the presence of ‘<<<<<’ always indicates a second page, because it is used on the backside of identity cards. With this page number approach, we don’t have the challenges of the document type approach, but it does introduce a few new ones. For instance, if you find a page 1, a page 2, and then a page 1 again, does that second page belong to the page before or the page after it?
In the end, a hybrid of both approaches turned out to work best; for every page, both the document type and the page number is determined. Next, we glue together pages with the same document type, but different page numbers. If we then count in how many of the places where should have been a cut, we see that this is the case for 86% but we also see that 98% of all cuts done are correct! These experiments show this is a promising method to recognize and reorganize dossiers, preventing both mistakes and handwork.
We constantly search for ways to improve the developed configurations to get the best possible extraction results. At the moment, we have a thesis statement project on signature recognition with the Žilina University (Slovakia).
We are aware of the fact that we can improve the extracted data results only to a certain level. Data extraction quality will never be 100 %. Nowadays, documents are being digitized and processing software for digitized data (also provided by Davinci) will be needed more and more. But there will always be a need to use OCR based extraction techniques.Publikované: 16. nov 2016 17:23