TableNet - Deep Learning model

8 min readApr 27, 2021

For end-to-end Table detection and Tabular data extraction from Scanned Document Images

With the surge in Deep Learning and plethora of research work in this field due to its ascendancy in obtaining high accuracy when trained with a huge amount of data the Deep Learning models are gaining more and more popularity towards various applications such as Healthcare, Visual Recognition, Self-Driving Cars, etc.,

From this blog, we learn one such application in using Deep Learning model TableNet for table detection and tabular data extraction from the scanned document images.

1. Business problem:

With the wide usage of mobile phones and electronic devices in taking photographs of the documents such as Insurance claims, financial invoices, etc., and uploading these documents it has become essential in detecting the underlying table and tabular information accurately in an automated fashion in order to reduce the human errors which easily occurs to manual processing of these documents and to reduce the labor costs. TableNet model performs both the tasks of detecting the table in the document and also the table structure together via an end-to-end single model.

2. Problem Statement:

This problem is taken from the research work published for table detection https://arxiv.org/pdf/2001.01469.pdf. Where here we perform the task to detect the table present in the scanned document images and extract tabular information from it.

3. Source of Data and Data overview:

We train our model on the Marmot dataset in which we have annotations for the table mask and table columns mask. Data Source: https://drive.google.com/drive/folders/1QZiv5RKe3xlOBdTzuTVuYRxixemVIODp

3.1 Data overview:

The Marmot dataset contains 509 document images in bmp format in which each image contains a table. Out of these, 494 images have the annotation of the table and table columns/rows present in xml format.

4. Mapping to DL problem:

The problem we are solving is table detection and table information extraction if a table exists i.e., we can frame it as a Classification problem in which we are performing the task of predicting whether a table exists or not.

5. Business Constraints and Performance metrics:

No strict low latency as we need to extract the information from the tables can take few seconds. The cost of making errors is high i.e., information extraction needs to be accurate. Interpretability has no significance in this case.

We use classification metrics Precision, Recall, F1-Score as our performance metrics for our model.

6. Exploratory Data Analysis:

The dataset we are working on is the Marmot dataset which contains scanned document images of different sizes and tables present in different layouts. In the dataset we are also provided with the annotations for the columns of the table for each image in xml format from this we can get the segmented masks for the table and columns of each image.

6.1. Data overview:

Let us first check the original image. This is a random image taken from the Marmot dataset.

Table annotation in xml format of the above Image

From the xml file annotation for each of the images with the table coordinates(xmin, ymin, xmax , ymax) we map it to the images to extract the masks for the table and columns of the image.

6.2. Annotation of the table coordinates and mapping to Images:

First, we will take xml file name as argument converts into json file and return width, height, table coordinates (along with the column coordinates) for each image.

Then by calling the above function for each of the xml files for which its corresponding image is present, we are storing the returned values width, height, table coordinates into a DataFrame.

6.3. Checking if annotation of table coordinates and mapping to images is performed correctly.

Here we resizing each image to 1024 x 1024 size since the proposed model TableNet uses a fixed size of the input image to be of 1024 x 1024.

Original image display plotted using pyplot.

Once we have resized the image, we try to adjust the table coordinates to modified size and display the generated table and column masks for the image.

6.4. Generating table and table-column masks and saving them.

Now for each image, we try to generate the table and table-column mask using their corresponding annotations with their table coordinates and save each of them.

7. Data Preparation and Preprocessing:

7.1. Train-Test data preparation:

We take the first 80% of data points as the train data to train the model and validate the performance of the model using the remaining 20% of data.

7.2. Data Preparation:

We prepare our train and test data using tf.data pipeline where we transform and batch data at the same time with preprocessing of the data. We also repeat the same indefinitely using repeat and apply prefetch to prefetch the data into CPU/GPU. The preprocessing we perform on each image is a) convert the pixel to float values in [0, 1], b) resize the image to 1024 x 1024.

8. Model Explanation:

TableNet end-to-end deep learning model proposed in the paper uses the inherent interdependence between the twin tasks of table detection and table structure identification. This model utilizes a base network that is initialized with pre-trained VGG-19 features. It is followed by two decoder branches for 1) Segmentation of the table region and 2) Segmentation of the columns within a table region.

8.1. Using VGG19 as a base network:

TableNet model as shown in the above figure uses a pre-trained VGG-19 layer as the base network. The fully connected layers (layers after pool5) of VGG-19 are replaced with two (1x1) convolution layers. Each of these convolution layers (conv6) uses the ReLU activation followed by a dropout layer. Following this layer, two different branches of the decoder network are appended. The output of the (conv6 + dropout) layer is distributed to both decoder branches. In the table branch of the decoder network, an additional (1x1) convolution layer, conv7 table is used, before using a series of fractionally stridden convolution layers for upscaling the image. The output of the conv7 table layer is also up-scaled using fractionally stridden convolutions and is appended with the pool4 pooling layer of the same dimension. Similarly, the combined feature map is again up-scaled and the pool3 pooling is appended to it. Finally, the final feature map is upscaled to meet the original image dimension.

In the other branch for detecting columns, there is an additional convolution layer (conv7 column) with a ReLU activation function and a dropout layer with the same dropout probability. The feature maps are up-sampled using fractionally stridden convolutions after a (1x1) convolution (conv8 column) layer. The up-sampled feature maps are combined with the pool4 pooling layer and the combined feature map is up-sampled and combined with the pool3 pooling layer of the same dimension. After this layer, the feature map is up-scaled to the original image.

TableNet model

The model is trained on the train_dataset prepared out of the Marmot dataset and after few epochs, the sample output we obtain on the test_dataset is as follows.

Sample output displayed with Predicted Table and Column mask

8.2. Using ResNet50 as a base network:

The architecture remains the same except change in the encoder where we use Resnet50 instead of VGG19 initialized with Imagenet weights. Comparing the results of the two models TableNet model with Vgg19 as base network outperforms the other model. F1-Score of the table masks for the Marmot test data we achieve is 0.832 for the VGG19 as encoder model compared to another model which gives F1-score to be 0.763.

9. Tabular Data Extraction:

After the prediction of table_mask from the TableNet model we get the table cropping of the original image as shown below.

We pass the cropped table from the input image using the predicted table mask to the pytesseract - a tesseract-OCR to extract the tabular information.

The text extracted from the about image is as follows:

10. Deployment:

After training the TableNet model we store the model weights, and deploy the model with the flask API built around the final pipeline which returns the table mask, column mask, and extracted table image into the Html page.
Using the pytesseract, tesseract-OCR tool and passing the extracted table image we get the tabular information into the csv/txt file.
With the use of the predicted table and column mask, we extract the table in the image along with the tabular information in it which is shown as follows.

Screenshot of the extracted table and its information for the given input image

Demo video explaining the results

11. Future Work:

We observe that the single end-to-end TableNet model is able to extract the table even for the low-resolution images but the extracted tabular information is totally dependent on the tesseract-OCR tool which returns better results for high-resolution images. Hence there is a need for data with high resolution scanned document images to get extracted text results accurately.
As we have trained TableNet model with limited to compute resources and comparably less amount of data and to improve the results and thereby enabling us to make use of applying transfer learning to similar datasets we can train more efficient models with a large number of decoder branches giving better results with more compute resources and a large amount of tabular data document images with proper annotations for table co-ordinates.
We are able to get the tabular information in txt files and to store the results in a proper format into a csv file, the returned text from pytesseract can be modified in such a way we can store it as it is in a tabular structure.

12. References:

The complete project can be found here at GitHub. For any queries regarding this project, you can contact me on Linkedin.