Extracting tables from images can be a tedious and time-consuming task, especially if you have a large number of images to process. However, with the right tools and techniques, you can automate this process and extract tables from images quickly and easily.
In this article, we will explore how to extract tables from images using Python. We will cover a library that can be used to identify and extract tables from images, along with sample code and explanations. Whether you are working with scanned documents, photos, or other types of images, this article will provide you with the tools and knowledge you need to extract tables efficiently and accurately.
What is img2table?
Img2Table is a straightforward, user-friendly Python library for table extraction and identification that is based on OpenCV image processing and supports PDF files in addition to the majority of popular image file formats.
Due to its design, it offers a useful and less heavy alternative to solutions based on neural networks, especially for CPU usage.
It supports the following file formats:
JPEG files - .jpeg, .jpg, *.jpe
Portable Network Graphics - *.png
JPEG 2000 files - *.jp2
Windows bitmaps - .bmp, .dib
WebP - *.webp
Portable image format - .pbm, .pgm, .ppm .pxm, *.pnm
PFM files - *.pfm
OpenEXR Image files - *.exr
Table cell-level bounding boxes and table identification for images and PDF files.
Dealing with intricate table structures, like merged cells.
Extraction of table titles.
Extracting table content while supporting OCR tools and services.
A Pandas DataFrame representation and a simple object representing the extracted tables are returned.
Preserve the original structure of extracted tables by exporting them to an Excel file.
The package is simple (in comparison to deep learning solutions) and needs little or no training. There are still some limitations though since borderless tables' more complicated identification is not yet supported and may call for CNN-based approaches.
Just like every other Python package, img2table can be installed via
pip install img2table
Working with Images
from img2table.document import Image image = Image(src,dpi=200, detect_rotation=False)
Image , where src is the path to the image (it is required),
dpi is used to adapt OpenCV algorithm parameters, it's optional with an
int type (default is 200),
detect_rotation detects and corrects skew or rotation of the image, it is a boolean type and by default
Let's have an example:
from img2table.document import Image # Instantiation of the image img = Image(src="image.jpg") # Table identification imgage_tables = img.extract_tables() # Result of table identification imgage_tables #output [ExtractedTable(title=None, bbox=(10, 8, 745, 314),shape=(6, 3)), ExtractedTable(title=None, bbox=(936, 9, 1129, 111),shape=(2, 2))]
Working with PDF
from img2table.document import PDF pdf = PDF(src, dpi=200, pages=[0, 2])
It is the same as the way we work with images, just that we have a new parameter
pages, which is a list of PDF page indexes to be processed. But if there are no specified indexes in the pages list, all the
pages are processed.
Working with OCR
To parse the content of tables,
img2table offers an interface for various OCR tools and services.
from img2table.ocr import TesseractOCR ocr = TesseractOCR(n_threads=1, lang="eng", tessdata_dir="...")
n_threads is the number of concurrent threads used to call Tesseract with an
int type and the default is
lang is used in Tesseract for text extraction and it is optional, finally the
tessdata_dir is the directory containing Tesseract traineddata files.
Note: Usage of Tesseract-OCR requires prior installation.
Let's have a look at an example.
from img2table.document import PDF from img2table.ocr import TesseractOCR # Instantiation of the pdf pdf = PDF(src="tablesfile.pdf") # Instantiation of the OCR, Tesseract, which requires prior installation ocr = TesseractOCR(lang="eng") # Table identification and extraction pdf_tables = pdf.extract_tables(ocr=ocr) # We can also create an excel file with the tables pdf.to_xlsx('tables.xlsx', ocr=ocr)
Extracting Multiple tables
extract_tables method of a document allows multiple tables to be extracted simultaneously from a PDF page or an image.
from img2table.ocr import TesseractOCR from img2table.document import Image # Instantiation of OCR ocr = TesseractOCR(n_threads=1, lang="eng") # Instantiation of document, either an image or a PDF doc = Image(src, dpi=200) # Table extraction extracted_tables = doc.extract_tables(ocr=ocr, implicit_rows=True, borderless_tables=False, min_confidence=50)
Most of the parameters have been discussed earlier when working with images and PDF, but there are new parameters.
ocr is the instance used to parse document text,
implicit_rows is a Boolean type indicating if implicit rows should be identified,
borderless_tables indicates if borderless tables are extracted, and lastly,
min_confidence is the minimum confidence level from OCR in order to process text from 0(the worst) to 99(the best).
The OpenCV-python library and OpenCV are both used for all of the image processing. The Hough Transform, which recognizes lines in an image, serves as the algorithm's foundation. It enables us to recognize the image's horizontal and vertical lines. The library really doesn't have much more to it because the intention was to keep it as straightforward as possible in order to avoid any potential complications that might arise from using other approaches.
View the project's documentation on GitHub.
Let's connect on Twitter and on LinkedIn. You can also subscribe to my YouTube channel.
Top comments (5)
Hi buddy, hope you are doing good. I'm facing issue while calling TesseractOCR. Scope of my task is extract table from pdf file which is containing scanned images. i have added the sample file also. and error page also shared below. can you please help me to resolve this??
Hi, what operating system are you using?
You can check out this out: stackoverflow.com/questions/509519...
This should solve it for you.
I'm using Windows machine and i'm following the same steps to call pytesseract from local, when using other libraries also. Is it mandatory to install 'pip install tesseract-ocr' also? i tried to install tesseract-ocr but there is visual studio tools dependency and the problem is its size 7.0GB.
You should pip install tesseract-ocr
Let me check and give you feedback.
I am getting the output as similar to #output
[ExtractedTable(title=None, bbox=(10, 8, 745, 314),shape=(6, 3)),
ExtractedTable(title=None, bbox=(936, 9, 1129, 111),shape=(2, 2))
How can we get these as table format?