Extracting tables from images can be a tedious and time-consuming task, especially if you have a large number of images to process. However, with the right tools and techniques, you can automate this process and extract tables from images quickly and easily.
In this article, we will explore how to extract tables from images using Python. We will cover a library that can be used to identify and extract tables from images, along with sample code and explanations. Whether you are working with scanned documents, photos, or other types of images, this article will provide you with the tools and knowledge you need to extract tables efficiently and accurately.
What is img2table?
Img2Table is a straightforward, user-friendly Python library for table extraction and identification that is based on OpenCV image processing and supports PDF files in addition to the majority of popular image file formats.
Due to its design, it offers a useful and less heavy alternative to solutions based on neural networks, especially for CPU usage.
It supports the following file formats:
JPEG files - .jpeg, .jpg, *.jpe
Portable Network Graphics - *.png
JPEG 2000 files - *.jp2
Windows bitmaps - .bmp, .dib
WebP - *.webp
Portable image format - .pbm, .pgm, .ppm .pxm, *.pnm
PFM files - *.pfm
OpenEXR Image files - *.exr
img2table Features
Table cell-level bounding boxes and table identification for images and PDF files.
Dealing with intricate table structures, like merged cells.
Extraction of table titles.
Extracting table content while supporting OCR tools and services.
A Pandas DataFrame representation and a simple object representing the extracted tables are returned.
Preserve the original structure of extracted tables by exporting them to an Excel file.
The package is simple (in comparison to deep learning solutions) and needs little or no training. There are still some limitations though since borderless tables' more complicated identification is not yet supported and may call for CNN-based approaches.
Implementation
Installation
Just like every other Python package, img2table can be installed via pip
.
pip install img2table
Working with Images
from img2table.document import Image
image = Image(src,dpi=200, detect_rotation=False)
We instantiate Image
, where src is the path to the image (it is required), dpi
is used to adapt OpenCV algorithm parameters, it's optional with an int
type (default is 200), detect_rotation
detects and corrects skew or rotation of the image, it is a boolean type and by default False
.
Let's have an example:
from img2table.document import Image
# Instantiation of the image
img = Image(src="image.jpg")
# Table identification
imgage_tables = img.extract_tables()
# Result of table identification
imgage_tables
#output
[ExtractedTable(title=None, bbox=(10, 8, 745, 314),shape=(6, 3)),
ExtractedTable(title=None, bbox=(936, 9, 1129, 111),shape=(2, 2))]
Working with PDF
from img2table.document import PDF
pdf = PDF(src, dpi=200, pages=[0, 2])
It is the same as the way we work with images, just that we have a new parameter pages
, which is a list of PDF page indexes to be processed. But if there are no specified indexes in the pages list, all the pages
are processed.
Working with OCR
To parse the content of tables, img2table
offers an interface for various OCR tools and services.
from img2table.ocr import TesseractOCR
ocr = TesseractOCR(n_threads=1, lang="eng", tessdata_dir="...")
Where n_threads
is the number of concurrent threads used to call Tesseract with an int
type and the default is 1
, lang
is used in Tesseract for text extraction and it is optional, finally the tessdata_dir
is the directory containing Tesseract traineddata files.
Note: Usage of Tesseract-OCR requires prior installation.
Let's have a look at an example.
from img2table.document import PDF
from img2table.ocr import TesseractOCR
# Instantiation of the pdf
pdf = PDF(src="tablesfile.pdf")
# Instantiation of the OCR, Tesseract, which requires prior installation
ocr = TesseractOCR(lang="eng")
# Table identification and extraction
pdf_tables = pdf.extract_tables(ocr=ocr)
# We can also create an excel file with the tables
pdf.to_xlsx('tables.xlsx', ocr=ocr)
Extracting Multiple tables
The extract_tables
method of a document allows multiple tables to be extracted simultaneously from a PDF page or an image.
from img2table.ocr import TesseractOCR
from img2table.document import Image
# Instantiation of OCR
ocr = TesseractOCR(n_threads=1, lang="eng")
# Instantiation of document, either an image or a PDF
doc = Image(src, dpi=200)
# Table extraction
extracted_tables = doc.extract_tables(ocr=ocr,
implicit_rows=True,
borderless_tables=False,
min_confidence=50)
Most of the parameters have been discussed earlier when working with images and PDF, but there are new parameters. ocr
is the instance used to parse document text, implicit_rows
is a Boolean type indicating if implicit rows should be identified, borderless_tables
indicates if borderless tables are extracted, and lastly, min_confidence
is the minimum confidence level from OCR in order to process text from 0(the worst) to 99(the best).
Conclusion
The OpenCV-python library and OpenCV are both used for all of the image processing. The Hough Transform, which recognizes lines in an image, serves as the algorithm's foundation. It enables us to recognize the image's horizontal and vertical lines. The library really doesn't have much more to it because the intention was to keep it as straightforward as possible in order to avoid any potential complications that might arise from using other approaches.
View the project's documentation on GitHub.
Let's connect on Twitter and on LinkedIn. You can also subscribe to my YouTube channel.
Happy Coding!
Top comments (6)
Hi buddy, hope you are doing good. I'm facing issue while calling TesseractOCR. Scope of my task is extract table from pdf file which is containing scanned images. i have added the sample file also. and error page also shared below. can you please help me to resolve this??
Hi, what operating system are you using?
You can check out this out: stackoverflow.com/questions/509519...
This should solve it for you.
I'm using Windows machine and i'm following the same steps to call pytesseract from local, when using other libraries also. Is it mandatory to install 'pip install tesseract-ocr' also? i tried to install tesseract-ocr but there is visual studio tools dependency and the problem is its size 7.0GB.
You should pip install tesseract-ocr
Let me check and give you feedback.
This is a problem of not installing the additional local packages in the program files, so in case you want to use the extraction features from Tesseract-OCR you first need to download the Tesseract engine from here digi.bib.uni-mannheim.de/tesseract... and install it then open your environment variable just search in search bar of the pc then click on edit environment variable and then select path then click edit then paste the path of Tesseract-OCR from the c drive program file or what ever path you installed the tesseract just paste the path of it in the env variable then try it, it sould work.
I am getting the output as similar to #output
[ExtractedTable(title=None, bbox=(10, 8, 745, 314),shape=(6, 3)),
ExtractedTable(title=None, bbox=(936, 9, 1129, 111),shape=(2, 2))
How can we get these as table format?