DEV Community

loading...
Cover image for How I Managed to Pre-processing Musshaf Images

How I Managed to Pre-processing Musshaf Images

naruaika profile image Naufan Rusyda Faikar ・12 min read

You know, just another random story ...

Behind the Scenes

If you have been following me from the very beginning of my journey on this wonderful blogging platform, at DEV.to for sure, you have already noticed that I am using the Windows Subsystem for Linux aka WSL2 at the moment. But, starting from three months ago, or four, I have decided to go back to Fedora. Eh? Uh, what? So, today I am going to tell you another story, rather than a technical blog post, about a personal project started on February 15th which I have committed to.

Wait, what is the relationship between moving to Fedora and developing this project? You may be wondering, are not you? No? It is okay, LOL. Anyhow, one of the main reasons I always end up with Fedora is that I get the original flavor of the GNOME desktop environment. No custom themes, nothing changes at all except for a few extensions that I installed later, GSConnect for example.

I just love the whole GNOME ecosystem, which made me run into installing almost all GNOME applications every time I found a new one, despite being unused, LOL. I mean, all the applications that are written in GTK, especially since they all have a beautiful header bars. That very reason was bring me to think when I need to find an application for specific purposes, that has to be GTK application on the first place. Similarly, when I got to develop a desktop application. Err ... Even though, this consideration has nothing to do with this post. Ha-ha-ha ... Well, enough talking then! Okay, let us move on!

A Background

Meanwhile, this discussion will only be limited to the main issues that I have experienced with. Do you know how the Quran, or Koran, has always been written down today? Exempli gratia, to the best of my knowledge, the Medina Quran is always written, by King Fahd Glorious Quran Printing Complex at Saudi Arabia, by hand. We may have heard Uthman Taha, the calligrapher of Musshaf Al-Madina, or we even have KFGQPC Uthman Taha Naskh installed.

I am pretty sure that the Quran which everyone owns was written by someone has hand. I meant to write: someone's; does it sound differently? Please tell me, O, you good. And then the copies were reproduced and distributed around the world. Nowadays, for some reason, we can easily get the Quran in our pockets as a digital application. So, we can read wherever and whenever we want to. For those messages of kindness can be searched, copied, and shared, many people struggle to meet unicode standards, instead of providing the Quran as images only. Unfortunately, this manner happened to bring a new issue. Before continuing, you can try to figure out the difference between the Zekr and Ayat application.

The pagination, layouting, typeface, et cetera, in a digital world often look differently. If we used to read and memorise the Quran from the Musshaf, this could be so annoying, at least for someone doing a halaqa. Many scholars who are great at memorising the Quran also memorise, things that people thought, including me before, were not really important, such as where the page which this specific ayah, or verse, is belong to. Even more specific things: on which line and in which word position. So, each time got a question, they can quickly turn to the related pages. Hence, I have decided to display the Quran as images based on the official printed Medina Quran. Meanwhile, these ayah can still be searched and copied, as we have included a database of the Quran in machine-encoded texts.

But the true story has not been told ... I know, I know, I am about to tell you. Using images, how does the application know that the area of the page that we click on belongs to which surah and ayah? Ah, got it! So, the problem is how do we do the mapping. In a simpler way, how to select an ayah on a picture. Exactly! Once we managed to find the best method, we could apply it to other standards, and other Qira'at printings as well. However, better go with a cross-platform graphical user interface (GUI) framework or toolkit, if you want to target that broad audience. Yeah, that was under my consideration, and seems to remain. That is merely my ego at all.

The Main Story

The story began with downloading a Quran PDF file from the official Quran Complex website. I had better take a cup of tea, for it took a while for my potato ...

Along the way, I got some ideas in my head to solve that. Most of them were not possible for me at the moment, as it turned out that I need to learn some machine learning techniques which I would be stupid to learn myself. I know best, if I try hard to learn it, I will make up my own opportunities! But since I am doing this to provide only a proof-of-concept of my digital Quran application, Grapik Quran, I am not willing to take too much effort into it yet. Not to mention that I just started learning GTK+; there is a lot to learn, so I have to manage my time as best I can.

Some of them required me to learn about computer vision and image processing stuff, which I am enjoy with. But I am still pretty beginner at this, as well. And well, let me start writing a true story of mine ...

Too lazy to begin with manual work of writing computer vision algorithms or even to just use them since they could easily be imported by simply calling import cv2, yes, I am. So, I was wondering if I could make predictions with some kind of machine learning model; someone else's model, of course. After looking for it for a while, I found nothing.

Okay, let me have a think of it. I was too naive not to write down what I exactly needed. It happened to someone who really thought of him as a pretty smart, or to someone who is too hasty in doing something; I have learnt a lesson from it. In order to segment ayahs from the image pages, we need to:

  1. Extract all the lines with ayah(s), excluding all other information such as page numbers, juz, hizb, and all ornaments. The lines should be sorted by appearance from top to bottom;
  2. Find all surah and ayah markers on each lines, if any; and
  3. Iterate over all the markers from the very first line to the end. Doing this sequentially from the first page of the Quran, where Surah Al-Fathihah is located, we will able to map them correctly to each surah and ayah.

It sounds simple, is not it? To get a better idea, let me show you how the final results should look like:

A Preview of Preprocessed Quran Images (1)

A Preview of Preprocessed Quran Images (2)

They look great, right? Let us jump into the main story now ...

1. Extracting Page Lines

First trial; My first intention was to write a decent code to extract all lines from any type of Musshaf images. It has to be worked for pages that have a different number of rows without specifying them first. After doing a research, I finally came up with several algorithms.

As I already said, I was lazy and still I am ... Therefore, I was thinking of looking for Python libraries to achieve it. Then I was made to understand that it has something to do with optical character recognition. So, I had been playing around with PyTesseract:

import cv2
from pytesseract import image_to_data, Output


image = cv2.imread('dataset/quran-images/3.png')
d = image_to_data(image, 'ara', output_type=Output.DICT)
print(d.keys())

n_boxes = len(d['text'])
for i in range(n_boxes):
    if int(d['conf'][i]) > 60:
        (x, y, w, h) = (d['left'][i], d['top'][i], d['width'][i], d['height'][i])
        image = cv2.rectangle(image, (x, y), (x + w, y + h), (0, 255, 0), 2)

cv2.imshow('Result', image)
cv2.waitKey(0)
Enter fullscreen mode Exit fullscreen mode

and also EasyOCR:

import easyocr


reader = easyocr.Reader(['ar'])
result = reader.readtext('dataset/quran-images/3.png')
Enter fullscreen mode Exit fullscreen mode

Unfortunately, they were not capable of. PyTesseract was the worst; EasyOCR at least could give me some insights, but it was neither stable nor accurate. The real challenge here is that the texts are handwritten and they all have diacritics. The idea behind this approach is that I would like to extract as much information as possible, including the surah, ayah, juz, and hizb numbers.

I had also played with a number of different machine learning approaches downloaded from their GitHub repositories without considering modifying them even a line of code! As lazyness had came to me. In short, they were all doomed to fail. You know, and moreover, there is not any artificial general intelligence!

Second trial; Learning from the previous failure, I had decided to focus on a specific thing and it should be extracting page lines. By reading some papers, either talking on extracting page lines in general or only on extracting from arabic pages, I have tried several approaches of theirs. Either just did git clone, again, or implemented them by myself.

First of all, since they all need a page containing only the texts we want to extract from, we have to get rid of everything else outside the page border and the border itself for sure. You already know, as for by doing a simple code should be enough. Uh, wait, we need to install all required libraries and their dependencies:

$ sudo dnf install --yes poppler-utils
...

$ pip install pdf2image opencv-python matplotlib
...
Enter fullscreen mode Exit fullscreen mode

Then, inside my Jupyter notebook, I wrote these:

from pdf2image import convert_from_path


pages = convert_from_path('dataset/quran-images/hafs-standard39.pdf', dpi=190)
Enter fullscreen mode Exit fullscreen mode

Somehow, it has eaten up all my memory;

Screenshot from 2021-03-08 17-01-33

Thus, I did this instead:

from pdf2image import convert_from_path
import tempfile


with tempfile.TemporaryDirectory() as outdir:
    filepath = 'dataset/quran-images/hafs-standard39.pdf'
    pages = convert_from_path(filepath, dpi=190, output_folder=outdir)
Enter fullscreen mode Exit fullscreen mode

Or even do a sampling to get 5–8 images only.

Screenshot from 2021-03-24 00-02-37

Next, these lines:

import cv2
import numpy as np
from matplotlib import pyplot as plt


img = cv2.cvtColor(np.array(pages[5]), cv2.COLOR_RGB2BGR)
img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
img_mod = cv2.GaussianBlur(img_gray, (15, 15), 0)
_, img_mod = cv2.threshold(
    img_mod, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)

w = 456  # image preview width in pixels
scale = w / img.shape[1]
h = int(img.shape[0] * scale)
plt.rcParams.update({
    'figure.figsize': (h / (96 / 3), w / (96 / 3)),
    'axes.spines.left': False,
    'axes.spines.right': False,
    'axes.spines.top': False,
    'axes.spines.bottom': False,
    'xtick.bottom': False,
    'xtick.labelbottom': False,
    'ytick.labelleft': False,
    'ytick.left': False
})

plt.imshow(img_mod)
plt.show()
Enter fullscreen mode Exit fullscreen mode

will result:

Alt Text

Got it, ya? Let us continue writing ...

img_mod = cv2.Canny(img_mod, 100, 200)
kernel = np.ones((2, 2), np.uint8)
img_mod = cv2.dilate(img_mod, kernel)
cv2.floodFill(img_mod, None, (0, 0), 255)
cv2.floodFill(img_mod, None, (0, 0), 0)

plt.imshow(img_mod)
plt.show()
Enter fullscreen mode Exit fullscreen mode

And ta-da!

Alt Text

Then we could do something like:

bboxes = cv2.findContours(img_mod, cv2.RETR_EXTERNAL,
                          cv2.CHAIN_APPROX_SIMPLE)
bboxes = bboxes[0] if len(bboxes) == 2 else bboxes[1]
x, y, w, h = cv2.boundingRect(max(bboxes, key=cv2.contourArea))
Enter fullscreen mode Exit fullscreen mode

and finally:

img_mod[y:y+h, x:x+w] = img_thres[y:y+h, x:x+w]
Enter fullscreen mode Exit fullscreen mode

to obtain the page border and all of its content inside. Great! Next, let us talk about several algorithms to segment page lines. For that, you shall let me write this as is in the Google search box:

line segmentation site:github.com
Enter fullscreen mode Exit fullscreen mode

LOL! Well, you do not judge me at all.

After doing a screening, I came up with a conclusion:

  • I did not want any path finding approaches as they do not seem to give me straight lines;
  • I did not want any codes written in anything other than Python, because I was not familiar with;
  • I did not want this and that ...

After all, I needed to implement my own by combining all those approaches to-get-her. Until a week later, I had been such the happiest person in the world when it seemed to be working for Quran images from the King Saud University, which had been used for the Ayat application. But, come on, the code was messy (you will not even like seeing its related public commits in my GitHub repository; it is still there) and I had no luck in applying for other Musshafs.

Third trial; Make it simple, dude! I lost my hope while still being lazy. It had me realised: sometimes, it is okay to be lazy, because it might turn us into creative people, just like what Bill Gates has said (if I am not wrong). But indeed, sometimes:

Programmers like to come up with clever ways to solve non-existing problems—https://news.ycombinator.com/item?id=18180017

Since then, I have come back to the easiest ways to do this and that;

  1. defining the number of lines in that page;
  2. obtaining the page border; and then
  3. dividing the area equally.

Dots.

2. Finding Ayah Markers

Okay, just do a template matching for it. Simply,

  1. open the PDF using Evince or any PDF reader;
  2. set the zoom to 100%;
  3. press the shift and prt sc (print screen) keys, then use a mice to collect a screenshot of the surah, bismillah, and ayah markers as samples; and
  4. Google on how to do template matching in Python!

Nah, as simple as that!

alt text

3. Iterating Over Lines

Yeah, the next steps to take are (1) iterating each line while incrementing the surah and ayah numbers, (2) segmenting each line based on the appearance of the ayah markers in there if any, and (3) saving all the information into a single CSV file. We may want to apply some post-processing later on. Well, I am too tired to elaborate. Sigh ... I hope that you got the idea. That CSV file might contain something like:

page,sura,aya,x1,y1,x2,y2
4,1,1,248,537,578,591
4,1,2,158,537,248,591
4,1,2,202,591,578,645
4,1,3,158,591,202,645
4,1,3,384,645,578,699
4,1,4,158,645,384,699
4,1,5,238,699,578,753
4,1,6,158,699,238,753
4,1,6,354,753,578,807
4,1,7,158,753,354,807
4,1,7,158,807,578,861
4,1,7,158,861,578,921
Enter fullscreen mode Exit fullscreen mode

If you feel interested, since all of these stuff has been uploaded, you can always access all my random things on here:

GitHub logo naruaika / my-playground

My random researches on the subject of interest

After Story

I figured out that there was wasted space around the page border;

Screenshot from 2021-03-16 15-29-58

ending up taking up my screen for no better reason than having page number, juz, and hizb boxes, and others, next to the page border. For I want to deal all those in my application navigation system. Come on, let us fix it!

$ ls
alasbahani-mad.pdf  hafs-standard39.pdf     kasr-elbadl.pdf       qiraat-naskh.pdf   qiraat-sousi.pdf
albahrain.pdf       hafs-standardthree.pdf  muyassar-ghareeb.pdf  qiraat-qaloon.pdf  qiraat-warsh39.pdf
hafs-jawamee39.pdf  hafs-wasat39.pdf        qiraat-douri.pdf      qiraat-shuba.pdf   tawassot-elbadl.pdf

# Convert PDF to JPEG images
$ mkdir -p hafs-standard39
$ pdftoppm -jpeg -r 190 -e hafs-standard39.pdf hafs-standard39/even
$ pdftoppm -jpeg -r 190 -o hafs-standard39.pdf hafs-standard39/odd

# Optimise the output images
$ sudo dnf install --yes jpegoptim
$ mkdir -p hafs-standard39/optimized
$ jpegoptim hafs-standard39/*.jpg --dest hafs-standard39/optimized/
... 
Enter fullscreen mode Exit fullscreen mode

Then, using GIMP;

  1. Open even-*.jpg as layers (ctrl+alt+O);
  2. Crop all the even pages at (57, 127) in the size 774 by 1190 pixels manually. Do not forget to enable the "Delete cropped pixel" option when cropping;
  3. Export all of them as separate JPEG images using a plugin Export Layers; and
  4. Repeat for all the odd pages.

Since the first two pages and before apparently have a different alignment, repeat for them separately. Although we need additional manual operations such as aligning and resizing the images over layers. You might decide not to include all the pages that did not contain any ayahs. But since I wanted to maintain the number of the pages as the original, I have cut them too.

Afterward, using Nautilus we are able to batch rename them to remove the even- and odd- prefixes. To remove the zero prefix, rename them by using a template: [1, 2, 3].

As we now have JPEG images instead of PDF files, it so happens that we will have to change the initial few lines of our code. But that is all for now. See you again soon!

Discussion (0)

pic
Editor guide