Necessity is mother of invention

#python #pdf #webscrapping #pydroid3

During this lock down my mother came to me asking for her Gujarati news paper. As news paper weren't allowed to be distributed in our residential complex i tried to find an e-paper for her. The paper was available on the news paper's website for free but issue was all pages were stored as PDF files but stored as 1 file per page.

I am big fan of Python's approach to find solution for practical problems and ever growing list of modules and libraries for anything under sun.

I followed an experiment model and combined different methods to get the required result.

I use BeautifulSoup to scrap the data and PYPDF2 to read and merge files with tips from Stakeoveflow :)

Below is the code. This gets me a single PDF in a directory with today's date. She is able to run this from her smart phone using Pydroid3.

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import requests
import os
from datetime import datetime

today=datetime.today().strftime('%d-%m-%Y')
if not os.path.exists(today):
os.mkdir(today)

os.chdir(os.path.join(os.getcwd(),today))

req = Request("http://www.newspapaersomething.com/frmEPShow.aspx")
html_page = urlopen(req)

soup = BeautifulSoup(html_page, "lxml")

links = []

for link in soup.findAll('a'):
links.append(link.get('href'))

mylist=[]
del links[0:15]
mylist=links

cnt=len(mylist)
i=0
urlnew = [None] * cnt

while True:
urlnew[i] = mylist[i]
r = requests.get(urlnew[i], allow_redirects=True)
z=urlnew[i].split("/")
name=z[-1]
open(name, 'wb').write(r.content)
i = i + 1
if(i >= cnt):
break

from PyPDF2 import PdfFileMerger,PdfFileReader

def mergeIntoOnePDF(path):
f=path+"/"
pdf_files=[fileName for fileName in os.listdir(f) if fileName.endswith('.pdf')]
merger=PdfFileMerger()
for filename in pdf_files:
merger.append(PdfFileReader(os.path.join(f,filename),"rb"))
merger.write(os.path.join(f,"merged_full.pdf"))

mergeIntoOnePDF(os.getcwd())

DEV Community

Necessity is mother of invention

Top comments (0)