DEV Community

Alan Stocco
Alan Stocco

Posted on

Scraper payslips with Python | Selenium

Scenario:

I work in a company and my paylips are downloadable in an aspx portal. One by one, not in block.
I needed them all, for burocaracy reasons and in order to archive them.

How: py, selenium. I tried with beutifulsoup but it didn't work.

Explenation and Code

Web Driver

I used webdriver Chrome with some options in order to save the pdf-files when browser opens it. Look pref in code below.

Creating a class

class PaylipsScaper:
    # Init
    def __init__(self, username, password):
        self.username = username
        self.password = password
        # Options
        chrome_options = webdriver.ChromeOptions()
        prefs = {
            "plugins.always_open_pdf_externally": True,
            "download.default_directory": "C:\\tmp", # folder save files
            "download.prompt_for_download": False,
            "download.directory_upgrade": True,
            "safebrowsing.enabled": True
            }
        chrome_options.add_experimental_option("prefs",prefs)
        chrome_options.headless = True # If True hide browser
        self.driver = webdriver.Chrome(executable_path='chromedriver.exe', options=chrome_options)
Enter fullscreen mode Exit fullscreen mode

Login

The login phase is quite easy, just select by id the area and insert the value. I just kept attention to iframe, because in that case you have to use switch_to.frame before.

    # Manage login page
    def login(self, url):
        driver = self.driver
        driver.get(url)
        driver.switch_to.frame("FunArea") 
        username = driver.find_element_by_id("login")
        password = driver.find_element_by_id("pwd")
        username.send_keys(self.username)
        time.sleep(1)   
        password.send_keys(self.password)  
        driver.find_element_by_id("CmdInvia").click()
Enter fullscreen mode Exit fullscreen mode

Loop table using XPATH

I created a class that wrap the selenium driver in order to keep all cleans.
I just reproduced the clicks done by myself.
At the beginning I tried with CSS selector but for the structure of the pages was a better solution to use XPATH
(to get the XPATH with Chrome see here)
By the way I don't like the time.sleep but it was useful to avoid navigations problems during the process.

# Inside PaylipsScaper class
def get_num_rows(self, num_rows = 1):
        driver = self.driver
        self.click_to_payslips_area()            
        num_rows = len(driver.find_elements_by_xpath("//table[@id='ContTab']/tbody/tr/td/div/table/tbody/tr"))             
        return num_rows


[...other stuff...]
try:
    bot = PaylipsScaper(username, password) 
    bot.login(url_website)
    wait = WebDriverWait(bot.driver, 10)
    num_rows = bot.get_num_rows()       
    for row in range(1,num_rows+1):   
        paylip_year  = bot.get_val_in_cedolino_row(row, 4)
        paylip_month = bot.get_val_in_cedolino_row(row, 5)            
        paylip_type  = bot.get_val_in_cedolino_row(row, 7)
        bot.driver.execute_script("arguments[0].click();", WebDriverWait(bot.driver, 20).until(EC.element_to_be_clickable((By.XPATH, "/html/body/form/div/table/tbody/tr/td/div/table/tbody/tr["+str(row)+"]/td[10]/img"))))    
        time.sleep(2)  
        filepdf= dirpath + "\\*.pdf"
        list_of_files = glob.glob(filepdf)    
        file_name = max(list_of_files, key=os.path.getctime)
        current_paylip = Paylip(paylip_year, paylip_month, paylip_type, file_name)
        bot.rename_and_move (current_paylip)
        print("Downloaded:")
        print(current_paylip)
Enter fullscreen mode Exit fullscreen mode

Save pdf file in folder and rename it

It's quite a brute solution anyway I got the last pdf saved in a folder and renamed it with the informations from the website.
Then I moved the files in sub-folders by year.

def rename_and_move(self, urrent_paylip):
        if current_paylip.paylip_month == "" :
            new_file_name='Cud_'+str(current_paylip.paylip_year)+'_'+str(current_paylip.paylip_type).replace(" ", "_").replace("Completo", "").replace("NORMALE", "")+'.pdf'
        elif "TREDICESIMA" in current_paylip.paylip_type:
            new_file_name = 'Cedolino_'+str(current_paylip.paylip_year)+'_'+str(current_paylip.paylip_month)+'_Tredicesima.pdf'
        else:
            new_file_name = 'Cedolino_'+str(current_paylip.paylip_year)+'_'+str(current_paylip.paylip_month)+'.pdf'
        print(new_file_name)
        new_file_name = os.path.join(dirpath, new_file_name)
        # Rename file and move it in the year-directory
        os.rename(current_paylip.file_name, new_file_name)
        current_paylip.file_name = new_file_name
        # Check if path with year directory exist otherwise create it
        dirin=os.path.split(new_file_name)
        newdir=dirin[0]+'\\'+current_paylip.paylip_year
        if os.path.exists(newdir)==False:
                # Create directory
                os.mkdir(newdir)
        # Move file in the year-directory 
        if os.path.exists(newdir+"\\"+dirin[1]):
            # If file already exist, delete it 
            os.remove(newdir+"\\"+dirin[1])
        shutil.move (current_paylip.file_name,newdir+"\\"+dirin[1])
        return
Enter fullscreen mode Exit fullscreen mode

Final situation

Got it. I have a folder with subfolders by year and in each one all the paylips with a standard name format.

What I learned:

  • Use of Selenium in py.
  • Simple automation can save a lot of time and avoid manual boring tasks.
  • How to write my first article here.(it's a personal task so not so useful for you but better than nothing after all)

Future improvements:

  • input parameters
  • (re)try to use css selector instead of xpath selector
  • (re)try to use BeautifulSoup
  • save last paylips saved in order, next run, to save only the not already saved paylips
  • read pdf and report data in file(eg google sheets)

Of course the code is useful just for me and my colleagues. Anyway I hope that the idea and process can be a good idea to someone else.

Top comments (0)