Ronan Azarias

Posted on Dec 31, 2020 • Edited on Jan 2, 2021

Scraping Skoob's Bookshelf

#scraping #python #selenium #datascience

About Skoob site:

Skoob is a collaborative social network for Brazilian readers, launched in January 2009 by developer Lindenberg Moreira.Without advertising, the site became a meeting point for readers and writers who exchange tips about books and organize meetings in bookstores. The network allows interaction with other social networks, such as Facebook and Twitter, as well as popular e-commerce stores in Brazil, such as: Americanas.com, Saraiva, and Submarino.

Source: Wikipedia

Skoob is the most famous site for Brazilians readers interact each other.

It's a amazing and easy place to manage your bookshelf without install any program or make some spreadsheet. The Skoob's community itself registers the books in the system, so, usually, you just look for the book that you want and insert it in your bookshelf. Just sometimes I had to manually register a book.

It's like Goodreads

The issue

I have 806 books at home and I use the Skoob site to manage my personal library, but sometimes it's not quite easy to look up for more then two book for fast tasks like send list books to a friend by text messages or social mídia.

Because of this reason I decided to make a CSV file with some information of my bookshelf and make easy consults and copy/paste when necessary.

The Project

This code just find and store the title, author and publisher from the books of my Skoob's account. It will not save data about reviews,rate or other information of the site because this information are not essential to my porpoise.

It's important to say that this code just takes books from your bookshelf, so you need to have an account and fill your bookshelf.

This is the very first shape of this code,so I'm not worried in control errors or scraping more data, because I'm not my priority by now. Maybe when I implement the synopsis feature maybe I can handle the error control or something like that.

I am using Jupyter notebook and Selenium to scrape the site, I will not explain how to use selenium or how to install it because there are too many sites about it. Besides,I suggest you these sites for more information: here or here

First of all, we go import all the packages that we need and then we will open a chrome with the site in the login page.

import pandas as pd
import re
import requests
from requests import get
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from time import sleep
from pandas import DataFrame as df

url="https://www.skoob.com.br/login/"
driver = webdriver.Chrome()
driver.get(url)
sleep(3)
action = ActionChains(driver)

You cannot access your bookshelves if you are not logged in and can't see a specific user as well. So go login:

#user data
usr= input('user e-mail')
psw= input('password')

#finding form
username = driver.find_element_by_id("UsuarioEmail")
password = driver.find_element_by_id("UsuarioSenha")

#filling form
username.send_keys(usr)
password.send_keys(psw)
sleep(2)

#clicking "submit" button
driver.find_element_by_xpath('//*[@id="login-box-col02"]/form/div/div/input').click()

sleep(5)
print("Login Ok")

Then you are sent to the user's feed page, if you look at address box of the browser will see that it shows your ID account and your first name.

skoob.com.br/usuario/ID-FIRSTNAME

By clicking in your avatar you can access your bookshelf, and taking a close look, it's seems that the site, apparently, stores all their users bookshelves in just one folder indexed by user ID.

skoob.com.br/estante/livros/todos/USER_ID

So what we will do is use regular expression to collect the ID-number from the feed page.

By default, the site shows books by cover, this is not useful for us, so we need to change it to a cascade layout. We will solve this by clicking the specific button in the bookshelf page

id_number= re.findall('[0-9]+',driver.current_url)
id_number=str(id_number[0])

# url refresh 
url= "https://www.skoob.com.br/estante/livros/todos/" + id_number
driver.get(url)
driver.refresh
sleep(3)

#we need use the cascade layout to scraping the books, so:
driver.find_element_by_xpath('/html/body/div/div[2]/div[2]/div/div/div[2]/div/div[1]/a[2]').click()

print("Successful screen adjustment ")

I found that just clicking on the next button results in an infinite loop and, by now, I don't want to worry or handle it (maybe next upgrade), so I decided to calculate the number of pages that I will need to scrap.

#I need to calculate the numbers of pages because just clicking on next button results on a infinite loop.
qty_books=driver.find_element_by_xpath('//*[@id="corpo"]/div/div[4]/div[2]/div[1]/div[1]/div/span[1]').text
qty_books= re.findall("[0-9]+", qty_books)
qty_pages=round(int(qty_books[0])/36)+1
qty_pages

Finally I can scrap the data that I'm looking for: Title, author and publisher. All this information is stored on a list called l_books.

#list of books
l_books=[]

#counter
qty=1


while qty<=qde_pages:
    books=driver.find_elements_by_class_name('clivro')
    for b in books:
        d=b.find_element_by_class_name('livro-conteudo')
        title= d.find_element_by_class_name('ng-binding').text
        author_editora=d.find_element_by_tag_name('p').text
        #this approach returns one string with two informations, so we go use str.split() funtion to solve. We'll use the new line character (\n) as delimiter
        author_publisher=  author_publisher.split('\n')
        #agora posso separar
        author=author_publisher[0]
        publisher= author_publisher[1]
        b_details={'title':title,'author': author,'publisher': publisher}          
        l_books.append(b_details)
    qty+=1
    # next page
    driver.find_element_by_xpath('//*[@id="corpo"]/div/div[4]/div[2]/div[1]/div[2]/ul/li[8]/a').click()
    sleep(3)

len(l_books)

Now I export the data like a CSV file

#saving the file
dict = l_books
df_books = pd.DataFrame(dict, index=None)  

arcv="my_books.csv"

df_books.to_csv(arcv)

len(df_books)

Then that's it!
I hope you enjoy it.

If is you wish, the Github repository is here

Top comments (1)

Shailesh Vasandani • Jan 1 '21

Cool program! Scraping data from online is always super fun, and given how popular the site you're scraping from is, I'd definitely recommend making it into a library or even a website that people can use to scrape their own data.

Awesome post, and thanks for sharing!