DEV Community

Vicente G. Reyes
Vicente G. Reyes

Posted on

The output of the site I scrape includes html elements

I need to scrape the table with letter 'A' only. My code is this so far:

class ChallengeSpider(scrapy.Spider)
    name = "challenge"
    allowed_domains = ["laws.bahamas.gov.bs"]
    start_urls = ["http://laws.bahamas.gov.bs/cms/en/legislation/acts.html"]

The problem is when I parse the page, html elements appear in the output. This is my parse function.

    def parse(self, response):
        css_selector

Top comments (7)

Collapse
 
stokry profile image
Stokry

You can modify your parse function:

import scrapy

class ChallengeSpider(scrapy.Spider):
    name = "challenge"
    allowed_domains = ["laws.bahamas.gov.bs"]
    start_urls = ["http://laws.bahamas.gov.bs/cms/en/legislation/acts.html"]

    def parse(self, response):
        css_selector = ".hasTip"

        rows = response.css(css_selector)
        for row in rows:
            # Extract the relative PDF URL
            pdf_url_relative = row.css(".hasTip::attr(href)").get()
            if pdf_url_relative and pdf_url_relative.endswith(".pdf"):
                # Build the complete PDF URL by joining with the base URL
                pdf_url = response.urljoin(pdf_url_relative)
            else:
                pdf_url = None

            # Clean up the title, source_url, and date data
            title = row.css(".hasTip::text").get()
            source_url = response.url
            date = row.css(".hasTip::attr(title)").get()

            # Yield cleaned up data in a dictionary
            yield {
                "title": title.strip() if title else None,
                "source_url": source_url,
                "date": date.strip() if date else None,
                "pdf_url": pdf_url,
            }
Enter fullscreen mode Exit fullscreen mode

response.urljoin will construct the complete URL for the PDF file by joining it with the base URL and strip() will clean up the extracted title and date, also use pdf_url to the output dictionary. But you need to test this out.

Collapse
 
highcenburg profile image
Vicente G. Reyes • Edited

Appreciate your help man! Did it work on your end? Tried it just now but it didn't work.

Collapse
 
stokry profile image
Stokry

Is there an error or, what is the output?

Thread Thread
 
highcenburg profile image
Vicente G. Reyes

Nothing shows in the output.json file lol

Thread Thread
 
stokry profile image
Stokry

The website may have mechanisms to block or limit scraping activities.
Before troubleshooting the issue, it would be helpful to verify if the scrapy spider is getting any data at all from the website. You can do this by adding print statements in your parse:

def parse(self, response):
css_selector = ".hasTip"
rows = response.css(css_selector)
print("Total rows: ", len(rows))
for row in rows:

try to run scrapy runspider challenge_spider.py -o output.json

Thread Thread
 
highcenburg profile image
Vicente G. Reyes

Yeah something stopped me browsing the site when i visited it

Thread Thread
 
stokry profile image
Stokry

You can implement proxy to change your IP or something