Vicente G. Reyes

Posted on Jul 26, 2023

The output of the site I scrape includes html elements

#help #python

Jul 26 '23 Comments: Answers: 1

-1

I need to scrape the table with letter 'A' only. My code is this so far:

class ChallengeSpider(scrapy.Spider)
    name = "challenge"
    allowed_domains = ["laws.bahamas.gov.bs"]
    start_urls = ["http://laws.bahamas.gov.bs/cms/en/legislation/acts.html"]

The problem is when I parse the page, html elements appear in the output. This is my parse function.

    def parse(self, response):
        css_selector

…

Open Full Question

Top comments (7)

Stokry • Jul 27 '23

You can modify your parse function:

import scrapy

class ChallengeSpider(scrapy.Spider):
    name = "challenge"
    allowed_domains = ["laws.bahamas.gov.bs"]
    start_urls = ["http://laws.bahamas.gov.bs/cms/en/legislation/acts.html"]

    def parse(self, response):
        css_selector = ".hasTip"

        rows = response.css(css_selector)
        for row in rows:
            # Extract the relative PDF URL
            pdf_url_relative = row.css(".hasTip::attr(href)").get()
            if pdf_url_relative and pdf_url_relative.endswith(".pdf"):
                # Build the complete PDF URL by joining with the base URL
                pdf_url = response.urljoin(pdf_url_relative)
            else:
                pdf_url = None

            # Clean up the title, source_url, and date data
            title = row.css(".hasTip::text").get()
            source_url = response.url
            date = row.css(".hasTip::attr(title)").get()

            # Yield cleaned up data in a dictionary
            yield {
                "title": title.strip() if title else None,
                "source_url": source_url,
                "date": date.strip() if date else None,
                "pdf_url": pdf_url,
            }

response.urljoin will construct the complete URL for the PDF file by joining it with the base URL and strip() will clean up the extracted title and date, also use pdf_url to the output dictionary. But you need to test this out.

Vicente G. Reyes • Jul 27 '23 • Edited

Appreciate your help man! Did it work on your end? Tried it just now but it didn't work.

Stokry • Jul 27 '23

Is there an error or, what is the output?

Vicente G. Reyes • Jul 27 '23

Nothing shows in the output.json file lol

Stokry • Jul 27 '23

The website may have mechanisms to block or limit scraping activities.
Before troubleshooting the issue, it would be helpful to verify if the scrapy spider is getting any data at all from the website. You can do this by adding print statements in your parse:

def parse(self, response): css_selector = ".hasTip" rows = response.css(css_selector) print("Total rows: ", len(rows)) for row in rows:

try to run scrapy runspider challenge_spider.py -o output.json

Vicente G. Reyes • Jul 27 '23

Yeah something stopped me browsing the site when i visited it

Stokry • Jul 27 '23

You can implement proxy to change your IP or something

DEV Community

The output of the site I scrape includes html elements

Top comments (7)

Read next

The 7 Best Python Libraries Every Developer Needs to Know

Unlock Cleaner Code with Dexter.ai: A must have VS Code extension for Python Development

Top re:Invent 2024 Videos

Flipper Zero NFC Hacking - EMV Banking, Man-in-the-Middle, and Relay Attacks