I need to scrape the table with letter 'A' only. My code is this so far:
class ChallengeSpider(scrapy.Spider)
name = "challenge"
allowed_domains = ["laws.bahamas.gov.bs"]
start_urls = ["http://laws.bahamas.gov.bs/cms/en/legislation/acts.html"]
The problem is when I parse the page, html elements appear in the output. This is my parse
function.
def parse(self, response):
css_selector
…
Top comments (7)
You can modify your
parse
function:response.urljoin
will construct the complete URL for the PDF file by joining it with the base URL andstrip()
will clean up the extracted title and date, also usepdf_url
to the output dictionary. But you need to test this out.Appreciate your help man! Did it work on your end? Tried it just now but it didn't work.
Is there an error or, what is the output?
Nothing shows in the
output.json
file lolThe website may have mechanisms to block or limit scraping activities.
Before troubleshooting the issue, it would be helpful to verify if the scrapy spider is getting any data at all from the website. You can do this by adding print statements in your
parse
:def parse(self, response):
css_selector = ".hasTip"
rows = response.css(css_selector)
print("Total rows: ", len(rows))
for row in rows:
try to run
scrapy runspider challenge_spider.py -o output.json
Yeah something stopped me browsing the site when i visited it
You can implement proxy to change your IP or something