I had previously written about scraping some stats from basketball reference and when I went back to do it again my links didn't work, meaning I would have to redo some of the work. I didn't particularly love my last exploration with Nokogiri and scraping so I took this as an opportunity to learn and expand my knowledge.
I'm not going to write this as a walkthru since this blog will really walk you thru it and really just posting this to show different ways you can get the information you need.
I feel like finding elements using CSS leads to a more stable approach vs hardcoding the XPath. As long as the class name or id is the same, it won't matter if the site ends up nesting the element and breaking your code.
One thing that I spent a lot of time trying to troubleshoot was my variables were coming up empty despite me doing it all seemingly correct. In the end I realized that if the site hadn't finished loading. I'm assuming there's a safer way to do this vs just using sleep 2
or something.
My Code
require 'kimurai'
require 'json'
class TequilaScraper < Kimurai::Base
@name = 'tqdb_scrap'
@start_urls = ['https://tequilamatchmaker.com/tequilas/2325-fortaleza-blanco']
@engine = :selenium_chrome
@@tequilas = []
def scrape_page
sleep 2
doc = browser.current_response
tequila = doc.css('div.product-actions')
teq_name = tequila.css('h1[itemprop="name"]').text.gsub(/\n/, "")
teq_type = tequila.css('div.product-type a').text.gsub(/\n/, "")
teq_rating_p = tequila.css('ul.product-list__item__ratings li')[0].text.gsub(/\D/, '').gsub(/\n/, "")
teq_rating_c = tequila.css('ul.product-list__item__ratings li')[1].text.gsub(/\D/, '').gsub(/\n/, "")
teq_price_check = tequila.css('div.commerce-price-container div span')[1]
if teq_price_check
teq_price = tequila.css('div.commerce-price-container div span')[1].text.gsub(/\n/, "")
else
teq_price = 'n/a'
end
doc_mid = doc.css('div.container')
teq_image = doc_mid.css('img.product-image').attr('src')
teq_nom = doc_mid.css('div.production-details_product table tbody tr')[0].css('td a').text.gsub(/\n/, "")
doc_mid.search('span.sr-only').each do |spans|
# remove search result spans since it's just a comma
spans.remove
end
teq_agave = doc_mid.css('div.production-details_product table tbody tr')[1].css('td').text.gsub(/\n/, "")
teq_agave_region = doc_mid.css('div.production-details_product table tbody tr')[2].css('td').text.gsub(/\n/, "")
teq_region = doc_mid.css('div.production-details_product table tbody tr')[3].css('td').text.gsub(/\n/, "")
teq_cooking = doc_mid.css('div.production-details_product table tbody tr')[4].css('td').text.gsub(/\n/, "")
teq_extraction = doc_mid.css('div.production-details_product table tbody tr')[5].css('td').text.gsub(/\n/, "")
teq_water = doc_mid.css('div.production-details_product table tbody tr')[6].css('td').text.gsub(/\n/, "")
teq_fermentation = doc_mid.css('div.production-details_product table tbody tr')[7].css('td').text.gsub(/\n/, "")
teq_distillation = doc_mid.css('div.production-details_product table tbody tr')[8].css('td').text.gsub(/\n/, "")
teq_still = doc_mid.css('div.production-details_product table tbody tr')[9].css('td').text.gsub(/\n/, "")
teq_aging = doc_mid.css('div.production-details_product table tbody tr')[10].css('td').text.gsub(/\n/, "")
teq_abv = doc_mid.css('div.production-details_product table tbody tr')[11].css('td').text.gsub(/\n/, "")
teq_other = doc_mid.css('div.production-details_product table tbody tr')[12].css('td').text.gsub(/\n/, "")
tequila = {name: teq_name, type: teq_type, rating_p: teq_rating_p, rating_c: teq_rating_c,
price: teq_price, image_url: teq_image, nom: teq_nom, agave: teq_agave, agave_region: teq_agave_region,
region: teq_region, cooking: teq_cooking, extraction: teq_extraction, water: teq_water, fermentation: teq_fermentation,
distillation: teq_distillation, still: teq_still, aging: teq_aging, abv: teq_abv, other: teq_other}
@@tequilas << tequila #if !@@tequilas.include?(tequila)
end
def parse(response, url:, data: {})
scrape_page
File.open("tequila.json","w") do |f|
f.write(JSON.pretty_generate(@@tequilas))
end
@@tequilas
end
end
TequilaScraper.crawl!
puts 'done scraping'
Initially I was going to get mock data from a React site but opted for this tequila database just to test easily. Using Selenium is a bit overkill for this, but all in all it worked really well. I think next time I'm going to have to try to dig into some Python or JS scraping.
Top comments (0)