DEV Community

loading...
Cover image for Data Scraping in Rails by Processing CSV.

Data Scraping in Rails by Processing CSV.

railscarma profile image RailsCarma ・3 min read

The ruby on rails Application to scrape the link uploaded from CSV file and
find the occurance of link in particular page.

In the application user need to pass a csv and list of users email to whom the parsed CSV will be sent.

In the csv there will be three 2 column:
• refferal_link
• home_link
• and there values like below

First of all we will create the rails application

$ rails new scrape_data

$ cd scrape_data

Then we will genrate the UploadCsv module, run the below command

$ rails g scaffold UploadCsv generated_csv:string csv_file:string

That will create All the required model, controller and migrations for csv_file

Then we will start by first upload the file in DB

replace the below code in files app/views/upload_csvs/_form.html.erb

we added the below code to upload file in view

<%= form_with(model: upload_csv, local: true) do |form| %>
<% if upload_csv.errors.any? %>


<%= pluralize(upload_csv.errors.count, "error") %> prohibited this upload_csv from being saved:

  <ul>
    <% upload_csv.errors.full_messages.each do |message| %>
      <li><%= message %></li>
    <% end %>
  </ul>
</div>

<% end %>


<%= form.label :csv_file %>
<%= form.file_field :csv_file %>

<%= form.submit %>

<% end %>

Then we will add the gem for upload a csv_file

add the below line in gem file

gem 'carrierwave', '~> 2.0'

$ bundle install

Then we will create the uploader in carrierwave

$ rails generate uploader Avatar

we will attach the uploader in model

app/models/upload_csv.rb

class UploadCsv < ApplicationRecord
mount_uploader :csv_file, AvatarUploader
end

before moving further just check your application is working
run below commands

$ rake db:create db:migrate

update the routes

Rails.application.routes.draw do
resources :upload_csvs
root 'upload_csvs#index'
end

$ rails s

Then we will create a Job to read the CSV file and scrape the link from it
and genrated file will be save in generated_csv column of that records

for genearting the job we will do like below

$ rails generate job genrate_csv

add the below gem and run bundle install

gem 'httparty'
gem 'nokogiri'

then we will replace the code with below

class GenrateCsvJob < ApplicationJob
queue_as :default

def perform(upload_csv)
processed_csv(upload_csv)
file = Tempfile.open(["#{Rails.root}/public/generated_csv", '.csv']) do |csv|
csv << %w[referal_link home_link count]
@new_array.each do |new_array|
csv << new_array
end
file = "#{Rails.root}/public/product_data.csv"
headers = ['referal_link', 'home_link', 'count']
file = CSV.open(file, 'w', write_headers: true, headers: headers) do |writer|
@new_array.each do |new_array|
writer << new_array
end
upload_csv.update(generated_csv: file)
end
end
NotificationMailer.send_csv(upload_csv).deliver_now! if @new_array.present?
#need to genrate the mailer and follow the mailer steps
end

# Method to get the link count and stores in the array
def processed_csv(upload_csv)
@new_array = []
CSV.foreach(upload_csv.csv_file.path, headers: true, header_converters: :symbol) do |row|
row_map = row.to_h
page = HTTParty.get(row_map[:refferal_link])
page_parse = Nokogiri::HTML(page)
link_array = page_parse.css('a').map { |link| link['href'] }
link_array_group = link_array.group_by(&:itself).map { |k, v| [k, v.length] }.to_h
@new_array.push([row_map[:refferal_link], row_map[:home_link], (link_array_group[row_map[:home_link]]).to_s])
end
end
end

Then we will attach the job after_create of upload_csvs and we will add the validation for csv_file require

please update the code of app/models/upload_csv.rb

class UploadCsv < ApplicationRecord
mount_uploader :csv_file, AvatarUploader
after_create :processed_csv
def processed_csv
GenrateCsvJob.perform_later(self)
end
end

then check after uploding file your scrape genrated file will be updated you can check generated csv
inside /scrape_data/public/product_data.csv

we can send through email by using below instruction

First of we will genrate the mailer

$ rails generate mailer NotificationMailer

update the code of app/mailers/notification_mailer.rb

def send_csv(upload_csv)
@greeting = 'Hi'
attachments['parsed.csv'] = File.read(upload_csv.generated_csv)
mail(to: "sample@gmail.com", subject: 'CSV is parsed succesfully.')
end
end

please configure the mail configure also config/environments/development.rb or production.rb

add below lines in the file

config.action_mailer.default_url_options = { host: 'https://sample-scrape.herokuapp.com/' }
config.action_mailer.delivery_method = :smtp
config.action_mailer.smtp_settings = {
user_name: 'sample@gmail.com',
password: '*******123456',
domain: 'gmail.com',
address: 'smtp.gmail.com',
port: '587',
authentication: :plain
}
config.action_mailer.raise_delivery_errors = false

and update the view also app/views/notification_mailer/send_csv.html.erb

CSV has been processed, Thanks!

, Please check attachment to recieve the email

Thanks!

Discussion (0)

pic
Editor guide