Need help with python

#help

how would you achieve the following logic using python?

Take a search query, for example, why do I like dogs?
Open browser, navigate to duckduckgo (or something else), search for my query.
Save the HTML of the search page.
Open each URL in search page (for the first page)in a new tab.
Save the HTML of each opened URL

Top comments (2)

rhymes • May 27 '19

Why do you need to open the pages in the browser? Wouldn't it be easier to just download the HTML?

open the url https://duckduckgo.com/?q=dogs with requests
save the HTML
parse it with html.parser from the standard library
download all the links

This is the simplest version I can think of. There are other ways to scrape pages and links.

If you truly need to "drive" the browser instead, you probably want to look into something like pyppeteer which drives a headless chrome/chromium

Areahints • May 29 '19

@rhymes

this is what I've tried to do:

import os
import ssl
import logging

from bs4 import BeautifulSoup
import urllib, re
from urllib.request import Request, urlopen

# Global variables

user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
headers = {'User-Agent':user_agent,} 
url_google = 'https://www.google.com/search?&q='
url_duck = 'https://duckduckgo.com/?q='

# Get user's search query and format the string
query = input('What are you searching for?: ')
query = re.sub('\\ |\\?|\\.|\\!|\\/|\\;|\\:', '+', query)

# use user's choice to make request 
choice = int(input('Select Search Engine, Google = 1, Duckduckgo = 2: '))

googlesearch = url_google + query
ducksearch = url_duck + query

def set_custom_log_info(filename):
    logging.basicConfig(filename=filename, level=logging.INFO)

def report(e:Exception):
    logging.exception(str(e))

def write_webpage_as_html(filename, data=''):
    try:
        with open(filename, 'wb') as fobj:
            fobj.write(data)
    except Exception as e:
        print(e)
        report(e)
        return False
    else:
        return True

class Search:
    _url   = ''
    _data  = ''
    _log  = None
    _soup  = None

def __init__(self, url, log):
    self._url  = url 
    self._log = log

def retrieve_webpage(self):
        try:
            if choice == 1:
                html = urllib.request.urlopen(googlesearch,None,headers)
            else:
                html = urllib.request.urlopen(ducksearch,None,headers)
        except Exception as e:
            print (e)
            self._log.report(str(e))
        else:
            self._data = html.read()
            if len(self._data) > 0:
                print ("Retrieved successfully")

if __name__ == '__main__':
    search_scrap = Search()
    search_scrap.retrieve_webpage()
    search_scrap.write_webpage_as_html()

I am still getting errors, any advice is welcome

DEV Community

Need help with python

Top comments (2)

Read next

Docker with Helm: Simplifying Kubernetes Deployment and Management

Docker Autoscaling: Dynamically Adjust Containers Based on Demand

A Pleasant Work Environment = Better Productivity

A Comprehensive Guide to Grasping Quantum Computing