DEV Community

Web Scraper & Data Extraction with Python | Upwork Series #1

Rashid on January 02, 2020

This post cross-published with OnePublish Welcome to the first post of Upwork Series. In this series, we are going to work on real-world applicati...

Read full post

Narendra Kumar Vadapalli • Jan 3 '20

Just a couple of minor points, which might make the code look clean and neat.

If you define a method like the following

def return_match_from_info(input_str):
    re.search(input_str, information).group(1).strip()

can become

operating = get_match_from_info('Operating Status:(.*)Out')

Of course you need to declare information as global variable

Also if you want to use pandas pandas.pydata.org/,

you can make use of pd.to_csv which would directly write the csv file, without the need of additional (header) kungfu
Ref: pandas.pydata.org/pandas-docs/stab...)

Ben Halpern • Jan 2 '20

This is a really interesting concept for a series!

Rashid • Jan 3 '20

Thanks🙌🚀

rpopovwex • Aug 13 '20 • Edited

For some reason, this code gave me AttributeError when a Dot number was not found. I figured out that this was due to bs.find('center') not finding the correct field (since it doesn't exist on the page for non-existent or outdated DoT number). I solved the problem by changing this:

except AttributeError:
      pass

except AttributeError:
    continue

so that instead of doing nothing (pass) I'd switch to the next DoT number. I also had to move the whole block of code starting with "information" one tab to the right so that it's only executed when try statement executes without errors. This way only valid DoT numbers are crawled and saved. Hope this helps!

Here's how the code looks in the final form:

def crawl_data(url):
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
    html = urlopen(req).read()
    bs = BeautifulSoup(html, 'html.parser')
    bold_texts = bs.find_all('b')
    for b in bold_texts:
        try:
            date = re.search('The information below reflects the content of the FMCSA management information systems as of(.*).', b.get_text(strip=True, separator='  ')).group(1).strip()
            if len(date) > 11:
                date = date.split(".",1)[0]
            print(date)
        except AttributeError:
            continue

        information = bs.find('center').get_text(strip=True, separator='  ')

        operating = re.search('Operating Status:(.*)Out', information).group(1).strip()
        legal_name = re.search('Legal Name:(.*)DBA', information).group(1).strip()
        physical_address = re.search('Physical Address:(.*)Phone', information).group(1).strip()
        mailing_address = re.search('Mailing Address:(.*)USDOT', information).group(1).strip()
        usdot_address = re.search('USDOT Number:(.*)State Carrier ID Number', information).group(1).strip()
        power_units = re.search('Power Units:(.*)Drivers', information).group(1).strip()
        drivers = re.search('Drivers:(.*)MCS-150 Form Date', information).group(1).strip()

        write_csv(date, operating, legal_name, physical_address, mailing_address, usdot_address, power_units, drivers)

rpopovwex • Aug 13 '20

Also, it'd be convenient to add some sort of progress bar that would state which DoT is crawled at the moment and how many are left, as well as a short statement in the case when DoT number is not found.