DEV Community

foodspark
foodspark

Posted on

Part 3: How to Perform an EDA on Yelp Extracted Data?

Image description
This is the third in a series of articles that uses BeautifulSoup to scrape Yelp restaurant reviews and then apply Machine Learning to extract insights from the data. In this article, you will use the code to extract all the reviews in a list. The script will be as follows:

import requests
from bs4 import BeautifulSoup
import time
from textblob import TextBlob
import pandas as pd#we use these argument to scrape the website
rest_dict = [
{ "name" : "the-cortez-raleigh",
"link" : "https://www.yelp.com/biz/the-cortez-raleigh?osq=Restaurants&start=",
"pages" : 3
},
{ "name" : "rosewater-kitchen-and-bar-raleigh",
"link" : "https://www.yelp.com/biz/rosewater-kitchen-and-bar-raleigh?osq=Restaurants&start=",
"pages" : 3
}
]#scraping function
def scrape(rest_list):
all_comment_list = list()
for rest in rest_list:
comment_list = list()
for pag in range(1, rest['pages']):
try:
time.sleep(5)#URL = "https://www.yelp.com/biz/the-cortez-raleigh?osq=Restaurants&start="+str(pag*10)+"&sort_by=rating_asc"
URL = rest['link']+str(pag*10)
print(rest['name'], 'downloading page ', pag*10)
page = requests.get(URL)#next step: parsing
soup = BeautifulSoup(page.content, 'lxml')
soupfor comm in soup.find("yelp-react-root").find_all("p", {"class" : "comment_373c0_Nsutg css-n6i4z7"}):
comment_list.append(comm.find("span").decode_contents())
print(comm.find("span").decode_contents())
except:
print("could not work properly!")
all_comment_list.append([comment_list, rest['name']])
return all_comment_list#store all reviews in a list
reviews = scrape(rest_dict)
Here in the script, the output of the function will be saved in a variable known as reviews. While printing the variable, the result will be:

here-in-the-script-the-output-of-the-function
The nested list's structure follows this pattern:

[[[review1, review2], restaurant1], [[review1, review2], restaurant2]]
It will now be converted into DataFrame using Pandas

Converting the Data into a DataFrame
You will need to develop a DataFrame to hold all of the information now that you have established a list using the ratings and their respective restaurants.

df = pd.DataFrame(reviews)
converting-the-data-into-a-dat-frame
Here, we will try to persuade this hierarchical list into a DataFrame directly, and will end up with a column full of listings and another column with a single restaurant title. To correctly update the data, we will use the explode function, which creates a single row for each element of the list where it's used, in this example, column 0.

here-in-the-script-the-output-of-the-function2
df = df.explode(0)
The dataset is now appropriately structured, as you can see in the image. Each review has a restaurant associated with it.

the-dataset-is-now-appropriately-structured
Because the current samples are only numbered with 0 and 1, the only thing left to do is reset the index.

df = df.reset_index(drop=True)
df[0:10]
the-only-thing-that-is-left-to-do-is-resetting-the-index
Performing Sentiment Analysis to Classify Reviews
It is complicated to extract restaurant ratings that were previously assigned to each review available on the website. You will need sentiment analysis that will try to discover a solution to the missing information. The NLP model’s interferences regarding values will take the place of each review’s star rating. Obviously, working with the information is an experiment, and sentiment analysis is independent on the model that we employ which is not always precise.

We will use TextBlob, a simple library that already includes a pre-trained algorithm for the task. Because you will have to apply it to every review, we will first develop a function that returns the estimated sentiment of a paragraph in a range of -1 to 1.

def perform_sentiment(x):
testimonial = TextBlob(x)

testimonial.sentiment (polarity, subjectvity)

testimonial.sentiment.polarity

sentiment_list.append([sentence, testimonial.sentiment.polarity, testimonial.subjectivity])

return testimonial.sentiment.polarity
After developing the function, we will use pandas and apply method to add a new column of our dataset that will hold algorithm analysis results. The sort values method will then be used to sort all of the reviews, starting with the negative ones.

The final dataset will be:

the-final-dataset-will-be
Extracting Word Frequency
To continue with the experiment, we will now extract one of most frequently used words in a dataset division. However, there is a problem. Although certain words have the same root, such as "eating" and "ate," the algorithm will not automatically place them in the same category because they are different when converted to binary. As a solution to this difficulty, we will employ lemmatization, an NLP pre-processing approach.

Lemmatization may isolate the core of any existing word, removing any potential variation and enabling the data to be normalized. Lemmatizers are basic models that must be pre-trained before they can be built. To import a lemmatizer, we will use the spacy library.

!pip install spacy
Spacy is an open-source NLP library that includes a lemmatizer and many pre-trained models. This program will lemmatize all or most of the words in a single message and provide the frequency of each term (the number of times they have been used). We will arrange the results in ascending order to indicate which words have appeared the most frequently in a set of evaluations.

def top_frequent(text, num_words):
#frequency of most common words
import spacy
from collections import Counternlp = spacy.load("en")
text = text

#lemmatization
doc = nlp(text)
token_list = list()
for token in doc:
#print(token, token.lemma_)
token_list.append(token.lemma_)
token_list
lemmatized = ''
for _ in token_list:
lemmatized = lemmatized + ' ' + _
lemmatized#remove stopwords and punctuations
doc = nlp(lemmatized)
words = [token.text for token in doc if token.is_stop != True and token.is_punct != True]
word_freq = Counter(words)
common_words = word_freq.most_common(num_words)
return common_words
We will extract the most common words from the worst-rated reviews, rather than the complete list of reviews. The information has already been sorted to place the worst ratings at the front, so all that remains is to build a unique string that contains all of the reviews. To convert the review list into a string, we will use the join function.

text = ' '.join(list(df[0].values[0:20]))
texttop_frequent(text, 100)[('great', 22),
('<', 21),
('come', 16),
('order', 16),
('place', 14),
('little', 10),
('try', 10),
('nice', 10),
('food', 10),
('restaurant', 10),
('menu', 10),
('day', 10),
('butter', 9),
('drink', 9),
('dinner', 8),
...
If you are looking to perform an EDA on Yelp data then, you can contact Foodspark today!!

Know more : https://www.foodspark.io/part-3-how-to-perform-an-eda-on-yelp-extracted-data.php

Discussion (0)