DEV Community

MahelM
MahelM

Posted on

Consumer Demand Prediction for Fast-Food Sector

This is a computer system that can predict consumer demand for the fast-food sector.

Note: This solution developed using the scratch level of python language. The aim is to achieve higher accuracy as much as possible in consumer demand statistics prediction in the fast-food sector. The overall goal of this application did not succeed 100%. But able to achieve closer accuracy when comparing with the actual data. The purpose of this blog is to help people understand the workaround. Suitable for those who work similarly to this study.

Link to the Application Demonstration Video
https://youtu.be/qMAt5fyCOyQ

Link to the Source Code
https://github.com/DevXtreme0/consumer-demand-prediction-for-fast-food-sector

Introduction

Lack of fast food fulfillment, excesses of fast food over the estimated demand and business loss profit cause by inaccurate demand prediction are common nowadays in fast food-based businesses. Therefore, proposes a solution to avoid these problems by predicting consumer demand for the fast-food. Therefore, utilized machine learning techniques to this proposed solution. A forecasting algorithm is known as CatBoost implemented along with a data categorization technique. Fast food demand is affected by several independent variables such as seasonality, trend, price fluctuation, and length of historical data. A combination of these selected variables was used to calculate demand prediction using parameter tuning in the CatBoost algorithm and other algorithms for the experimentation (Such as Linear Regression, LGBM, and XGBoost). However, CatBoost algorithm was the best performing model that was selected. Therefore, a standalone application was developed to yield fast-food demand prediction and statistics generation.

Explanation of the Proposed Solution

Dataset Selection and Configuration

Dataset derived from the Kaggle platform (https://www.kaggle.com/ghoshsaptarshi/av-genpact-hack-dec2018). One dataset is a combination of three single information files. One file consists of historical demand information for each center, another file consists of data center information and another file consist of meal information. Auxiliary file related to testing information also used for the demand prediction.

  1. Historical Demand Information file – "trainForLearnInformation.csv"

This file consists with the historical information of demand for each center. Defined variables listed with a brief description in below,

base_price - Average price of the meal
checkout_price - Sold price of the meal
meal_id - Id of the meal
center_id - Id of the meal center
week - Week number of the sold meal
id - Id of the record
emailer_for_promotion - Removed this variable from the implementation
homepage_featured - Removed this variable from the implementation
num_orders - Demand for the meal

  1. Historical center information file – centerInformation.csv

This file consists of historical data for each center. Defined variables listed with a brief description in below,

center_id - Id of the meal center
city_code - Code of the center located city
region_code - Removed this variable from the implantation
center_type - Centre type
op_area - Removed this variable from the implementation

  1. Historical meal information file – mealInformation.csv

This file consists historical data of meal information. Defined variables listed with a brief description in below,

meal_id - Specific Id for the meal
category - Categorized name for the meal
cuisine - Type of the cuisine for the meal

4.Test Information file data from 146th week to 155th week – testInformation.csv

This file consists of test data for model validation purpose. Same variables included as mentioned in the historical demand information file (trainForLearnInformation file) except for target variable “num_orders”.

Use of Data Preprocessing from the Extracted Data

When the user inputs files to the system, all information on the submitted files is merged into a separate dataset. Therefore, not having null values or missing values is compulsory. After this, validation process will initiate. Therefore, merging information is required to be validated. As an example, when merging file called “trainForLearnInformation”, specific variable name and their contents should be matched to the same name and contents in the other file. Such as trainToLearn file “meal_id” should match in the mealInfromation file “meal_id” variable. Figure 1 shows how this achieved,

pre processing

Figure 1. Data Preprocessing Code

Use of Exploratory Data Analysis for the Dataset

This is a process, performing an initial investigation on data. Such as identify patterns, identify anomalies, test hypotheses and validate assumptions with the help of graphical statistics representation or statistical information. At the beginning of the exploratory data analysis process, it is compulsory to identify and remove unnecessary variables that are not contributing to the prediction process. Therefore, four variables were removed. Such as region_code, op_area, emailer_for_promotion and homepage_featured. emailer_for_promotion and homepage_featured column with data were dropped by updating the file. However, region_code, op_area kept in the dataset for identifying its necessity for implementation. Information regarding the number of entries for each variable and their data types in each data files shown in fig.2 to fig.5,

normalized train dataset
Figure 2. Train Information Dataset Content

normalized center dataset
Figure 3. Center Information Dataset Content

normalized meal dataset
Figure 4. Meal Information Dataset Content

normalized test dataset
Figure 5. Test Information Dataset Content

The next process of the exploratory data analysis is the standardization of features. Standardization is a technique. It changes the values of numeric columns in the dataset to a common scale, without altering differences in the ranges of values. Standardization of the created features is necessary due to the accuracy of the model is not acceptable and a high deviation of the values was observed during the implementation. Sample of the standardized dataset shown in figure 6,

standardization
Figure 6. Sample of Standardized Dataset

The final process of the exploratory data analysis is to identify the correlation of each variable. Correlation is a statistical technique that represents relationships of variables between each other. There are three methods available to calculate the correlation. Pearson correlation coefficient, Kendall rank correlation coefficient and Spearman's rank correlation coefficient. Pearson correlation coefficient was chosen since it is widely used and most suited for this study. Generated heatmap of pairwise calculation with the use of Pearson correlation coefficient method shows in figure 7,

HeatMap2
Figure 7. Generated Heatmap for Identify Variable Correlation

According to the above figure 7, it is possible to identify “base_price” and “checkout_price” highly correlated with each other. Therefore, using these variables might be enough to create the final model. But before adding to the final model it was decided to validate with “num_orders” variable. Therefore, it was decided to subtract “base_price” variable value from “checkout_price” variable value and store the result as a special price. After special price calculation, the relationship between “num_orders” variable and special price variable displayed in the scatterplot as shown in figure 11,

relationship between special price and NoO
Figure 8. Scatter Plot for Observe Relationship of Selected Variables

However, according to the above figure 8, it was observed that the relationship between each variable is non-linear and not a good correlation for the final model. Therefore, it was decided to focus on other techniques rather than depending on specific variable selection according to linearity and correlation.

Use of Feature Engineering from the Extracted Information

This is a method to create features according to the domain knowledge that enables to enhancement performance and accuracy of the machine learning models using the dataset. Therefore, the below table clarifies the features that are used for model creation from the available dataset,

Created Feature Feature Description

Special Price - Price defined by subtracting “checkout_price” variable by “base_price” variable.
Special Price Percent - Percentage of the special price.
Special Price T/F - Defines special price included or not. If the special price is true, then the value will be -1 if the special price is false, then the value will be 0.
Weekly Price Comparison - This is a price comparison of the selected weeks about the rise or fall of the meal for a specific center.
Weekly Price Comparison T/F - Defines price comparison increased or not. If increase true, then value will be -1 if it is false, then value will be 0.
Year - This defined according to the dataset, number of weeks.
Quarter - According to the dataset, number of weeks defines as one-fourth of a year.

Use of Data Transformation for Eliminate Outliers

In the demand prediction context, it is compulsory to outlier data to be 0% on a targeted variable called “num_orders”. Therefore, this necessity is achieved by using the Interquartile range method. Log transformation is the most popular among the different types of transformations used to transform skewed data to approximately conform to normality in feature engineering. Therefore, the target variable called “num_orders” is not aligned with normality and non-use of transformation methods will reduce the performance of the data model. Therefore, it was decided to include log transformation on the targeted variable “num_orders”. Which is data approximately conform to normality. Implementation of it shown in figure 9,

outlier1
Figure 9. Outlier Detection Code

Elaboration of used Machine Learning Algorithms for Demand Prediction

Multiple data modeled using gradient boosting algorithms (such as XGBoost, LightGBM and CatBoost) and linear regression algorithm. Those algorithms are implemented with feature extraction, data transformation and data preprocessing for achieving better accuracy on the predicted result. Therefore, this section only elaborates the way of algorithm implementation.

01. take data from dataframe
Figure 10. Get Data through DataFrame

After the above process, it was decided to encode categorical features and manipulate data received through the above process to data standardization as shown in figure 11, Reason for using the “astype” method is to convert column data to object type. Which is helps to reduce the usage of memory space.

02. encoding and modify dataframe
Figure 11. Encoding Categorical Features

After the above process as displayed in figure 11, it was decided to categorize dataset “week” values into a created feature called “Quarter” and “Year” as shown in figure 16. Reason for categorizing, train dataset contains 146 weeks of data which is approximately 11 quarters and one-quarter consists of approximately 13 weeks. That is the reason for the week divided by 13 for quarter and purpose of the calculation it was defined to 12 quarters. And year consists of approximately 52 weeks. Therefore, when it comes to years, it was identified 3 years of data. That is the reason for the week divided by 52. The goal of mapping those related data using the map method is to return a list of the results according to the calculated outcome. Then manipulated those data accordingly for the detection outlier purpose.

03. categorizing data to week quarterPNG
Figure 12. Categorizing Year and Quarter Aspects

Before the outlier detection, it was observed the necessity of using log-transformation on the target feature on the training dataset. Therefore, log transformation included for the target feature as shown in figure 13,

04.log transfr=e=ormation

Figure 13. Applying Log Transformation on the Target Feature

The reason for outlier detection is, without log transformation there is deviation occurred in the trained dataset. Therefore, outlier detection implemented with the Interquartile range method as mentioned above. The result of the outlier detection shows in the below figure 18,

05. outliers result
Figure 14.Result of Outlier Detection

Since tuning the data as explained in the above section succeeded. It was decided to use CatBoostRegressor as the final model for the predicted demand. Therefore, it was decided to test the accuracy of the algorithm using train data before finalizing the model. Train dataset contains 146 weeks of data. Therefore, it was decided to split 1st week to 136th-week data as train data and from 136th week to 146th data as test data. As shown in figure 15,

06. data spliting
Figure 15. Dataset Splitting as Test set and Trains set

It was necessary to drop some variables that are not affecting the prediction to improve the prediction result. Such as variables “id” and “city_code” are identified as irrelevant variables for the train, “num_orders” is a target variable for prediction, “special price” variable calculation of base price and checkout price. But identified there is lack of correlation with the target variable, “week” variable categorized with quarter/year wise and “special price percent” also removed.

07 remove unwanted
Figure 16. Removing Variables

After removing irrelevant variables, it was decided to fit catboostRegressor model to the training data using the fit method. Therefore, it was able to predict result based on this data using predict method as shown in figure 17,

08. CBR
Figure 17. Model Training and Data Prediction

The predicted result was evaluated according to the implemented standard evaluation metrics as shown in figure 18.

09.evaluation metrix
Figure 18. Used Evaluation Metrics for Model Evaluation

And a result of the evaluation matrix shown in figure 19,

11. results of the metrix
Figure 19. Model Evaluation Result from Metrics

In figure 19, model training time and prediction time according to the following format, HH – represent Hour. MM – represent Minute. SS – represent Second. NS – represent Nano Second. It was certified that implementation of this model prediction accuracy very similar to the actual results. Therefore, implemented a scatterplot diagram to visualize the relationship between actual values and predicted values to ensure the above statement (refer to figure 20).

scatterplot prediction comparison
Figure 20. Scatterplot to Observe the Relationship Between the Actual Values and Predicted Values

As shown in figure 20, the actual value tends to increase as the predicted values increases. Therefore, it is possible to say there is a linear positive correlation between those variables with a little number of outliers. At last, it was decided to use this model for the demand prediction process. Therefore, only two modifications were needed to make. Such as adjust train data selected week range from 1st week to 146th week and adjust test data selected week range from 146th week to 156th week. As per choice, implemented a way to select time period selection for limit prediction demand.

Two ways of data can be store on the user’s local storage as considering data sharing possibility. Such as, • Store generated graph in the local storage - Matplotlib’s pyplot consists of the mechanism for saving generated graph in PNG format. Therefore, it was decided to use that functionality rather than customized graph saving. • Numeric statistics generation on CSV file and store in the local storage - When the prediction process finishes, results will be stored in the defined location on the local storage as implemented (refer to figure 21).

12. store demand
Figure 21. CSV File Generation Code

Top comments (0)