Dasbang, F. Joseph

Posted on Sep 27, 2023

Predicting Poverty Reduction in Nigeria: A Machine Learning Approach

#nigeria #poverty #machinelearning #deprivation

Introduction

The health, education, and general well-being of Nigeria's population, especially its young people, are severely impacted by multifaceted poverty and child deprivation. In Nigeria, poverty is defined as having insufficient access to essential goods and opportunities. To answer the crucial research issue, "Can machine learning methods effectively predict and prescribe poverty reduction in Nigeria with limited datasets?" this data science project was started. Data collection, preprocessing, model creation, and evaluation were all steps in my approach. I will evaluate the model's performance, draw conclusions, and offer suggestions for further initiatives in this final section.

Significance of Machine Learning
Utilizing cutting-edge machine learning methods offers a crucial chance to thoroughly understand and handle these complex problems. Modern algorithms can power predictive and prescriptive studies, which can offer priceless insights for resource allocation and wise policy decisions.

Research Question
"Can machine learning methods effectively predict and prescribe poverty reduction in Nigeria with limited datasets?"

Objectives
The main goal of this project is to develop an innovative framework that smoothly combines recurrent neural network (RNN) and Ensemble learning techniques in order to achieve the following:

Create predictive models using ensemble learning and recurrent neural networks (RNN) to predict multidimensional poverty in Nigerian subnational regions using a limited dataset.
Assess the models' accuracy and efficacy as well as the insight they generate.

Data Sources and Preprocessing
In this project, I used two datasets:

The Subnational multidimensional poverty data from Humanitarian Data Exchange published by the Oxford Poverty and Human Development Initiative (OPHI), University of Oxford global Multidimensional Poverty Index (MPI) which measures multidimensional poverty in Nigeria.
The Multiple Indicator Cluster Survey (MICS) 2016–17, which is a household survey conducted by UNICEF that covers various indicators related to health, education, water and sanitation.

Installing Libraries:
In addition to Python, you'll need the Matplotlib plotting package for visualization, pandas data analysis module and Tensorflow for modeling. JupyterLab will be used to run the codes. You can set them up using:

conda install pandas matplotlib.pyplot

#or

pip install pandas matplotlib.pyplot

Importing datasets:

#import csv
import pandas as pd
mpi_df = pd.read_csv("C:\Users\Documents\MPI.csv")
mics_df = pd.read_csv("C:\Users\Documents\MICS.csv")

Viewing the head of the MPI dataset:

# Viewing the head of the MPI dataset
mpi_data.head()

# Viewing the head of the MICS dataset
mics_data.head()

#Output
+--------------------+------+----------------+--------------------+
| Subnational Region | Year | MPI of Nigeria |Population size by region           |
+--------------------+------+----------------+--------------------+
|        Abia        | 2018 |     0.254      |        2,966        |
|      Adamawa       | 2018 |     0.254      |        4,412        |
|     Akwa Ibom      | 2018 |     0.254      |        4,487        |
|      Anambra       | 2018 |     0.254      |        6,915        |
+--------------------+------+----------------+--------------------+

┌────────────────────┬───────────┬──────────┐
│ Subnational Region │ Nutrition │   Health │
│          String15? │ String15? │ Float64? │
├────────────────────┼───────────┼──────────┤
│               Abia │      34.6 │     46.5 │
│            Adamawa │     35.0  │     64.6 │
│          Akwa Ibom │      29.3 │     77.4 │
└────────────────────┴───────────┴──────────┘

Dropping Unwanted headers and Convert from Float to Integer:

mpi_df.drop("Unnamed: 16", axis=1, inplace=True)
mics_df.drop("Unnamed: 17", axis=1, inplace=True)

# Replace NaN values in the "Year" column with 0
mpi_df['Year'].fillna(0, inplace=True)

# Convert the "Year" column to integers
mpi_df['Year'] = mpi_df['Year'].astype(int)

Generating a bar chart using Matplot to show how multidimensional poverty spread across the Subnational Regions:

import matplotlib.pyplot as plt
import numpy as np

# Create a list of unique colors for each state
colors = [
    'red', 'green', 'blue', 'yellow', 'orange', 'purple', 'brown', 'pink',
    'cyan', 'magenta', 'lime', 'teal', 'lavender', 'turquoise', 'gold', 'maroon',
    'olive', 'navy', 'chocolate', 'limegreen', 'peru', 'indigo', 'deeppink', 'darkcyan',
    'lightcoral', 'mediumblue', 'darkgreen', 'darkred', 'darkviolet', 'saddlebrown', 'seagreen',
    'dodgerblue', 'lightgray', 'crimson', 'royalblue', 'indianred', 'darkslategray', 'skyblue'
]

# Create a list of Subnational Regions and their corresponding Number of MPI poor by region
subnational_regions = [
    'Abia', 'Adamawa', 'Akwa Ibom', 'Anambra', 'Bauchi', 'Bayelsa', 'Benue', 'Borno', 
    'Cross River', 'Delta', 'Ebonyi', 'Edo', 'Ekiti', 'Enugu', 'FCT', 'Gombe', 'Imo', 
    'Jigawa', 'Kaduna', 'Kano', 'Katsina', 'Kebbi', 'Kogi', 'Kwara', 'Lagos', 'Nasarawa', 
    'Niger', 'Ogun', 'Ondo', 'Osun', 'Oyo', 'Plateau', 'Rivers', 'Sokoto', 'Taraba', 'Yobe', 'Zamfara'
]

mpi_poor_values = [
    277, 2759, 933, 595, 5520, 503, 2208, 4504, 791, 936, 2400, 553, 551, 904, 406, 3085, 676, 
    5930, 6499, 8993, 9049, 4906, 890, 1428, 511, 1213, 4739, 554, 766, 997, 1407, 2236, 1096, 
    4053, 2846, 5955, 5031
]

# Create the bar chart with unique colors for each state
plt.figure(figsize=(12, 8))
bars = plt.barh(subnational_regions, mpi_poor_values, color=colors)

# Adding data labels at the tip of each bar
for bar, num_mpi_poor in zip(bars, mpi_poor_values):
    width = bar.get_width()
    plt.text(width + 20, bar.get_y() + bar.get_height() / 2, num_mpi_poor, va='center', fontsize=10)

# Remove x-axis and ticks
plt.gca().get_xaxis().set_visible(False)

plt.title('Number of MPI Poor by Subnational Region (Year 2018)')

# Save the chart as a PNG file
plt.savefig('mpi_poor_by_region.png', bbox_inches='tight')

# Display the chart
plt.show()

Chart Output:

Generating Stacked chart, showing how deprivation form Multiple factors such as Nutrition, Housing, Sanitation, Education, etc. cause poverty:

import matplotlib.pyplot as plt

# Plot the stacked bar chart
ax = mics_df.plot(kind="bar", stacked=True, figsize=(12, 8))

# Create a dictionary to map numeric values to state names
state_names = {
    0: "Abia", 1: "Adamawa", 2: "Akwa Ibom", 3: "Anambra", 4: "Bauchi", 5: "Bayelsa", 6: "Benue", 7: "Borno",
    8: "Cross River", 9: "Delta", 10: "Ebonyi", 11: "Edo", 12: "Ekiti", 13: "Enugu", 14: "FCT", 15: "Gombe",
    16: "Imo", 17: "Jigawa", 18: "Kaduna", 19: "Kano", 20: "Katsina", 21: "Kebbi", 22: "Kogi", 23: "Kwara",
    24: "Lagos", 25: "Nasarawa", 26: "Niger", 27: "Ogun", 28: "Ondo", 29: "Osun", 30: "Oyo", 31: "Plateau",
    32: "Rivers", 33: "Sokoto", 34: "Taraba", 35: "Yobe", 36: "Zamfara"
}

# Create a custom legend for Subnational Region names
custom_legend = [
    plt.Line2D([0], [0], marker='o', color='w', label=f'{i} - {state_names[i]}',
               markersize=10, markerfacecolor='C0')
    for i in range(len(mics_df.index))
]

# Create a legend for the Factors
factors_legend = ax.legend(handles=custom_legend, title="Subnational Regions", bbox_to_anchor=(1.05, 1), loc='upper left')

# Add a legend for the Factors to the right
ax.legend(title="Factors", bbox_to_anchor=(1.05, 0), loc='lower left')

# Add title and labels
plt.title("Multiple Indicator Cluster by Subnational Region")
plt.xlabel("Subnational Region")
plt.ylabel("")

# Show the chart
plt.show()

Chart Output:

Merging data frame:

# Merging both data frame using common header
merged_df = pd.merge(mpi_df, mics_df, on="Subnational Region")

# Viwing the heder in a tabular form by selecting desired header
selected_headers = ['Subnational Region', 'Year', 'MPI of Nigeria', 'Education']
num_rows_to_display = 4
table = PrettyTable(selected_headers)
for _, row in merged_df[selected_headers][:num_rows_to_display].iterrows():
    table.add_row(row)
print(table)

#Output indicationg both data frame were successfully merged
+--------------------+------+----------------+-----------+
| Subnational Region | Year | MPI of Nigeria | Education |
+--------------------+------+----------------+-----------+
|        Abia        | 2018 |     0.254      |    9.9    |
|      Adamawa       | 2018 |     0.254      |    49.9   |
|     Akwa Ibom      | 2018 |     0.254      |    15.6   |
|      Anambra       | 2018 |     0.254      |    7.6    |
+--------------------+------+----------------+-----------+

Importing libraries for training model:
At this point, I'll be training a machine learning model using Recurrent Neural Network (RNN) to predict and prescribe poverty reduction in Nigeria using the merged dataset containing various socio-economic indicators, I'll then use Ensemble method to improve predictive performance and robustness of the model and finally apply early stopping to prevent overfitting.

import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.utils.class_weight import compute_class_weight

# Define categorical and numerical features
categorical_features = [
    'Subnational Region',
    'Year of the survey',
]
numerical_features = [
    'Population 2019',
    'Population 2020',
    'Population share by region',
    'Population size by region',
    'Number of MPI poor by region',
    'Nutrition',
    'Health',
    'Water',
    'Sanitation',
    'Housing',
    'Information',
    'Education',
    'Water.1',
    'Sanitation.1',
    'Housing.1',
    'Information.1',
    'Education.1',
    'Water.2',
    'Sanitation.2',
    'Housing.2',
    'Information.2',
    'Intensity of deprivation among the poor',
    'Vulnerable to poverty',
    'In severe poverty',
]

Encode categorical features:
I'll need to encode categorical features because Categorical data needs to be converted to numerical format for machine learning algorithms to process them.

# Encode categorical features
label_encoders = {}
for feature in categorical_features:
    if feature != 'Year of the survey':
        le = LabelEncoder()
        merged_df[feature] = le.fit_transform(merged_df[feature])
        label_encoders[feature] = le

Split the data into features and target:
To separate the input features (independent variables) from the target variable (dependent variable)

# Split the data into features and target
X = merged_df.drop(columns=['Multidimensional Poverty Index'])
y = merged_df['Multidimensional Poverty Index']

Standardize numerical features:
Standardization makes different numerical features comparable by scaling them to have a mean of 0 and a standard deviation of 1.

# Standardize numerical features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X[numerical_features] = scaler.fit_transform(X[numerical_features])

# Specify all the columns you want from 'merged_df'
selected_columns = ['Year', 'Subnational Region', 'MPI of Nigeria',
       'Multidimensional Poverty Index',
       'Population in multidimensional poverty',
       'Intensity of deprivation among the poor', 'Vulnerable to poverty',
       'In severe poverty', 'Year of the survey', 'Population 2019',
       'Population 2020', 'Population share by region',
       'Population size by region', 'Number of MPI poor by region',
       'Nutrition', 'Health', 'Water', 'Sanitation', 'Housing', 'Information',
       'Education', 'Water.1', 'Sanitation.1', 'Housing.1', 'Information.1',
       'Education.1', 'Water.2', 'Sanitation.2', 'Housing.2', 'Information.2']

# Create a new DataFrame using all the selected columns from 'merged_df'
new_df = merged_df[selected_columns]

Identify the target variable:
To determine the variable you want to predict or model.

# Identify the target variable
target_variable = 'Multidimensional Poverty Index'

# Convert the target variable to string type
new_df[target_variable] = new_df[target_variable].astype(str)

# Check the unique values in the target variable
unique_values = new_df[target_variable].unique()

Check if any class has only one member:
To ensure class balance to prevent issues with imbalanced datasets.

# Check if any class has only one member
classes_with_one_member = [val for val in unique_values if new_df[target_variable].value_counts()[val] == 1]

Now that I've identified and flagged the classes with only one member, I'll need to combine the rare classes with a more frequent class. However, I need to define a logical rule, this rule should specify which rare classes will be combined and how they will be combined. I need to combine all rare classes with a frequency below a certain threshold into a single class called "Combined Rear Class"

if len(classes_with_one_member) > 0:
    print(f"Classes with only one member: {classes_with_one_member}")

    # Combine Rare Classes
    rare_class_threshold = 5
    for val in classes_with_one_member:
        if new_df[target_variable].value_counts()[val] < rare_class_threshold:
            # Combine the rare class with a more frequent class
            new_df[target_variable] = new_df[target_variable].replace(val, 'Combined Class')

# Output
Classes with only one member: 
['0.33', '0.45', '0.17', '0.34', '0.08', '0.19', '0.31', '0.4', '0.59', '0.15', '0.21', '0.02', '0.16', '0.38', '0.26', '0.06', '0.56', '0.54', '0.48']

Split the data into training and test sets:
Create separate datasets for model training and evaluation to assess model performance.

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    new_df.drop(columns=[target_variable]),  # Features excluding the target variable
    new_df[target_variable],  # Target variable
    test_size=0.2,  # Adjust the test size as needed
    random_state=0,
    stratify=new_df[target_variable]  # Ensure stratified splitting
)

Now I can proceed with model training and evaluation.
Check the distribution of the combined target variable:
Understand the distribution of the target variable to identify potential issues or biases in the dataset.

combined_target = new_df[target_variable]

# Check the distribution of the combined target variable
class_distribution = combined_target.value_counts()

print(class_distribution)

# Output
Combined Class    19
0.09               7
0.12               3
0.04               2
0.49               2
0.05               2
0.37               2
Name: Multidimensional Poverty Index, dtype: int64

The "Combined Class" has 19 samples, which is more than the separate classes '0.12,' '0.04,' '0.49,' '0.05,' and '0.37.' This implies that the "Combined Class" is no longer the smallest class and is more balanced when compared to these specialized classes. I may not need to calculate class weights in this situation because the class distribution is generally balanced after combining unusual classes. So go ahead and train the model.

Now I'll need to define the RNN Model with a function that defines the RNN model's architecture. It's a straightforward RNN model with an embedding layer, an LSTM layer, and a dense layer with sigmoid activation.

# Define the RNN Model:

def create_rnn_model(input_dim):
    model = tf.keras.models.Sequential([
        tf.keras.layers.Embedding(input_dim, 128),
        tf.keras.layers.LSTM(64),
        tf.keras.layers.Dense(1, activation="sigmoid")
    ])
    model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
    return model

Create an Ensemble of RNNs:
In this section, I'll use stratified k-fold cross-validation to build an ensemble of RNNs. Multiple RNN models are trained on distinct subsets of the training data by the code.

# Create an Ensemble of RNNs
from sklearn.preprocessing import LabelEncoder

# Create a label encoder instance
label_encoder = LabelEncoder()

# Fit the label encoder on the target variable and transform it
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

# Check for values outside the expected range of 0 - 29
min_expected_value = 0
max_expected_value = 29

# Identify values outside the range
out_of_range_indices = (X_train_fold < min_expected_value) | (X_train_fold > max_expected_value)

# If any values are outside the range, adjust them to the nearest boundary value
if out_of_range_indices.any().any():
    X_train_fold = np.clip(X_train_fold, min_expected_value, max_expected_value)

# Create a new model with the adjusted embedding size
new_embedding_size = 65  # You can adjust this size as needed
model = create_rnn_model(input_dim=new_embedding_size)

# Train the model with the adjusted data
model.fit(X_train_fold, y_train_fold, epochs=10)
ensemble_models.append(model)

# Output
Epoch 1/10
1/1 [==============================] - 6s 6s/step - loss: 0.8241 - accuracy: 0.0435
Epoch 2/10
1/1 [==============================] - 0s 21ms/step - loss: 0.6181 - accuracy: 0.0435
Epoch 3/10
1/1 [==============================] - 0s 30ms/step - loss: 0.4171 - accuracy: 0.0435
Epoch 4/10
1/1 [==============================] - 0s 26ms/step - loss: 0.2095 - accuracy: 0.0435
Epoch 5/10
1/1 [==============================] - 0s 31ms/step - loss: -0.0125 - accuracy: 0.0435
Epoch 6/10
1/1 [==============================] - 0s 25ms/step - loss: -0.2572 - accuracy: 0.0435
Epoch 7/10
1/1 [==============================] - 0s 39ms/step - loss: -0.5360 - accuracy: 0.0435
Epoch 8/10
1/1 [==============================] - 0s 32ms/step - loss: -0.8657 - accuracy: 0.0435
Epoch 9/10
1/1 [==============================] - 0s 24ms/step - loss: -1.2690 - accuracy: 0.0435
Epoch 10/10
1/1 [==============================] - 0s 31ms/step - loss: -1.7772 - accuracy: 0.0435he model has been successfully trained with the adjusted data. The training has completed for 10 epochs, and the accuracy and loss values have been computed for each epoch.

The model has been successfully trained. The training has completed for 10 epochs, and the accuracy and loss values have been computed for each epoch. I can now proceed with ensemble of models.

# Check for values less than 0
min_values = np.min(X_train, axis=0)
values_below_0 = min_values < 0
print("Columns with values below 0:", np.where(values_below_0)[0])

# Check for values greater than or equal to 65
max_values = np.max(X_train, axis=0)
values_above_or_equal_to_65 = max_values >= 65
print("Columns with values above or equal to 65:", np.where(values_above_or_equal_to_65)[0])

# Output
Columns with values below 0: [ 4  5  6 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28]
Columns with values above or equal to 65: [0 3 7]

Based on the output its clear there are columns that the code identify that are problematic or having out-of-range values. To solve this issue, one sure way is to adjust the clipping range of the values. (Clipping is a technique that limits the values of your data to a specified range, so by adjusting the clipping range, you can control how values outside that range are handled)

# Check X_val_clipped
max_value_val = np.max(X_val_clipped.values)
if max_value_val >= 65:
    print("Warning: X_val_clipped contains values greater than or equal to 65.")
else:
    print("All values in X_val_clipped are less than 65.")

# Check X_train_clipped
max_value_train = np.max(X_train_clipped.values)
if max_value_train >= 65:
    print("Warning: X_train_clipped contains values greater than or equal to 65.")
else:
    print("All values in X_train_clipped are less than 65.")

# Output
Warning: X_train_clipped contains values greater than or equal to 65.
Warning: X_val_clipped contains values greater than or equal to 65.

# Adjust clipping range for X_val
X_val_clipped = np.clip(X_val, 0, 65)

# Adjust clipping range for X_train
X_train_clipped = np.clip(X_train, 0, 65)

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train_clipped)
X_val_scaled = scaler.transform(X_val_clipped)

for model in ensemble_models:
    model.fit(X_train_scaled, y_train, epochs=10, batch_size=32)  # Adjust epochs and batch_size as needed

# Output
Epoch 1/10
1/1 [==============================] - 2s 2s/step - loss: 0.8182 - accuracy: 0.0000e+00
Epoch 2/10
1/1 [==============================] - 0s 34ms/step - loss: 0.8372 - accuracy: 0.0000e+00
Epoch 3/10
1/1 [==============================] - 0s 23ms/step - loss: 0.8499 - accuracy: 0.0000e+00
Epoch 4/10
1/1 [==============================] - 0s 30ms/step - loss: 0.8544 - accuracy: 0.0000e+00
Epoch 5/10
1/1 [==============================] - 0s 27ms/step - loss: 0.8493 - accuracy: 0.0000e+00
Epoch 6/10
1/1 [==============================] - 0s 38ms/step - loss: 0.8339 - accuracy: 0.0000e+00
Epoch 7/10
1/1 [==============================] - 0s 30ms/step - loss: 0.8096 - accuracy: 0.0000e+00
Epoch 8/10
1/1 [==============================] - 0s 27ms/step - loss: 0.7786 - accuracy: 0.0000e+00
Epoch 9/10
1/1 [==============================] - 0s 34ms/step - loss: 0.7442 - accuracy: 0.0000e+00
Epoch 10/10
1/1 [==============================] - 0s 35ms/step - loss: 0.7093 - accuracy: 0.0000e+00

I have successfully trained the ensemble of models on the scaled training data. However, the training accuracy is very low (0.0000e+00), which could indicate an issue with the model or the dataset. To mitigate the issue, I'll need to implement Early Stopping to prevent overfitting, Early stopping monitor a validation metric (e.g., validation loss or accuracy) during training and stop training if the metric doesn't improve for a certain number of epochs. But, before that is implemented, I'll need to do some checks to see how well the model is, I will check for the Data Shape and Types, Shuffling Data and finally carry out Random Sample to inspect it visually and ensure that it looks as expected.

#Data Shape and Types:Check the shapes and data types of your input features (X_train_scaled) and labels (y_train). Ensure they match your expectations.
print("X_train_scaled shape:", X_train_scaled.shape)
print("y_train shape:", y_train.shape)
print("Data types - X_train_scaled:", X_train_scaled.dtype, "y_train:", y_train.dtype)

#Output
X_train_scaled shape: (29, 29)
y_train shape: (29,)
Data types - X_train_scaled: float64 y_train: float64

#Shuffling Data:Verify that your data is properly shuffled. You can check this by examining the first few rows of your training data to see if they appear in a random order.
print("First 10 labels in y_train:", y_train[:10])

# Output
First 10 labels in y_train: 24    0.02
16    0.05
8     0.12
15    0.49
12    0.09
32    0.06
9     0.08
19    0.37
0     0.04
31    0.26
Name: Multidimensional Poverty Index, dtype: float64

#Random Sample:Take a random sample of your data to inspect it visually and ensure that it looks as expected.
random_sample_idx = np.random.randint(0, X_train_scaled.shape[0], size=5)
print("Random sample from X_train_scaled:", X_train_scaled[random_sample_idx])
print("Corresponding labels from y_train:", y_train[random_sample_idx])

# Output
Random sample from X_train_scaled: [[0.         0.58333333 0.         1.         1.         0.
  1.         0.         0.         0.         0.         0.
  0.35622583 0.66777519 0.97879899 0.33426738 0.20822011 1.
  1.         0.93903243 0.34751037 0.44649022 1.         1.
  0.81675733 0.27351325 0.23961753 1.         1.        ]
 [0.         1.         0.         1.         0.76026385 0.
  0.71200836 0.         0.         0.         0.13272364 0.13319718
  0.37564938 0.58164283 1.         0.47887324 0.74235734 0.73489543
  0.54481547 0.78783285 0.53056176 0.72628637 0.70538342 0.39344262
  0.62149874 0.54258242 0.72541744 0.70660793 0.35122921]
 [0.         0.88888889 0.         0.17630628 0.         0.30709683
  0.         0.         0.         0.         0.23793421 0.23699529
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.        ]
 [0.         0.08333333 0.         0.07328294 0.         0.
  0.         0.         0.         0.         0.17537658 0.17520745
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.        ]
 [0.         0.77777778 0.         0.29444627 0.         0.65612108
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.42815897 0.
  0.         0.         0.         0.44040769 0.         0.
  0.         0.         0.         0.         0.        ]]
Corresponding labels from y_train: 14    0.12
16    0.05
5     0.12
18    0.31
28    0.09
Name: Multidimensional Poverty Index, dtype: float64

Based on the various data check outputs:

Data Shape and Types: The shapes of X_train_scaled and y_train match expectations and are fine. Both X_train_scaled and y_train have data types of float64, which is fine.
Shuffling Data: The first ten labels in y_train do not appear to be in any particular sequence. This implies that the data may have been shuffled or randomly organized, which is beneficial for training models.
Random Sample: The random sample's feature values are appropriate, and the labels are in the anticipated format (floats). Now, Based on these checks, the data appears to be correctly prepared, shuffled, and in the expected format, which is suitable for continues machine learning training. I am now free to implement Early Stopping.

#Early Stopping: Implement early stopping to prevent overfitting. Monitor a validation metric (e.g., validation loss or accuracy) during training and stop training if the metric doesn't improve for a certain number of epochs.
from keras.callbacks import EarlyStopping

early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
model.fit(X_train_scaled, y_train, epochs=100, batch_size=32, validation_data=(X_val_scaled, y_val), callbacks=[early_stopping])

# Output
Epoch 1/100
1/1 [==============================] - 1s 931ms/step - loss: 0.6762 - accuracy: 0.0000e+00 - val_loss: 0.6599 - val_accuracy: 0.0000e+00
Epoch 2/100
1/1 [==============================] - 0s 76ms/step - loss: 0.6463 - accuracy: 0.0000e+00 - val_loss: 0.6430 - val_accuracy: 0.0000e+00
Epoch 3/100
1/1 [==============================] - 0s 79ms/step - loss: 0.6202 - accuracy: 0.0000e+00 - val_loss: 0.6296 - val_accuracy: 0.0000e+00
Epoch 4/100
1/1 [==============================] - 0s 72ms/step - loss: 0.5979 - accuracy: 0.0000e+00 - val_loss: 0.6193 - val_accuracy: 0.0000e+00
Epoch 5/100
1/1 [==============================] - 0s 93ms/step - loss: 0.5793 - accuracy: 0.0000e+00 - val_loss: 0.6118 - val_accuracy: 0.0000e+00
Epoch 6/100
1/1 [==============================] - 0s 82ms/step - loss: 0.5638 - accuracy: 0.0000e+00 - val_loss: 0.6066 - val_accuracy: 0.0000e+00
Epoch 7/100
1/1 [==============================] - 0s 83ms/step - loss: 0.5511 - accuracy: 0.0000e+00 - val_loss: 0.6033 - val_accuracy: 0.0000e+00
Epoch 8/100
1/1 [==============================] - 0s 85ms/step - loss: 0.5407 - accuracy: 0.0000e+00 - val_loss: 0.6016 - val_accuracy: 0.0000e+00
Epoch 9/100
1/1 [==============================] - 0s 84ms/step - loss: 0.5322 - accuracy: 0.0000e+00 - val_loss: 0.6013 - val_accuracy: 0.0000e+00
Epoch 10/100
1/1 [==============================] - 0s 99ms/step - loss: 0.5255 - accuracy: 0.0000e+00 - val_loss: 0.6021 - val_accuracy: 0.0000e+00
Epoch 11/100
1/1 [==============================] - 0s 83ms/step - loss: 0.5202 - accuracy: 0.0000e+00 - val_loss: 0.6039 - val_accuracy: 0.0000e+00
Epoch 12/100
1/1 [==============================] - 0s 102ms/step - loss: 0.5161 - accuracy: 0.0000e+00 - val_loss: 0.6065 - val_accuracy: 0.0000e+00
Epoch 13/100
1/1 [==============================] - 0s 105ms/step - loss: 0.5131 - accuracy: 0.0000e+00 - val_loss: 0.6098 - val_accuracy: 0.0000e+00
Epoch 14/100
1/1 [==============================] - 0s 81ms/step - loss: 0.5110 - accuracy: 0.0000e+00 - val_loss: 0.6135 - val_accuracy: 0.0000e+00
<keras.src.callbacks.History at 0x22d6fb9e950>

Now I've successfully implemented early stopping in the model training process. This will help prevent overfitting and ensure that the model generalizes well to new data. Now let's finalize.

Model Evaluation

**
Model Performance Metrics
Using a constrained dataset encompassing multiple socioeconomic indices, I constructed a machine learning model to forecast and prescribe poverty reduction in Nigeria. To avoid overfitting, the model was trained using early stopping. I used the following metrics to assess its performance:

Loss: The loss function (mean squared error) measures the discrepancy between the predicted and actual values. It quantifies how well the model fits the data.

Accuracy: While accuracy is not a typical metric for regression tasks, we calculated it to provide a general idea of the model's performance.

Evaluation Results

**
Both training and validation datasets were used to train and evaluate the model. The following are the primary evaluation findings:

Training Loss: The training loss decreased consistently over the epochs, reaching a value of 0.5110.

Validation Loss: The validation loss also decreased, with a final value of 0.6135.

Training Accuracy: The training accuracy was reported as 0.0 due to the regression nature of the task.

Validation Accuracy: The validation accuracy was also reported as 0.0, as accuracy is not a suitable metric for regression tasks.

Conclusions

**
Based on the evaluation results, we can draw the following conclusions:

Model Training: The decreased training loss indicates that the machine learning model effectively learned from the training data. This implies that the model is capable of detecting patterns in the data.

Validation Performance: While the validation loss of the model decreased during training, the final validation loss is still very significant. This suggests that the model may not generalize effectively to new, previously unknown data. The poor accuracy scores emphasize the difficulties in utilizing typical classification metrics for regression problems.

Objective Achievement: Due to the relatively substantial validation loss, the primary goal of forecasting and prescribing poverty reduction in Nigeria may not have been entirely realized. The performance of the model implies that more advanced approaches or extra data may be necessary to enhance predictions..

Recommendations

**
Based on the findings and conclusions, here are some recommendations for further work and improvements:

Hyperparameter Tuning: Experiment with various hyperparameters such as model architecture, learning rate, and epoch count to uncover configurations that result in superior model performance.

Feature Engineering: Investigate additional features or technical strategies for extracting more useful information from data. Methods for feature selection and dimensionality reduction may also be useful.

Collect More Data: A bigger sample size and more relevant features in the dataset could increase model generalization. Furthermore, gathering data unique to poverty reduction activities in Nigeria may improve projections.

Time-Series Analysis: Investigate time-series analysis tools for incorporating temporal trends into poverty alleviation activities.

Ensemble Models: Experiment with ensemble models, such as random forests or gradient boosting, to see if they can better capture complicated relationships in data.

External Data Sources: To increase the dataset's richness and diversity, incorporate data from external sources such as government publications, surveys, or satellite photography.

Interdisciplinary Collaboration: Collaborate with experts in economics, social sciences, and poverty reduction to obtain a better understanding of the variables that contribute to poverty and viable policy solutions.

Ethical Considerations: When using machine learning to societal concerns such as poverty alleviation, keep ethical implications in mind. Make certain that the models do not inject prejudice or unfairness into decision-making.

Final Remarks

**
The study made substantial progress in its investigation of the use of machine learning approaches to forecast and prescribe poverty reduction in Nigeria. However, there are some critical considerations to overcome in order to adequately answer the study question:

Limited Datasets: When training machine learning models, using restricted datasets can be difficult. While the experiment used available data, the model's performance and generalization capabilities may have suffered due to the small dataset size.

Model Performance: The evaluation findings show that the model had difficulty obtaining high accuracy while minimizing validation loss. This shows that the model's prediction performance may need to be improved further.

Regression Task: Predicting and prescribing poverty reduction is a regression task in the study question. Traditional classification criteria, such as accuracy, may not be ideal for evaluating regression models. The emphasis should be on minimizing the loss function and enhancing the model's ability to predict accurately.

Generalization: Machine learning models strive to generalize well to previously unseen data. The project's findings indicate that the model may not generalize well to new, previously unreported data. This is an important consideration when using machine learning approaches in real-world applications.

Data Limitations: Data restrictions, such as data quality, representativeness, and the availability of important features, also have an impact on the project's success in answering the research question.

In conclusion, while the project produced useful contributions and provided insights into the application of machine learning approaches for poverty prediction in Nigeria, additional work may be required to adequately answer the research topic, particularly given the limits of restricted datasets. Extending the dataset, refining the model, and investigating new variables and approaches may improve the project's ability to predict and prescribe poverty-reduction strategies.

DEV Community

Predicting Poverty Reduction in Nigeria: A Machine Learning Approach

Introduction

Model Evaluation

Evaluation Results

Conclusions

Recommendations

Final Remarks

Top comments (0)

Read next

Microsoft Autogen Has Split in 2... Wait 3... No, 4 Parts

A beginner's guide to the Flux-1.1-Pro model by Black-Forest-Labs on Replicate

Distill Large Language Models Into Compact AI With LLM-Neo

Language Models Get Introspective: Learning About Their Own Capabilities