DEV Community

Akmal Chaudhri for SingleStore

Posted on

Quick tip: Using Apache Spark with SingleStore Notebooks for Fraud Detection

Abstract

In a previous article, we saw the ease with which we could install and use Apache Spark within the SingleStore notebook environment. Continuing our series on Spark, we'll now use it to classify fraudulent credit card transactions.

The notebook file used in this article is available on GitHub.

Fraud dataset selection

We can find actual credit card data on Kaggle. The data are anonymised credit card transactions containing genuine and fraudulent cases.

The transactions occurred over two days during September 2013, and the dataset includes a total of 284,807 transactions, of which 492 are fraudulent, representing just 0.172% of the total.

This dataset, therefore, presents some challenges for analysis as it is highly unbalanced.

The dataset consists of the following fields:

  • Time: The number of seconds elapsed between a transaction and the first transaction in the dataset
  • V1 to V28: Details not available due to confidentiality reasons
  • Amount: The monetary value of the transaction
  • Class: The response variable (0 = no fraud, 1 = fraud)

One method to prepare the data for analysis is to keep all the fraudulent transactions and randomly sample 1% of the non-fraudulent transactions without replacement. The data would be sorted on the Time column and provide a total of 3265 rows. However, many other approaches are possible.

We'll show the following metrics:

                       Predicted 
                | Positive | Negative |
  Actual        |          |          |
----------------+----------+----------+
  Positive      |    TP    |    FN    |
----------------+----------+----------+
  Negative      |    FP    |    TN    |
----------------+----------+----------+
Enter fullscreen mode Exit fullscreen mode
  • Accuracy = (TP + TN) / (TP + TN + FP + FN)
  • Precision = TP / (TP + FP)
  • Recall = TP / (TP + FN)
  • F1 Score = 2 * (Precision * Recall) / (Precision + Recall)

Where

  • Accuracy: Measures the proportion of correctly classified instances among all instances
  • Precision: Quantifies the proportion of correctly identified positive cases out of all cases identified as positive
  • Recall: Evaluates the proportion of correctly identified positive cases out of all actual positive cases
  • F1 Score: Combines precision and recall into a single metric, balancing both measures to provide a comprehensive evaluation of a model's performance

Create a SingleStore Cloud account

A previous article showed the steps to create a free SingleStore Cloud account. We'll use the following settings:

  • Workspace Group Name: Spark Demo Group
  • Cloud Provider: AWS
  • Region: US East 1 (N. Virginia)
  • Workspace Name: spark-demo
  • Size: S-00

Create a new notebook

From the left navigation pane in the cloud portal, we'll select Develop > Notebooks.

In the top right of the web page, we'll select New Notebook > New Notebook, as shown in Figure 1.

Figure 1. New Notebook.

Figure 1. New Notebook.

We'll call the notebook spark_fraud_demo, select a Blank notebook template from the available options, and save it in the Personal location.

Fill out the notebook

First, let's install Spark:

!pip cache purge --quiet
!conda install -y --quiet -c conda-forge openjdk pyspark
Enter fullscreen mode Exit fullscreen mode

Next, we'll obtain the reduced dataset, already prepared, and load it into a Pandas DataFrame:

import pandas as pd

url = "https://raw.githubusercontent.com/VeryFatBoy/gpt-workshop/main/data/creditcard.csv"

pandas_df = pd.read_csv(url)
Enter fullscreen mode Exit fullscreen mode

We can check the number of rows:

pandas_df.shape[0]
Enter fullscreen mode Exit fullscreen mode

The output should be:

3265
Enter fullscreen mode Exit fullscreen mode

We can check the Class:

pandas_df.groupby("Class").size()
Enter fullscreen mode Exit fullscreen mode

The output should be:

Class
0    2773
1     492
dtype: int64
Enter fullscreen mode Exit fullscreen mode

We can also output the first 5 rows, as follows:

pandas_df.head(5)
Enter fullscreen mode Exit fullscreen mode

Since the details for the columns V1 to V28 are not available, we can only check the Amount:

pandas_df["Amount"].describe()
Enter fullscreen mode Exit fullscreen mode

The output should be:

count    3265.000000
mean       86.715210
std       195.568876
min         0.000000
25%         4.490000
50%        21.900000
75%        80.310000
max      2917.640000
Name: Amount, dtype: float64
Enter fullscreen mode Exit fullscreen mode

We can produce a quick plot of the Amount values using the following:

import plotly.express as px
import warnings

warnings.filterwarnings("ignore", category = FutureWarning)

fig = px.scatter(
    pandas_df,
    y = "Amount",
    color = pandas_df["Class"].astype(str),
    hover_data = ["Amount"]
)

fig.update_layout(
    # yaxis_type = "log",
    title = "Amount and Class"
)

fig.show()
Enter fullscreen mode Exit fullscreen mode

The output is shown in Figure 2.

Figure 2. Amount and Class.

Figure 2. Amount and Class.

Another way we can look at the data is as a histogram:

fig = px.histogram(
    pandas_df,
    x = "Amount",
    nbins = 50
)

fig.show()
Enter fullscreen mode Exit fullscreen mode

The output is shown in Figure 3.

Figure 3. Histogram.

Figure 3. Histogram.

Figures 2 and 3 show that the vast majority of transactions were small in value.

Next, let's create a SparkSession:

from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("Fraud Detection").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
Enter fullscreen mode Exit fullscreen mode

and then use Logistic Regression:

from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.feature import VectorAssembler

# Convert pandas DataFrame to Spark DataFrame
spark_df = spark.createDataFrame(pandas_df)

# Select features and labels
features = spark_df.columns[1:30]
labels = "Class"

# Assemble features into vector
assembler = VectorAssembler(
    inputCols = features,
    outputCol = "features"
)

spark_df = assembler.transform(spark_df).select("features", labels)

# Split the data into training and testing sets
train, test = spark_df.cache().randomSplit([0.7, 0.3], seed = 42)

# Initialise logistic regression model
lr = LogisticRegression(
    maxIter = 1000,
    featuresCol = "features",
    labelCol = labels
)

# Train the logistic regression model
train_model = lr.fit(train)

# Make predictions on the test set
predictions = train_model.transform(test)

# Calculate the accuracy, precision, recall, and F1 score of the model
accuracy = predictions.filter(predictions.Class == predictions.prediction).count() / float(test.count())

evaluator = MulticlassClassificationEvaluator(
    labelCol = labels,
    predictionCol = "prediction"
)

precision = evaluator.evaluate(
    predictions,
    {evaluator.metricName: "precisionByLabel"}
)

recall = evaluator.evaluate(
    predictions,
    {evaluator.metricName: "recallByLabel"}
)

f1 = evaluator.evaluate(
    predictions,
    {evaluator.metricName: "fMeasureByLabel"}
)
Enter fullscreen mode Exit fullscreen mode

Next, we'll create a Confusion Matrix:

# Create confusion matrix
cm = predictions.select("Class", "prediction")
cm = cm.groupBy("Class", "prediction").count()
cm = cm.toPandas()

# Pivot the confusion matrix
cm = cm.pivot(
    index = "Class",
    columns = "prediction",
    values = "count"
)

# Generate and plot the confusion matrix
fig = px.imshow(
    cm,
    x = ["Genuine (0)", "Fraudulent (1)"],
    y = ["Genuine (0)", "Fraudulent (1)"],
    color_continuous_scale = "Reds",
    labels = dict(x = "Predicted Label", y = "True Label")
)

# Add annotations to the heatmap
for i in range(len(cm)):
    for j in range(len(cm)):
        fig.add_annotation(
            x = j,
            y = i,
            text = str(cm.iloc[i, j]),
            font = dict(color = "white" if cm.iloc[i, j] > cm.values.max() / 2 else "black"),
            showarrow = False
        )

fig.update_layout(
    title_text = "Confusion Matrix - Logistic Regression",
    coloraxis_showscale = False
)

fig.show()
Enter fullscreen mode Exit fullscreen mode

The output is shown in Figure 4.

Figure 4. Confusion Matrix.

Figure 4. Confusion Matrix.

Overall, the model has made some good predictions without too many errors.

We can also print some metrics:

# Print the accuracy, precision, recall and f1 of the model
print("Accuracy: %.4f" % accuracy)
print("Precision: %.4f" % precision)
print("Recall: %.4f" % recall)
print("F1: %.4f" % f1)
Enter fullscreen mode Exit fullscreen mode

Example output:

Accuracy: 0.9817
Precision: 0.9862
Recall: 0.9924
F1: 0.9893
Enter fullscreen mode Exit fullscreen mode

Finally, we'll stop Spark:

spark.stop()
Enter fullscreen mode Exit fullscreen mode

Summary

In this short article, we've been able to use Apache Spark to build the first iteration of a fraud detection model using SingleStore notebooks. In the next article in this series, we'll use the SingleStore Spark Connector to read and write data using the SingleStore Data Platform. Stay tuned.

Top comments (0)