Akmal Chaudhri for SingleStore

Posted on Mar 31 • Edited on Dec 18

Quick tip: Using Apache Spark with SingleStore Notebooks for Fraud Detection

#singlestoredb #apachespark #frauddetection

Abstract

In a previous article, we saw the ease with which we could install and use Apache Spark within the SingleStore notebook environment. Continuing our series on Spark, we'll now use it to classify fraudulent credit card transactions.

The notebook file used in this article is available on GitHub.

Fraud dataset selection

We can find actual credit card data on Kaggle. The data are anonymised credit card transactions containing genuine and fraudulent cases.

The transactions occurred over two days during September 2013, and the dataset includes a total of 284,807 transactions, of which 492 are fraudulent, representing just 0.172% of the total.

This dataset, therefore, presents some challenges for analysis as it is highly unbalanced.

The dataset consists of the following fields:

Time: The number of seconds elapsed between a transaction and the first transaction in the dataset
V1 to V28: Details not available due to confidentiality reasons
Amount: The monetary value of the transaction
Class: The response variable (0 = no fraud, 1 = fraud)

One method to prepare the data for analysis is to keep all the fraudulent transactions and randomly sample 1% of the non-fraudulent transactions without replacement. The data would be sorted on the Time column and provide a total of 3265 rows. However, many other approaches are possible.

We'll show the following metrics:

                       Predicted 
                | Positive | Negative |
  Actual        |          |          |
----------------+----------+----------+
  Positive      |    TP    |    FN    |
----------------+----------+----------+
  Negative      |    FP    |    TN    |
----------------+----------+----------+

Accuracy = (TP + TN) / (TP + TN + FP + FN)
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1 Score = 2 * (Precision * Recall) / (Precision + Recall)

Where

Accuracy: Measures the proportion of correctly classified instances among all instances
Precision: Quantifies the proportion of correctly identified positive cases out of all cases identified as positive
Recall: Evaluates the proportion of correctly identified positive cases out of all actual positive cases
F1 Score: Combines precision and recall into a single metric, balancing both measures to provide a comprehensive evaluation of a model's performance

Create a SingleStore Cloud account

A previous article showed the steps to create a free SingleStore Cloud account. We'll use the following settings:

Workspace Group Name: Spark Demo Group
Cloud Provider: AWS
Region: US East 1 (N. Virginia)
Workspace Name: spark-demo
Size: S-00

Create a new notebook

From the left navigation pane in the cloud portal, we'll select DEVELOP > Data Studio.

In the top right of the web page, we'll select New Notebook > New Notebook, as shown in Figure 1.

Figure 1. New Notebook.

We'll call the notebook spark_fraud_demo, select a Blank notebook template from the available options, and save it in the Personal location.

Fill out the notebook

First, let's install Java:

!conda install -y --quiet -c conda-forge openjdk=8

Next, we'll obtain the reduced dataset, already prepared, and load it into a Pandas DataFrame:

url = "https://raw.githubusercontent.com/VeryFatBoy/gpt-workshop/main/data/creditcard.csv"

pandas_df = pd.read_csv(url)

We can check the number of rows:

pandas_df.shape[0]

The output should be:

We can check the Class:

pandas_df.groupby("Class").size()

The output should be:

Class
0    2773
1     492
dtype: int64

We can also output the first 5 rows, as follows:

pandas_df.head(5)

Since the details for the columns V1 to V28 are not available, we can only check the Amount:

pandas_df["Amount"].describe()

The output should be:

count    3265.000000
mean       86.715210
std       195.568876
min         0.000000
25%         4.490000
50%        21.900000
75%        80.310000
max      2917.640000
Name: Amount, dtype: float64

We can produce a quick plot of the Amount values using the following:

fig = px.scatter(
    pandas_df,
    y = "Amount",
    color = pandas_df["Class"].astype(str),
    hover_data = ["Amount"]
)

fig.update_layout(
    # yaxis_type = "log",
    title = "Amount and Class"
)

fig.show()

The output is shown in Figure 2.

Figure 2. Amount and Class.

Another way we can look at the data is as a histogram:

fig = px.histogram(
    pandas_df,
    x = "Amount",
    nbins = 50
)

fig.show()

The output is shown in Figure 3.

Figure 3. Histogram.

Figures 2 and 3 show that the vast majority of transactions were small in value.

Next, let's create a SparkSession:

# Create a Spark session
spark = SparkSession.builder.appName("Fraud Detection").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

and then use Logistic Regression:

# Convert pandas DataFrame to Spark DataFrame
spark_df = spark.createDataFrame(pandas_df)

# Select features and labels
features = spark_df.columns[1:30]
labels = "Class"

# Assemble features into vector
assembler = VectorAssembler(
    inputCols = features,
    outputCol = "features"
)

spark_df = assembler.transform(spark_df).select("features", labels)

# Split the data into training and testing sets
train, test = spark_df.cache().randomSplit([0.7, 0.3], seed = 42)

# Initialise logistic regression model
lr = LogisticRegression(
    maxIter = 1000,
    featuresCol = "features",
    labelCol = labels
)

# Train the logistic regression model
train_model = lr.fit(train)

# Make predictions on the test set
predictions = train_model.transform(test)

# Calculate the accuracy, precision, recall, and F1 score of the model
accuracy = predictions.filter(predictions.Class == predictions.prediction).count() / float(test.count())

evaluator = MulticlassClassificationEvaluator(
    labelCol = labels,
    predictionCol = "prediction"
)

precision = evaluator.evaluate(
    predictions,
    {evaluator.metricName: "precisionByLabel"}
)

recall = evaluator.evaluate(
    predictions,
    {evaluator.metricName: "recallByLabel"}
)

f1 = evaluator.evaluate(
    predictions,
    {evaluator.metricName: "fMeasureByLabel"}
)

Next, we'll create a Confusion Matrix:

# Create confusion matrix
cm = predictions.select("Class", "prediction")
cm = cm.groupBy("Class", "prediction").count()
cm = cm.toPandas()

# Pivot the confusion matrix
cm = cm.pivot(
    index = "Class",
    columns = "prediction",
    values = "count"
)

# Generate and plot the confusion matrix
fig = px.imshow(
    cm,
    x = ["Genuine (0)", "Fraudulent (1)"],
    y = ["Genuine (0)", "Fraudulent (1)"],
    color_continuous_scale = "Reds",
    labels = dict(x = "Predicted Label", y = "True Label")
)

# Add annotations to the heatmap
for i in range(len(cm)):
    for j in range(len(cm)):
        fig.add_annotation(
            x = j,
            y = i,
            text = str(cm.iloc[i, j]),
            font = dict(color = "white" if cm.iloc[i, j] > cm.values.max() / 2 else "black"),
            showarrow = False
        )

fig.update_layout(
    title_text = "Confusion Matrix - Logistic Regression",
    coloraxis_showscale = False
)

fig.show()

The output is shown in Figure 4.

Figure 4. Confusion Matrix.

Overall, the model has made some good predictions without too many errors.

We can also print some metrics:

# Print the accuracy, precision, recall and f1 of the model
print("Accuracy: %.4f" % accuracy)
print("Precision: %.4f" % precision)
print("Recall: %.4f" % recall)
print("F1: %.4f" % f1)

Example output:

Accuracy: 0.9817
Precision: 0.9862
Recall: 0.9924
F1: 0.9893

Finally, we'll stop Spark:

spark.stop()

Summary

In this short article, we've been able to use Apache Spark to build the first iteration of a fraud detection model using SingleStore notebooks. In the next article in this series, we'll use the SingleStore Spark Connector to read and write data using the SingleStore Data Platform. Stay tuned.

DEV Community

Quick tip: Using Apache Spark with SingleStore Notebooks for Fraud Detection

Abstract

Fraud dataset selection

Create a SingleStore Cloud account

Create a new notebook

Fill out the notebook

Summary

Top comments (0)

Read next

Hero to Zero: How Not To Manage Staff Redundancies

#11 Next.js 15: Revolutionizing Server-Side Rendering (SSR) for Modern Applications😯🤓

Understanding the Intricacies of Digital Currency

Part 9: Exception Handling in C#