DEV Community

Ryan Nazareth for AWS Community Builders

Posted on • Updated on • Originally published at ryannazareth.com

AWS Fraud Detector for classifying fraudulent online registered accounts - Part 1

Amazon Fraud Detector is a fully managed service that can identify potentially fraudulent online activities. These can be situations such as the creation of fake accounts or online payment fraud. Amazon Fraud Detector automates the time consuming and expensive steps to build, train, and deploy an ML model for fraud detection. It customizes each model it creates to your dataset, making the accuracy of models higher than current one-size-fits-all ML solutions. Since you pay only for what you use, you can avoid large upfront expenses.

This blog is split in two parts, where we will create a workflow to read data from S3, perform ETL job, train a Fraud Detector model which will be deployed and used to generate predictions for a sample of batch data as well as realtime predictions via a custom API. The code used in the snippets and additional scripts can be found in Github. We will use simulated train and test datasets from Kaggle which and can be downloaded from this page

The datasets _fraudTest.csv_ and _fraudTrain.csv_ contain variables for each online account registration event as required for creating an event in AWS Fraud Detector as described here. This contains the following variables:

  • index - Unique Identifier for each row
  • transdatetrans_time - Transaction DateTime
  • cc_num - Credit Card Number of Customer
  • merchant - Merchant Name
  • category - Category of Merchant
  • amt - Amount of Transaction
  • first - First Name of Credit Card Holder
  • last - Last Name of Credit Card Holder
  • gender - Gender of Credit Card Holder
  • street - Street Address of Credit Card Holder
  • city - City of Credit Card Holder
  • state - State of Credit Card Holder
  • zip - Zip of Credit Card Holder
  • lat - Latitude Location of Credit Card Holder
  • long - Longitude Location of Credit Card Holder
  • city_pop - Credit Card Holder's City Population
  • job - Job of Credit Card Holder
  • dob - Date of Birth of Credit Card Holder
  • trans_num - Transaction Number
  • unix_time - UNIX Time of transaction
  • merch_lat - Latitude Location of Merchant
  • merch_long - Longitude Location of Merchant
  • is_fraud - Fraud Flag <--- Target Class

In the first part of this blog, we will focus mainly on the training workflow starting from uploading raw data to S3, crawling the data and then transforming it with AWS Glue. Finally, we will invoke a Fraud Detector training job, which uses this processed data as input to train a model.

 Creating the Resource Stack

fraud_train_architecture

We will use AWS Cloudformation for creating and managing the resource stacks for the architecture depicted in the diagram above. The template can be accessed here. This is a large file which will create a stack to deploy the following resources:

  • Fraud Detector and Role: This includes all the Fraud Detector resources including Detector, Variables, EntityType, EventType, Outcome, and Label resources. An IAM Role for FraudDetector to access data from the S3 bucket.
  • Glue Classifier, Crawler and Job: Raw data from S3 to be crawled into Glue Data Catalog. A Glue Job for running the transformation steps to add transformed data into S3.
  • 3 x Lambda Functions, Roles and Event Source Mapping: A Lambda function for triggering a glue job in response to event from Eventbridge. Two lambda resources for triggering the training and prediction job respectively with synchronous invocation from separate SQS queues. Each Lambda function needs a role resource attached to it, with policies to give permissions to access the upstream and downstream resources. In the Cloudformation template, we have used AWS managed policies but these can be adapted depending on the architecture.
  • 2 x SQS Queues and Policies: to decouple the train and prediction workflows. Data upload to S3 bucket will publish a message to the respective SQS queue which will prompt a lambda function to run and trigger training or prediction job (depending on the workflow).
  • EventBridge Rules and Permission: Eventbridge rule for monitoring the state of the glue crawler and trigger a lambda function when the glue crawler completes. It also needs permissions to invoke the lambda function target.

More detailed explanation of the various template sections is outside the scope of this article. Please refer to the AWS doc for what the different sections do

The stack can be created by running the following AWS CLI command, and passing in the stack-name and the path to the Cloudformation template yaml file.

aws cloudformation create-stack --stack-name FraudDetectorGlue \ 
--template-body file://cloudformation/fraud_detector.yaml \
--capabilities CAPABILITY_NAMED_IAM
Enter fullscreen mode Exit fullscreen mode

An additional argument capabilities is required as we have IAM resources with custom names in the template. Otherwise, a, InsufficientCapabilities error is raised during stack creation. For more information, refer to the docs.

Once completed, we can check that a stack named FraudDetectorGlue and the required resources are created as expected.

Upload data to S3

We will now upload the Kaggle datasets downloaded earlier to S3.The script below will create a boto client and resource for s3, to create an S3 bucket (if it already exists, it will skip this step) and upload the files and script to the bucket.
The script also configures command line arguments --local_dir and --bucket_name, which need to be set to the local folder containing the kaggle datasets downloaded and the name of the bucket to be created respectively. Optionally, we can also pass in another argument --policy_filepath for the path to an S3 bucket resource policy, in case we want to attach one to the bucket.

from pathlib import Path
import json
import argparse
import logging
import boto3
from botocore.exceptions import ClientError
import os

logging.basicConfig(
    format="%(asctime)s %(name)s %(levelname)s:%(message)s", level=logging.INFO
)
logger = logging.getLogger(__name__)
s3_client = boto3.client("s3")
s3_resource = boto3.resource("s3")


def create_bucket(s3_client, bucket_name, policy_path=None):
    response = s3_client.list_buckets()["Buckets"]
    bucket_list = [bucket["Name"] for bucket in response]
    if bucket_name in bucket_list:
        logger.info(
            f"Bucket '{bucket_name}' already exists, so skipping bucket create step"
        )
    else:
        logger.info(f"Creating new bucket with name:{bucket_name}")
        s3_client.create_bucket(Bucket=bucket_name)
        if policy_path is not None:
            with open(policy_path, "rb") as f:
                policy = json.load(f)
            logger.info(f"Creating bucket policy")
            bucket_policy_str = json.dumps(policy)
            s3_client.put_bucket_policy(Bucket=bucket_name, Policy=bucket_policy_str)


def upload_files(path, bucket):
    for subdir, dirs, files in os.walk(path):
        for file in files:
            full_path = os.path.join(subdir, file)
            object_name = os.path.relpath(full_path, path)
            try:
                s3_client.upload_file(full_path, bucket, object_name)
            except ClientError as e:
                logging.error(e)
                return False


def add_arguments(parser):
    """
    Adds command line arguments to the parser.
    :param parser: The command line parser.
    """

    parser.add_argument(
        "--bucket_name", help="Name of bucket to create or upload data to"
    )

    parser.add_argument("--local_dir", help="Local folder path to upload")
    parser.add_argument("--policy_filepath", help="filepath of resource policy")


def main():
    # get command line arguments
    parser = argparse.ArgumentParser(usage=argparse.SUPPRESS)
    add_arguments(parser)
    args = parser.parse_args()
    policy_filepath = args.policy_filepath
    if policy_filepath is not None:
        create_bucket(s3_client, args.bucket_name, policy_path=policy_path)
    else:
        create_bucket(s3_client, args.bucket_name)
    dataset_path = os.path.join(str(Path(__file__).parents[1]), args.local_dir)
    upload_files(dataset_path, args.bucket_name)
    logger.info(
        f"Successfully uploaded all files in {args.local_dir} to S3  bucket {args.bucket_name}"
    )

if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode

Running the script above from the command line, with values passed to arguments --bucket_name and --local_dir, should stream the following to stdout. In this case, we have set the value for bucket name to fraud-sample-data.

2022-05-15 01:21:55,390 botocore.credentials INFO:Found credentials in shared credentials file: ~/.aws/credentials
2022-05-15 01:21:55,982 __main__ INFO:Creating new bucket with name:fraud-sample-data
0it [00:00, ?it/s]2022-05-15 01:21:56,733 __main__ INFO:Starting upload ....
0it [00:00, ?it/s]
2022-05-15 01:21:57,163 __main__ INFO:Successfully uploaded all files in datasets/fraud-sample-data/dataset1 to S3  bucket fraud-sample-data
Enter fullscreen mode Exit fullscreen mode

If this is successful, we should have train and test csv files in the same folder as in screenshot below

s3 bucket screenshot

We also need to upload the the following pyspark script to S3 to run the glue job for transforming the dataset. This will be uploaded to another S3 bucket, which is referenced in the Cloudformation template in the glue job resource. If this bucket does not exist, then create your own and modify the Cloudformation template accordingly, so the glue job created can reference the script in the correct location. Upload the script into the bucket in a scripts folder via console or cli, so if the script is named fraud-etl-glue.py ,the object key should be scripts/fraud-etl-glue.py

Configure S3 event notifications to SQS

We also need to configure S3 to send notifications to SQS when the data from glue job is written to the bucket. The SQS messages will then be polled by lambda to start the training job in AWS Fraud Detector. To do this go to the S3 bucket in the console and select Properties. On the Properties page, navigate to the EventNotifications section and choose Create event notification and then specify a descriptive event name for your event notification. We will include a prefix and a suffix to limit the notifications to objects added to a specific folder (glue_transformed/) and the object key should end in the specified characters, for example fraudTrain.csv. In the Event types section, we will only select all object create events.

event_notification_train_data

In the Destination section, choose the event notification destination and select SQS Queue for the destination type. Specify the Arn of the queue, which can be obtained from the SQS console

event-notification-s3-destination-sqs

Repeat the same for the next event configuration for the batch predictions to the prediction queue in SQS for lambda to be invoked to create a batch prediction job. The only change we will make is specifying a different prefix and suffix as we want the notifications to be sent went object is added to
the 'batch_predict' folder in the bucket having key batch_predict/fraudTest.csv. So we set the prefix to batch_predict and suffix to fraudTest.csv or just csv, since this is the only object in this folder

event-notification-batch-predict-data-sqs

Once we have configured both event notifications for the two SQS queues we should see them in the EventNotifications section in the bucket properties as in screenshot below.

s3_fraud_bucket_config_events

In each SQS queue, we should see an Access policy already configured via the Cloudformation template which grants the Amazon S3 principal the necessary permissions to call the relevant API to publish messages to SQS queue.

access-policy-sqs

Running the ETL workflow with AWS Glue

A Glue crawler run by the user crawls the train and test csv files in the S3 bucket and creates a combined table with all the data. The crawler uses a custom classifier, both of which are created automatically via Cloudformation. These are configured as below. The S3 path for the crawler is set to s3://fraud-sample-data/input which should include both the train and test csv files

custom-classifier

If the crawler runs successfully, you should see a table in glue data catalog. We can confirm that the headers and types
have been crawled correctly.

glue-catalog-table

EventBridge rule is configured to the listen to glue crawler state change event (i.e. when crawler status is 'Succeeded')
as configured in the event pattern in the screenshot below. This uses the default Eventbridge bus.

eventbridge-rule-trigger-lambda

EventBridge target is the lambda function which starts the glue job. The glue job created from cloudformation using the script in S3 path, applies the pyspark and glue transforms and writes the transformed dynamic dataframe back to S3. Using glue we transform the train and test datasets to conform to the AWS Fraud Detector requirements. for example Fraud detector model training requires some mandatory variables in the dataset:

  • EVENT_LABEL : A label that classifies the event as 'fraud' or 'legit'.
  • EVENT_TIMESTAMP : The timestamp when the event occurred. The timestamp must be in ISO 8601 standard in UTC.

These column names and exact event values are not present in the original raw dataset and need to be updated via glue script.If the glue job is successful then we should see train and test transformed files in each of the below folder locations in the bucket

s3-glue-output-folders

The transformed data written to S3, will then trigger an S3 event notification (configured in the previous section)
to SQS which would then invoke the lambda function synchronously to start the Fraud Detector model training job described in the next section.

Model Training

The lambda function invoked from S3 event messages received from SQS, will instantiate a model via the CreateModel operation, which acts as a container for your model versions. If this already exists, then it will directly progress to the next step which is the CreateModelVersion operation. This starts the training process, which results in a specific version of the model. Please refer to the AWS docs for more details . The script fetches the variables for the training job from S3 path which contains the csv file with the training data.

train-model-lambda-logs

Note: The event metadata columns e.g. EVENT_TIMESTAMP and EVENT_LABEL need to be ordered together. The csv files transformed from the glue script reorder the columns so that all the event variables are at the start and the
last two columns were event metadata i.e. EVENT_TIMESTAMP, EVENT_LABEL at the end of the other event variables. Otherwise, the following exception is seen:

botocore.errorfactory.ResourceNotFoundException: An error occurred (ResourceNotFoundException) when calling the CreateModelVersion operation: VariableIds: [EVENT_TIMESTAMP] do not exist.
Enter fullscreen mode Exit fullscreen mode

The lambda function is also configured to have a large enough memory capacity (1024 MB), given the training file size (approx 300MB).If the default value (128MB) is used, then we will see a memory error as the max lambda memory capacity is exceeded when the data is
loaded in from S3.

lambda-low-memory-error-runtime

By default, the environment variables in lambda MODE and MODEL_VERSION are set as 'create' and '1.0' when creating
via the cloudformation stack. This makes the script, create a new model with the specified version. This is a major version change and we will need to specify a version which does not already exist. Usually this would be an increment value from the existing version (e.g. if 1.0 already exists, then a major version bump would be 2.0). The MODE variable also accepts an update value which can be set if we need to update an existing model version incrementally e.g. from 1.0 to 1.0.1. This creates a minor version. We can update the environment variables either via the cloudformation parameters or from the lambda en vars configuration on console. For example for creating a new model version 2.0, we would update the MODEL_VERSION variable to 2.0 as below and trigger the glue crawler to execute the workflow and train a new model

lambda-env-vars-config-verison2-create

We will then see the model training start in the Fraud Detector console in the Model tab.

fraud-train-new-model

To update an existing model version, we set the MODE variable to update and MODEL_VERSION to the major version which needs updating. In this example below we leave it as 1.0 as we already have an active 1.0 major version which we want to update to 1.0.1

lambda-train-env-variables

Once the workflow is complete, we can then see the minor version 1.0.1 training job start.

fraud-model-update-version-101-training-console

We cannot train two models for the same major version at once.So if minor version training was already in progress and if we executed the workflow again to update the major version again, we would normally see the following exception.

botocore.errorfactory.ValidationException: An error occurred (ValidationException) when calling the CreateModelVersion operation: Simultaneous training for the same major version not allowed
Enter fullscreen mode Exit fullscreen mode

However, the script catches it and prints out the message Model Version already training in the logs and will exit without raising another exception. Once the model training for 1.0.1 completes, and if want to re-train again, we can trigger the workflow with MODE as update and it should automatically start a minor version 1.0.2 as it knows that 1.0.1 already exists.

Model Performance

From the console, we can compare different model version performances and also click on each model version and inspect the score distribution, confusion matrix and model variable importance. For example in model version 1, the AUC score is 0.94 and the model variable importance plot which gives an understanding of how each variable is contributing to your model's performance. The chart below lists model input variables in the order of their importance to the model, indicated by the number. This feature of Fraud Detector of ranking variables by importance in the model version is known as model variable importance

model-versions-performance

A variable with a much higher number relative to the rest could indicate that the model might be overfitting on it, while variables with relatively lowest numbers could just be noise. Here it shows the model may be overfitting to the amt variable as it has such a high score relative to the others and most of the rest are contributing noise in this sample dataset. This could be because the model is overfitting to a particular fraud pattern (e.g. all fraud events being related to high amt values) or because that there is a label leakage if the variable depends on the fraud labels. This version only uses 2 months of data (dec 2019 -Jan 2021) so in the next iteration we can include more data and see if it makes a difference

model-v1

We train a new model (version 2), with over a year of data from 2019 to mid 2020. This increases training time but we
can see there has been an improvement to the AUC score (0.99) as well as the variable importance plot

model-version2-featureimportance

However, we should still look and see if we can reduce overfitting. In subsequent iterations we could also decide to
remove the amt variable and see how the model performs or add some extra variables to diversify the dataset.
We can also check the model performance metrics. This is generated from the 15% of data that Fraud Detector uses for
validation after training is completed

This includes the following charts:

  • Score distribution chart to review the distribution of model scores for your fraud and legitimate events. Ideally, you will have a clear separation between the fraud and legitimate events. This indicates the model can accurately identify which events are fraudulent and which are legitimate.
  • Confusion matrix which summarizes the model accuracy for a given score threshold by comparing model predictions versus actual results. Depending on your selected model score threshold, you can see the simulated impact based on a sample of 100,000 events (refer to the AWS docs for more information). The distribution of fraud and legitimate events simulates the fraud rate in your businesses. You can use this information to find the right balance between true positive rate and false positive rate.
  • ROC chart which plots the true positive rate as a function of false positive rate over all possible model score thresholds. The ROC curve can help you fine-tune the tradeoff between true positive rate and false positive rate.

In the screenshots below, I have selected a model thresholds of 500 and 305 in score distribution chart . You can see
how adjusting the model score threshold impacts the TPR and FPR. The ROC, confusion matrix are updated as the model threshold is adjusted on the score distribution chart.

modelv2-threshold500

modelv2-threshold-305

In the second part of this article, we will deploy the model and configure API Gateway to setup a REST API to send requests to the model endpoint and make predictions.

Top comments (1)

Collapse
 
ryankarlos profile image
Ryan Nazareth • Edited

Thanks for the comment ! Having been a victim of fraud myself, its not a very nice experience for anyone. Scammers are getting better at adapting their methods to outsmart detection algorithms. At some point i am going to try out AWS Sagemaker options for training/tuning custom fraud models on some new data and try out some feature engineering techniques.