Eduardo Rabelo for AWS Community Builders

Posted on Dec 6, 2023

S3 virus scanning with TypeScript and Node.js 20.x AWS Lambda Container

#aws #node #typescript #docker

Searching the internet, you can find guides showing how to create a serverless virus scanning with ClamAV:

Even in the AWS Blog you can find good ideas:

Virus scan S3 buckets with a serverless ClamAV based CDK construct
AWS SAM application to keep your S3 objects safe from viruses using ClamAV Open Source software

However, a few aspects fell short of my expectations:

1) They are not using the latest version of ClamAV
The examples above are installing ClamAV directly from the OS package registry (e.g., yum install clamav).

They aren't always in sync with the latest release, and if you don't update or upgrade the OS package registry, you will install older versions.

Using yum install clamav for Amazon Linux 2, we receive the 0.100.x version, whereas the ClamAV release is already in the 1.x version.

2) Outdated Node.js runtime
Examples using Node.js use the outdated Node.js 14.x runtime. This runtime is already under Deprecation (Phase 1) in the AWS timeline for Node.js supported runtimes and is not part of the maintenance window of Node.js releases.

Using outdated and unsupported versions is a risk I try to avoid!

3) Using plain JavaScript
While not a deal-breaker, I don't use plain JavaScript in my production projects.

How would a TypeScript example look in the latest AWS Lambda runtime for Node.js?

4) How to use top-level await in Node.js handlers
Top-level await support in AWS Lambda is not new, but how can we configure it with everything else? (e.g., TypeScript, esbuild, etc). How can we output ESM code in the lambda handler?

Here it comes Amazon Linux 2023

Starting on Node.js 20.x runtime, the default operational system for the AWS Lambda base image is Amazon Linux 2023.

The AL2023 brings a new package management tool called dnf.

dnf is the successor to yum, the package management tool in Amazon Linux 2.

While many of the commands are compatible, for example, for the following Amazon Linux 2 yum commands:

$ sudo yum install packagename
$ sudo yum search packagename
$ sudo yum remove packagename

In AL2023, they become these commands:

$ sudo dnf install packagename
$ sudo dnf search packagename
$ sudo dnf remove packagename

Not everything stays the same. Be aware of changes! 🚨

You can check the page changes in dnf CLI compared to yum.

Let's build a newer example

We'll keep the example project similar to the guides listed at the beginning:

A S3 bucket with object notification to an AWS Lambda
When an object is created in S3, a notification triggers the AWS Lambda
The Lambda will read the file from the bucket, write it to /tmp, and run clamscan on it
The returned code from clamscan will be used to check the file status

There are a few constraints I want to define for our newer example:

Because we need to download the ClamAV virus database in our lambda source code, the uncompressed file size of 250MB can be an issue.
We will be using AWS Lambda Container image code, which enables us to have up to 10GB of uncompressed image size.
The Docker build process should take care of transpiling TypeScript to JavaScript and installing production-only dependencies.
We want to download and install the latest ClamAV during the Docker build process and update its virus definitions.

The Dockerfile

We can use Docker multi-stage builds to create these steps:

# ========================================
# Builder Image
# ========================================
FROM --platform=linux/x86_64 public.ecr.aws/lambda/nodejs:20 as builder

COPY package.json package-lock.json index.ts  ./

#
# 1) install dependencies with dev dependencies
# 2) build the project
# 3) remove dev dependencies
# 4) install dependencies without dev dependencies
#
RUN npm install && \
    npm run build && \
    rm -rf node_modules && \
    npm install --omit=dev

# ========================================
# Runtime Image
# ========================================
FROM --platform=linux/x86_64 public.ecr.aws/lambda/nodejs:20 as runtime

ENV CLAMAV_PKG=clamav-1.2.1.linux.x86_64.rpm
RUN <<-EOF
    set -ex

    #
    # install glibc-langpack-en to support english language and utf-8
    # this was required by clamscan to avoid error "WARNING: Failed to set locale"
    #
    dnf install wget glibc-langpack-en -y

    # 
    # 1) download latest ClamAV from https://www.clamav.net/downloads
    # 2) install using `rpm` and it requires full path for local packages
    # 3) remove the downloaded package and clean up for smaller runtime image
    # 
    wget https://www.clamav.net/downloads/production/${CLAMAV_PKG}
    rpm -ivh "${LAMBDA_TASK_ROOT}/${CLAMAV_PKG}"
    rm -rf ${CLAMAV_PKG}
    dnf remove wget -y
    dnf clean all

    #
    # the current working directory is "/var/task" as defined in the base image:
    # https://github.com/aws/aws-lambda-base-images/blob/nodejs20.x/Dockerfile.nodejs20.x
    #
    # 1) "lib/database" is the path to download the virus database
    # 2) "freshclam.download.log" and "freshclam.conf.log" are the log files for freshclam CLI
    #
    mkdir -p ${LAMBDA_TASK_ROOT}/lib/database
    touch ${LAMBDA_TASK_ROOT}/lib/{freshclam.download.log,freshclam.conf.log}
    chmod -R 777 ${LAMBDA_TASK_ROOT}/lib

    #
    # default configuration path for freshclam is "/usr/local/etc/freshclam.conf"
    # we create a symbolic link to the default configuration path and copy our custom config file
    #
    ln -s /usr/local/etc/freshclam.conf ${LAMBDA_TASK_ROOT}/lib/freshclam.conf
EOF

COPY freshclam.conf /var/task/lib/freshclam.conf

#
# freshclam CLI is a virus database update tool for ClamAV, documentation:
# https://linux.die.net/man/1/freshclam
#
RUN <<-EOF
    set -ex
    export LOG_FILE_PATH="${LAMBDA_TASK_ROOT}/lib/freshclam.conf.log"

    freshclam --verbose --stdout --user root \
        --log=${LOG_FILE_PATH} \
        --datadir=${LAMBDA_TASK_ROOT}/lib/database

    if grep -q "Can't download daily.cvd\|Can't download main.cvd\|Can't download bytecode.cvd" ${LOG_FILE_PATH}; then
        echo "ERROR: Unable to download ClamAV database files - your request may be being rate limited"
        exit 1;
    fi
EOF

#
# copy application files from the builder image
# 
COPY --from=builder /var/task/dist/* /var/task/
COPY --from=builder /var/task/node_modules /var/task/node_modules

CMD [ "index.handler" ]

The above Dockerfile covers:

Install and build the TypeScript Lambda to JavaScript with production dependencies using a multi-stage build. The first stage as builder creates the dist folder and node_modules folder used by the as runtime stage
Download the latest ClamAV from their release page, install it using rpm and remove cache for smaller final image
Download the ClamAV virus database definitions with freshclam
You can change the CLAMAV_PKG to be in sync with the latest version of ClamAV

🚨 Important: To update your database definition, you need to re-build this image every once in a while

The required freshclam.conf file contains the following:

CompressLocalDatabase yes
DatabaseDirectory /var/task/lib/database
DatabaseMirror database.clamav.net
DNSDatabaseInfo current.cvd.clamav.net
ScriptedUpdates no
UpdateLogFile  /var/task/lib/freshclam.conf.log

🚨 Important: The full path files (e.g., /var/task/* must match the Dockerfile definitions

The TypeScript AWS Lambda Handler

For a S3 notification event, we can write our handler similar to:

import { S3CreateEvent } from "aws-lambda";
import { GetObjectCommand, S3Client } from "@aws-sdk/client-s3";
import { spawnSync } from "node:child_process";
import { mkdir, writeFile } from "node:fs/promises";

const s3Client = await new S3Client({});

//
// directories for clamscan
// "/tmp/files_to_scan" where we will store the files from s3 to scan
// "/tmp/clamscan_tmp" required by clamscan to store temporary files during the virus scan
//
await mkdir("/tmp/files_to_scan", { recursive: true });
await mkdir("/tmp/clamscan_tmp", { recursive: true });

async function handler(event: S3CreateEvent) {
  console.log(JSON.stringify(event, null, 2));

  for (const record of event.Records) {
    const bucketName = record.s3.bucket.name;
    const objectKey = record.s3.object.key;

    const getObjectCommand = new GetObjectCommand({
      Bucket: bucketName,
      Key: objectKey,
    });
    const s3Object = await s3Client.send(getObjectCommand);
    const s3ObjectContent = (await s3Object.Body?.transformToString()) as string;

    const tmpFilePath = `/tmp/files_to_scan/${objectKey}`;
    await writeFile(tmpFilePath, s3ObjectContent, { encoding: "utf-8" });

    //
    // clamscan CLI documentation:
    // https://linux.die.net/man/1/clamscan
    //
    const clamavScan = spawnSync(
      "clamscan",
      ["--verbose", "--stdout", `--database=/var/task/lib/database`, `--tempdir=/tmp/clamscan_tmp`, tmpFilePath],
      {
        encoding: "utf-8",
        stdio: "pipe",
      },
    );
    console.log(JSON.stringify(clamavScan, null, 2));

    // You can find the return codes here:
    // https://linux.die.net/man/1/clamscan
    if (clamavScan.status === 0) {
      console.log("no virus found");
    } else if (clamavScan.status === 1) {
      console.log("virus found");
    } else if (clamavScan.status === 2) {
      console.log("some error(s) occured in clamscan");
    }

    await unlink(tmpFilePath);
  }
}

export { handler };

We use the top-level await feature and create two folders when the lambda container starts.

Later, we use spawnSync to trigger the clamscan binary installed via the Dockerfile.

Ensure you use full path definitions in the clamscan parameters, for example: /var/task/lib/database, to load the correct virus definitions.

We can test the ClamAV detection using any EICAR text files. The result should look like:

Now, we have our Dockerfile, ClamAV configuration, and Lambda handler.

Where do we deploy all of that?

The CDK TypeScript Project

Because Docker is building our lambda handler, we create its own package.json with dependencies:

{
  "name": "clamav-scan",
  "version": "1.0.0",
  "type": "module",
  "scripts": {
    "build": "rimraf dist && esbuild index.ts --format=esm --outfile=dist/index.mjs"
  },
  "devDependencies": {
    "@types/aws-lambda": "^8.10.130",
    "esbuild": "^0.19.8",
    "rimraf": "^5.0.5",
    "typescript": "^5.3.2"
  },
  "dependencies": {
    "@aws-sdk/client-s3": "^3.465.0"
  }
}

Using "type": "module" will tell TypeScript and Node.js that we are aiming to use ECMAScript Modules in our source code (ESM).

The build command asks esbuild to output our source code in the ESM format with the --format=esm flag.

The last piece of the puzzle, is the tsconfig.json:

{
  "compilerOptions": {
    "esModuleInterop": true,
    "forceConsistentCasingInFileNames": true,
    "isolatedModules": true,
    "module": "NodeNext",
    "moduleResolution": "NodeNext",
    "noEmit": true,
    "preserveConstEnums": true,
    "skipLibCheck": true,
    "sourceMap": false,
    "strict": true,
    "target": "ESNext"
  },
  "exclude": ["node_modules"]
}

Using NodeNext for moduleResolution / module and ESNext for target, will tell the TypeScript engine tsc to output code in ESM format.

The complete example can be found on GitHub:

oieduardorabelo / s3-virus-scanning-typescript-aws-lambda-container

S3 virus scanning with TypeScript and Node.js 20.x AWS Lambda Container

ClamAV 1.2.1 with AWS Lambda Container Images for Node.js 20.x

CDK project for deploying a ClamAV 1.2.1 with AWS Lambda Container Images for Node.js 20.x

This helps you to scan files for viruses using AWS Lambda functions

🚨 Important:

Virus definitions are updated during build
Ensure you are building the container regularly to keep your definitions up to date
You can update the Dockerfile to use a different version of ClamAV

View on GitHub

🚨 WARNING: You are being rate-limited

This is super important and caught me off guard multiple times.

Pay attention to the number of viruses your database is using:

During the build process of your Docker image, the ClamAV database mirror can rate limit your IP address and block you from downloading the virus definitions.

For example, visiting to https://database.clamav.net/main.cvd, can return the following:

Ensure your freshclam is downloading and loading the definitions:

daily.cvd

main.cvd

and bytecode.cvd

By default freshclam CLI will NOT throw an error when that happens.

That's why in the Dockerfile we are grepping the log file generated by the CLI and looking for errors:

    if grep -q "Can't download daily.cvd\|Can't download main.cvd\|Can't download bytecode.cvd" ${LOG_FILE_PATH}; then
        echo "ERROR: Unable to download ClamAV database files - your request may be being rate limited"
        exit 1;
    fi

And we manually throw an error when any of the rate-limiting messages are detected! 🏁

DEV Community

S3 virus scanning with TypeScript and Node.js 20.x AWS Lambda Container

Here it comes Amazon Linux 2023

Let's build a newer example

The Dockerfile

The TypeScript AWS Lambda Handler

The CDK TypeScript Project

oieduardorabelo / s3-virus-scanning-typescript-aws-lambda-container

S3 virus scanning with TypeScript and Node.js 20.x AWS Lambda Container

ClamAV 1.2.1 with AWS Lambda Container Images for Node.js 20.x

🚨 WARNING: You are being rate-limited

Top comments (0)

Read next

Create a container using the Ubuntu image in Docker.

How to Define AI Agents with Cloudformation and SAM: A Builder's Guide

🚀 Amazon Nova: AWS's New Foundation Model for GenAI🤖

Amazon Q Developer Tips: No.19 Amazon Q Developer Agents - /doc