DEV Community

Cover image for Type-safe S3 Select queries with Kysely
Thomas Aribart for Serverless By Theodo

Posted on

Type-safe S3 Select queries with Kysely

S3 Select is an Amazon S3 feature that enables retrieving subsets of S3 Objects content via SQL expressions. You can use clauses like SELECT and WHERE to fetch data from CSVs, JSONs, or Apache Parquet files, even if they are compressed with GZIP and/or server-side encrypted.

S3 Select is simple to use, cost-effective and can drastically improve the performances of your application depending on your query. It is overall a great addition to the Serverless developer toolkit, particularly when:

  • You need to access a large amount of data (which would make it unfit for other storage solutions like DynamoDB)
  • You need to query and filter it in a complex and/or dynamic way
  • You don’t need to use a JOIN clause (as S3 Select can only query a single file) or to update a single record of the data (it is called S3 Select, not S3 Insert 🙂)

In this article, we’ll learn how to run S3 Select commands in Typescript, and how we can improve our DevX and type-safety using Kysely.

Querying with S3 Select

Let’s say that we have a DB of Pokemons in the shape of a CSV, stored somewhere in a S3 Bucket:

id ; name      ; customName ; type     ; level ; generation
1  ; pikachu   ;            ; electric ; 42    ; 1
2  ; charizard ;            ; fire     ; 54    ; 1
3  ; meganium  ; plantyDino ; grass    ; 26    ; 2
...
Enter fullscreen mode Exit fullscreen mode

What if we want to retrieve the fire Pokemons from generation 1 and 2? Well, S3 Select let us do that with the following query:

import { S3Client, SelectObjectContentCommand } from '@aws-sdk/client-s3';

export const s3Client = new S3Client({
  region: 'us-east-1', // <= Your region here
});

const { Payload } = await s3Client.send(
  new SelectObjectContentCommand({
    // 👇 Those first params are required but don't mind them
    ExpressionType: 'SQL',
    OutputSerialization: {
      JSON: {
        RecordDelimiter: ',',
      },
    },
    // 👇 Those depends on the CSV
    InputSerialization: {
      CSV: {
        FileHeaderInfo: 'USE',
        FieldDelimiter: ';',
        QuoteCharacter: '"',
      },
    },
    // 👇 Those are the most important
    Bucket: 'my-super-bucket-name',
    Key: 'pokedex.csv',
    Expression: `
            select "id", "name", "customName", "type" as "pokemonType", "level"
                from "S3Object"
                where
                    "generation" in ('1', '2')
                    and "type" = 'fire'
      `,
  }),
);

// Note that this command requires the s3:GetObject permission
Enter fullscreen mode Exit fullscreen mode

This is great and all but we can make it better:

  • 📝 We need to add some custom parsing to actually use the response Payload
  • 👍 We can use Kysely to produce the SQL expression in a type-safe and devX-friendly way
  • 🌈 We can also use it to type the command result

Parsing the query response

The Payload is not a plain Javascript object but an instance of AsyncInterable. Now what the hell is an AsyncIterable you ask me? Well, you could look for the official Async Iterator documentation on MDN… or you can just stop asking questions and use the following helper:

import type { SelectObjectContentEventStream } from '@aws-sdk/client-s3';

// 👇 TextDecoder is a native Node/Browser class
const textDecoder = new TextDecoder();

const parseS3SelectEventStream = async (
  s3SelectEventStream:
    | AsyncIterable<SelectObjectContentEventStream>
    | undefined,
): Promise<unknown[]> => {
  if (!s3SelectEventStream) {
    return [];
  }

  const stringifiedJSONOutputs: string[] = [];

  for await (const event of s3SelectEventStream) {
    if (event.Records) {
      const stringifiedJSONOutput = textDecoder.decode(event.Records.Payload);
      stringifiedJSONOutputs.push(stringifiedJSONOutput);
    }
  }

  const rows = JSON.parse(
    '[' + stringifiedJSONOutputs.join('').slice(0, -1) + ']',
  ) as unknown[];

  return rows;
};
Enter fullscreen mode Exit fullscreen mode

Now we can parse the output like this:

const { Payload: s3SelectEventStream } = await s3Client.send(
  new SelectObjectContentCommand({
    // ...
  }),
);

// 🎉 Ta-da!
const myPokemons: unknown[] = await parseS3SelectEventStream(
  s3SelectEventStream,
);
Enter fullscreen mode Exit fullscreen mode

Building the query with Kysely

Let’s face it, writing SQL queries by hand is a pain and very error prone. Besides, wouldn't it be nice to have some type error if the CSV suddenly changes shape?

That’s where Kysely comes to the rescue: Kysely is a type-safe and devX-friendly typescript SQL query builder. It was designed to work with PostgreSQL and MySQL, but it exposes a few classes that can let us write queries without being connected to an actual relational database.

Let’s begin by designing our CSV type:

enum PokemonType {
  Water = 'water',
  Grass = 'grass',
  Fire = 'fire',
  // ...
}

interface PokemonCSV {
  id: string;
  name: string;
  customName?: string;
  type: PokemonType;
  // 👇 In CSVs everything is a string
  level: string;
  generation: '1' | '2'; // ...up to 9
}
Enter fullscreen mode Exit fullscreen mode

Next, let’s install Kysely and instanciate a Kysely database:

# npm
npm install kysely

# yarn
yarn add kysely
Enter fullscreen mode Exit fullscreen mode
import {
  Kysely,
  DummyDriver,
  SqliteAdapter,
  SqliteIntrospector,
  SqliteQueryCompiler,
} from 'kysely';

interface Database {
  S3Object: PokemonCSV;
}

const db = new Kysely<Database>({
  dialect: {
    createAdapter: () => new SqliteAdapter(),
    createDriver: () => new DummyDriver(),
    createIntrospector: ($db: Kysely<unknown>) => new SqliteIntrospector($db),
    createQueryCompiler: () => new SqliteQueryCompiler(),
  },
});

// That’s it 🎉
Enter fullscreen mode Exit fullscreen mode

Simple, no? Now, let’s write the same query, but while enjoying some type-safety and auto-completion:

const kyselyQuery = db
  .selectFrom('S3Object')
  .select([
    'id',
    'name',
    'customName',
    // 🙌 You can rename columns as you like
    'type as pokemonType',
    'level',
    // 💥 Will trigger an error:
    'unexistingColumn',
  ])
  // 🙌 Every method is type-safe!
  .where('generation', 'in', ['1', '2'])
  .where('type', '=', PokemonType.Fire);
Enter fullscreen mode Exit fullscreen mode

To protect us from nasty SQL injections, Kysely doesn’t directly provide us with the SQL expression but with a sql string and parameters array:

const {
  sql, // 👈 SQL query with '?' as placeholders
  parameters, // 👈 Array of parameters
} = kyselyQuery.compile();
Enter fullscreen mode Exit fullscreen mode

Because S3 Select doesn’t accept parameters in its API, we have to hydrate the parameters ourselves:

const dangerouslyHydrateSQLParameters = (
  sql: string,
  parameters: readonly unknown[],
): string => {
  for (const parameter of parameters) {
    sql = sql.replace('?', `'${String(parameter)}'`);
  }

  return sql;
};

const { sql, parameters } = kyselyQuery.compile();

// ⛔️ **BE SURE TO VALIDATE DYNAMIC PARAMETERS FIRST** ⛔️
const sqlExpression = dangerouslyHydrateSQLParameters(sql, parameters);
console.log(sqlExpression);

// 👇 We retrieve the same expression as above:
// select "id", "name", "customName", "type" as "pokemonType", "level"
//   from "S3Object"
//   where
//     "generation" in ('1', '2')
//     and "type" = 'fire'
Enter fullscreen mode Exit fullscreen mode

We just have to provide it to our S3 Select command and voilà! We’re done!

Well almost: Notice that we only typed our SQL query expression! What about the response that comes from S3?

Inferring the response type

Initially, Kysely was not only made to build queries but also execute them. We can just benefit from the inferred type by inspecting the execute property of our Kysely query. That can be done with the help of some TS wizardry:

import type { SelectQueryBuilder } from 'kysely';

type QueryResponseRow<
  // 👇 Add a large type constraint
  KyselyQuery extends SelectQueryBuilder<
    Record<string, unknown>,
    string,
    unknown
  >,
> =
  // 👇 Remove the Promise wrapper
  Awaited<
    // 👇 Get the return type of the execute method
    ReturnType<KyselyQuery['execute']>
    // 👇 Unpack the array
  >[number];

type Pokemon = QueryResponseRow<typeof kyselyQuery>;

// 👇 Equivalent to:
type Pokemon = {
  id: string;
  name: string;
  // 👍 customName is indeed possibly undefined
  customName: string | undefined;
  level: string;
  // 🙌 "type" property has been renamed
  pokemonType: PokemonType;
};
Enter fullscreen mode Exit fullscreen mode

Now we can just use this type the response of our parseS3SelectEventStream util and voilà! We’re done! For good this time 🙂

Conclusion

Both S3 Select and Kysely are awesome tools. By joining both of them, we can run performant, scalable and cost-effective queries while benefitting from strong and DRY type-safety and inference 🥳

Type-safe S3 Select queries with Kysely

Note that there is one minor drawback, though: Kysely will add 120KB in your Lambdas bundles (props to for serverless-analyze-bundle-plugin for helping me out with this 🙌). It is not a lot, but not negligible either as NodeJS Lambdas bundles above 5MB negatively impacts their cold starts. So you might want to re-evaluate adding Kysely to your bundles if your query is not changing often.

Top comments (1)

Collapse
 
thomasaribart profile image
Thomas Aribart • Edited

EDIT: You can actually use the CAST sql clause to cast the pokemon levels as number. Is is both supported by Kysely and S3 Select 🙂 See: