James Moschou

Posted on Mar 29, 2023 • Originally published at criteria.sh

How to write JSON Schemas for your API Part 1: validating data

#jsonschema #json #api

All APIs should treat data coming from the client as untrusted until it can be validated.

By validating incoming data, you can ensure that it is safe, and can be interpreted and processed correctly by your server. Your validation rules, therefore, need to be communicated to the client via documentation so that they know how to format their data properly.

A common challenge many API teams face is keeping their documentation up to date as their API evolves. By using JSON schema as both the validation engine and the communication tool, the two always stay in sync since there is only one source of truth.

This series of articles will explain what each JSON Schema keyword means with realistic examples so that you can communicate as much information to your customers as possible.

Background on JSON

JSON is typically used to describe data as a collection of labelled fields or key-value pairs.

This form

May be represented as an object in JSON like this:

{
  "api_name": "Example API",
  "api_version": "1.0.0",
  "description": "An example API to demonstrate JSON Schema."
}

JSON can also describe a sequence of unlabelled values called an array.

[
  {
    "message": "This is an object value in an arary."
  },
  "This is a plain string value in an array."
]

JSON also refers to the data type of individual fields, that is strings (text), numbers (both whole numbers and fractional), boolean (yes/no) and empty values (null and undefined).

Objects and arrays are the compositional features of JSON. Objects can contain arrays and vice versa. This allows complex data structures to be composed of smaller structures.

General purpose validation keywords

type

JSON Schema answers the question: what type of JSON data do we expect?

We know that JSON data can be an object, array, string, number, boolean or empty value. This brings us to our first and most important validation keyword: type.

In most cases type literally says the type that we expect: "array", "boolean", "integer", "null", "number", "object", "string".

In practice, most JSON schemas that are used to validate requests and responses will specify that the payload must be an object. This is good practice from an API design point of view because it allows for the API to evolve incrementally by adding new fields without breaking existing clients.

This JSON Schema says that we always expect the JSON value to be an object.

{
  "type": "object"
}

If the data is allowed to take more than one form we can also specify more than one type in an array.

A use case for this might be to describe a configuration option. For example, the user could specify the name of each setting individually or specify a preset via its name:

POST /print_jobs

{
  "printer_settings": "black_and_white"
}

POST /print_jobs

{
  "printer_settings": {
    "paper_size": "A4",
    "color": true,
    "double_sided": true
  }
}

The JSON Schema for the value of printer_settings would look like this:

{
  "type": ["string", "object"]
}

enum

The printer example exposes a potential issue in the printer settings schema. What happens if the user specifies a paper size that doesn't exist?

We can describe all the possible values upfront using the enum keyword.

{
  "type": "string",
  "enum": ["A3", "A4", "A5", "US Legal", "US Letter"]
}

Usually enum is used with string values, which translates well to many programming languages. However, JSON Schema is more flexible and the enum keyword can be used with any type of value.

const

Occasionally your API may have fields that, when specified, always have the same value no matter what. This can be described using const.

{
  "type": "string",
  "const": "must be this value"
}

Using const is equivalent to using enum with only one value.

Why would you have this? A field where you already know the value doesn't convey any new salient information.

Constant values can be useful in requests where you want to the user to be explicit about the effect of the API call. They also allow you to expand the use cases of the field into an enum in the future.

For example, a POST /shareable_links API will make a resource available to anyone on the Internet. Calling this API would have drastic consequences if the the user never meant for everyone to be able to see the resource. Making the user explicit about the effect can reduce confusion and misuse of APIs.

POST /shareable_links

{
  "visibility": "public"
}

Later on, a "private" visibility option could be added to the API request in a backwards compatible manner.

Validating textual values

minLength and maxLength

If your API expects textual data, you can specify the minimum and maximum number of characters that are allowed.

Some APIs may restrict text to a maximum length based on what the underlying storage layer has the capacity for.

An example of a minimum length might be a phone number with a full area code where there is always a certain number of digits. If there are not enough digits, you know that the supplied value is not a valid phone number.

pattern

The phone number field example highlights another way we can check the integrity of the data. Phone numbers only contain digits and perhaps spaces or punctuation marks. We can forbid unallowed characters by using pattern, which is a regular expression.

Regular expressions are too big a topic to go into here. The following schema specifies that the string value must be a series of zero or more characters, where each character is a digit, space, parenthesis or dash.

{
  "pattern": "^[\\d ()-]*$"
}

Regular expressions can also go beyond describing what characters are allowed, and also specify what order they must appear in.

This example describes a US phone number formatted as (XXX) XXX-XXXX.

{
  "pattern": "^\\(\\d{3}\\) \\d{3}-\\d{4}$"
}

format

There are common use cases where the text value actually has a specific, well-known meaning such as an email address. We can use the format keyword to describe this.

{
  "type": "string",
  "format": "email"
}

This says that the text value must be an email address, not just any sequence of characters.

Validating numeric values

minimum, maximum, exclusiveMinimum and exclusiveMaximum

If a field can be set to a number, we can define the range that we expect the number to be within.

A very common use case is to disallow negative numbers:

{
  "type": "number",
  "minimum": 0
}

The keywords minimum, maximum, exclusiveMinimum and exclusiveMaximum corresponds to the mathematical inequalities >=, <=, > and < respectively. You can use these keywords in any combination, though typically you would either use the exclusive or non-exclusive one as needed.

multipleOf

You can specify that the number must be a multiple of something using multipleOf.

Say your Calendar API accepts a meeting duration in minutes, but only in 15-minute increments, you can use multipleOf to disallow 20-minute meetings. This means your API can avoid having an awkward number_of_quarter_hours field and instead have an easier-to-use minutes field that still respects the constraints of your application.

{
  "type": "integer",
  "multipleOf": 30
}

Validating arrays

items

When a field value is an array, you can specify the data structure of each item in the array using items.

Say your API allows adding multiple tags to an element, you can validate that each tag is a non-empty string.

{
  "type": "array",
  "items": {
    "type": "string",
    "minLength": 1
  }
}

uniqueItems

Continuing with the tags example, it doesn't make sense for an element to have two tags that are identical. We can use uniqueItems to disallow duplicate items in an array.

{
  "type": "array",
  "items": {
    "type": "string",
    "minLength": 1
  },
  "uniqueItems": true
}

minItems and maxItems

Just as we can limit the length of strings and numbers to a certain range, we can limit the number of items in an array to a certain range too.

Say in a survey builder we want to allow the creation of a new survey of questions, but each survey must have at least one question. We can disallow empty arrays by setting minItems to 1.

{
  "type": "array",
  "minItems": 1
}

Similarly, we can set an upper limit on the number of questions using maxItems.

prefixItems

Say we want to introduce an additional constraint on new surveys, where the first few questions are always the same, but the rest of the questions can vary.

We can use prefixItems to specify a sequence of schemas that the first items in an array must satisfy.

This example requires that all surveys ask for name and address up front.

{
  "type": "array",
  "prefixItems": [
    {
      "type": "object",
      "properties": {
        "question": {
          "type": "string",
          "const": "What is your name?"
        }
      }
    },
    {
      "type": "object",
      "properties": {
        "question": {
          "type": "string",
          "const": "What is your address?"
        }
      }
    }
  ],
  "items": {
    "type": "object",
    "properties": {
      "question": {
        "type": "string"
      }
    }
  }
}

contains

We can use the contains keyword to specify that at least one item in the array must satisfy an additional constraint.

For example, we can specify that a team that consists of team members must contain at least one member who is an admin.

{
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "name": {
        "type": "string"
      },
      "isAdmin": {
        "type": "boolean"
      }
    }
  },
  "contains": {
    "type": "object",
    "properties": {
      "isAdmin": {
        "type": "boolean",
        "const": true
      }
    }
  }
}

maxContains and minContains

We can also specify the number of times the contains keyword would match using maxContains and minContains.

For example, if a team must have exactly one admin user, we can specify this.

{
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "name": {
        "type": "string"
      },
      "isAdmin": {
        "type": "boolean"
      }
    }
  },
  "contains": {
    "type": "object",
    "properties": {
      "isAdmin": {
        "type": "boolean",
        "const": true
      }
    }
  },
  "minContains": 1,
  "maxContains": 1
}

unevaluatedItems

The unevaluatedItems keyword of JSON Schema is used to define a schema that applies to items in an array that do not match any of the previous schema definitions.

A very simple use-case of this keyword is to disallow any unexpected values.

This schema describes an array with exactly two string values and nothing else.

{
  "type": "array",
  "prefixItems": [{ "type": "string" }, { "type": "string" }],
  "unevaluatedItems": false
}

You typically wouldn't use items and unevaluatedItems together, since items will evaluate every value in the array.

Validating objects

A JSON object is made up of pairs of keys and values. Each pair is called a property, which can be separately validated using nested schemas.

properties

Validate each value in an object using properties.

This example describes a calendar event structure with a name, start time and end time:

{
  "type": "object",
  "properties": {
    "name": {
      "type": "string"
    },
    "start_time": {
      "type": "string",
      "format": "date"
    },
    "end_time": {
      "type": "string",
      "format": "date"
    }
  }
}

required

By default, all properties in an object are treated as optional. To make some properties mandatory, use required.

Let's update the calendar event structure to add an optional location property, while making the other properties mandatory.

{
  "type": "object",
  "required": ["name", "start_time", "end_time"],
  "properties": {
    "name": {
      "type": "string"
    },
    "start_time": {
      "type": "string",
      "format": "date"
    },
    "end_time": {
      "type": "string",
      "format": "date"
    },
    "location": {
      "type": "string"
    }
  }
}

dependentRequired

Whether a property is required or not is not cannot always be decided upfront. The dependentRequired keywords allows you to specify scenarios where a propery becomes mandatory if another optional property is also included.

For example, you may have a Food Delivery API that allows specifying a delivery address when making a food delivery order. If an address is not specified it is a pick-up order. If the delivery address is specified, then an additional property specifying whether the courier has the authority to leave the delivery in a safe place must also be specified.

{
  "type": "object",
  "dependentRequired": {
    "delivery_address": ["authority_to_leave"]
  },
  "properties": {
    "delivery_address": {
      "type": "object"
    },
    "authority_to_leave": {
      "type": "boolean"
    }
  }
}

additionalProperties

Sometimes your data structures require more flexibility than what is normally possible by specifying fixed fields upfront.

The additionalProperties keyword allows you to validate the values properties where you don't know the names upfront. This is useful to allow consumers of your API to define their own properties, such as with a custom metadata field.

This example lets clients include custom fields in a metadata object, but only for simple values like strings, numbers and booleans.

{
  "type": "object",
  "properties": {
    "metadata": {
      "type": "object",
      "additionalProperties": {
        "type": ["string", "number", "boolean"]
      }
    }
  }
}

patternProperties

While you don't always know the property names upfront, as with the custom metadata field, you can still constrain them in some way.

This is especially useful to group custom properties in a certain "namespace" to maintain forwards compatiblity with future versions.

This example allows additional custom properties on a contact record, but requires them to start with the user_defined_ prefix:

{
  "type": "object",
  "properties": {
    "name": {
      "type": "string"
    },
    "email": {
      "type": "string",
      "format": "email"
    }
  },
  "patternProperties": {
    "^user_defined_": {}
  }
}

propertyNames

If you just want to validate the names of properties, but allow any values, you can use propertyNames.

A very common use case is to only allow property names made up of certain characters, excluding spaces, punctuation and other non-alphanumeric characters.

{
  "type": "object",
  "propertyNames": {
    "pattern": "^[A-Za-z_][a-za-z0-9_]*$"
  }
}

minProperties and maxProperties

For flexible data structures where you don't know all the properties up front, you can limit the number of custom properties to a certain range using minProperties and maxProperties.

This example describes a color pallete structure that contains between 3 and 5 color objects, keyed by their hex code.

{
  "type": "object",
  "patternProperties": {
    "#[0-9a-f]{6}": {
      "type": "object",
      "properties": {
        "name": "string"
      }
    }
  },
  "minProperties": 3,
  "maxProperties": 5
}

unevaluatedProperties

A JSON Schema is a system of constraints. Everything is allowed by default, unless you explicitly disallow it.

You can specify a minimum set of constraints that all properties must satisfy, regardless of whether they are known or unknown properties, using unevaluatedProperties.

Of course, if you simply want to disallow unknown properties, set unevaluatedProperties to false.

{
  "type": "object",
  "properties": {
    "name": {
      "type": "string"
    }
  },
  "unevaluatedProperties": false
}

Validating binary data

contentMediaType

Sometimes the data your API works with cannot be expressed by JSON. However, to be consistent with the rest of the API, it may be encoded as a string and wrapped inside an outer JSON structure.

You can use contentMediaType to specify that a string value should be interpreted as different type of data.

For example, you may have a Content Management System that returns a post's content as HTML that can be rendered directly on the client's website.

The JSON schema might look like this:

{
  "type": "object",
  "properties": {
    "post_slug": {
      "type": "string"
    },
    "post_content": {
      "type": "string",
      "contentMediaType": "text/html"
    }
  }
}

contentEncoding

With the embedded HTML example, the HTML text could be represented as a string using the same encoding as the enclosing JSON document, which would typically be UTF-8.

For other mime types, e.g. images, the binary data must be converted to a string representation using a different encoding, which can be specified using contentEncoding.

The following schema indicates that a string contains a PNG image, encoded using Base64:

{
  "type": "string",
  "contentEncoding": "base64",
  "contentMediaType": "image/png"
}

contentSchema

There are some scenarios where a data structure can be expressed as a JSON data structure, but is nonetheless transmitted as an encoded string.

An example is JSON Web Tokens, which are used to store authentication and authorization information about a user. Different JWTs can make different claims about the user, so we can require certain claims to be present using contentSchema.

This example describes a JWT that requires the issuer (iss) and expiration time (exp) fields in its claim set.

{
  "type": "string",
  "contentMediaType": "application/jwt",
  "contentSchema": {
    "type": "array",
    "minItems": 2,
    "prefixItems": [
      {
        "const": {
          "typ": "JWT",
          "alg": "HS256"
        }
      },
      {
        "type": "object",
        "required": ["iss", "exp"],
        "properties": {
          "iss": { "type": "string" },
          "exp": { "type": "integer" }
        }
      }
    ]
  }
}

Conclusion

We've seen how JSON Schema can be used to specify and communicate validation rules for JSON data, including text, numbers, arrays, objects and even binary data.

Future articles in this series will cover how you can use JSON Schema to include human-readable metadata for documentation purposes, specify dynamic data structures that vary based on certain conditions, and organize schemas into a library of reusable elements.