An irreverent yet positively innovative approach regarding data typing
I love programming. I discovered the computers and wrote my first programs when I was a kid and I've never stopped since then.
For some time now, I've been working on a Web App framework, and this led me to think about a lot of things, amongst which data typing.
It took me quite a long time to realize that the conventional approach of using variable types in most programming languages leads developers to create programs that require tremendous efforts to deal with validation, adaptation, storage and rendering of the manipulated values.
The reason is that traditional data typing in programs is very permissive, tends to prioritizes technical needs over human reasoning, often lacks of context validation, and is rarely explicit in terms of human representation.
This article presents a solution that aims to tackle these limitations by suggesting a more complete, flexible and unambiguous way to define and to handle data.
TL;DR
If, when declaring... | You intend to deal with... |
---|---|
let country:string |
"a two letters code of a country that complies with ISO3166" |
float amount |
"an amount of money with a precision of 4 decimal digits" |
string email |
"an email address that complies with IETF RFC 5322 & RFC 6854" |
string password |
"the blowfish encrypted version of a password complying with NIST recommendations" |
Then you're using an ambiguous typing and you probably have to manually deal with validations and conversions.
This article presents the advantages of using an explicit typing, and suggests that a simple syntax inspired by the Media Type notation could be used as a supplement or even as a replacement to traditional data types.
type [ ["/" subtype[.variation][":" length]] ]["{" min, max "}"]
Example: "A money amount with up to 12 integer digits and 4 decimal digits" (FASAB/GAAP compliant) :
amount/money:12.4
Did you say Variable ?
As a developer, I deal with variables and types all the time. It is so common that it has become difficult to explain what exactly is a variable.
Let's try anyway.
All programming languages rely on data types in order to define how each piece of data has to be stored in memory, and to guarantee that operations performed on the data are consistent.
Programs use names to identify the memory locations that are assigned for storing the data. These names are called "variables".
In most cases, the value stored in a variable can change during the execution of the program (which is why it is called a "variable").
A variable can be used in computations, assigned to other variables, or passed as argument to functions or methods.
Most programming languages use types as classification of variables based on the kind of data they represent.
The most common types, used in nearly all programming languages, are : bool(ean), int(eger), float(ing-point number), and string.
The whole picture
Whatever the purpose of the processing, when a program handles variables there are 4 key considerations from the developer perspective:
- Data Model (Validation): Ensuring that the submitted value complies with the Model defintion and the application logic.
- Data Storage (Allocation): Determining the amount of memory required to store the variable (i.e. how many bytes).
- Data Exchange (Adaptation): Manipulating the data to convert it from one format to another (example: data exchange through API).
- Data Representation (Formatting): Specifying how the data should be displayed for human reading. (Sometimes there are some conventions for representing a data : a phone number, an email address, a monetary amount, a percentage value, ... This aspect might also depend on the user's language & locale).
In most cases, programming languages deal with a single concept of "type" to manage all these considerations, which often turns out to be incomplete or even impossible. And some of (and sometimes all of) these tasks are left to the discretion of the developer on a case-by-case basis.
Data Model (validation)
Data modeling consists of telling the kind of data (human concept) that is stored into a variable (computer concept), and specifying if some constraints must be applied on a variable. The questions it addresses are "how to validate it" and "what operations can be performed on it".
In most situations, the type of a variable is used to control the operations that can be performed on it. It is also used to check that the variable that is passed as parameter to a function has the expected type, based on the function signature.
This is because the intended use of a piece of data always has some implicit restrictions. As a consequence, some operators might be forbidden for variables of a certain type. For instance, the operator /
is valid for numbers (means "perform a division") but doesn't make sense for strings.
Also, depending on the language, a same operator might have distinct meaning depending on the type of the operands. For instance, in JS or C++, using the operator +
on numbers means "perform an arithmetic addition on these values" while using it on strings means "concatenate these strings".
Observations :
Most languages allow developers to define variables amongst the following models : character, number, string, binary value and array.
But these types can describe a lot of things:
- a string can be used for storing a name, a country, an address, ...
- a number can represent a quantity, an index, or even a date
- a binary value might be a document, an image, an archive, ...
The Model should not only tell "What it is" but also "What it is intended for". Doing so makes it possible to bind some constraints to a variable. And the advantage of having constraints bound to data types is that, in case a submitted value does not comply with what is expected, it is easy to automatically provide the user with an accurate feedback about the reason why its submission was rejected.
Data Storage (allocation)
Data storage relates to the amount of memory the computer must reserve in order to store the variable. The question it addresses is "what amount of memory must be allocated".
Types are almost always used to tell the computer how much memory (what amount of bytes) should be allocated for storing the value of a variable.
In the early days of computing, when every byte was valuable, resource allocation efficiency was a crucial consideration. And the rigor in the choice of the most appropriate type in terms of memory was decisive in order to take advantage of the least available bit. But it also resulted in not very intuitive and sometimes ambiguous notations. For instance, in C and C++, the int
type can be stored on 2 or 4 bytes depending on the environment.
Here are the elementary computer memory units:
- bit: 1 binary digit
- byte: a 8-bits (the smallest addressable unit of memory in most architectures)
- string: a series of bytes of an arbitrary length
- integer: a block of bytes used to represent an integer value, which length depends on the computer architecture
The way types are defined depends mostly on machine architecture (CPU registers capacity: 16 bits, 32 bits, 64 bits, ...) and on common values expected to be manipulated for different levels of precision (min, max) according to Byte multiples (int32, unsigned int, long long int, ...)
Here are a few examples of variable declaration (as it can be found in common programming languages) that are ambiguous for humans in terms of memory allocation:
Declaration statement | Comments |
---|---|
const N:u8 (Rust) |
an unsigned integer stored on 8 bits (i.e. a natural number between 0 and 255) |
Dim n As Byte (VB) |
(idem) |
double n (C++) |
a real number with "double" precision (unclear about min and max values) |
let n:number (TS) |
any integer or floating point number (also unclear about min and max values) |
let n:i32 (Rust) |
not easy for humans to tell the limits (something like "a range between -2 billons and +2 billions") |
unsigned long long n (C++) |
(who the hell knows the result of 2^64-1 ?) |
Observations :
- Telling how a variable must be stored by using an integer number of bytes might have been necessary in the past, but it is not the case anymore: typing should not be confused with data allocation.
- Any application uses a hardware-software stack which has implications in terms of limits (maximum number of digits, precision of the floating number). However, this stack is likely to evolve over time, and so are these limitations. Therefore, the logic of an application should be decoupled from the stack. That way, it would be easy for the developer to ensure that the storage/memory allocation remains consistent with the App logic.
- Typing should not only focus on how to store the data but also on how to translate human concepts to something that a computer can (easily) handle. If capability for humans to read the resulting source code is maintained, it greatly eases the work of the developers while adapting, maintaining or improving the code.
Data Exchange (adaptation)
Data Exchange relates to the transfer of data and variables from one environment to another. The question it addresses is "How to convert a piece of data received from the outside into a local variable and convert it back for sending a response".
Each environment might store variables using its own types and representations.
When the application receives data from the outside, nothing guarantees that it can be converted from the input format to its native format (programming language) in a consistent way and without exceeding the capacity of the underlying layers (available/allocated memory).
The same goes for sending data : we have no clue about how to format the data so that is it correctly interpreted, apart from the notation suggested by official norms like ISO. But again, nothing guarantees that the system we're communicating with follow these standards.
Observations :
- A variable should be seamlessly convertible to another format according to the target specifics (ex. JS, PHP, SQL)
- A type should allow distinct programs or computers to handle the data in a consistent way.
- An application should communicate about the parameters it expects along with the inherent limitations (expected format, maximum size, etc.)
Data Representation (formating)
Data representation relates to the encoding of a value in a way that makes it understandable by humans and valid within a given context. The question it addresses it "How to display it within a given context?".
For some types of variables, outputting data is not trivial and can involve a lot a possible variations (for instance when preferences or user settings are involved). In such case, it is necessary to be explicit about how the data must be displayed.
An additional difficulty is that some representations might vary depending on the "locale" settings (e.g. character for decimal separator) or international variations of notations (e.g. amount of digits of a phone number or digits grouping convention).
Dates are a good example: various strategies can be found depending on the tools and environments (timestamps, formatted strings, ISO notation). And this matter is still subject to many issues is modern softwares.
Observations :
- Typing should allow both computers and humans to know how to present the value of that variable.
- For types that may be represented in many different ways (ex. dates / datetimes), an additional piece of information is necessary for disambiguation.
Explicit typing
Unlike computers, humans know that a real is a number and that a string is a piece of text. On the other hand, unlike humans, computers need to know beforehand how much memory must be allocated in order to store a real number or a string.
The problem is that the human definitions of "text" and "number" are quite vague. Respectively: "a coherent set of signs that transmits some kind of informative message"; and "symbols that represents an amount" and give no clue about storage or rendering.
To enable a computer to handle a value accurately, without requiring the developer to program in every detail how to store and manipulate it based on its intended use, the associated variable should have an explicit type.
In other words, explicit typing describes a variable in a way that covers the 4 aspects presented above : modeling, allocation, adaptation and formating.
Proposition
The suggestion made in this article is that explicit typing can be achieved by using a single descriptor.
Here below, is a proposal in the form of a proof of concept, that uses a notation logic close to the the Media Type syntax (MIME) for building such descriptors.
Using the MIME notation, in conjunction with concepts for which an unambiguous definition or an international standard exists, provides an explicit typing. For instance, we can assume that a "language" is a value that holds 2 or 3 lowercase letters and whose possible values are provided by the ISO standards ISO-639-1 and ISO-639-2.
The term "usage" is used to distinguish explicit types from primitive types.
Here is the full syntax of a "usage" descriptor:
type [ ["/" subtype[.variation][":" length]] ]["{" min, max "}"]
-
type
is the name of the non-primitive type; -
subtype
is the mandatory categorization of the non-primitive type (example: number/integer). It always has a default value so it can be omitted (ex. text defaults to text/plain); -
variation
(optional) additional meta information to identify a specific standard or norm (example: hash/md.4); -
length
is either an amount of digits (or glyphs), or a length under the formprecision.scale
;- for numeric values, the length tells the amount of digits (required bytes can be found with
ceil(log2(x) / 8
); - for strings, the length indicates the amount of glyphs (which might differ from the number of bytes);
- for usages that imply a floating point numbers, the length can be composed of several parts separated with dots (e.g.
precision[.scale]
).
- for numeric values, the length tells the amount of digits (required bytes can be found with
-
min
andmax
are optional parameters intended for numbers and can be used for defining custom limits. By default, there are always limits, set based on the min and max implied by the usage. If set, these values must be consistent with the length (for instance number/natural:2 cannot have a min lower than 0 nor a max higher than 99).
The primitive types can coexist, but become special cases of the "usage".
The proposition being that:
- A usage always relates to a generic type (by example, all variations of "number/real" relate to the type "float");
- All generic types are associated with one default usage;
- The default boundaries depend on the local environment and off-limit values should be rejected by it.
Examples
Below is presented a non-exhaustive list of descriptors using such notation.
Usage | Primitive Type | Description | Examples |
---|---|---|---|
TEXTS | |||
text/plain | string | A regular string composed of unicode chars (UTF-8 4-byte) Alias: text, text/plain.short, text/plain:255 Variations: text/plain.small (65KB) text/plain.medium (16MB), text/plain.long (4GB) |
Hello world. |
text/xml | An XML formatted string (equivalent to MIME "application/xml") | <Person> <Name>Joe</Name> </Person> |
|
text/html | An HTML formatted string. | <p>HTML rocks!</p> | |
text/markdown | A piece of text using MarkDown notation. | ## Title | |
text/wiki (markup/wikitext) | A piece of text using Wiki markup. | ''italic'' | |
NUMBERS | |||
amount/money | A financial amount Alias: amount/money:9.2, amount/money:2 Variations: amount/money:9.4 (FASAB/GAAP) |
886.90 EUR | |
amount/percent | A percentage Alias: amount/percent:2 |
85.13% | |
amount/rate | A rate amount (with no units). Alias: amount/rate:4 Variations: amount/rate:6 |
1.0789 | |
number/boolean | bool | Alias: number/boolean:1 | |
number/natural | A positive integer (ISO 80000-2: [0;99]) Alias: number/natural:9 (32bit) or number/natural:19 (64 bits) Example : number/natural:2 (2 digits positive integer) |
0, 25, 999 | |
number/integer | int |
Alias: number, number/integer.decimal Variations: number/integer.hexadecimal, number/integer.octal Examples: number/integer:1 (a single digit integer: [-9;+9]) |
-32767, 1, 65535 |
number/real | float |
Alias: number/real:10.2 Example: number/real:5.2 (a float number with 2 decimal digits and max 3 digits for the integer part) |
123.45 |
URI | |||
uri/url.mailto | A "mailto" URI (RFC6068) for writing a message to an email address. | mailto:cedric@equal.run | |
uri/url.payto | A "payto" URI (RFC8905) for payments to a bank account. Alias: url/payto.iban Variations: url.payto.iban:16 |
payto://iban/DE75512108001245126199?amount=EUR:200.0 | |
uri/url.tel | A "tel" URI (RFC3966) fro phone numbers. | tel:+324567890 | |
uri/url.http | A URL using HTTP or HTTPS scheme. | https://equal.run | |
uri/url.ftp | A URL using FTP and FTPS scheme. | ftp://ftp.example.com | |
uri/urn.isbn | An International Standard Book Number for identifying a book Alias: urn/isbn:13 or urn/isbn.13 Variations: uri/urn.isbn:10 (ISO 2108) |
||
uri/urn.isan | International Standard Audiovisuel Number for identifying and audiovisual | ISAN 0000-0000-3A8D-0000-Z-0000-0000-6 | |
uri/urn.iban | International bank account number (ISO-13616) Variations: urn/iban.BE (or urn/iban.16 or urn/iban:16) |
||
uri/urn.ean | International article numerotation EAN number (ISO-15420) Alias: urn/ean.13 or urn/ean:13 |
||
DATES | |||
date/plain | An ISO 8601 date with date part only [@00:00:00UTC]) Alias: date |
1955-11-05 | |
date/time | A full ISO 8601 date holding an UTC time. Alias: datetime |
1955-11-13T05:04:00Z | |
date/year | Alias: date/year:4 (integer 0-9999) Variations: date/year:2 |
1999 | |
date/month | An integer representing a month within the year (ISO-8601 : 1 to 12, 1 being January) | 11 | |
date/weekday | An integer representing a day within the week (ISO-8601: 1 to 7, 1 being Monday) Alias: date/weekday.mon Variations: date/weekday.sun (0 to 6, 0 is Sunday) |
3 | |
date/monthday | An integer representing a day within the month (1-31) (ISO-8601) | 28 | |
date/yearweek | An integer representing a number of week within a year (1-52) | 51 | |
date/yearday | An integer representing a day within the year (1-365) (ISO-8601) | 130 | |
time/plain | An integer, representing the time of the day (number of seconds within range [0-86400], displayed as h:m:s ) |
24200 | |
MISC | |||
A string describing an email address as defined by IETF RFC 5322 & RFC 6854 | cedric@equal.run | ||
language | A language represented with 2 lower case letters (ISO 639-1) Alias: language/iso-639:2 Variations : language/iso-639:3 (ISO 639-2) |
en, de, zh | |
country | A country represented with 2 upper case letters (ISO 3166-1 alpha-2) Alias: country/iso-3166:2 Variations: country/iso-3166:3 (ISO 3166-1 alpha-3) |
FR, GB | |
image | A binary value representing an image. Variations: image/jpeg, image/gif, image/png, image/tiff, image/webp |
.PNG........IHDR..............wS.....IDAT..c.................IEND.B. | |
password | A password complying to NIST 800-63 (minimum length of 8 chars, unicode chars) Alias: password/nist, password/enisa |
||
coordinate/latitude |
Alias: coordinate/latitude.decimal Variations: coordinate/latitude.dms (ex. 13/30/25.760/S) |
50.81321991 | |
coordinate/longitude |
Alias: coordinate/longitude.decimal Variations: coordinate/longitude.dms (ex. 91/1/12.469/W) |
4.42880891 | |
currency |
Alias: currency/iso-4217, currency/iso-4217.alpha Variations: currency/iso-4217.numeric |
USD, CHF, EUR | |
hash/md | A message digest 5 hexadcimal numeric string (/[a-f0-9]{32}/) Alias: hash/md.5 Variations: hash/md.4 (/[a-f0-9]{32}/), hash/md.6 (/[a-f0-9]{64}/) |
8747e564eb53cb2f1dcb9aae0779c2aa | |
hash/sha |
Alias: hash/sha.256 (/[a-f0-9]{64}/) Variations: hash/sha.1 (/[a-f0-9]{40}/), hash/sha.512 (/[a-f0-9]{128}/) |
dcb8649a334a05fd \ 641c6f15f42426fe \ 284bf1a672e412d3 \ fef5583fa0907856 |
|
color/css | (https://www.w3.org/wiki/CSS/Properties/color/keywords) | ||
color/rgb | A color descriptor as a string with 3 integer values separated by commas. | 255,255,255 | |
color/rgba | A color descriptor as a string with 4 integer values separated by commas. | 255,128,255,0.8 | |
color/hexadecimal | A color descriptor as a string starting with '#', followed by 3 (#rgb), 6 (#rrggbb) or 8 (#rrggbbaa) hexadecimal characters. | #ffffff |
Numbers boundaries
When it comes to numbers, it is common for a variable not to relate to any standard or external convention, but instead to have boundaries (i.e. minimum and maximum possible values).
For that, we can use a notation similar to the one used in regular expressions for {n,m} quantifiers (https://docs.oracle.com/javase/tutorial/essential/regex/quant.html)
Here are a few examples:
general | with boundaries | description |
---|---|---|
number/integer | number/integer{-10,10} | an integer within the range [-10;10] |
amount/money:6.4 | amount/money:6.4{0,300000} | a money amount between $0.0 and $300,000 examples: 299999.9999 |
This technique can be advantageously extended to strings to define specific lengths:
general | with boundaries | description |
---|---|---|
text/plain | text/plain{2,30} | a string having minium 2 glyphs and maximum 30 glyphs |
Handling Arrays
Array is a kind of super type. It has a length (that can be dynamic), and holds elements of a specific type.
Various notations for arrays can be found amongst popular programming languages. The most usual notations are :
-
Array<Type>
-
Type[]
orType()
-
[Type; size]
orType[size]
In order to respect the assumptions made earlier, and make it easily usable in controllers (validation only), we should be able to determine the maximum length of the array, along with the kind of elements it holds.
For arrays as well, Usage can helpfully complete the information about the handled data:
we know the variable holds a series of items and we know how to store, validate, display and convert those items.
// Array holding 3 integers having max 2 digits
// (#memo - type is a special case of a broader usage)
[
'type' => 'array',
'usage' => 'number[3]/integer:2'
];
// Array
[
'type' => 'array',
'usage' => 'email[10]'
];
There is a special case of non-typed arrays that may be used in order to accept arbitrary values or maps.
// special case non-typed array
// (#memo - applies only to input params not meant to be stored)
[
'type' => 'array'
'usage' => 'array'
];
Additional benefits
This approach can be used for identifying (or guessing) the kind of data that a field stores, which can be useful in various situations :
Sensitive data
When using production data within a DEV or Staging environment, this approach allows to identity sensitive or privacy-related data.
Having a Usage telling the kind of value we're dealing with (ex. name, birthdate, gender, address, email, iban, password) allows developer to identify which values must be obfuscated (withdrawn or randomized) before importing data to the test environment, and/or with what sample data they can be replaced.
Also, for record-based backups, it can be used to tell which columns must by encrypted.
Sample data (DB seeding)
In the same way, coupling properties with Usages makes it easy to generate sample sets for seeding a database with dummy data in a consistent way.
Next
If your curious about this or would like to see an example of implementation, you can have a look at the eQual framework.
Here are the direct links to the involved files in the repository :
- https://github.com/equalframework/equal/blob/master/lib/equal/data/adapt/DataAdapterInterface.class.php
- https://github.com/equalframework/equal/blob/master/lib/equal/orm/usages/Usage.class.php
- https://github.com/equalframework/equal/tree/master/lib/equal/orm/usages
Valuable sources consulted over time
https://en.wikipedia.org/wiki/Data_type
https://en.wikipedia.org/wiki/Media_type
https://www3.ntu.edu.sg/home/ehchua/programming/java/datarepresentation.html
https://www1.icsi.berkeley.edu/~sather/Documentation/EclecticTutorial/node5.html
https://www.lehigh.edu/~ineng2/notes/datatypes
https://doc.rust-lang.org/nomicon/repr-rust.html
https://docs.oracle.com/javase/tutorial/essential/regex/
Top comments (0)