When developing software, we sometimes can't get around of handling data in some way.
No matter, if you work on the frontend (where the UI you are building is basically a function of the data) - or on the backend (backends tend to be rather data heavy).
Typical tasks when processing data include (but are not limited to):
- Filtering out datasets (like a facebook search should only give me matches with my search term or redacting information of privacy reasons - like bank numbers, email addresses or even passwords)
- Mapping binary data into a human readable format or vice versa
...well, okay. Facebook might not be the best example for demonstrating applied data privacy...
TL;DR You can skip the theory, if you're not interested or already know the basics.
Theory of Data and Transformations
Data
Before working on data, we need to have an idea of how the data looks like. In general the structure can be defined as a collection of items like:
- Rows in a database table - where the table is the collection and rows are the items
- Elements in a set - with elements as items, the set as the collection
- Fields in an array
- Elements in a list
- Entries in a dictionary
- Nodes in a tree
Any collection could be embedded as an item into another collection. Let's formalize this statement:
-- an Item is defined as either:
-- 1. a Primitive value
-- 2. a Collection of Items
data Item = Primitive | Collection Item
Note: I also tried to find a representation for this in Typescript notation. But Typescripts type system doesn't seem to be powerful enough to describe recursive types
Transformable Data
Since our definition of data consists of just two kinds of types (Item and Collection), we can only have transformations on the collection level or on the item level. Where filtering (deciding for each element whether or not to include it in the result) is a transformation on the collection level.
A collection that can be filtered is called Filterable.
Mapping is the process of taking each item from its container (the collection), applying a transform to the item, and putting it back into a container of the same kind of collection where it came from. Mapping is a transformation on the item level.
A collection (or container), where you can map over its contents is called Functor.
In Practice
The Common Approach
Javascript comes with native array methods for mapping and filtering array data. Most notable:
Array.prototype.map(mapFn)
Array.prototype.filter(predicateFn)
Array.prototype.reduce(reducerFn [, init])
Now let's make up a simple example, to see how each of them works.
We have a collection of bibliographical data. Each item represents a Book or publication, that has a unique id, a title, was written by one or more authors and has a publication date (in form of a unix timestamp) in ms since epoch.
type BookOrArticle = {
id: string,
title: string,
authors: Array<string>,
publicationDate: number,
};
We have an array of (let's say) 10,000 books and and articles and we need to get all items that contain the word "guide" in the title, written by J. Doe and published in 2007. The result items should be in the form:
type Result = {
title: string,
author: string,
date: string
};
1. Naive Approach
Lets take a first approach:
const booksAndArticlesArray = [
/* our array of books and articles */
];
function getResults(inputArray) {
return inputArray
.map(item => ({
...item,
year: new Date(item.publicationDate).getFullYear()
}))
.filter(({ year }) => year === 2007)
.filter(({ title }) => /guide/i.test(title))
.map(({ title, authors, publicationDate }) => ({
title,
author: authors.join(", "),
date: new Date(publicationDate).toDateString()
}))
.filter(({author}) => author.includes("J. Doe"))
}
It might not be immediately visible, but each step of filtering or mapping creates an intermediate array containing the result, that is used as an input for the next filter/map step.
2. Optimized Approach
To reduce the number of intermediate data structures created, one could try and reduce the number of transformation steps by collapsing consecutive map and filter operations:
// collapse consecutive filter operations
function getResults(inputArray) {
return inputArray
.map(item => ({
...item,
year: new Date(item.publicationDate).getFullYear()
}))
.filter(
({ year, title }) => year === 2007 && /guide/i.test(title)
)
.map(({ title, authors, publicationDate }) => ({
title,
author: authors.join(", "),
date: new Date(publicationDate).toDateString()
}))
.filter(({ author }) => author.includes("J. Doe"));
}
We could further optimize this by moving the mapping from the authors
array field in the second map operation into the first map operation. This would allow us to collapse the the final filter operation with the other filter:
// 1. move mapping authors array to author string into first map operation
// 2. collapse final filter operation with previous filter operation
function getResults(inputArray) {
return inputArray
.map(({ publicationDate, authors, ...item }) => ({
...item,
author: authors.join(", "),
year: new Date(publicationDate).getFullYear()
}))
.filter(
({ year, title, author }) =>
year === 2007 &&
/guide/i.test(title) &&
author.includes("J. Doe")
)
.map(({ title, author, publicationDate }) => ({
title,
author,
date: new Date(publicationDate).toDateString()
}));
}
We reduced the number of intermediate data structures created from 5 to 3, but at the cost of readability. Further reduction is possible by moving the author
and year
transforms from the first map operation into the filter step and remove the first map operation (this comes at the cost of reducing readability, too).
Mapping and Filtering via Reduce
What if we could express filter and map in a way, that allows us to directly compose them.
Reminder: Compose or function composition means, that you call one function with the result of another function.
I.E: Given the functionsf: X -> Y
andg: Y -> Z
, we callg.f(x) := g(f(x))
the composition off
withg
withg.f: X -> Z
(that meansf.g(x)
takes values ofX
as argument and returns values ofZ
as result).
Example: Composingf(x) => x + 1
andg(x) => x * x
tog(f(x)) = h(x)
results inh(x) => (x + 1) * (x + 1)
. We get this result by substitutingx
ing(x)
with the function body off(x)
.
So it looks like function composition is the right tool to express all our map and filter transformation steps at once. As reduce
is one of the most versatile operations on arrays (or any other iterable structure), let's try to express map
and filter
as reducers.
Reminder: Reducers are functions, that can be used as a callback in
Array.prototype.reduce()
. The first argument of a reducer is the accumulator (the return value of the last reducer call). The second argument is the element we are looking at (just like the first argument infilter
andmap
). Thus a reducer has the following signature:reducer(acc: A, elem: B) => A
, or in Hindley-Milner notation:reducer :: A -> B -> A
(The HM notation assumes that every function is curried. So a function that should take two arguments is expressed a function (A -> B -> C
):
- that takes one argument (
A
) -> returns a function (B -> C
)
- that takes one argument (
B
) -> returns the result (C
)
Let's try to rebuild map
and filter
as a reducer. For map
we need a function that takes a mapping function as argument and returns a reducer function as result. For filter
we need a function, that takes a filter predicate and also returns a reducer function as result:
const map = mapFn => (acc, elem) => acc.concat(mapFn(elem));
const filter = predicateFn => (acc, elem) => {
if (predicateFn(elem)) {
/*
* append the element
* if predicate evaluates to a truthy result
*/
return acc.concat(elem);
}
/*
* don't append anything otherwise
*/
return acc;
};
If you are now wondering 'How would one compose this?', you are wondering right. Composing two functions requires that the argument type of the outer function, matches the return type of the inner function. In the above functions, neither the result of map(mapFn)
would fit into map
nor into filter
or vice versa. There is simply no way to do so.
What we would need, is that map(mapFn)
(or filter(predicateFn)
) returns a function that expects a function of its own type (I know it becomes very convoluted here, but sty with me: the resolution is simple).
We resolve this problem, by further generalising map
and filter
. The above implementations are just suited to work with arrays as output. But one could imagine having any type of data structure as output, that allows adding elements (like trees, heaps, maps, sets, linked lists, etc.). So let's get rid of acc.concat(x)
calls in the above code and replace it with combine(acc, x)
, where combine
is provided via an argument of the initially returned function:
const map = mapFn => combine => (acc, elem) => combine(acc, mapFn(elem));
const filter = predicateFn => combine => (acc, elem) => {
if (predicateFn(elem)) {
/*
* add the element to the result data structure
* if predicate evaluates to a truthy result
*/
return combine(acc, elem);
}
/*
* don't add anything otherwise
*/
return acc;
};
Now take a close look, how combine
is being used:
- First argument: the accumulator (i.e. the result data structure)
- Second argument: the element that should be added to the result data structure
- Return value: the result data structure, containing the element
This not only looks like a reducer, it is a reducer!
xform (1)
Let's start using our new map
and filter
and build our example transform from above:
const booksAndArticlesArray = [
/* our array of books and articles */
];
const xform = (acc, elem) =>
map(item => ({
...item,
year: new Date(item.publicationDate).getFullYear()
}))(
filter(({ year }) => year === 2007)(
filter(({ title }) => /guide/i.test(title))(
map(({ title, authors, publicationDate }) => ({
title,
author: authors.join(", "),
date: new Date(publicationDate).toDateString()
}))(
filter(({ author }) => author.includes("J. Doe"))(
(acc, elem) => acc.push(elem)
)
)
)
)
)(acc, elem);
const result = booksAndArticlesArray.reduce(xform, []);
... I don't know about you, but I find this horrible ๐คฎ. I would't approve any PR, that contains a thing like this.
To simplify readability, we introduce a general notion of compose
, that composes n
functions:
const compose = (...functs) => (...args) =>
functs.length > 1
? compose(...functs.slice(0, -1))(
functs[functs.length - 1](...args)
)
: functs[0](...args);
Note: The above implementation of
compose
is tail-recursive. So you could rewrite the recursion into a loop. Feel free to take this as an exercise. I'm happy to review your solution in the comments ๐ค.
We now have a proper implementation for compose
. Let's refactor our xform
from above and bring it into a readable form:
xform(2)
const xform = compose(
map(item => ({
...item,
year: new Date(item.publicationDate).getFullYear()
})),
filter(({ year }) => year === 2007),
filter(({ title }) => /guide/i.test(title)),
map(({ title, authors, publicationDate }) => ({
title,
author: authors.join(", "),
date: new Date(publicationDate).toDateString()
})),
filter(({ author }) => author.includes("J. Doe"))
);
As we have now a clean xform
, we could define the function, that will use it, to apply our transforms. We call the "framework" for running xforms against data xduce
. xduce
takes four arguments:
- the
xform
- a combiner (
combine
) - an initializer (
getInitial
) - the input data (
inputArr
)
/*
* xduce draft
*/
const xduce = (xform, combine, getInitial, inputArr) =>
inputArr.reduce(xform(combine), getInitial());
As we might want reuse the transform, we curry the last argument and default combine
and getInitial
arguments (for our convenience ๐):
/*
* xduce (reworked - curried-data-last)
*/
const xduce = (
xform,
combine = (acc, elem) => (acc.push(elem), acc),
getInitial = () => []
) => inputArr => inputArr.reduce(xform(combine), getInitial());
Note: The above code uses Array.prototype.push
as the combiner. Since push
is a mutating operation, we have to make sure, to create a new initial collection with every call to xduce
. Otherwise subsequent calls would add up results to the same array. This is usually not what we want and produces unexpected results. Alternatively you could use (acc, elem) => acc.concat(elem)
as combiner. Array.prototype.concat
does not mutate its source, but will be considerably slower (especially with large arrays).
Besides the provided default, you could use any data structure you like for your results. Just make sure, that the combiner and your initial collection fit together (e.g. for ES Set use (acc, elem) => acc.add(elem)
as combiner and () => new Set()
as your getInitial
arguments).
Next step is to use our xform
with xduce
:
/*
* reminder: xduce(xform) returns a reusable transform
* that just expects input data
*/
const bookFilterMapTransform = xduce(xform);
/*
* using bookFilterMapTransform
*/
const result = bookFilterMapTransform(booksAndArticlesArray);
The Concept
The concept explained here is also known as transducers. As transducers is such a complicated sounding word, I chose a more descriptive name for the article.
The first transducers implementation was provided in Clojure. The concept gained popularity and was ported to other languages like Javascript, Python, ReasonML/OCaml, Elixir and many more.
There are some notable library implementations of transducers available:
If you want to know more about transducers, you'll find a reading list at the end of the article.
Benchmarks
The real power of transducers will show up when using them with really large sets of data.
I made some benchmarking, to give you an idea of the performance advantage of the transducers approach. The xduce
, map
, filter
and compose
implementations, are exactly the ones I provided in this article.
Following transformations were used for the benchmark:
Transducer
const benchmarkTransducer = xduce(
compose(
map(function(x) {
return x + 10;
}),
map(function(x) {
return x * 2;
}),
filter(function(x) {
return x % 5 === 0;
}),
filter(function(x) {
return x % 2 === 0;
})
)
);
Native/Classic Transform
const classicTransform = arr =>
arr
.map(function(x) {
return x + 10;
})
.map(function(x) {
return x * 2;
})
.filter(function(x) {
return x % 5 === 0;
})
.filter(function(x) {
return x % 2 === 0;
});
For time values of each array length, I took the average time from running each implementation 16 times (both on the exact same array). I created 64 equidistant steps. The smallest array had a length of 10417
, the largest had a length of 489583
items.
Both implementations behave very similar in their runtime characteristics below an array size of 60k values (with transducers being only minimally faster than the classical transformation chain). Between 60k and 70k we see an almost immediate increase in the running time of the classical approach. I don't know the exact reasons for this jump (if someone has an idea, please let me know in a comment ๐).
Here is a more detailed image of this region:
Transducers also perform great in comparison with other libraries (e.g. Immutable.js):
Further Reading
- https://codeburst.io/simpler-transducers-for-javascript-4d02a0926648
- https://tgvashworth.com/2014/08/31/csp-and-transducers.html
- https://medium.freecodecamp.org/efficient-data-transformations-using-transducers-c779043ba655
- https://jlongster.com/Transducers.js--A-JavaScript-Library-for-Transformation-of-Data
Top comments (4)
Hi, thanks for the great post. I'm still learning and I really want to start using transducers soon :) but I've a question that sounds a bit too "old-school": how that compare to imperative approach with for loops?
I like and use a functional approach as much as possible, for various reasons (reusability, immutability, readiness etc), but I always have some doubts when it comes to performances.
How that transducer function
compare to an imperative approach like
For sure it cannot be used for streams, but for a static data array it can outperform the transducer version. Where and when you think one approach can be used instead of another?
On the field of animation and interactions, where every ms count, having optimized performances is a must, what do you think?
Hi Marco,
thanks for your comment ๐
The imperative approach you are describing is of course the fastest of all alternatives. It gives you roughly a 50% performance plus. So depending on your collection sizes and performance requirements, it is worth considering to use the imperative approach. I'd recommend reading this article if you want to get more insight
Most of the performance penalty comes from the function call overhead. I know that there are/were some efforts going on, to inline functions at build time (via babel plugin or closure compiler). These optimizations are not very reliable (at the current time). Babel needs all functions in the same file for this optimization and the closure compiler only inlines, when it would save space to inline a function.
A more reliable alternative would be to write your data transforms in ReasonML and let the Bucklescript compiler do the heavy lifting. The optimizations applied by the Bucklescript compiler are more advanced than the optimizations of any Js to Js compiler out there. So if your transforms tend to be rather complex but you don't want to sacrifice readability or maintainability, then I'd recommend to try ReasonML ๐
Maybe v8 is unfolding the loops out doing some other static wizardry?
Did you try on different js engines?
Hi Benny,
Thanks for the comment ๐
I'll try with different JS engines and post the results here ๐.