DEV Community

loading...

Transducers - a Generalized Concept for Data Transformations

misterwhat profile image Jonas Winzen ・10 min read

When developing software, we sometimes can't get around of handling data in some way.
No matter, if you work on the frontend (where the UI you are building is basically a function of the data) - or on the backend (backends tend to be rather data heavy).

Typical tasks when processing data include (but are not limited to):

  • Filtering out datasets (like a facebook search should only give me matches with my search term or redacting information of privacy reasons - like bank numbers, email addresses or even passwords)
  • Mapping binary data into a human readable format or vice versa

...well, okay. Facebook might not be the best example for demonstrating applied data privacy...

TL;DR You can skip the theory, if you're not interested or already know the basics.

Theory of Data and Transformations

Data

Before working on data, we need to have an idea of how the data looks like. In general the structure can be defined as a collection of items like:

  • Rows in a database table - where the table is the collection and rows are the items
  • Elements in a set - with elements as items, the set as the collection
  • Fields in an array
  • Elements in a list
  • Entries in a dictionary
  • Nodes in a tree

Any collection could be embedded as an item into another collection. Let's formalize this statement:

-- an Item is defined as either:
-- 1. a Primitive value
-- 2. a Collection of Items
data Item = Primitive | Collection Item

Note: I also tried to find a representation for this in Typescript notation. But Typescripts type system doesn't seem to be powerful enough to describe recursive types

Transformable Data

Since our definition of data consists of just two kinds of types (Item and Collection), we can only have transformations on the collection level or on the item level. Where filtering (deciding for each element whether or not to include it in the result) is a transformation on the collection level.
A collection that can be filtered is called Filterable.

Mapping is the process of taking each item from its container (the collection), applying a transform to the item, and putting it back into a container of the same kind of collection where it came from. Mapping is a transformation on the item level.
A collection (or container), where you can map over its contents is called Functor.

In Practice

The Common Approach

Javascript comes with native array methods for mapping and filtering array data. Most notable:

  • Array.prototype.map(mapFn)
  • Array.prototype.filter(predicateFn)
  • Array.prototype.reduce(reducerFn [, init])

Now let's make up a simple example, to see how each of them works.
We have a collection of bibliographical data. Each item represents a Book or publication, that has a unique id, a title, was written by one or more authors and has a publication date (in form of a unix timestamp) in ms since epoch.

type BookOrArticle = {
  id: string,
  title: string,
  authors: Array<string>,
  publicationDate: number,
};

We have an array of (let's say) 10,000 books and and articles and we need to get all items that contain the word "guide" in the title, written by J. Doe and published in 2007. The result items should be in the form:

type Result = {
  title: string,
  author: string,
  date: string
};

1. Naive Approach

Lets take a first approach:

const booksAndArticlesArray = [
  /* our array of books and articles */
];

function getResults(inputArray) {
  return inputArray
    .map(item => ({
      ...item,
      year: new Date(item.publicationDate).getFullYear()
    }))
    .filter(({ year }) => year === 2007)
    .filter(({ title }) => /guide/i.test(title))
    .map(({ title, authors, publicationDate }) => ({
      title,
      author: authors.join(", "),
      date: new Date(publicationDate).toDateString()
    }))
    .filter(({author}) => author.includes("J. Doe"))
}

It might not be immediately visible, but each step of filtering or mapping creates an intermediate array containing the result, that is used as an input for the next filter/map step.

2. Optimized Approach

To reduce the number of intermediate data structures created, one could try and reduce the number of transformation steps by collapsing consecutive map and filter operations:


// collapse consecutive filter operations

function getResults(inputArray) {
  return inputArray
    .map(item => ({
      ...item,
      year: new Date(item.publicationDate).getFullYear()
    }))
    .filter(
      ({ year, title }) => year === 2007 && /guide/i.test(title)
    )
    .map(({ title, authors, publicationDate }) => ({
      title,
      author: authors.join(", "),
      date: new Date(publicationDate).toDateString()
    }))
    .filter(({ author }) => author.includes("J. Doe"));
}

We could further optimize this by moving the mapping from the authors array field in the second map operation into the first map operation. This would allow us to collapse the the final filter operation with the other filter:


// 1. move mapping authors array to author string into first map operation
// 2. collapse final filter operation with previous filter operation

function getResults(inputArray) {
  return inputArray
    .map(({ publicationDate, authors, ...item }) => ({
      ...item,
      author: authors.join(", "),
      year: new Date(publicationDate).getFullYear()
    }))
    .filter(
      ({ year, title, author }) =>
        year === 2007 &&
        /guide/i.test(title) &&
        author.includes("J. Doe")
    )
    .map(({ title, author, publicationDate }) => ({
      title,
      author,
      date: new Date(publicationDate).toDateString()
    }));
}

We reduced the number of intermediate data structures created from 5 to 3, but at the cost of readability. Further reduction is possible by moving the author and year transforms from the first map operation into the filter step and remove the first map operation (this comes at the cost of reducing readability, too).

Mapping and Filtering via Reduce

What if we could express filter and map in a way, that allows us to directly compose them.

Reminder: Compose or function composition means, that you call one function with the result of another function.
I.E: Given the functions f: X -> Y and g: Y -> Z, we call g.f(x) := g(f(x)) the composition of f with g with g.f: X -> Z (that means f.g(x) takes values of X as argument and returns values of Z as result).
Example: Composing f(x) => x + 1 and g(x) => x * x to g(f(x)) = h(x) results in h(x) => (x + 1) * (x + 1). We get this result by substituting x in g(x) with the function body of f(x).

So it looks like function composition is the right tool to express all our map and filter transformation steps at once. As reduce is one of the most versatile operations on arrays (or any other iterable structure), let's try to express map and filter as reducers.

Reminder: Reducers are functions, that can be used as a callback in Array.prototype.reduce(). The first argument of a reducer is the accumulator (the return value of the last reducer call). The second argument is the element we are looking at (just like the first argument in filter and map). Thus a reducer has the following signature: reducer(acc: A, elem: B) => A, or in Hindley-Milner notation: reducer :: A -> B -> A (The HM notation assumes that every function is curried. So a function that should take two arguments is expressed a function (A -> B -> C):

  • that takes one argument (A) -> returns a function (B -> C)
    • that takes one argument (B) -> returns the result (C)

Let's try to rebuild map and filter as a reducer. For map we need a function that takes a mapping function as argument and returns a reducer function as result. For filter we need a function, that takes a filter predicate and also returns a reducer function as result:

const map = mapFn => (acc, elem) => acc.concat(mapFn(elem));

const filter = predicateFn => (acc, elem) => {
  if (predicateFn(elem)) {
    /*
     * append the element
     * if predicate evaluates to a truthy result
     */
    return acc.concat(elem);
  }
  /*
   * don't append anything otherwise
   */
  return acc;
};

If you are now wondering 'How would one compose this?', you are wondering right. Composing two functions requires that the argument type of the outer function, matches the return type of the inner function. In the above functions, neither the result of map(mapFn) would fit into map nor into filter or vice versa. There is simply no way to do so.

What we would need, is that map(mapFn) (or filter(predicateFn)) returns a function that expects a function of its own type (I know it becomes very convoluted here, but sty with me: the resolution is simple).

We resolve this problem, by further generalising map and filter. The above implementations are just suited to work with arrays as output. But one could imagine having any type of data structure as output, that allows adding elements (like trees, heaps, maps, sets, linked lists, etc.). So let's get rid of acc.concat(x) calls in the above code and replace it with combine(acc, x), where combine is provided via an argument of the initially returned function:

const map = mapFn => combine => (acc, elem) => combine(acc, mapFn(elem));

const filter = predicateFn => combine => (acc, elem) => {
  if (predicateFn(elem)) {
    /*
     * add the element to the result data structure
     * if predicate evaluates to a truthy result
     */
    return combine(acc, elem);
  }
  /*
   * don't add anything otherwise
   */
  return acc;
};

Now take a close look, how combine is being used:

  • First argument: the accumulator (i.e. the result data structure)
  • Second argument: the element that should be added to the result data structure
  • Return value: the result data structure, containing the element

This not only looks like a reducer, it is a reducer!

xform (1)

Let's start using our new map and filter and build our example transform from above:

const booksAndArticlesArray = [
  /* our array of books and articles */
];

const xform = (acc, elem) =>
  map(item => ({
    ...item,
    year: new Date(item.publicationDate).getFullYear()
  }))(
    filter(({ year }) => year === 2007)(
      filter(({ title }) => /guide/i.test(title))(
        map(({ title, authors, publicationDate }) => ({
          title,
          author: authors.join(", "),
          date: new Date(publicationDate).toDateString()
        }))(
          filter(({ author }) => author.includes("J. Doe"))(
            (acc, elem) => acc.push(elem)
          )
        )
      )
    )
  )(acc, elem);

const result = booksAndArticlesArray.reduce(xform, []);

... I don't know about you, but I find this horrible 🤮. I would't approve any PR, that contains a thing like this.
To simplify readability, we introduce a general notion of compose, that composes n functions:

const compose = (...functs) => (...args) =>
  functs.length > 1
    ? compose(...functs.slice(0, -1))(
        functs[functs.length - 1](...args)
      )
    : functs[0](...args);

Note: The above implementation of compose is tail-recursive. So you could rewrite the recursion into a loop. Feel free to take this as an exercise. I'm happy to review your solution in the comments 🤓.

We now have a proper implementation for compose. Let's refactor our xform from above and bring it into a readable form:

xform(2)


const xform = compose(
  map(item => ({
    ...item,
    year: new Date(item.publicationDate).getFullYear()
  })),
  filter(({ year }) => year === 2007),
  filter(({ title }) => /guide/i.test(title)),
  map(({ title, authors, publicationDate }) => ({
    title,
    author: authors.join(", "),
    date: new Date(publicationDate).toDateString()
  })),
  filter(({ author }) => author.includes("J. Doe"))
);

As we have now a clean xform, we could define the function, that will use it, to apply our transforms. We call the "framework" for running xforms against data xduce. xduce takes four arguments:

  • the xform
  • a combiner (combine)
  • an initializer (getInitial)
  • the input data (inputArr)
/*
 * xduce draft
 */
const xduce = (xform, combine, getInitial, inputArr) =>
  inputArr.reduce(xform(combine), getInitial());

As we might want reuse the transform, we curry the last argument and default combine and getInitial arguments (for our convenience 😇):

/*
 * xduce (reworked - curried-data-last)
 */
const xduce = (
  xform,
  combine = (acc, elem) => (acc.push(elem), acc), 
  getInitial = () => []
) => inputArr => inputArr.reduce(xform(combine), getInitial());

Note: The above code uses Array.prototype.push as the combiner. Since push is a mutating operation, we have to make sure, to create a new initial collection with every call to xduce. Otherwise subsequent calls would add up results to the same array. This is usually not what we want and produces unexpected results. Alternatively you could use (acc, elem) => acc.concat(elem) as combiner. Array.prototype.concat does not mutate its source, but will be considerably slower (especially with large arrays).

Besides the provided default, you could use any data structure you like for your results. Just make sure, that the combiner and your initial collection fit together (e.g. for ES Set use (acc, elem) => acc.add(elem) as combiner and () => new Set() as your getInitial arguments).

Next step is to use our xform with xduce:

/*
 * reminder: xduce(xform) returns a reusable transform
 * that just expects input data
 */
const bookFilterMapTransform = xduce(xform);


/*
 * using bookFilterMapTransform
 */
const result = bookFilterMapTransform(booksAndArticlesArray);

The Concept

The concept explained here is also known as transducers. As transducers is such a complicated sounding word, I chose a more descriptive name for the article.
The first transducers implementation was provided in Clojure. The concept gained popularity and was ported to other languages like Javascript, Python, ReasonML/OCaml, Elixir and many more.

There are some notable library implementations of transducers available:

If you want to know more about transducers, you'll find a reading list at the end of the article.

Benchmarks

The real power of transducers will show up when using them with really large sets of data.
I made some benchmarking, to give you an idea of the performance advantage of the transducers approach. The xduce, map, filter and compose implementations, are exactly the ones I provided in this article.

Following transformations were used for the benchmark:

Transducer

const benchmarkTransducer = xduce(
  compose(
    map(function(x) {
      return x + 10;
    }),
    map(function(x) {
      return x * 2;
    }),
    filter(function(x) {
      return x % 5 === 0;
    }),
    filter(function(x) {
      return x % 2 === 0;
    })
  )
);

Native/Classic Transform

const classicTransform = arr =>
  arr
    .map(function(x) {
      return x + 10;
    })
    .map(function(x) {
      return x * 2;
    })
    .filter(function(x) {
      return x % 5 === 0;
    })
    .filter(function(x) {
      return x % 2 === 0;
    });

For time values of each array length, I took the average time from running each implementation 16 times (both on the exact same array). I created 64 equidistant steps. The smallest array had a length of 10417, the largest had a length of 489583 items.

Benchmark Results

Both implementations behave very similar in their runtime characteristics below an array size of 60k values (with transducers being only minimally faster than the classical transformation chain). Between 60k and 70k we see an almost immediate increase in the running time of the classical approach. I don't know the exact reasons for this jump (if someone has an idea, please let me know in a comment 🙃).
Here is a more detailed image of this region:

Detail View

Transducers also perform great in comparison with other libraries (e.g. Immutable.js):

Immutable.js

Further Reading

Discussion (4)

pic
Editor guide
Collapse
markov02 profile image
Marco Vettorello

Hi, thanks for the great post. I'm still learning and I really want to start using transducers soon :) but I've a question that sounds a bit too "old-school": how that compare to imperative approach with for loops?
I like and use a functional approach as much as possible, for various reasons (reusability, immutability, readiness etc), but I always have some doubts when it comes to performances.
How that transducer function

const benchmarkTransducer = xduce(
  compose(
    map(function(x) {
      return x + 10;
    }),
    map(function(x) {
      return x * 2;
    }),
    filter(function(x) {
      return x % 5 === 0;
    }),
    filter(function(x) {
      return x % 2 === 0;
    })
  )
);

compare to an imperative approach like

let i = 0;
const updatedData = []
for(;i< data.length;i++) {
   const value = data[i];
   const updatedValue = (value + 10) * 2
   if (updatedValue % 5 === 0 && updatedValue % 2 === 0){
     updatedData.push(updatedValue)
   }
}

For sure it cannot be used for streams, but for a static data array it can outperform the transducer version. Where and when you think one approach can be used instead of another?
On the field of animation and interactions, where every ms count, having optimized performances is a must, what do you think?

Collapse
misterwhat profile image
Jonas Winzen Author

Hi Marco,
thanks for your comment 😃

The imperative approach you are describing is of course the fastest of all alternatives. It gives you roughly a 50% performance plus. So depending on your collection sizes and performance requirements, it is worth considering to use the imperative approach. I'd recommend reading this article if you want to get more insight

Most of the performance penalty comes from the function call overhead. I know that there are/were some efforts going on, to inline functions at build time (via babel plugin or closure compiler). These optimizations are not very reliable (at the current time). Babel needs all functions in the same file for this optimization and the closure compiler only inlines, when it would save space to inline a function.

A more reliable alternative would be to write your data transforms in ReasonML and let the Bucklescript compiler do the heavy lifting. The optimizations applied by the Bucklescript compiler are more advanced than the optimizations of any Js to Js compiler out there. So if your transforms tend to be rather complex but you don't want to sacrifice readability or maintainability, then I'd recommend to try ReasonML 😉

Collapse
bennypowers profile image
Benny Powers 🇮🇱🇨🇦

Between 60k and 70k we see an almost immediate increase in the running time of the classical approach.

Maybe v8 is unfolding the loops out doing some other static wizardry?

Did you try on different js engines?

Collapse
misterwhat profile image
Jonas Winzen Author

Hi Benny,
Thanks for the comment 😃
I'll try with different JS engines and post the results here 👍.