DEV Community

Paula Gearon
Paula Gearon

Posted on

Clojure Transducers

With a personal life that gets in the way of writing, I've been on a short hiatus, but I do plan to return to my overall story. In the meantime, I was playing with Clojure transducers last night and thought I'd write some notes.

Sudoku

During a discussion on query algebra for graph databases with a friend, he mentioned constraint propagation and pointed me at Peter Norvig's Sudoku Solver. I'm a little rusty on Python so I worked my way through it, which was a fun thing to do over my New Year break.

As an advocate of Functional Programming (FP), the statefulness of the solution made me feel particularly uncomfortable, though it did not look to be too tied to the solution. I was tempted to rewrite it in more familiar terms until I saw that the post includes a link to Justin Kramer's Clojure version (thank you, Justin!).

Short Circuits

As a sometimes-lazy functional language, Clojure has many operations for processing sequences of data. Using these as much as possible is idiomatic for Clojure, as it tends to isolate concerns (reducing the likelihood of bugs), creates abstractions that are favorable for refactoring, and on occasion gives libraries the latitude to improve performance. So I often look for opportunities to use sequence processing, even when the implementation may be less obvious. That doesn't mean I would necessarily do this in production code, but the exploration is pedagogical, and occasionally leads me to new insights.

One aspect of Norvig's code is that it terminates processing early when it discovers an inconsistency in the data (e.g. 7 may be the only option left for this square, but it is already used elsewhere in the same column). This allows the algorithm to move to the next possibility. To do this, instead of returning an updated data structure from the search functions, Norvig's code returns the value False. (Cue the howls from advocates of statically typed languages. After all, he doesn't even use an Option monad! πŸ˜‰).

In general, Clojure's sequence processing does not have an option to terminate early. Of course, exceptions can always be thrown, but this is not really appropriate for handling expected data, and it is definitely not "good form" for FP.

In Justin's case, he wrote a function called reduce-true which performs the usual operation of reduce, but terminates early whenever the result becomes a false value (false or nil). This function uses a loop/recur construct, which is straightforward to follow. This is exactly how I've always done such things in the past.

But this phrase "terminates early" is one that is specifically referenced by Clojure Transducers. This made me wonder if there was something I should consider here.

Transducers

Clojure Transducers are a mechanism introduced in Clojure 1.7. From the changelog in 2014:

Transducers is a new way to decouple algorithmic transformations from their application in different contexts. Transducers are functions that transform reducing functions to build up a "recipe" for transformation.

The transducers mechanism supplies the "separation of concerns" that I mentioned earlier and results in several benefits, most especially in performance when data goes through multiple transformations. But the thing that always stood out for me is that transducers can process a sequence to contain more or fewer elements than the initial sequence. This is a requirement that keeps coming up for me, and to date, I have been avoiding using transducers to fill this gap.

Multiple Elements

The main place where standard map operations let me down is where a single value in an input sequence may result in multiple values in the output sequence. Consider an example of a modified FizzBuzz. For an input sequence of:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]

Then the output should be:
[1, 2, :fizz, 4, :buzz, :fizz, 7, 8, :fizz, :buzz, 11, :fizz, 13, 14, :fizz, :buzz, 16]

The numbers smaller than 15 can all be managed with a simple condition-based function:

(defn fizzbuzz1 [x]
  (condp #(zero? (mod %2 %1)) x
    3 :fizz
    5 :buzz
    x))
(map fizzbuzz1 (range 1 15))

=> (1 2 :fizz 4 :buzz :fizz 7 8 :fizz :buzz 11 :fizz 13 14)

But what about the case for 15? We can't just return 2 values, as this will result in a pair object in the result stream:

(defn fizzbuzz2 [x]
  (condp #(zero? (mod %2 %1)) x
    15 [:fizz :buzz]
    3 :fizz
    5 :buzz
    x))
(map fizzbuzz2 (range 1 17))

=> (1 2 :fizz 4 :buzz :fizz 7 8 :fizz :buzz 11 :fizz 13 14 [:fizz :buzz] 16)

One approach is to use a loop instead, note that the test for 15 could be avoided, but I'm keeping it in place to parallel the above function:

(loop [[n & ns] (range 1 17) result []]
  (if-not n
    result
    (let [appended (fizzbuzz2 n)
          result (if (sequential? appended)
                   (into result appended)
                   (conj result appended))]
      (recur ns result))))

=> (1 2 :fizz 4 :buzz :fizz 7 8 :fizz :buzz 11 :fizz 13 14 :fizz :buzz 16)

This works, but it mixes the mechanism (the loop/recur) with the specific operation that we want. It also has a specific test for the datatype being returned, to determine how to add the data.

We can fix the second issue, by using the same datatype (a vector) for everything. This wraps scalars as a single-element vector...

(defn fizzbuzz3 [x]
  (condp #(zero? (mod %2 %1)) x
    15 [:fizz :buzz]
    3 [:fizz]
    5 [:buzz]
    [x]))
(loop [[n & ns] (range 1 17) result []]
  (if-not n
    result
    (recur ns (into result (fizzbuzz3 n)))))

But with consistent return types, we can now go one better with the fizzbuzz function and skip the test for 15:

(defn fizzbuzz4 [x]
  (or
    (seq
      (cond-> []
        (zero? (mod x 3)) (conj :fizz)
        (zero? (mod x 5)) (conj :buzz)))
    [x]))
(loop [[n & ns] (range 1 17) result []]
  (if-not n
    result
    (recur ns (into result (fizzbuzz3 n)))))

Finally, we can revisit the mechanism, which is the loop, and consider the map operation again. If we just look a the simple application of this operation, we get the intermediate structure:

(map fizzbuzz4 (range 1 17))

=> ([1] [2] (:fizz) [4] (:buzz) (:fizz) [7] [8] (:fizz) (:buzz) [11] (:fizz) [13] [14] (:fizz :buzz) [16])

This sequence of sequences can be joined into a single sequence with mapcat:

(mapcat fizzbuzz4 (range 1 17))

=> (1 2 :fizz 4 :buzz :fizz 7 8 :fizz :buzz 11 :fizz 13 14 :fizz :buzz 16)

This does better than the previous iterations:

  • fizzbuzz4 only tests for divisibility by 3 and 5, and automatically handles 15.
  • There is little consideration of the mechanism for processing the list and building the result.

The only thing that may be of concern here is that all results for the fizzbuzz4 function are sequences, even when the result is a single value. This creates a minor overhead and is also exposing something of how the data is joined in the mapcat operation. Nevertheless, this works well, and it is the approach that I have used numerous times in the past.

Transducer Approach

Since transducers can output multiple values in a sequence for a given input element, is this something that we can use here?

To reiterate the documentation for transducers, a transducer is just an operation that returns a function with 3 arities. These follow the pattern we see in a reduce operation. Indeed, the transducer function is given a reducing function to create a context for the multiple arity function to use:

(fn [reducing-function]
  (fn
   ([] ...)               ;; initialize
   ([result] ...)         ;; completion
   ([result input] ...))) ;; step

The initialize function gets called first and is used for any initialization required. If initialization starts without an initial value, then the reducing function would be called with no arguments here. This reducing function varies according to the use of the transducer, but it can be provided directly for the transduce function.

e.g. if the code that calls the transducer was provided the + function for reduction as in:
(transduce xf + [1 2 3 4])
Then the init function would call (+) (which is 0). If the xf function is identity, then this will give the same result as:
(reduce + [1 2 3 4])

The step function is where the work happens. It processes the input argument and chooses whether or not to append to the result argument.

Finally, the completion function flushes any data and provides the final result.

Step

The step function takes the current result and the next element to be processed. If the transducer is a filter, then it may choose to not add something to the result. If the transducer adds multiple elements (as in the FizzBuzz example above), then all of these can be added to the result here.

While learning about building a transducer, I was curious about what had been accumulated in the result already. This depends entirely on the type of operation that is being done with the transducer.

  • If it is called with sequence then the result will be nil, and the reducing function will buffer the data, as a lazy sequence.
  • If (into [] ...) is used, then a TransientVector is created and populated.
  • If transduce is called, then the initial value for the result is either provided or will be the result of calling the 0-arity form of the reducing function.
  • If the transducer is called via eduction, then the result will depend on the way that the iterator is called. Printing to repl will provide a result of nil. Calling via reduce will have a result of the data being reduced into.

In other worse, there is no guarantee of what kind of result will be provided, nor on the reduction function that will be used to interact with it. These values must be considered opaque.

If we want to know what has been added to the result already, then the appropriate way to do it is with state of our own. For instance, the dedupe transducer uses a volatile value to save the last thing appended.

Take 2

Mixing in the operation of fizzbuzz4 with the transducer operation, we get:

(defn fizzbuzz*
  []
  (fn [rf]
    (fn
      ([] (rf))
      ([result] (rf result))
      ([result input]
       (let [fb (cond-> []
                  (zero? (mod input 3)) (conj :fizz)
                  (zero? (mod input 5)) (conj :buzz))]
         (if (seq fb)
           (reduce rf result fb)
           (rf result input)))))))

It would be possible to use the existing fizzbuzz4 function and reduce the returned value into the result, but this approach avoids the creation of that wrapping vector.

The first few lines of this transducer are all boilerplate, which is always an unfortunate thing, but it is not too onerous here. The interesting work all occurs in the final 2-arity function. First of all, the fb value is set to an empty vector which is then appended to if the input is divisible by 3 and/or 5. If this fb vector is not empty, then each of the values are put into the result with the reducing function (via reduce), or else the reducing function is used to add the current input number.

How can this be used? Like any other reducer, it can be comp-ed into a transducer pipeline, or else it can be given to a function that provides a transducer context:

(transduce (fizzbuzz*) conj [] (range 1 17))
(into [] (fizzbuzz*) (range 1 17))
(sequence (fizzbuzz*) (range 1 17))
(eduction (fizzbuzz*) (range 1 17))

The first 2 will provide a vector with the correct output, the third (sequence) will create a lazy sequence, and the final one (eduction) will create an object that can be iterated on. All of them will result in the desired seq.

Truncating Results

Recall that I started this because I wanted to create a reduction operation that terminated early. So how is that managed?

Transducers can terminate early by calling reduced on the result being returned from the step function. This indicates to the calling process that it must not process any more data, and call the completion function.

Let's return to Justin's reduce-true function. This operates just like a reduce operation, unless the reduction function returns a false value at any point, in which case it will immediately return that value with no further processing.

To do this we need to know what data the reducing function generates before adding it to the result, since we are not able to look at the result value. We also don't know what the reducing function will look like, since different contexts can use a different function.

I decided to opt for the approach used by dedupe which keeps some state using a volatile value. This gives us the option of checking the result when invoking the user-supplied function and setting an exit flag if it is nil. When this happens, the step function can use reduced on the result to indicate the end of the reduction.

So this is my approach:

(defn reduce-true
 [f val coll]
 (let [end (volatile! false)
       terminating (fn [rf]
                     (fn
                       ([] (rf))
                       ([result] (rf result))
                       ([result input]
                        (let [nxt (rf result input)]
                          (if @end
                           (reduced nxt)
                           nxt)))))
       ff (fn
           ([a] a)
           ([a b]
            (let [r (f a b)]
              (when-not r (vreset! end true))
              r)))]
  (transduce terminating ff val coll)))

The first line of the let creates the termination flag and initializes it to false. The terminating value is then set to our terminating transducer.

We can see here that the step function gets the result of calling the reducing function, and then checks the flag to determine if the final result needs to have reduced called or not. If this reduced function gets called, then the process will finish.

Next, we can see that the user-supplied reducing function is wrapped in a function called ff that will set this exit flag if the user-supplied function ever returns a false value. A single arity function is also provided for cleanup, though this is just a no-op.

With this transducer set up, we can now do a terminating reduction using transduce, starting with the user-supplied val and processing the provided collection coll.

Wrap Up

Is this better?

For the provided example, I would say no. Justin's reduce-true function is clearer in operation, despite revealing the mechanism for processing the collection. Given that the loop/recur construct of this function has no construction overhead, then there won't be any performance benefit either.

An approach like this transducer-based function may be more flexible and integrate with other transducers a little better (if the function returns the transducer directly and does not call transduce itself). But for the provided application there isn't any benefit.

However, I now have a better idea of where things go when building my own transducer. When I first started playing with this, I was focused on the transduce function, which gave me the wrong idea of what the reduction function would be and the datatype of the result arguments. Writing this post helped me fill in some of the edges of my knowledge that I hadn't realized were lacking.

So now I know how to avoid mapcat and how to create early-terminating sequence processing. That's a little more about transducers than I was comfortable with when I started. πŸ™‚

Happy New Decade everyone!

Discussion (0)