DEV Community

Paula Gearon
Paula Gearon

Posted on

Seqing an Answer

Clojure List Structures

I recently needed to process data structures in Clojure, handling "List-Like" structures one way, and "Vector-Like" another. I ran into an issue with this and thought it might be worth exploring this properly, rather than the 95% solution that I've applied in the past.

The following may be a bit dense, but the purpose here is to clarify this process for me and to leave notes to refer back to in the future. Everyone else is free to read, but please don't expect much of it! 🙂

Despite being a dynamically typed language with very little explicit type information, Clojure is built on types and often runs on the Java Virtual Machine (JVM) which is an implicitly typed system. Also, one of Clojure's strengths that sets it apart from other variants of Lisp is that it has a strong set of functional data structures as a core part of the language implementation and syntax. For JVM-targetted Clojure, these structures participate fully in the type system.

I only have a passing familiarity with Common Lisp (CL), but I know a little more about Scheme, having learned it while going through the SICP text and lectures from MIT. (Incidentally, I highly recommend these lectures for learning the principles of functional programming (FP), but especially for anyone who writes Clojure). While I like these languages, Clojure's inclusion of vectors, sets, maps, and lazy structures sets it apart from other Lisps. I particularly like the simple syntactic form for each them:

  • (1 2 3 4) is a list of 4 numbers
  • [1 2 3 4] is a vector of 4 numbers
  • #{1 2 3 4} is a set of 4 numbers
  • {1 2 3 4} is a map with 2 entries, 1 mapping to 2, and 3 mapping to 4.

However, the list form is different from the others in that evaluating it will execute code, while the other forms will evaluate to themselves. For those who know Clojure and other Lisps, evaluating a list will treat the first element as a function that will be applied with the remainder of the list as its arguments. The list (1 2 3 4) will, therefore, lead to an error, as 1 is not a function. Instead, if the symbol + were put in place at the start of the list, then it would evaluate by adding all the digits, giving 10.

#_=> (1 2 3 4)
Execution error (ClassCastException) at user/eval1 (REPL:1).
class java.lang.Long cannot be cast to class clojure.lang.IFn (java.lang.Long is in module java.base of loader 'bootstrap'; clojure.lang.IFn is in unnamed module of loader 'app')
#_=> (+ 1 2 3 4)
10
Enter fullscreen mode Exit fullscreen mode

We can also prevent a list from being executed by quoting it, and then save it to evaluate later using the eval operation:

#_=> (def addition-list '(+ 1 2 3 4))
#'user/addition-list
#_=> addition-list
(+ 1 2 3 4)
#_=> (eval addition-list)
10
Enter fullscreen mode Exit fullscreen mode

This should be elementary for anyone familiar with Clojure, but it then leads to the question… what is a list in Clojure?

In Scheme, a list is built from a series of pairs-of-values, or tuples of size 2, called a "Cons" cell. These form a "Linked List", and it is this data structure that Lisp both manipulates, and is syntactically built from. This, along with the syntactic use of parentheses leads to the names: List Pprocessing, or alternatively, Lots of Insane, Superfluous Parentheses.

The fact that the language is built out of these structures means that all code is a valid data structure. But which data structures form valid code?

Evaluable Structures

While Scheme uses a linked list built from cons-cells to represent executable code, Clojure has other options. Building lists from cons-cells does actually work:

#_=> (def cons-list (cons '+ (cons 1 (cons 2 (cons 3 (cons 4 nil))))))
#'user/cons-list
#_=> cons-list
(+ 1 2 3 4)
#_=> (eval cons-list)
10
Enter fullscreen mode Exit fullscreen mode

I am using a quote before the + so that the symbol is stored rather than the full function, since it looks nicer to print, even though it evaluates the same way.

But this is not the same type of list as the one we used earlier:

#_=> (type cons-list)
clojure.lang.Cons
#_=> (type addition-list)
clojure.lang.PersistentList
Enter fullscreen mode Exit fullscreen mode

I like using type instead of class, though class would probably be more strictly correct when I know it's a class.

We may also have a list-like structure that only gets evaluated when you need it:

#_=> (def concat-list (concat '(+) (range 1 5)))
#'user/concat-list
#_=> concat-list
(+ 1 2 3 4)
#_=> (type concat-list)
clojure.lang.LazySeq
#_=> (eval concat-list)
10
Enter fullscreen mode Exit fullscreen mode

This list looks the same, and evaluates to the same value, but is another data type.

So far, we've seen Cons based lists, PersistentLists and LazySeqs, all being evaluated to execute their contents. We also know that vectors evaluate to themselves:

#_=> (def addition-vector ['+ 1 2 3 4])
#'user/addition-vector
#_=> addition-vector
[+ 1 2 3 4]
#_=> (eval addition-vector)
[#object[clojure.core$_PLUS_ 0x777c9dc9 "clojure.core$_PLUS_@777c9dc9"] 1 2 3 4]
Enter fullscreen mode Exit fullscreen mode

Note that the + symbol has been evaluated to the value of the function object. As I mentioned earlier, these aren't very pretty to read.

From simple observation, it appears that the main thing that determines if a Clojure expression will be executed on evaluation is if the REPL prints it in parentheses. So how can we find this programmatically? (Aside from using pr-str and parsing the resulting string).

One possibility might be to use the list? function:

#_=> (list? addition-list)
true
#_=> (list? addition-vector)
false
Enter fullscreen mode Exit fullscreen mode

However, this does not work for the other types we've seen:

#_=> (list? cons-list)
false
#_=> (list? concat-list)
false
Enter fullscreen mode Exit fullscreen mode

Seq

"Way back when" I read Stuart Halloway's book Programming Clojure (1st Ed.), it introduced us to the fundamentals of the language before quickly diving into a chapter on Sequences.

Note: Programming Clojure was updated to a 2nd edition by Aaron Bedra in 2012, and to the 3rd edition by Alex Miller in 2018.

Stu explains how the fundamental abstraction in Clojure is the sequence or seq (pronounced "seek"). Much of the core library is based on this abstraction, taking seqs as arguments and/or returning them. All of the built-in data structures in Clojure can be treated as a seq, including maps, where the elements of the sequence are the key/value pairs in the map.

Seqs have 3 main capabilities:

  • You can get the first item with first
  • You can get a seq of all items after the first item with rest
  • You can add something to the beginning of a seq using cons

Another important function is seq which will convert an object to a seq, and returning nil (a valid seq) when the sequence is empty. In practice, this function is mostly used to check if a sequence type has elements in it since nil will evaluate as the equivalent of false, and otherwise the function returns something that behaves like the original structure.

Importantly, Stu specifically mentions that seqs are always printed using parentheses. Could this be the thing that indicates that an object will be executed on evaluation?

One thing that I think led to some confusion for me was the comment on page 93 of this edition:

You can treat vectors as seqs:
(first [1 2 3])
1
(rest [1 2 3])
(2 3)
(cons 0 [1 2 3])
(0 1 2 3)

This suggests that vectors are seqs, and indeed, I've used them for 10 years as if they are. Almost every function that accepts objects that are abstractly seqs will effectively call seq for you, but this step is kept hidden from the user. Checking if something needs to be converted to a seq is not something the programmer usually has to do. A useful exception is when testing if a collection is not empty. Indeed, the definition of empty? uses this and even states it in the docstring:

(defn empty?
  "Returns true if coll has no items - same as (not (seq coll)).
  Please use the idiom (seq x) rather than (not (empty? x))"
  {:added "1.0"
   :static true}
  [coll] (not (seq coll)))
Enter fullscreen mode Exit fullscreen mode

A Taxonomy

Clojure includes several functions to identify sequence types:

  • list? - Is the argument a list type?
  • vector? - Is the argument a vector type?
  • sequential? - Is the argument a sequential type?
  • coll? - Is the argument a collection type?
  • seqable? - Can seq be called on this type?
  • seq? - Is the argument a seq type?

But which is appropriate and in which circumstances? What is the difference between seq? and sequential?? Does seq? just return true for anything that can be validly called with the seq function?

My presumption to that last question was "yes", yet it's actually the opposite. Which is why I decided to write this.

The best way to figure out what is happening with these functions is to look:

#_=> (source list?)
(defn list?
  "Returns true if x implements IPersistentList"
  {:added "1.0"
   :static true}
  [x] (instance? clojure.lang.IPersistentList x))
#_=> (source vector?)
(def
 ^{:arglists '([x])
   :doc "Return true if x implements IPersistentVector"
   :added "1.0"
   :static true}
 vector? (fn ^:static vector? [x] (instance? clojure.lang.IPersistentVector x)))
#_=> (source sequential?)
(defn sequential?
 "Returns true if coll implements Sequential"
 {:added "1.0"
  :static true}
  [coll] (instance? clojure.lang.Sequential coll))
#_=> (source coll?)
(defn coll?
  "Returns true if x implements IPersistentCollection"
  {:added "1.0"
   :static true}
  [x] (instance? clojure.lang.IPersistentCollection x))
#_=> (source seqable?)
(defn seqable?
  "Return true if the seq function is supported for x"
  {:added "1.9"}
  [x] (clojure.lang.RT/canSeq x))
#_=> (source seq?)
(def
 ^{:arglists '([x])
   :doc "Return true if x implements ISeq"
   :added "1.0"
   :static true}
 seq? (fn ^:static seq? [x] (instance? clojure.lang.ISeq x)))
Enter fullscreen mode Exit fullscreen mode

This is a bit verbose, so let's simplify them to something that is basically equivalent:

(defn list? [x]
 (instance? clojure.lang.IPersistentList x))

(defn vector? [x]
 (instance? clojure.lang.IPersistentVector x))

(defn sequential? [coll]
 (instance? clojure.lang.Sequential coll))

(defn coll? [x]
 (instance? clojure.lang.IPersistentCollection x))

(defn seqable? [x]
 (clojure.lang.RT/canSeq x))

(defn seq? [x]
 (instance? clojure.lang.ISeq x))
Enter fullscreen mode Exit fullscreen mode

So with the exception of seqable? (which I will look at later) we're looking for instances of classes and interfaces. We can also see most of the interfaces begin with the letter I, since this is a common naming approach in the Clojure sources, though there are a few interfaces that don't do this. Describing multiple interfaces like this raises the question, what is the taxonomy of these objects? We can start looking using some reflection:

#_=> (type addition-list)
clojure.lang.PersistentList
#_=> (supers (type addition-list))
#{java.util.List clojure.lang.ASeq java.lang.Iterable clojure.lang.IPersistentList clojure.lang.ISeq clojure.lang.Counted clojure.lang.IPersistentStack clojure.lang.IReduceInit java.io.Serializable clojure.lang.IHashEq java.util.Collection clojure.lang.IReduce clojure.lang.IPersistentCollection clojure.lang.Obj clojure.lang.Seqable clojure.lang.Sequential java.lang.Object clojure.lang.IMeta clojure.lang.IObj}
Enter fullscreen mode Exit fullscreen mode

I'm going to ignore the interfaces that are related to operations that are not related to collections. So that includes IHashEq, Obj, and even IReduce (an operation that is specifically for collections, but not related to the taxonomy of collections).

Vectors have similarly large families:

#_=> (type addition-vector)
clojure.lang.PersistentVector
#_=> (supers (type addition-vector))
#{clojure.lang.Indexed java.util.List java.lang.Runnable java.lang.Iterable clojure.lang.AFn clojure.lang.Counted clojure.lang.IPersistentStack clojure.lang.IReduceInit java.io.Serializable clojure.lang.IHashEq java.util.Collection clojure.lang.IReduce clojure.lang.ILookup clojure.lang.IKVReduce clojure.lang.IPersistentCollection java.util.RandomAccess clojure.lang.IEditableCollection clojure.lang.APersistentVector clojure.lang.Seqable clojure.lang.Sequential java.lang.Object clojure.lang.IPersistentVector java.util.concurrent.Callable clojure.lang.Reversible clojure.lang.IMeta clojure.lang.IObj clojure.lang.Associative java.lang.Comparable clojure.lang.IFn}
Enter fullscreen mode Exit fullscreen mode

This is getting a lot to understand, so let's look at it visually.

Unbalanced tree

The 4 concrete classes I've included are described by the grey boxes. The thicker boxes (ASeq and APersistentVector) are for abstract classes that implement some of the behavior for their associated interface.

Following this diagram, we can quickly see that:

  • vector? only applies to vectors.
  • sequential? applies to every type of Clojure sequence.
  • list? only applies to list types (the only one in this diagram is PersistentList. PersistentQueue also implements this).
  • coll? applies to all of the Clojure collections.
  • seqable? is not in the diagram since it does not test if something is of type Seqable.
  • seq? applies to all of the list-like types, and not vectors.

Evaluation

Each of the types that implement ISeq meets this "list-like" requirement that appears to represent an operation to execute a function. So is this interface what indicates that a function should be executed?

We can check this out by looking at what happens when eval is called on an ISeq object.

#_=> (source eval)
(defn eval
  "Evaluates the form data structure (not text!) and returns the result."
  {:added "1.0"
   :static true}
  [form] (. clojure.lang.Compiler (eval form)))
Enter fullscreen mode Exit fullscreen mode

This is using Java interop and since Java code is much more verbose I will link to the sources while describing what happens. The eval function is calling the method which we can find at clojure.lang.Compiler.eval(Object). This goes straight into the eval(Object,boolean) method, which then branches, depending on the type of form that was presented.

Line 7170 will identify that the argument is an IPersistentCollection, and it's not a def expression. This calls the analyze method on line 6748 to get back an ObjExpr object that can have eval() called on it.

At this point, we are hoping to see that analyze will create an object that will execute code if the type is an ISeq or evaluate to itself for anything else.

The first thing that analyze does is to look to see if the form is a LazySeq and if it is, then it gets replaced with a realized version. It then compares the type of form to various types (skipping the IPersistentCollection match on line 6777 because the count of the collection is non-zero), until it matches ISeq on line 6788. At this point, the Expr to be returned comes from calling analyzeSeq(C,ISeq,String). We can see immediately after (line 6791), that an IPersistentVector form will instead call VectorExpr.parse.

The code in analyzeSeq looks for various things like metadata, macro-expansion, fn expressions, and so on, but for a simple ISeq it will get down to line 7109, where it calls InvokeExpr.parse(C,ISeq). The name here tells us that it about invoking something, but we can have a quick look at the code to be sure.

Most of the parse method looks for various forms, but in the general case it identifies the function by evaluating the first expression in the sequence, and the args will be everything after it. These are then used to create a new InvokeExpr object, which saves those values. We already saw back in eval(Object,boolean) that this object is about to have eval() called on it, and by going there we can see that the function object is being applied to the arguments on line 3702.

For anyone who kept up with that, we can now be satisfied that it's when we call eval on an object of type ISeq that code is evaluated. This is exactly what we expected to see. Along the way, we saw that vectors are evaluated using their own strategy.

Non-Seqs

What about other data types that don't implement the ISeq interface? What happens when we treat these objects as seqs?

Consider calling first on a vector. The source for first is:

#_=> (source first)
(def
 ^{:arglists '([coll])
   :doc "Returns the first item in the collection. Calls seq on its
    argument. If coll is nil, returns nil."
   :added "1.0"
   :static true}
 first (fn ^:static first [coll] (. clojure.lang.RT (first coll))))
Enter fullscreen mode Exit fullscreen mode

This calls into the clojure.lang.RT.first(Object) function at line 690. In that source file we can also see the next(Object) function defined right after.

static public Object first(Object x){
  if(x instanceof ISeq)
    return ((ISeq) x).first();
  ISeq seq = seq(x);
  if(seq == null)
    return null;
  return seq.first();
}

static public ISeq next(Object x){
  if(x instanceof ISeq)
    return ((ISeq) x).next();
  ISeq seq = seq(x);
  if(seq == null)
    return null;
  return seq.next();
}
Enter fullscreen mode Exit fullscreen mode

This means that the appropriate methods will be called on anything that is already an ISeq. For anything else, it calls the seq method to convert the object to an ISeq. This happens to be the same method that the Clojure seq function calls:

(def
 ^{:arglists '(^clojure.lang.ISeq [coll])
   :doc "Returns a seq on the collection. If the collection is
    empty, returns nil.  (seq nil) returns nil. seq also works on
    Strings, native Java arrays (of reference types) and any objects
    that implement Iterable. Note that seqs cache values, thus seq
    should not be used on any Iterable whose iterator repeatedly
    returns the same mutable object."
   :tag clojure.lang.ISeq
   :added "1.0"
   :static true}
 seq (fn ^:static seq ^clojure.lang.ISeq [coll] (. clojure.lang.RT (seq coll))))
Enter fullscreen mode Exit fullscreen mode

So calling:
(first [1 2 3])
Is exactly equivalent to calling:
(first (seq [1 2 3]))

The implementation of clojure.lang.RT.seq(Object) is:

static public ISeq seq(Object coll){
  if(coll instanceof ASeq)
    return (ASeq) coll;
  else if(coll instanceof LazySeq)
    return ((LazySeq) coll).seq();
  else
    return seqFrom(coll);
}

// N.B. canSeq must be kept in sync with this!
static ISeq seqFrom(Object coll){
  if(coll instanceof Seqable)
    return ((Seqable) coll).seq();
  else if(coll == null)
    return null;
  else if(coll instanceof Iterable)
    return chunkIteratorSeq(((Iterable) coll).iterator());
  else if(coll.getClass().isArray())
    return ArraySeq.createFromObject(coll);
  else if(coll instanceof CharSequence)
    return StringSeq.create((CharSequence) coll);
  else if(coll instanceof Map)
    return seq(((Map) coll).entrySet());
  else {
    Class c = coll.getClass();
    Class sc = c.getSuperclass();
    throw new IllegalArgumentException("Don't know how to create ISeq from: " + c.getName());
  }
}
Enter fullscreen mode Exit fullscreen mode

This looks at all the known variations for a sequence and wraps it in an appropriate object. In the case of a vector, it will match the Seqable interface, and call the seq method there. The implementation of this just creates a wrapper around the vector to return it as an ISeq:

public IChunkedSeq chunkedSeq(){
  if(count() == 0)
    return null;
  return new ChunkedSeq(this,0,0);
}

public ISeq seq(){
  return chunkedSeq();
}
Enter fullscreen mode Exit fullscreen mode

Did anyone notice the comment about the canSeq method? This is the method that the core function seqable? is invoking. Recalling how this was the only function that was not checking if an object was an instance of a known interface, it seems like this is the time to look at the implementation.

static public boolean canSeq(Object coll){
    return coll instanceof ISeq
            || coll instanceof Seqable
            || coll == null
            || coll instanceof Iterable
            || coll.getClass().isArray()
            || coll instanceof CharSequence
            || coll instanceof Map;
}
Enter fullscreen mode Exit fullscreen mode

This looks for Seqable (as I would have expected) but also includes:

  • null - so that nil can be treated as an empty seq.
  • Java arrays - wrapped in ArraySeq.
  • java.lang.Iterable - picks up anything that Java can iterate over. Uses a LazySeq over ChunkedCons to wrap the Iterable in an efficient way.
  • java.lang.CharSequence - Java strings (which are natively strings in Clojure), StringBuffer, StringBuilder, and so on. Uses StringSeq to wrap the sequence of chars.
  • java.util.Map - So that Java maps can be iterated over just like Clojure maps.

Many of these are particularly useful for Java interop as they allow seq or into to be called to convert a Java collection object into something usable by Clojure.

I wasn't really aware of this function before, but in looking at the source for seqable? I see that it was introduced in Clojure 1.9 (for use by spec), along with a number of other predicates.

Seq Summary

If something is already a seq, then it can be treated as a seq with no extra steps. Otherwise, it will be wrapped in a simple class that presents that data as a seq. In this way, everything that looks like a collection can be represented the same way, without the user needing to know how it's done.

We've also established that anything that looks like a seq can have eval called on it to execute a function.

To show this, we can create a seqable object with a function and arguments, convert it to a seq, and then call eval on it.

#_=> (def java-array
       (doto (object-array 3)
             (aset 0 '+)
             (aset 1 2)
             (aset 2 3)))
#'user/java-array
#_=> java-array
#object["[Ljava.lang.Object;" 0x4a55a6e8 "[Ljava.lang.Object;@4a55a6e8"]
#_=> (seq java-array)  ;; inspect the contents
(+ 2 3)
#_=> (eval (seq java-array))
5
#_=> (eval (seq [* 6 7]))
42
Enter fullscreen mode Exit fullscreen mode

Oldest comments (2)

Collapse
 
gneissguise profile image
Justin Frost

Thank you for the deep dive into Clojure collections and the ISeq interface! This is really helpful for understanding how seqable collections work.

Collapse
 
kapitaali profile image
Teppo Saari

This is actually very helpful, because in Scheme they don't have classes as such, and everything is just a list. So you can cons, car and cdr everything.

I keep frequently running into casting errors with Clojure.