Andrew (he/him)

Posted on Jan 25, 2019

Language Features: Best and Worst

#java #haskell #r #discuss

I'm interested in building my own programming language and I want to know: what are your most-loved and most-hated features of any programming languages?

Here are a few of mine:

When I create my own language (soon, hopefully), I would love for it to emulate R's paradigm where scalars are just length-1 vectors:

> x <- 3
> length(x)
[1] 1
> x <- c(1,2,3)
> length(x)
[1] 3

...this means (as you can see above) that you can use methods like length() on scalars, which are actually just length-1 vectors. I would like to extend this to matrices of any dimensionality and length, so that every bit of data is actually an N-dimensional matrix. This would unify data processing across data of any dimensionality. (Though performance would take a hit, of course.)

I also love Haskell's Integer type, which is an easy-to-use, infinite-precision integer:

ghci> let factorial n = product [1..n]

ghci> factorial 100
93326215443944152681699238856266700490715968264381621468592963895217599993229915608941463976156518286253697920827223758251185210916864000000000000000000000000

Java has BigInteger and BigDecimal which are arbitrary-precision integers and floating point numbers, respectively. Since a user never enters an infinite-precision floating point number, it should be possible to keep track of the numbers entered and used, and only round / truncate the result when the user prints or exports the data to a file. You could also keep track of significant digits and use that as the default precision when printing.

Imagine, for instance, that instead of calculating x = 1/9 and truncating the result at some point to store it in the variable x, you instead keep a record of the formula which was used to construct x. If you then declare a variable y = x * 3, you could either store the formula in memory as y = (1/9) * 3 or, recognize that the user entered 3 and 9 as integers, and simplify the formulaic representation of y internally to 1/3.

(The way I see it, if this were implemented in a programming language, it would mean that there's no such thing as a "variable", really. Every variable would instead be a small function, which is called and evaluated every time it is used.)

Forgoing that simplification, you could have y refer to x whenever it's calculated and implement a spreadsheet-like functionality, where updating one variable can have a ripple effect to other variables. When print-ing the variable, you could display the calculated value, but when inspect-ing it, you could display the formula used to calculate it. Or something.

Finally, one language feature which I would never wish to emulate is Java's number hierarchy.

In Java, there are primitives like int, float, boolean, which are not part of Java's class hierarchy. These are meant to emulate C's basic data types and can be used for fast calculations. They are some of the only types not descended from Java's overarching Object class. As Java does not support operator overloading, the arithmetical operations +, -, *, /, and so on are only defined for primitives (and + is overloaded internally for Strings). So if you want to do any math, you need one of these primitive types... got it?

Well, Java also has wrapper classes for each of the primitive types: Integer, Float, Boolean, and so on. (Note also that int is wrapped by Integer and not Int. Why? I don't know. If you do, please let me know in the comments.) These wrapper classes allow you to perform proper OOP with numbers. Java "boxes" and "unboxes" the number types automatically so you don't have to convert Integer to int manually to do arithmetic.

It gets worse! The number class hierarchy in Java is flat, meaning that all of the numeric classes (Byte, Integer, Double, Short, Float, Long) descend directly from Number, and Number implements no methods to do anything other than convert a given Number to a particular primitive type. This means that if you want to, for instance, do something as simple as define a method which finds the maximum of two Numbers, you need to first convert each number to a Double and then use Double.max() for comparison. You need to convert to Double so you don't lose precision converting to a "smaller" type (assuming you're not also accepting BigIntegers or BigDecimals, which makes this even more complicated). Number (for the love of god) doesn't even implement Java's Comparable interface, which means you can't even compare a Float to a Double without Java unboxing them to primitives, implicitly casting the float to a double and then performing the comparison.

I wouldn't wish Java's Number hierarchy on my worst enemy.

Latest comments (66)

Reece Dunham • Sep 27 '19

Lambdas are the best

Anton • Jan 26 '19

Pattern matching is a must.
Haskell like syntax (small but expressive).

Fred Ross • Jan 26 '19

R conflating length 1 vectors and scalars is something to avoid. MATLAB does the same thing, and it was a bad idea there, too. Perl had the same error where it autoflattened an array of arrays unless you carefully inserted reference marks. So much of programming is getting data into the appropriate structure, and anything that gets in the way is a problem.

Boxing and unboxing is much more complicated than you would first think because of arrays. Say I have a class A with a subclass B. I can put an instance of B in an array of A. But when you say double[], you probably want a contiguous hunk of memory directly storing values. C++ fully exposes this semantic difference. Julia began its type system by insisting on unboxed arrays of doubles. Java compromised between making numeric computing not ridiculously inefficient and not complicating its semantics enormously.

Your gripe about Number not having any methods is spot on, and is why Stepanov invented the math that led to the Standard Template Library in C++ and part of why Common Lisp went with multiple dispatch in CLOS. This is a ubiquitous problem with single dispatch object systems.

Andrew (he/him) • Jan 26 '19

How does it work in MATLAB and Perl? In R, they're pretty upfront about the fact that there's really no such thing as a scalar, but you can emulate it with a length-1 vector.

I'll definitely have to think about how I want data laid out for matrices and things. Lots of things to consider when you want to balance performance and syntax, etc.

What are your opinions on single vs. multiple dispatch mechanisms?

Fred Ross • Jan 26 '19

MATLAB does the same thing as R (or really, historically, R does the same thing as MATLAB): scalars are length 1 vectors. Perl 5 doesn't do that, but if you write @(1, 2, @(3, 4)), which in most languages would give you a list of length three containing two scalars and another list, you instead get a list of length four.

These choices make certain programming tasks more straightforward at the expense of all other programming tasks. Which is fine if you know the language is meant for just those tasks.

For single vs multiple dispatch, multiple dispatch is the clear winner in all respects except familiarity is multiple dispatch.

For balancing syntax, I remember the BitC folks saying that their best decision was using S-expressions for syntax until the language semantics stabilized, because they ended up changing deep things that would have been a real pain if they had a more structured syntax. Then they built a syntax besides S-expressions after the semantics stabilized.

Adam Marshall • Jan 26 '19

Both Perl 6 and Common Lisp have a concept of Rational Number which exactly preserved the ratio of two integers. They also boast accurate floating point math. Perl 6 is relatively new -- and not much like Perl 5 if you have heard bad things about the latter.

Jesse Phillips • Jan 26 '19

Well much of what is in D. Most it would be nice to have different defaults.

static typing
templates
compile time evaluation
c calling conversion support

On the other hand there hand there is Lua. It embeds nicely into D and there are things I don't like.

1 based indexing
blocks using words over brackets
lack of ranges, iteration isn't very nice.

Andrew (he/him) • Jan 26 '19

Ranges and slices are two things I definitely want to implement from the get-go.

1-based indexing isn't even on the table!

Larry Foobar • Jan 25 '19 • Edited

1) A native not operator that can be prepended to any bool expression.

if not contains() { ... }

Simple ! is so unmarkable and not noticable when reading. But using false == ...() is ugly

2) defer like in golang. And option to defer a loop, not only a function. So action to be executed right after break (not just return)

3) a crazy idea but I'm always thinking about having a break-from-if operator. When you have compex if-logic, this option allows to make it easy, more flattened and therefore more readable

Andrew (he/him) • Jan 25 '19

Literally today I tried to break from an if in Java and got a compiler error. It would be a super useful addition.

Isaac Yonemoto • Jan 25 '19

For your first suggestion, I would seriously give this talk a watch, and learn from the very well-fought struggles of another PL:

youtube.com/watch?v=C2RO34b_oPM

I think julia's typesystem is quite fascinating, because it's used in a totally different way from pretty much every other PL.

Elixir is a language that really optimizes for programmer joy, one of the best PL features is the pipe operator.

I can do this:

instead of:

value
println("value: $value")
r1 = function_1(value)
println("result 1: $r1")
r2 = function_2(r1, with_param)
println("result 2: $r2")
result = function_3(r2)

Indespensable for easy-to-read code and println debugging.

Andrew (he/him) • Jan 25 '19

That is pretty neat. Thanks for the suggestions! I'll check out that video ASAP.

Validark • Jan 25 '19

A difference between method syntax and calling a member function. Enums that are not numbers or strings (unless you say so), like enum class in C++. If I were making a language I would consider the idea of minimizing keywords by using built-in function calls instead. Just a thought.

Andrew (he/him) • Jan 25 '19

Can you give an example of what you don't want to see vs. what you want to see? I'm not sure I understand.

Validark • Jan 26 '19

In JavaScript, a function call implicitly passes this depending on the call site. In Lua there is a difference between a method call and a function being called from a namespace. This is accomplished through syntactic sugar.

In Lua:

Object:Move(1) is equivalent to
Object.Move(Object, 1)

This function can be declared like so:

function Object:Move(Amount)

The syntactic sugar puts a self as the first parameter. You could do this yourself as well:

function Object.Move(self, Amount) (you could name self whatever you want this way)

That means method calls can be localized and called later if necessary because self is actually a parameter.
local f = Object.Move
f(Object). You can't do this in JS, but it isn't just that. There isn't a need to implicitly pass this into namespaces.

The other solution could be to wrap localized functions with an implicit this, but that would make each instance of a class have a unique version of each method.

I would recommend under the hood using a data oriented approach. Instances of classes wouldn't hold all their data next to each other, an instance of a class would just be an integer which is the index at which its data can be retrieved from a bunch of arrays that each hold a single member of each instance. This is smarter design because it is much more frequent that at one time you might iterate through a particular property of all instances and with the way CPU caching works you can load up the right array and use just that data, without fetching unnecessary data you aren't using for comparison. Ex:

Names = ["First", "Second"]
IDs = [1, 2]

new Class() // reference to index 0 in each of these arrays

new Class() // reference to index 1 in each of these arrays

(Obviously this example is a simplification)

Andrew (he/him) • Jan 26 '19

Thanks for all the advice! All of these are good points which I'll definitely have to consider when designing my language.

Cameron Martin • Jan 25 '19 • Edited

One thing that's pretty similar to your "storing the formula used to calculate the number and then calculating the precision on-demand" idea that you have is exact real arithmetic. Several implementations exist for Haskell. One of the downsides of this approach, besides performance, is that equality is undecidable. The best you can do is determine that two numbers are within a certain distance of eachother.

Andrew (he/him) • Jan 25 '19

Thanks for the link!

That's true, but it's also true with floating-point numbers in any programming language. Doing something like setting the default precision to the number of significant digits would eliminate this problem, I would think?

If you set x = 3.0 * 0.20 (= 0.60 @ 2 sig digits) and y = 0.599 * 1.0 (= 0.60 @ 2 sig digits) then y and x are equivalent when only significant figures are considered. Doing something like y - x would yield 0.599 * 1.0 - 3.0 * 0.20 = -0.001, which, to 2 significant figures, is zero. That's equality.

What do you think?

Crystal Durham • Jan 25 '19

I think the main thing that people look for in a modern "alternative" language is convenience and clarity.

Some specific possibilities:

Pick a "host" language and offer first-class interop. Immediate library ecosystem! (If non-idiomatic.)
Along the same lines, first-class project build and dependency management. Either integrate with an existing tool or make it as or more convenient than your favorite.
REPL. Almost a must-have for quickly understanding a new tool today.
LSP host and Jupyter kernel. Between the two you can support every development environment and awesome tooling in O(1) effort.

And the fun part, some anti-features:

Syntax overload. You'll pick up a language faster if you don't have to relearn everything.
Syntax uncanny valley. If it's too similar to a more popular language, people will just use that instead (or accidentally try to use it instead of yours).
Near-miss paradigms. Similar to your Java number hiearchy, adopting a paradigm almost everywhere but having some concessions in some corner cases just makes everything feel rougher.

And one more thing: I think there's the most available space around asynchronous-by-default. Play with an async runtime once you're up and running. There's potential there I haven't seen anyone fully hit.

Andrew (he/him) • Jan 25 '19

REPL is a good shout. That will definitely go hand-in-hand with developing the language at the beginning.

View full discussion (66 comments)