Andrew (he/him)

Posted on May 4, 2020

Bits of Syntax: String Literals

#design #healthydebate #watercooler

Have you ever wondered why we can write something like

x = 42

...in just about any programming language, and the compiler understands what we're trying to say (the number 42), but if we try to save some text to a variable like

y = The Answer to the Great Question... is forty-two.

...we'll probably get all sorts of syntax errors? The answer has to do with how string literals are defined in code.

Literals

In most programming languages, decimal numbers are literals. This means that the source code which defines the number -- in this case the digits 4 and 2 placed next to each other -- is literally interpreted as a number.

"But how could a number be interpreted as anything other than a number?", you might ask. Well, in the vast majority of cases, it can't be. That's why numbers and other characters are treated differently in source code. To explain this difference, let's take a look at alphabetic characters instead of digits.

The character x is used above to represent a variable. When we refer to x later in the program, we expect it to be replaced, upon execution, with the value that it represents, namely the integer 42. In this case, x is not interpreted literally as the character x, but rather symbolically, and we -- and the compiler -- understand that symbol x to represent that value 42.

When we create a string literal, we're telling the compiler that we want some chunk of characters to be interpreted literally, and not to try to parse the sequence of characters as source code. So while we can write something like

a = 3
b = 4
c = a + b
print(c)

...in most languages, and expect that the number 7 will be printed to the console, writing something like

a = 3
b = 4
c = "a + b"
print(c)

...would instead print the literal sequence of characters a + b. By delimiting the string with double-quote characters on each side, we're telling the compiler that we don't want it to evaluate the terms on the right-hand side of the third line, but rather we just want to assign to c the literal sequence of characters inside the quotes.

Trying to treat literal sequences of characters and literal sequences of digits on an equal footing leads to ambiguities. When you say

forty-two

...do you mean the literal string of characters "forty-two", or do you mean the value represented by the variable named two, subtracted from the value represented by the variable named forty? And if you said

w = "42-19"

...are you trying to subtract the number 19 from the number 42, and assign the result to w? Or are you just trying to define a sequence of characters that should be assigned to w as-is?

In order to avoid interpreting text as special characters and variable names, we must somehow define where a string literal begins and ends, so the compiler can figure out when it should stop trying to interpret source code characters as language constructs and when it should start again. In essence, a quoted string is really shorthand for a function of the form literal(...), which takes an arbitrary sequence of characters as its only argument and returns them as-is, without any interpretation.

Another way to think about this is what there are different modes we're typing in when we program. In the "default mode", we want the compiler to interpret everything we write as code -- every - is a minus sign, every sequence of contiguous alphabetic characters is a variable or function name, and so on. We shift to the "literal mode" with a toggle character, like ", and often shift back using the same character. This is not unlike when you hold down the shift key to temporarily move from lowercase to uppercase letters.

Common Methods for Delimiting Strings

So how is this accomplished in different programming languages? In most modern languages, typewriter quotes aka. "straight" or "dumb" quotes are used to delimit the beginning and end of a string

"like so"

Using the same character (ASCII #34) to mark both the start and the end of the string can also lead to difficulties when the delimiting character is embedded in the string itself (known as delimiter collision)

"where does "this" string start and end?"

Many languages allow delimiters to be escaped -- either by doubling ("like ""this""") or by preceding with a backslash \ or carat ^ character ("like \"this\"" or "like ^"this^""). Some languages (like JavaScript and Python) also allow apostrophes to delimit strings, which means strings can be nested at most two levels deep without escape sequences

"like 'in this' example"

But some languages use different characters to mark the beginning and end of a string, like

PostScript, which uses open-and-close parentheses to delimit strings (...),
m4, which uses a backtick-apostrophe pair `...'^*, and
Tcl, which allows open-and-closed braces to delimit strings {...}

* If you're wondering how I did this in DEV's Markdown editor, you can define inline code blocks using two backticks instead of a single one. Just make sure you put whitespace around the double-backtick delimiters, like so:

`` `...' ``

Having different characters for beginning and ending a string means that the opening character no longer needs to be escaped within the string, but the closing character must be treated specially. For example, for a string delimited by open-and-close parentheses,

does (this (string end here?) or here?)

If we truly ignore all non-( characters within the string, then it should end at the first ) character. But if we want to allow nested strings (or nested comments, notoriously unavailable in HTML), our parser needs to track all non-escaped ( and ) characters within the string, and keep track of how many "levels deep" we are, and things quickly get messy.

Uncommon Methods for Delimiting Strings

Other languages have tackled this problem in interesting and unusual ways. FORTRAN avoided the delimiter problem altogether by using Hollerith notation, in which the length of the string, in characters, preceded the string itself, followed by a literal H character:

33HI am a string with 33 characters.

This method, of course, is clunky and error-prone when characters must be manually counted by programmers, or if using characters beyond ASCII, which require more than 1 byte per character.

Some languages allow whitespace-delimited strings, like YAML:

myString: |
  This is a long string, containing various special characters,
  like " < \" ^ ) }
  This is fine.

In a YAML literal block scalar, like the one above, the indentation defines the scope of the string. In MediaWiki template parameters, one can define a string using newlines as delimiters, like so:

{{Navbox
 |name=Nulls
 |title=[[wikt:Null|Nulls]] in [[computing]]
 }}

The above string name contains the sequence of five characters Nulls as its value.

Finally, in some languages, in restricted contexts, strings can be inferred with no delimiters at all, like when defining property names in a JavaScript object:

var myObj = {
  red: 0x00f,
  blue: 0x0f0,
  green: 0xf00
}

Above, red, blue, and green are strings, not variables, but in this narrow context, the quotes can be omitted. (JavaScript objects defined in JSON data should always have property names quoted, however.)

Conclusion

A literal string is really just a sequence of characters in source code which we do not want to be interpreted. It should be treated literally, as-is (as just a sequence of characters).

To demarcate the beginning and end of this "no interpretation zone", however, we need some kind of flag, or signal, to let the compiler know where we want code interpretation to temporarily pause, and where we want it to pick back up again.

This is made more difficult by the fact that those flags are embedded in the source code itself, using the same medium (characters) as the rest of the source code and the literal characters contained within the string.

This means that we can't be completely insensitive to the characters which appear after the delimiter at the beginning of the string, we need to step through the source, character-by-character, "watching" for the delimiter that ends the string. This leads to issues with nested strings, embedded delimiters, and more.

Maybe someday, someone will come up with an easier, more intuitive and robust, less error-prone method than the one we currently have for embedding literal sequences of characters within source code. But for now, literal strings in code seem to be held together by an ad-hoc collection of duct tape, escape sequences, and, well, string.

Follow me: Dev.To | Twitter.com
Support me: Ko-Fi.com

Thanks for reading!