DEV Community

Mike Samuel
Mike Samuel

Posted on • Updated on

A web security story from 2008: silently securing JSON.parse

My 8 year old is doing a report for school on cyber security so I thought I'd dig up an old report for a break-the-web level security vulnerability that we silently fixed and which, as far as I know, has never been disclosed.

Back in 2008, JSON.parse was not part of the JavaScript language. It was a separate library downloaded from that used JavaScript eval to unpack data.

// In the third stage we use the eval function to compile the text into a
// JavaScript structure. The "{" operator is subject to a syntactic ambiguity
// in JavaScript: it can begin a block or an object literal. We wrap the text
// in parens to eliminate the ambiguity.

                j = eval("(" + text + ")");
Enter fullscreen mode Exit fullscreen mode

That code from a modern version of json2.js wraps the JSON source text in parentheses, because {} is a statement block in JavaScript but ({}) is an object constructor.

But a side-effect of using eval is that, if the source text is something like doVeryBadThings() then the JavaScript engine will happily do those very bad things, a classic arbitrary code execution vulnerability.

In computer security, arbitrary code execution (ACE) is an attacker's ability to run any commands or code of the attacker's choice on a target machine or in a target process.

Luckily, json2.js did a bunch of regular expression checks to make sure that text contained valid JSON, allowing commands that construct a value but not more powerful commands.

Unluckily, JSON is not a subset of JavaScript in the semantic sense. There are important differences.

To understand those differences, let's begin at the end with a change to the JavaScript language definition that I argued for based on this research.

Screenshot of draft EcmaScript 3.1 specification quoted below

Here's the changed specification text. It mirrors some explanatory text from Unicode §6.2.

7.1 Unicode Format-Control Characters
The Unicode format-control characters (i.e., the characters in category “Cf” in the Unicode Character Database such as LEFT-TO-RIGHT MARK or RIGHT-TO-LEFT MARK) are control codes used to control the formatting of a range of text in the absence of higher-level protocols for this (such as mark-up languages).
It is useful to allow these in source text to facilitate editing and display.
The format control characters maybe used in identifiers, within comments, and within string literals and regular expression literals.

And to the right of that is some deleted specification text.

7/2/2008 Deleted: anywhere in the source text of an ECMAScript program. These characters are removed from the source text before applying the lexical grammar. Since these characters are removed before processing string and regular expression literals, one must use a Unicode escape sequence (see 7.6) to include a Unicode format-control character inside a string or regular expression literal

JavaScript allows these control flow characters, known as [Cf], in identifiers because they're important for proper presentation of identifiers, especially those in cursive writing systems like عنصر in the Perso-Arabic alphabet. But they're a bit of a blind spot for many developers.

They were "removed from the source text before applying the lexical grammar." That means before the JavaScript parser has broken the source code into tokens, so

  • before it pairs quotation marks (") that start and end quoted string values, and
  • before it pairs delimiters like /* and */ or groups sequences like // and += that have no analogue in JSON.

JSON's specification does not have an equivalent clause. simply says:

A string is a sequence of zero or more Unicode characters, wrapped in double quotes, using backslash escapes.

I realized that, by putting a [Cf] character between a backslash and a quote character I could get JavaScript's eval to find a different end of string than the JSON grammar would as expressed in the regular expressions that vet the JSON text. And that would allow me to sneak code into a JSON string that eval would execute.

I sent an email to the JSON maintainer, Douglas Crockford, with a proof of concept:

Email to Douglas Crockford dated Mar 14 2008 / On firefox 2, the below alerts "hello world" after "created string about to parse" using a version of downloaded earlier today. \<html\> \<head\> \<script src=json2.js\>\</script\> \</head\> \<body onload=\" var s = '\&quot;\\\\\\u200D\\\\\&quot;, alert\(\\'hello world\\'\) //\&quot;\n'; alert\(\'created string about to parse\'\); JSON.parse\(s\); \"\> \</body\> \</html\>

I didn't do a good job explaining it the first time though, but here is my second attempt:

Email quoted below

The proof of concept was, where | stands in for Unicode's zero-width joiner character (U+200D):

"\|\", alert('hello world') //"
Enter fullscreen mode Exit fullscreen mode

So a valid JSON parser would see just a single quoted string that contained two escaped characters, one of which was an escaped quote. And the regular expressions that json2.js used to approximate had the same interpretation.

So json2.js's security checks let that string through to JavaScript's eval.

But JavaScript's tokenizer saw something different because the control character is stripped out before tokenization:

"\\", alert('hello world') //"
Enter fullscreen mode Exit fullscreen mode

There, the two backslashes (which previously had a [Cf] character between them) combine to form one escaped backslash.
The previously escaped double quote now ends the string. As the email explains (with bullets added for clarity):

Firefox sees ...

  1. a string literal containing only a backslash followed by
  2. a comma followed by
  3. a call to alert followed by
  4. a line comment.

(That line comment hides the final double quote, so that eval doesn't stop with a syntax error.)

Douglas realized he needed to change the regular expression used by json2.js.

Email quoted below

So I think I have to do the same search for restricted characters that I do for ADsafe:

cx = /[\u0000-\u001f\u007f-\u009f\u00ad\u0600-\u0604\u070f\u17b4\u17b5\u200c-\u200f\u2028-\u202f\u2060-\u206f\ufeff\ufff0-\uffff]/,


(Bother, indeed.)

But then he realized that we didn't want attackers to be able to reverse engineer the vulnerability from that targeted change, so he unilaterally changed the definition of JSON on the fly.

If I put cx in json.js and announce them[sic] problem, it will be giving a pretty clear signal to the miscreants about how to exploit but. But if instead, the fix is that I replace the regexp


with one that only matches \ followed by a printable ASCII character, then the the text will be rejected. My fix will be less informative.

So that's how an arbitrary code execution that affected almost all of the JavaScript code using JSON in 2008 was silently closed with, afaik, no-one the wiser.

Top comments (1)

kurtextrem profile image
Jacob "kurtextrem" Groß

Now if that isn't a story to tell your kids about - I don't know what is.