Hello, my name is Dmitriy Karlovskiy and I.. ride bicycles.. off-road.. against the wind.. uphill.. on skis. And today I invite you to take a ride with me along and across textual data formats and design the ideal format together.
I already talked about it 5 years ago, which led to heated debates that resulted in minor syntax changes. Therefore, let me tell you from scratch what it is at the moment.
Meta
Speech
Speaker \Dmitry Karlovsky
Place \PiterJS #47
Time 2020-05-20
This is an extended text version of the speech of the same name on PiterJS#47. You can read it as an article or open it in the presentation interface or watch video.
Plan
- Analyze popular text data formats π©
- From scratch, develop a new format without flaws π½
- Show examples of applying the new format πΎ
Formats
We will compare 5 formats.
Format |
---|
XML |
JSON |
YAML |
TOML |
tree |
Only the deaf have not heard about the first three. But the last two are dark horses for many. Well, nothing, today I will shed light on them.
XML example
XML - once the most popular format, you can say "technological standard". But despite all its power, it is now becoming obsolete, as it is too complicated for a modern web developer.
<!DOCTYPE svg
PUBLIC "-//W3C//DTD SVG 1.1//EN"
"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"
>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<svg version="1.1" xmlns="http://www.w3.org/2000/svg">
<circle r="30" cx="50" cy="50" fill="orange" />
</svg>
JSON example
XML is being replaced by a simpler and more daring data format - JSON.
{
"name": "example",
"version": "1.0.0",
"description": "example package",
"main": "index.js",
"repository": "https://example.org",
"author": "anonymous",
"license": "MIT"
}
If you think that this is the ideal, then I ask you to excuse me in advance, as I will upset you further.
YAML example
Someone is already prophesying YAML to replace JSON.
Date: 2001-11-23 15:03:17-5
User: ed
fatal:
Unknown variable "bar"
Where:
file: TopClass.py
line: 23
code: |
x = MoreObject("345\n")
Due to its better human readability, it has already gained popularity in the field of manually writing configuration files.
TOML Example
Few have heard of TOML. However, take a look at the example and it will become clear why I mention it at all.
[servers]
[servers.alpha]
ip="10.0.0.1"
dc="eqdc10"
[servers.beta]
ip="10.0.0.2"
dc="eqdc10"
Yes, it's actually a standardized INI config bitten by JSON. As a result, he absorbed the worst of both worlds.
Example Tree
Finally, as a spoiler, let me show you the minimal non-empty tree file that we will develop next.
spoiler
Data models
Different formats are based on different data models. The chosen model answers the following two questions.
- What data can we write and read without a tambourine? π₯
- How to record data that does not fit into the model? π
No single format is able to support the entire variety of types of subject areas, so the need inevitably arises for packing data into a certain format and then unpacking it back.
XML Model
XML is based on a typed element model that contains one dictionary of attributes and one list of nested typed nodes.
- NodeList
- Element Node (
<br/>
) - Attribute Node (
tabindex="1"
) - Text Node(
Hello, World!
) - CDATA Node (
<![CDATA[ ... ]]>
) - Processing Instruction Node (
<? ... ?>
) - Comment Node (
<!-- ... -->
) - Document Node
- Document Type Node (
<!DOCTYPE html>
)
Disadvantages of the XML Model
This model is quite flexible, but it has a number of limitations: only strings can be attribute values, and there can be only one nested list of nodes. Despite the fact that the XML format is already not the simplest, a banal dictionary with subtrees as values ββrequires additional agreements. For example, this: some elements are used to describe the keys in the parent element and such elements in the parent should be only in one instance.
<panel>
<head>Are you sure?</head>
<body>
<button>Yes</button>
<button>No</button>
</body>
</panel>
Here panel
is a component, and body
is no longer a component, but a parameter. It would have a place in the attributes, but only the strings can be placed in the attributes and nothing more.
XML model extensibility
Thanks to namespaces, many languages ββcan be mixed up within one XML document without breaking the interpretation of each other.
<xsl:stylesheet
version="1.0"
xmlns="http://www.w3.org/1999/xhtml"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/">
<html>
<head>
<link rel="stylesheet" href="web.css" />
</head>
<body>
<xsl:apply-templates select="*" />
</body>
</html>
</xsl:template>
</xsl:stylesheet>
This is a very powerful technique that is lacking in younger formats.
JSON Model
The JSON model is based on the fact that the entire tree consists of untyped lists and dictionaries. Plus a limited set of primitives as tree leaves.
- Null
- Boolean
- Number
- String
- Array
- Dictionary
Disadvantages of the JSON model
It would be naive to believe that two types of structural nodes are enough for everything. For example, let's take a dictionary. The keys in it are not ordered, that is, they can be returned by the parser in any order.
{
"foo": 777
"bar": 666
}
What if we need a dictionary with ordered keys?
[
[ "foo" , 777 ],
[ "bar" , 666 ]
]
We had to radically change the syntax and stick arrays of arrays. But this is just another type of dictionary.
Non-extensible JSON model
Well, the main drawback of the JSON model is its non-extensibility, which is why you have to introduce a bunch of tricky rules to cram all the variety of application types of their relations. Take, for example, a query to MongoDB, whose authors decided that JSON is a great fit for the role of a query language.
{
"$or": [
{
"sex": "female",
"age": { "$gt": 16 },
},
{
hobby: {
"$regex": "\\b(?:java|type)script\\b"
}
}
]
}
We see that the paired logical operations OR and AND have a completely different syntax. The equality predicate is sorely lacking, because we still need the predicates "greater than", "less than" and even "matches the regular expression". And by the way, regular expressions themselves cannot be represented in JSON except as a string and an agreement that if it is in the dictionary for a key named "$regexp", then this is a serialized regular expression and when parsing, you need to create the corresponding object.
YAML Model
The YAML model is similar in many ways to the JSON model. Unless there is support for time and internal links.
- !!null
- !!bool
- !!int
- !!float
- !!str
- !!timestamp
- !!seq
- !!map
- Anchor & Alias
- Document
- TypeTags
YAML model extensibility
The main advantage of YAML is in type annotations, which allow you to explain to the processor which algorithm to use to unpack the data.
--- !!omap
- foo:777
- bar: 666
In this example, we are telling the parser to "take this list of key-value pairs" and convert it to an OrderedMap object (an ordered dictionary).
TOML Model
The TOML model is like JSON, but a little more mundane. For example, integers and real numbers are distinguished here, which is important for compiled languages, and there is also time support.
- Boolean
- Integer
- Float
- String
- datetime
- Array
- Dictionary
With extensibility, everything is just as bad here as in JSON.
Model Tree
Whatever set of basic types we choose, it will not be enough for everything. This means that some packing and unpacking code will inevitably be required. And it is easiest to work with such code when the number of different types of nodes is minimal, since for each type you need to write a separate branch of logic. At the same time, maximum flexibility is required. Therefore, only two types of nodes will suffice for us.
- Structure Node
- Data Node
Structural nodes serve to describe the hierarchy, while data nodes store raw binary data. Any node can store a list of any other nodes, which achieves flexibility unattainable in other formats.
Model extensibility
Total, in terms of extensibility, everything is very bad. Popular formats are either extensible, but incredibly overcomplicated, or simple, but not extensible at all.
XML | json | YAML | TOML | tree | |
---|---|---|---|---|---|
Extendability | β | β | β | β | β |
Number of patterns | 90 | 30 | 210 | 90 | 10 |
Pay attention to YAML. Its grammar has two hundred patterns. It is so complex that you will most likely not find any complete and correct implementation of its parser. Why, even two identically working JSON parsers you still need to search, but there would seem to be 30 patterns in total.
Our goal will be to create an extremely simple, unambiguous, but at the same time maximally extensible format.
Readability
Syntax clarity is important in a variety of scenarios for working with the format: when writing, when reviewing code, when resolving conflicts, when debugging, when studying.
The speed of your work and the predictability of its results directly depends on how the format is serialized. However, some formats have serious problems with this.
XML | json | YAML | TOML | tree | |
---|---|---|---|---|---|
Readability | β | β | β | β | β |
XML readability
XML is built around text with tags interspersed with additional information. As long as there is not a lot of this information, everything is fine, but the more it is, the more difficult it is to perceive the text, which eliminates the usefulness of this feature.
Hello Alice!
How are you?
Could you bring me coffee now?
<message>
<greeting>
Hi <a href="http://example.org/user/alice">Alice</a>!
</greeting>
<body>
<s>How are you?</s><br/>
Could you bring me
<time datetime="1979-10-14T12:00:00.001-04:00">now</time>
coffee?
</body>
</message>
JSON readability
XML at least supports multiline text, but JSON, for example, can no longer boast of this. Formats of this type come from an information structure, in which text and not only text values ββare already interspersed.
{ "greetings": "Hi Alice!\nHow are you?\nCould you bring me some coffee?\n" }
Severity
As a rule, there are no problems with understanding what is written. But YAML excelled here.
XML | json | YAML | TOML | tree | |
---|---|---|---|---|---|
Unambiguous syntax | β | β | β | β | β |
YAML lax
a: true # boolean
b: tru # string
c: :-) # string
d: (-: # error
There are quite a few such jokes in YAML.
Escaping
A topic close to readability is escaping. The presence of this in one way or another inevitably leads to a decrease in readability. When designing escaping, the following points should be kept in mind.
- It is necessary to distinguish format constructs from actual data π΅
- It is advisable not to lose data in visibility π€
- It is advisable not to overcomplicate editing π€¬
Escaping in XML
XML is a wonderful example of how not to do escaping.
foo > 0 && foo < 10
From a simple and visual text, some kind of cryptotext is obtained, which has to be mentally interpreted in order to understand what is written here.
<code>foo > 0 && foo < 10</code>
Escaping in JSON
There is a similar problem with JSON, albeit to a lesser extent. If you have ever written plugins for VSCode syntax highlighting, then you know that grammars are described there in JSON format, where regular expressions are written.
/"[\s\S]*"/
The regulars themselves are not the most visual things, but escaped ones are even worse. It is very easy to make a mistake in them in such conditions, and it is not very easy to debug them.
"\"[\\s\\S]*\""
Escaping in YAML
In YAML, the escaping problem is generally solved, but at what cost.
- 5 types of strings π£
- 4 whitespace handling modifiers π₯
And all this you need to know in order to correctly read any YAML file.
Escaping in Tree
No π€ͺ
The most readable escaping is no escaping. Therefore, we will not have it. You might think that I'm crazy, but a little later I'll show you how to achieve this.
Minification
Many formats support different ways of formatting the same data. But it's always a trade-off between size and readability.
- Readable formatting weighs a lot π
- Compact formatting is hard to read π
XML minification
<users>
<user>
<name>Alice</name>
<age>20</age>
</user>
</users>
If you minify XML, you can save several tens of percent in size, but the result is even more difficult to read.
<!-- 13% less -->
<users><user><name>Alice</name><age>20</age></user></users>
JSON minification
{
"users": [
{
"name": "Alice",
age: 20
}
]
}
With JSON, the savings are slightly greater, but readability suffers more - instead of closing tags, we see a string of square and curly brackets.
// 30% less
{"users":[{"name":"Alice","age":20}]}
Tree minification
No π²
Our path is uncompromising - the format must be both extremely compact and easily perceived by a person.
Statistics on minification
XML | json | YAML | TOML | tree | |
---|---|---|---|---|---|
Readable | 195% | 140% | 125% | 110% | 100% |
Minified | 170% | 101% | - | - | - |
Download sample files.
As you can see, it is possible to make a format that in a readable form weighs less than any other, even if they are minified. The whole secret is that readability is achieved by the structure of the format itself, and does not require additional formatting that bloats the volume.
Holy Wars
A common problem when working with different formats is endless arguments about seemingly trifles.
- Tabs or spaces? π€ΌββοΈ
- 2 or 4 spaces? π€ΌββοΈ
- Do you need a carriage return? β‘
- Do we do alignment? π€Ί
- linter/format rules? π₯
- when saving/committing/pushing? π§
These arguments take time and emotions, but they are completely meaningless. It is better if the format has uniform, clearly defined rules that are equally understood by any tool and person. Therefore, our format will be extremely rigid, without any liberties.
Processing speed
Simplicity, rigidity and lack of escaping potentially gives a much higher possible processing speed.
For example, in JSON, to write an arbitrary string, you need to go through each character and output a backslash to the output buffer before certain ones. That is, we cannot even know in advance how much memory we can allocate for the output buffer. And during parsing, you need to do the reverse operation with the formation of a new line. We cannot reuse the original piece of memory.
serialization: foo\bar => "foo\\bar"
parsing: "foo\\bar" => foo\bar
When we do not have escaping, we can simply take chunks of memory and send them to the output stream during serialization, which is very fast. Conversely, when parsing, we can simply refer to pieces of the original buffer and not make extra memory allocations.
In my knee - length benchmark in the D language, the following results were obtained:
Tree: 299 ms
JSON: 421 ms
For comparison, I used the naive implementation of the tree parser and json parser from the standard library.
Error coordinates
During parsing, information about the original location of the nodes obtained from the format is often lost. For example, we received JSON, started processing it, and somewhere in the depths we suddenly realized that in the database we do not have the user specified in the file. At this moment, we must show an error, but in the text of this error, we cannot indicate in which place of which file it was made. This is because this information is lost during parsing. And this is a very common problem.
XML | json | YAML | TOML | tree | |
---|---|---|---|---|---|
Address | β | β | β | β | β |
Position | β | β | β | β | β |
Range | β | β | β | β | β |
In XML nodes there is a link to the resource from which it was obtained, but where it is in this resource - look with your eyes. To solve this problem, there are special parsers that give the output not arrays and dictionaries, but an Abstract Syntax Tree. But working with him is no longer so easy, and even slowly this business.
Well, this information is important, and I suggest not to lose it. Never lose. Saving node coordinates will still come in handy when it comes to AST and sourcemaps.
Stream processing
It happens that there is a lot of data and little memory, but you need to work with data quickly. And it happens that the data does not end at all. For example, you need to continuously process logs as they come in. In these cases, the ability to stream data processing saves.
XML | json | YAML | TOML | tree | |
---|---|---|---|---|---|
Streaming | β | β | β | β | β |
As you can see, the most common formats do not have streaming support. They require you to have exactly one complete document root, otherwise it's a parsing error. In the case of constantly arriving data such as logs, for example, adding them to a document while maintaining its correctness is not an easy task.
This does not mean that stream processing cannot be fastened to them. For example, for XML, there are lower-level SAX parsers that allow you to work not with a tree of elements, but with a stream of tags: such and such a tag opened, a string arrived, such and such a tag closed. And for JSON, there is a whole bunch of message streaming protocols. The main problem here is that not every format-supporting tool will be able to digest your data without additional gestures.
Formats that support stream processing can be easily supplemented by appending data to the end. You can glue several data streams into one and, conversely, cut into pieces. Can be processed in parts without waiting for the transfer to complete. And all this without losing the correctness of working with the format.
Tree format
Well, summarizing what was said earlier, let's formulate all the requirements for our new format.
- Easy syntax β
- No escaping π€
- No liberties π€
- No minification π
- Minimum size π
- Guaranteed readability π
- Stream processing πͺ
- Exact coordinates of nodes β
Just a tree node
So, we need to create a node named "house". What is the minimum code for this?
house
We just write this name and that's it.
List of tree nodes
And if we need not one node, but a whole list?
house
roof
wall
door
window
floor
We just write them on separate lines.
Nesting tree nodes
But what if we want to add hierarchies and put the list of nodes inside the first one?
house
roof
wall
door
window
floor
We just write nested nodes with a tab as an indent. Those familiar with the Python language may notice a similar approach here - using good code formatting style as the basis of the syntax, rather than an optional feature.
Deep tree hierarchy
By continuing to add padding, we can create hierarchies of any nesting.
house
roof
wall
door
window
glass
floor
Alone at home
Often there are situations when there is only one nested node, and then it will be somehow wasteful to increase the indentation level for all nested nodes because of it.
street
house
wall
door
window
Therefore, we simply line up such nodes in one line, separating them with spaces.
street house wall
window
door
Indented nodes are already nested in the last node on the previous line.
Raw data
When we need to write arbitrary data, the characters in which should not be processed in any special way, we simply write them after the backslash without any escaping.
\Any data \(^_^)/
The backslash is chosen to be associated with escaping. It sort of escapes the entire text to the end of the line. But, to be precise, it is rather not escaping, but a kind of quotation marks. The backslash is the opening mark, and the newline character is the trailing mark.
Multiline data
But how to write all the same multi-line text containing, among other things, newlines? It's simple: we take a data node and put a list of other data nodes in it.
\
\Here π±βπ»
\ many π±βπ
\ cats π±βπ€
When requesting the string content of the root data node, all nested data nodes will be concatenated via a newline character.
Different types of nodes
Finally, we can use both types of nodes mixed in any combination. For example, let's describe some user.
user
name \Jin
age \35
hobby
\kendo π±βπ€
\dance πΊπ½
\roleplay π
default
As you can see, everything is quite simple. To create the most advanced data format, we needed only 2 types of nodes and 4 special characters.
Languages ββbased on formats
So far, we have only talked about formats, that is, about serialization methods. On their basis, languages ββare already being designed that add semantics to abstract format nodes.
Format | Languages |
---|---|
XML | XHTML, SVG, XSLT, ... |
json | JSON Schema, json:api, ... |
YAML | yaml.org/type |
TOML | - |
tree | xml.tree, json.tree, view.tree, ... |
Any language is some subset of the format data model with restrictions on the possible types of nodes, their relative position and content.
Next, I will show some examples of such languages ββfor the tree format.
Language grammar.tree
Language grammar.tree - designed to describe formal grammars. For example, let's write a complete formal grammar for the tree format itself.
tree .is .optional .list_of line
line .is .sequence
.optional indent
.optional nodes
new_line
nodes .is .sequence
.optional .list_of struct
.optional data
.with_delimiter space
struct .is .list_of .byte
.except special
data .is .sequence
data_prefix
.optional .list_of .byte
.except new_line
special .is .any_of
new_line
data_prefix
indent
space
new_line .is .byte \0A
indent .is .list_of .byte \09
data_prefix .is .byte \5C
space .is .list_of .byte \20
As you can see, the grammar of the format is really extremely simple, which allows you to write a parser in any language in just an hour without even resorting to parser generators.
This grammar can be read literally: tree is an optional list of lines, and a line is a sequence of an optional indentation, an optional list of nodes, and a mandatory newline character. Well, and so on.
Language grammar.tree vs EBNF
Comparing grammar.tree with Extended Backus Naur Form you can see that the former is somewhat verbose but clear and concise, while the latter is compact , but for understanding it requires preliminary preparation, the expressive possibilities are still somewhat inferior, and its focus on a single-line representation looks somewhat awkward when using multi-line writing.
tree .is .optional .list_of line
line .is .sequence
.optional indent
.optional nodes
new_line
nodes .is .sequence
.optional .list_of struct
.optional data
.with_delimiter space
tree = {line};
line=[indent],
[ nodes ],
new_line;
nodes = data |
structure,
{ space , struct },
[ space , data ];
Language xml.tree vs XML
The xml.tree language is a way to represent an XML data model in tree format. Any kind of XML can be generated from it. Conversely, any XML can be converted to xml.tree.
! doctype html
html
meta @ charset \utf-8
link
@ href \web.css
@ rel \stylesheet
script @ src \web.js
body
h1 \Procter & Gamble
<!doctype html>
<html>
<meta charset="utf-8" />
<link href="web.css" rel="stylesheet" />
<script src="web.js"></script>
<body>
<h1>Procter & Gamble</div>
</body>
</html>
It would be nice to have such integration in the IDE that when opening any XML, you can see and edit its xml.tree representation, but everything would be saved back to XML. This would eliminate the need to break your eyes over ampersands and make working with XML as easy and simple as, for example, with markdown.
Language json.tree vs JSON
And json.tree is a language for describing the json model.
* user *
name \Jin
age 35
hobby /
\kendo π±βπ€
\dance πΊπ½
home \C:\users\jin\
{
"user": {
"name": "Jin",
age: 35
"hobby": [
"kendo π±βπ€",
"dance πΊπ½",
],
"home": "C:\\users\\jin\\"
}
}
We needed only 2 special characters - an asterisk to denote dictionaries and a slash to denote arrays.
json.tree extensions
The beauty of languages ββbased on formats like XML and Tree is that they are easy to extend while staying within the format. For example, both json and tree as formats fundamentally do not support comments. But, for example, comments are necessary in configs. How to be?
*
# \If disabled will be used platform specific delimiters
# \CRLN on windows and LN on others
unix_delimiters true
In tree, we easily extended the language to suit our needs by adding a special node type for comments.
{
"unix_delimiters#1": "If disabled will be used platform specific delimiters",
"unix_delimiters#2": "CRLN on windows and LN on others",
"unix_delimiters": true,
}
In JSON, the limitations of the model are affected, because of which you have to write crutches.
Language view.tree vs TypeScript
Language view.tree - used for component composition in framework $mol developed by me.
$my_details $mol_view
sub /
<= Pager $mol_paginator
value?val <=> page?val 0
This describes a component that owns another component and their properties are bidirectionally related to each other. You may notice that inside view.tree the json.tree language is also used to describe arrays, dictionaries, numbers and other JSON types.
From such a simple and concise code, a rather sprawling TypeScript class is generated. You can write it with your hands, but it's a chore and without a hierarchy it's not very clear.
class $my_details extends $mol_view {
sub() { return [ this.Pager() ] }
@ $mol_mem Pager() {
const Pager = new $mol_paginator
Pager.value = val => this.page( val )
return pager
}
@ $mol_mem page( val = 0 ) {
return value
}
}
API
Finally, there are various APIs for interacting with the format from different programming languages.
Format | Languages ββ | API |
---|---|---|
XML | XHTML, SVG, XSLT, ... | DOM, SAX, AST |
json | JSON Schema, json:api, ... | Native, AST |
YAML | yaml.org/type | Native, AST |
TOML | - | Native, AST |
tree | xml.tree, json.tree, ... | AST |
For XML, for example, there is a fairly flexible DOM, and there is a low-level SAX. The formats that replaced it mainly return dictionaries, arrays, and so on native to the language. True, the JSON data model is not well represented in compiled languages, where integers and floats are completely different types. And of course, for all languages ββthere is a representation in the form of an Abstract Syntax Tree. True, it is usually slow and inconvenient. We will make it fast and convenient, which will allow us not to fence the zoo of incompatible APIs.
JSON AST
Let's take a simple JSON file and put it in ASTExplorer.
{
"user": {}
}
{
"type" : "object",
"children" : [
{
"type" : "Property",
"key" : {
"type": "Identifier",
"value": "user"
}
"value": {
"type": "object",
"children": []
}
}
]
}
As you can see, the AST turned out to be large and complex. JSON is generally very poorly suited to describe AST. It is not very easy to work with it without special utilities.
AST Tree
Now let's take a slightly more complex tree file.
user
name \Jin
age 35
hobby
\kendo π±βπ€
\dance πΊπ½
\roleplay π
And look at his AST.
user
name \Jin
age 35
hobby
\kendo π±βπ€
\dance πΊπ½
\roleplay π
So, something is wrong. It's the same code. Ah, no, that's right, tree is its own AST.
Tree node properties
In the TypeScript implementation, each node has roughly the following interface.
interface $mol_tree2 {
type: string
value: string
kids: $mol_tree2[]
span: $mol_span
}
Span is a reference to a series of bytes in the original resource.
interface $mol_span {
uri: string
row: number
col: number
length: number
}
Derived Tree nodes
Each node has methods for creating new nodes based on it. These factories, when creating new nodes, push the span from the original node into them. This allows even after dozens of transformations to understand how it all began.
interface $mol_tree2 {
struct: ( type , kids )=> $mol_tree2
data: ( value , kids )=> $mol_tree2
list: ( kids )=> $mol_tree2
clone: ( kids )=> $mol_tree2
}
Error messages in Tree
For example, let's take the config, find the password in it, and if it doesn't work, we'll throw an exception, where it will be written in which place of which file the wrong password is written.
const config_path = './config.tree'
const config_text = fs.readFileSync( config_path )
const config = $mol_tree2.fromString( config_text , config_path )
// server auth
// login \root
// password \qwerty
const password = config.select( 'server' , 'auth' , 'password' , '' )
if( !auth( password.text() ) ) {
// AuthError: Wrong password
// \default
// ./config.tree#5:3-11
throw password.error( 'Wrong password' , AuthError )
}
Processing Tree
Or another example - we decided that "auth" is an unfortunate name and we need to replace it with "credentials". Therefore, we write a simple script for automatic refactoring:
// server credentials
// login \root
// password \qwerty
const new_config = config.list(
config.hack({
'auth' : ( auth , context )=> [
auth.struct( 'credentials' , auth.hack( context ) ),
] ,
})
)
fs.writeFileSync( config_path , new_config )
And in this way you can easily refactor any languages ββbased on the tree format without searching for a separate parser for each language and dealing with how it works with AST.
Support by editors
If you are using an editor for which there is no plugin yet, then this is a good opportunity to implement it. This will be easier to do than for any other language.
Language support
Again, I encourage those who are interested to implement support in their favorite language and try to put it to good use.
Results
XML | JSON | YAML | TOML | Tree | |
---|---|---|---|---|---|
Size | 195% | 140% | 125% | 110% | 100% |
Number of patterns | 90 | 30 | 210 | 90 | 10 |
Unambiguous syntax | β | β | β | β | β |
Readability | β | β | β | β | β |
No escaping needed | β | β | β | β | β |
Exact coordinates of nodes | β | β | β | β | β |
Streaming | β | β | β | β | β |
Extensible data model | β | β | β | β | β |
Widespread | β | β | β | β | β |
Ideas
And now let's dream up what else interesting things can be done using the tree format.
- Requests to the DBMS
- Domain description
- Logging
- Communication of console utilities
- LISP-like language
- Universal AST
sql.tree - queries to the DBMS
Remember those clumsy MongoDB queries? Let's try to write our SQL:
select
from $users
fetch
@name
@phone
@photo *
@uri
@width
@height
where or
and
@sex = female
@age > 16
@hobby ~ \\b(?:java|type)script\b
Parsing the query in this form is a breeze, unlike real SQL. Please note that there is a uniform syntax for logical operations and predicates "is equal to", "greater than" and even "matches the regular expression". By the way, the regular expression can also be described in the tree format, which will make it much more supported.
select
from $users
fetch *
where @hobby ~
word-edge
or
\java
\type
\script
word-edge
domain.tree - description of the domain
Since we are talking about databases. This is how I describe the domain model.
hyoo_api_person
descr \Live service user
inherit hyoo_api_entity
field
id
descr \Unique human readable identifier
example \person=jin
key unique
type text
edit author
avatar
descr \Links to avatars
type list hyoo_api_image
edit author
mail
descr \Attached emails
type set hyoo_api_mail
From such a formal description, a server API, ACL rules, a DBMS scheme and an admin panel are automatically generated to manage the whole thing.
Logs
A common practice is to output single-line messages to the logs. As long as they fit in the width of your terminal - everything is fine, but this is a rather rare situation. Much more often, messages still do not fit and begin to be transferred, turning the flow of messages into a real mess, which is difficult to read with your eyes, and even programmatically process them - pain and suffering.
log.tree - structured logs
But what if the logs are immediately displayed in a two-dimensional form, at the same time easily readable by both machines and humans?
193.34.12.132 - - [2011-10-20T12:46:08+04:00] GET /nin-jin/slides/edit/master/t
ree/readme.md HTTP/1.1 200 4435
193.34.12.132 - - [2011-10-20T12:46:09+04:00] GET /nin-jin/slides/edit/master/t
ree/readme.html HTTP/1.1 404 4435
access
ip \193.34.12.132
time \2011-10-20T12:46:08+04:00
method \GET
uri \/nin-jin/slides/edit/master/tree/readme.md
protocol \HTTP/1.1
response \200
size \4435
The lower code is clearer. Isn't it?
tree-tools - CLI tree processing utilities
You can write utilities that would allow you to simply and efficiently process such logs. For example, we will read the log, filter by the value of one of the fields, select from the messages only fields that are interesting to us and display them as a sign.
> cat access.log.tree | pick ip time method uri | table
\193.34.12.132 2011-10-20T12:46:08+04:00 GET /index.html
\193.34.12.132 2011-10-20T12:46:10+04:00 GET /index.css
\193.34.12.132 2011-10-20T12:46:20+04:00 GET /index.js
> cat access.log.tree | filter time >= 2019-09 | pick ip uri | table
\193.34.12.132 /index.html
\193.34.12.132 /index.css
\193.34.12.132 /index.js
I have a prototype of such utility which I sometimes use to view live dev server logs. It will be great if someone undertakes to implement a complete set of tools. And when there are tools, then software developers will be motivated to write logs not randomly, but in a structured way.
tree as a communication protocol
You can go further and not just write logs in tree format, but in principle promote the idea that the output of any program should be structured. Many utilities have flags for outputting a response in the form of JSON or XML, but reading such an output is stressful for a person - you have to reopen the output in visual representation tools in order to understand what is returned there and how to approach it. Just imagine a world where the output can be read and immediately somehow transformed without picking mana in search of the desired combination of keys for the next program.
> gitlog
commit
message \$mol_style: TS@3.9 compatibility
sha \b1a8f07c839604d0d34430a186246f0c1f71e628
date \2020-05-15T23:24:32+0300
author \nin-jin <sairi-na-tenshi@ya.ru>
commit
message \$mol_regexp: concurrent parse ability
sha \be1abfa50542728dd5c156517ea31f469e7fb4d4
date \2020-05-15T23:03:30+0300
author \nin-jin <nin-jin@ya.ru>
> git log | pick date message | table
\2020-05-15T23:24:32+0300 $mol_style: TS@3.9 compatibility
\2020-05-15T23:03:30+0300 $mol_regexp: concurrent parse ability
WAT
WebAssembly is a forward-thinking assembler that gets as close to the machine as possible without sacrificing portability. It has a text representation format based on Lisp s-expressions.
(func $fact (param $x i64) (result i64)
(if $x (result i64)
(i64.eqz
(local.get $x))
(then
(i64.const 1))
(else
(i64.mul
(local.get $x)
(call $fact
(i64.sub
(local.get $x)
(i64.const 1)))))))
It is difficult to perceive it no matter how you format it. Unfortunately, this is the kind of code you will see when disassembling in browser devtools.
wasm.tree - assembler without tinsel
I'm currently working on a bytecode compiler for a more descriptive wasm.tree description.
func
$fact
param $x i64
result i64
body switch
test i64.eqz local.get $x
then i64.const 1
else i64.mul
local.get $x
call $fact i64.sub
local.get $x
64.const 1
From this assembler, a list of bytecodes in the [bin.tree] language (https://github.com/nin-jin/tree.d/wiki/bin.tree) is generated, which is already distilled into a binary by an elementary function.
00
61
73
6d
01
00
00
00
.
.
.
When there is something more or less complete, I will try to push this syntax as WAT2.0. Who cares about the fate of WebAssembly - join the development.
jack.tree - LISP without brackets
In fact, writing in raw assembler is too verbose. Therefore, the next step is the implementation of a meta-language that allows you to extend the language by means of the same language itself. The core of such a language should turn out to be extremely minimalistic, and all idioms will be connected to it as third-party libraries written in the same language.
jack
import wasm
tree func $fact
> $x #8
< #8 switches
test is-zero $x
then #8 1
else mul
$x
$fact sub
$x
#8 1
Roughly speaking, a program in this language iteratively modifies its own AST in such a way that the output is a wasm binary. It may sound intimidating, but thanks to the fact that tree saves the coordinates of the sources, it is not difficult to trace the source of the error. In the repository, you can look at a scanty prototype.
Abolishing LLVM
You can go even further and generate not wasm bytecodes, but downright bytecodes of the target processor, simply by adding one more transformer to the pipeline.
compile pipelines:
jack.tree => wasm.tree =============> bin.tree
jack.tree => wasm.tree => arm.tree => bin.tree
any-dsl.tree => jack.tree => wasm.tree => arm.tree => bin.tree
At the same time, at any level, you can run additional transformers that can optimize the code using the information available at the corresponding levels of abstraction.
optimization middlewares:
jack.tree => jack.tree
wasm.tree => wasm.tree
arm.tree => arm.tree
At the same time, let me remind you that we do not lose touch with the original sources, which will allow us to display adequate messages. And any intermediate AST can always be dumped into text in a very visual form of the tree format.
Again, join the development, it can turn out to be a cool thing to replace LLVM.
One AST to rule them all
And finally, we come to the main idea of ββthis report. Tree is a perfect candidate for a universal AST binder. Just look at how long the TypeScript code goes from source to the resulting bundle when building on a typical project.
code =(P)=> loader =(P)=> compiler =(SP)=> bundler =(SP)=> terser =(S)=> bundle
P - Parse
S - Serialize
And each tool re-parses your sources into its own AST, processes it, serializes it, and passes it on. If we agree on a single AST format, then we can significantly simplify the implementation of utilities and reduce the overhead for code processing.
code =(P)=> loader =====> compiler ======> bundler ======> terser =(S)=> bundle
Even if some of the utilities will run in separate processes (which means intermediate serialization is inevitable), the tree format will allow you to transfer the AST as quickly as possible, due to the minimum overhead for parsing and serialization.
Sandbox
tree.hyoo.ru - a sandbox where you can drive various transformations. Here are some examples:
- view.tree β view.ts - translation of the component description into TypeScript code.
- view.tree β locale.json - export of reference texts for localization in the form of JSON from the component description.
- view.tree β view.dts - export TypeScript types with embedded sorsmaps from component descriptions.
- JSON β json.tree - translation of JSON into json.tree.
- xml.tree β XML - translation of xml.tree into XML
- XML β xml.tree - translation of XML into xml.tree.
- js.tree β JS - translation of JavaScript AST into JavaScript proper.
- wasm.tree β WASM - compilation of WASM AST into a WASM binary and checking its correctness. This thing is still very raw: only 3 types of sections are supported, you can't run it right there in the sandbox. But as soon as there is time, I will finish the specification.
- jack.tree β JS eval is a translation of a meta-language with JavaScript generation with built-in sorsmaps and immediately its execution.
- MarkedText β JS - translation of MarkedText into JavaScript code with embedded sorsmaps, which generates a DOM tree using the DOM API.
- grammar.tree check - grammar correctness check.tree syntax descriptions on the fly.
- span.tree imprint/reuse - stitching of sources and mapping in span.tree tree, its intermediate serialization into a string, followed by restoration of the original tree without loss of mapping.
- automate.tree (JS) is an example of writing your own transformation in JavaScript that converts a simple automation script into JavaScript code with built-in sorsmaps.
- automate.tree (jack) is the same, but using the jack.tree language.
Where to go, where to go
I hope I managed to infect you with ideas about a brighter future. But in order to bring it closer, we need to work on it together. I'm afraid I won't be able to handle all of this. So write, call and do not disappear.
- These slides: nin-jin/slides/tree
- Everything about Tree: nin-jin/tree.d
- My twitter: @_jin_nin_
Top comments (0)