DEV Community

Cover image for High-level Parsing Concepts
Uman Shahzad
Uman Shahzad

Posted on

High-level Parsing Concepts

This post will teach you some general things you need to know and think about when parsing out structured content inside a buffer. It will particularly provide questions to ask and points to ponder when approaching any parsing problem, along with some real-world examples.

We will not talk about any particular algorithm, e.g. those used in compilers or other sophisticated programs with large and complex parsing steps. However, all of the concepts explained here will be useful in one or way or another in any parsing context.

Whole vs. Chunks

One of the most important parameters in deciding a parsing strategy is how the input is received; is it one buffer that contains the entirety of the structure, or a partial buffer (a chunk) that gets more data over time (streaming)?

When dealing with the entirety of the structure in a single buffer at once, the algorithm is usually very simple to express. You can have a simple loop that goes from start to finish in the buffer, and keep state and context data relevant to the problem updated as the loop proceeds. There is usually never a reason to backtrack in the buffer itself, since out-of-band state and context data should be correctly kept up-to-date regarding what's been parsed so far.

When dealing with only chunks of the structure at a time, where chunk boundaries may occur at arbitrary points in the structure, parsing becomes significantly more difficult because the control flow is harder to express; a simple loop is usually never sufficient. However, by asking certain questions and coming up with the right solutions for your particular problem, the parse can be made a lot simpler:

  • What is the input source?
  • Is the input going to be retrieved in one big chunk, or in multiple fixed-sized chunks, or in multiple variable-sized chunks? And is the chunk size going to be sufficient for performing all parse steps, or will we need to buffer previous chunks until we get 1 bigger chunk that can be properly parsed?
  • What non-buffer data (i.e. context/state) is needed to help perform a later parse step given the current chunk?
  • How is the data encoded inside the buffer going to be output to the user after parsing?
  • What guarantees do we have about total size? Do we know in advance, or do we need to track how much has been input so far, in order to enforce limits?

Examples

In each example, we ask ourselves relevant questions and answer them in order to come to a reasonable parsing strategy.

Stream Parsing JSON

Q: What is the input source?

Assume the input source to be a network socket using a streaming protocol like TCP.

Q: What is the chunk size, and can all parsing steps succeed on a chunk of that size, or will buffering be needed?

Data from the network socket will come in streams, and JSON itself contains elements of varying lengths, so a single size will not suffice for all possible entities, nor does it align itself well with the streaming input model of the socket. So, we will use whatever comes from the socket first, and if it doesn't contain enough data for 1 successful parse step, we will buffer that data and ask for more, appending to the buffer until we can successfully complete a parse step. Then we can deallocate the buffer and start with a new one for more data, until the end of the JSON representation.

Q: What non-buffer data is needed?

We need several pieces of state information. For example:

  • Has the JSON document just started?
  • Has the JSON document just ended?
  • Have we just read in a key in an object, and are now expecting a value?
  • Have we just read in a value and a comma, and are now expecting another key (if in an object) or a value (if in an array)?
  • Has a new array just started?
  • Has a new object just started?
  • Has an array just ended?
  • Has an object just ended?

Additionally, we need to know how deep we are in the JSON hierarchy, and whether the parent layers were arrays or objects. This is important because as we "close" an object or array, we need to know whether the new scope is within an object or array, so we can properly confirm whether the scope will be closed appropriately, and whether to expect a key or value.

In other words, lots of data is needed outside of the buffer itself as we parse. We cannot ensure a fully correct parse otherwise.

Q: How is the data encoded inside the buffer going to be output to the user after parsing?

In the interest of making the least amount of assumptions about the user's desired non-buffer data format, we will only choose to output individual data types which have a 1-to-1 correspondence between JSON and the programming language used.

So, if we just completed the parse for a number, and it was in floating-point format, we can use one of f64_t or f128_t depending upon the size of the number. If it was an unsigned number, we will output a u64_t. If it was a negative number, then s64_t. If a string (whether key or value), then a str_t. And so on.

We do not discriminate against objects or arrays; the user will have to check the current JSON parse state to figure out in what context (whether object or array) the data was found, in case they need to construct an array or object of their own format (which may be a hashtable, a linked list, or any other format) in their application.

Q: What guarantees do we have about total size?

When receiving from the network, especially if received from the open internet where anyone may send a message, malicious or otherwise, we cannot make many guarantees about the total input size.

Instead, we can set internal limits, such as, for example:

  • A maximum string size of 1MB.
  • A maximum object/array hierarchy depth of 200.
  • A maximum total document size of 5MB.

There may be more limits, and all of these limits will be configurable. So a user trying to parse trusted JSON from disk can choose to remove the document size limit completely because they may be parsing gigabytes worth of JSON. Or a user who knows that some data coming from a web request will contain strings of sizes up to 5MB can change the string size limit and the total document size limit to compensate for that use case.

On-Disk CSV File Parsing

Q: What is the input source?

The input source is a file on a local filesystem. The entirety of the file will be loaded before parsing starts.

Q: What is the chunk size, and can all parsing steps succeed on a chunk of that size, or will buffering be needed?

Because we have the entirety of the CSV data in-memory, our "chunk" is really the entire file and so can be trivially parsed successfully by any parsing step.

Q: What non-buffer data is needed?

We can keep track of several pieces of state information, some necessary for correctness parsing and others necessary for debugging/etc. For example:

  • How many total columns are expected from the header?
  • What column are we in right now?
  • What row are we in right now?
  • Are we inside of a double-quoted cell?
  • Has the header been parsed already?
  • Did a new row just start?

Q: How is the data encoded inside the buffer going to be output to the user after parsing?

Because CSV only specifies data as strings, by default we will parse each cell as a buf_t. The user can then perform further parsing on the individual cell depending upon which column it was in. For examplee, it may contain a number or a boolean value which can be converted to a s32_t or a bool_t.

Q: What guarantees do we have about total size?

We know the exact size of the full CSV file because that information is available via the filesystem.

Structure Assumptions

The source of data may or may not be reliable. Getting data from the local filesystem that was written by an internal system poses an entirely different parsing problem than getting data from the network sent by an arbitrary host following some protocol defined in a public IETF RFC.

When receiving data for which we can make very strong guarantees regarding structure, parsing can generally avoid looking for error conditions and anomolous circumstances. But if the data comes from an untrusted arbitrary source, little to no assumption can be made about what's being received.

If little assumptions can be made about the data, what to do is dependant upon the particular problem. Generally, however, it is useful to think in the shoes of an adversary who can input arbitrary data into your parsing system, sometimes fuzzy and sometimes semi-structured.

When the consequence of a crash caused by an internal error in the parsing system from an unhandled case is low, it is generally acceptable to allow it and solve it iteratively, especially if the particular parsing problem is very complex, such as when parsing a programming language locally with a compiler.

Top comments (0)