DEV Community

CoinMonks
CoinMonks

Posted on • Originally published at Medium on

Ethereum Under the Hood- Part 2 (RLP Encoding)-ver 0.3

Orignal published date:Apr 25, 2018 ·

Update: Grammar fixes

Part-2 of the article, if you have not gone through Part-1, please do. We will cover how Ethereum uses Recursive Length Prefix to encode data and state in this article.

This is a slightly longer read, so get yourself some coffee and be prepared to get confused, I did reading about RLP

The Problem:

Ethereum is a global computer and data to be processed by Ethereum has to be sent across the wire compactly and efficiently. Ethereum team chose Recursive Link Prefix decoding/encoding algorithm to achieve the above-stated goals. If you go through articles on Ethereum there are references to the “stack machines” and its usage in Ethereum, so let’s cover them for a little bit.

The Stack Machine:

Let’s briefly touch on what is a Stack.

Stack Machine

Source: Wiki

The Ethereum Virtual Machine(EVM) is a Stack-based machine which stores instructions and data in a stack-based using the LIFO(Last In First Out) method. Let’s take the example of a string “ cat ” and see how it’s pushed into a stack

['c','a','t'] 
so the PUSH operation will be:
**PUSH(‘c’)   
PUSH(‘a’)   
PUSH(‘t’)** When data has to be retrieved POP function is used in the order it was put in:
**POP('t')  
POP('a')  
POP('c')**  
Enter fullscreen mode Exit fullscreen mode

Ethereum Stack has certain restrictions. The Stack depth is maxed out at 1024 elements, and max word size per instruction is 256 bits or max of 32 bytes. I think of this as how broad the instruction set can be, as depicted in the diagram below.

Stack
Ethereum Stack Machine(1024 elements max height, 32 bytes max word size)

Stack Machines are ideal to have a mini-compiler implemented. Let’s take an expression

[**[0x12344**]**,[(a+b\*c/d**), **'x', 9]** ]
Enter fullscreen mode Exit fullscreen mode

A couple of things to consider, there are two items on the above array. A hex value and expression. Think of hex value (0x12344) as an Ethereum address and the expression as the state change information. Note that the max size of the word is 256 bits. Let’s consider [(a+b*c/d), ‘x’, 9]. The whole expression is less than 32 bytes. Why 32 bytes you ask? Here is a good stack exchangediscussion on that topic.

Check out this youtube video about Stack Machines

Byte Arrays and Items :

Let’s recap some concepts which we learned from the previous article. All data in Ethereum is serialized and deserialized as a byte-array. Put, an array of bytes is a byte-array, it’s worth repeating some examples of these byte-arrays.

RLP encoding function accepts an item or a collection of items

  • A string (i.e., byte array) is an item
  • A list of items is also an item

Examples of item and items

  • “Bitcoin”
  • [[] ]
  • [“Lion” ]
  • [[ “cat”, 123, ‘d’, ‘o’, ‘g’] ]

Now, what is Recursive Link Prefix (RLP)?

RLP is a set of rules to encode and decode structures and data. Part II. I focus on RLP encoding; we will cover RLP decoding in part II.II of the article in later posts. Let us recap what’s inside of an RLP. If you recall in the previous article we discussed “items” , all “items” are byte-arrays. RLP function accepts one parameter of type object.

A critical aspect of RLP is that it leaves encoding rules of specific structure to the top level layers, for example, if there is another structure, e.g. {“abc” }. RLP will consider this as a string element and will leave the encoding to the top level layers

_function_ encodeRLP**(rawData)**{ 
  .....,
 return encodedData #type bytearray;
}_function_ decodeRLP**(bytearray stateInfo)**{
  .....,
  return decodeData #raw;
}
Enter fullscreen mode Exit fullscreen mode

Let’s get into those RLP encoding rules for these objects :

Tip: I would suggest to open up an ASCII_ chart._

RLP encoding rules are defined as follows:

Rule 1: When encoding a payload with 1 byte and value in that byte falls within the following range [0x00, 0x7f], in other words within [0, 127] then encode the single byte as RLP itself

$ 120 to rlp
**_78_**   
$ 127 to rlp
**_7F_**  
$ "a" to rlp
**a**
Enter fullscreen mode Exit fullscreen mode

so encoding 120 into RLP results in 78 and 127 results in 7F, and for “a”, encoding results in “ a ” and the byte size is 1 and encoding rule#1 applies.

Rule 2: When encoding a string (in byte array) which falls between 0 and 55 bytes apply the following logic.

0x80+length(string),string

string "dog" to byte-array in hex is [0x64,0x6f,0x67] and size of the byte array is 3
Enter fullscreen mode Exit fullscreen mode

Let us encode item “ dog ” into RLP, the size aka length of byte array is 3, so encoding into RLP will result in

[0x80+3,0x64,0x6f,0x67] -> [0x83 , 0x64, 0x6f, 0x67]

e.g.: Another example would be “ hello world ” of RLP encoding. Size of “hello world” is 11 bytes.

$ "hello world" to rlp
 [**0x8B** 0x68 0x65 0x6C 0x6C 0x6F 0x20 0x77 0x6F 0x72 0x6C 0x64]
Enter fullscreen mode Exit fullscreen mode

Rule 3: When encoding a string (in byte array) greater than 55 bytes apply the following rule.

1)0xb7+length_in_bytes(byte_size(string)) 
2)length(string)
3)encoded string  
Join 1,2,3
Enter fullscreen mode Exit fullscreen mode

Let’s take an example string “Hello there, I am a very very long string, and I am going get encoded in RLP!”. The length of the string is 76, so rule 3 applies. Let’s get a couple of prereqs out of the way.

_a)len raw string = 76   
b)byte\_size(string) =76   
c)length\_in\_bytes(byte\_size(string) ) = 1_   
Enter fullscreen mode Exit fullscreen mode

Following the steps laid out by rule#3

  1. 0xb7 + 1
  2. 76
  3. “Hello there, I am a very very long string, and I am going get encoded in RLP!”
  4. (0xb7+1) , 76, “Hello there, I am a very very long string, and I am going get encoded in RLP!”

So the final RLP encoding in hex is:

**B84C** _48656C6C6F2074686572652C204920616D206120766572792076657279206C6F6E6720737472696E6720616E64204920616D20676F696E672067657420656E636F64656420696E20524C5021_
Enter fullscreen mode Exit fullscreen mode

Rule 4: When encoding a list and the encoded payload in the list is between 0–55 bytes apply the following encoding rule.

1)0xc0+length of (list)) 
2)Encoded string
Join 1,2
Enter fullscreen mode Exit fullscreen mode

Given an input to RLP encoding with [“dog”, “mouse”, “tigers”, 127]

We need to encode the payload within the list first.

  • “dog” : 0x83 , 0x64, 0x6F, 0x67
  • “mouse” : 0x85 ,0x6D, 0x6F ,0x75, 0x73, 0x65
  • “tigers” : 0x86 , 0x74 ,0x69 ,0x67 ,0x65 ,0x72 ,0x73
  • 127 : 0x7F

Now the list with the encoded payload looks like this:

[0x83 0x64 0x6F 0x67 0x85 0x6D 0x6F 0x75 0x73 0x65 0x86 0x74 0x69 0x67 0x65 0x72 0x73 0x7F] and the length of the list is 18(12 in hex)

so final output of this rule is:

  1. 0xc0+12 : 0xd2
  2. [0x83 0x64 0x6F 0x67 0x85 0x6D 0x6F 0x75 0x73 0x65 0x86 0x74 0x69 0x67 0x65 0x72 0x73 *0x7F *]
  3. Union of 1 and 2
  4. Final output : 0xd2 0x83 0x64 0x6F 0x67 0x85 0x6D 0x6F 0x75 0x73 0x65 0x86 0x74 0x69 0x67 0x65 0x72 0x73 0x7F

Rule 5 : When encoding a list and the encoded payload in the list is greater than 55 bytes apply the following encoding rule.

1)0xf7+length_in_bytes(item 2)
2)length(payload)
3)Encoded payload
Join 1,2,3
Enter fullscreen mode Exit fullscreen mode

I would let you do the rest, very similar to the rule above.

Finally a couple of notes

  • RLP encodes positive integers in Big Endian format,and discards the leading zeros, and integer value zero is the same as an empty byte array.
  • There are certain constants for empty lists, string, and integer 0
1) Empty List **[]** encoded to **0xC0**
2) Empty String **""** encoded to **0x80**
3) Integer **0** encoded to **0x80**
Enter fullscreen mode Exit fullscreen mode

Recap Recap Recap

My recommendation is to let this digest and come back to the article again. I am not 100% sure on some of the finer details either, but I did get some useful perspectives on the design decision and some fantastic help from the good folks at stack-exchange.

Let’s recap:

  • RLP is a set of rules to encode an item or a list of items.
  • RLP has a different set of rules based on the size of the payload.
  • Strings is a byte array.
  • Empty strings, lists have a predefined value.
  • RLP is used due to its capability to compact data and is simple.

Tip: Try reading the rationale on the design decisions on why the Ethereum team choose RLP. There are various implementations of RLP. I would recommend going through them. I did refer to implementation in Elixir and Ruby.

There are also other great medium posts on RLP which I entirely referred to go and will continue to do for future reference. Check out my implementation with some help of code snippets from girishramnani, and many thanks to @kvkalidindi for proofreading and valuable comments.

In the next article, Part 3 , we will cover RLP decoding with some reference to implementations.

References:


Top comments (0)