Afri Schoedon

Posted on Nov 29, 2017 • Edited on Dec 17, 2017

The Ethereum-blockchain size will not exceed 1TB anytime soon.

#ethereum #parity #blockchain #bitcoin

Before diving into this article, please read the two disclosures about my involvement (1,2) and the one on data accuracy (3) at the bottom of the article.

At least once a month someone posts a chart on r/ethereum predicting the blockchain size of Ethereum will soon exceed 1 TB. I want to take that chance to clean up with some stories around the Ethereum-blockchain size in this article and try to explain why this chart is technically correct, but not the full picture.

Let's have a look at this chart first. It shows the complete data directory size of an Ethereum node (red), Geth in this case, and a Bitcoin node (blue), probably Bitcoin-Core, plotted over time. While the Bitcoin graph is moving slightly upwards in a seemingly linear inclination, the Ethereum graph reminds the reader of an exponential growing slope.

On Blocks, Block-History, States, and State-History

Users accusing Ethereum of blockchain-bloat are not far off with their assumptions. But actually, not the chain is bloated but the Ethereum state. I want to examine some terminology from the Whitepaper before proceeding.

Block. A bundle of transactions which, after proper execution, update the state. Each transaction-bundling block gets a number, has some difficulty, and contains the most recent state.
State. The state is made up of all initialized Ethereum accounts. At the time of writing, there are around 12 million known accounts and contracts growing at a rate of roughly 100k new accounts per day.
Block-History. A chain of all historical blocks, starting at the genesis block up to the latest best block, also known as the blockchain.
State-History. The state of each historical block makes up the state history. I will get into the details on this later.

If this already bores you, now please, read on.

Understanding Pruning-Modes and Sync-Modes

Early 2016, the Go-Ethereum team introduced a so-called fast synchronization mode. Since then, it was pretty famous to run geth --fast, especially after the spam-attacks on Ethereum later the same year making a full synchronization mode painful. I'm writing these modes italic because I will come back to an essential disambiguation at a later point in this article. Just keep them in mind for now.

The Parity team (formerly Ethcore) reacted to the on-chain spam by offering a warp synchronization mode at the end of 2016 to ease the chain synchronization for new users. Much as the same as Geth's fast, parity --warp soon became the de-facto standard mode for users trying to synchronize the Ethereum chain. As of today, both these options are adapted as default in both clients.

But what does it mean to fast-sync versus full-sync a Geth node? What does it actually mean to warp-sync a Parity node rather than no-warp-syncing it?

A full Geth node processes the entire blockchain and replays all transactions that ever happened. A fast Geth node downloads all transaction receipts in parallel to all blocks, and the most recent state database. It switches to a full synchronization mode once done with that. Note, that this results not only in a fast sync but also in a pruned state-database because the historical states are not available for blocks smaller than best block minus 1024. That's not an issue, but before reading on, please keep in mind that Geth synchronization modes are also pruning modes.

Looking at Parity configuration options, this gets more complex. In addition to the previously mentioned synchronization modes, Parity also offers separate pruning modes, namely fast and archive... Right, Geth fast is a sync-mode, we learned, that even prunes, however, Parity fast is pruning mode not heavily coupled to the sync mode. At this point, I have to admit, the terminology is confusing, and I might have lost you already. Let's draw something with pen and paper.

Geth's fast enables a quicker synchronization and database pruning. Geth full disables both. Parity warp, however, can be disabled without disabling the state-trie pruning! This is a significant sentence. Thus I bolded it. And I am not comparing Ethereum clients here, that's not my intention at least. I want to show you that it is possible to run a full-verifying Ethereum node with a small database. Parity just provides the proof-of-concept for this.

But why is this? Because as long as you have all historic blocks on your disk, you can compute any historical state from it by reprocessing the entire chain again. But in most use-cases, you don't need historical states at all! Therefore it is smart just to delete outdated entries from the state history and to reduce your required disk space by 95%.

So, what's the minimum Size of a full-verified Node?

Some 10's of GB by just running parity --no-warp. Earlier this fall it was less than 20 GB, but the state is growing very fast. Currently, the raw historical block data containing the blocks and transactions is approximately 12-15GB in size and the latest state around 1-2GB.

But is this to be considered a full Ethereum node? Yes:

It runs a full blockchain synchronization starting at genesis.
It replays all transactions and executes all contracts.
It recomputes the state for each block.
It keeps all historical blocks on the disk.
It keeps the most recent states on the disk and prunes ancient states.

Something an Ethereum client never does is deleting old blocks. This is a significant difference between Bitcoin and Ethereum because pruning a Bitcoin node does not leave any choice but removing old blocks. With this context available, it's easier to understand why users often think a pruned Ethereum node is not a full node. But now, dear reader, you know the opposite is true. :)

And on top of this, even a warp-synced Parity node is downloading the whole history of blocks after the initial synchronization allowing it to serve the network as a full node once completed the ancient-block synchronization.

The Full Picture: 9 Parity Configurations compared

Below is a screenshot of my nicely-colored spreadsheet trying to distinguish between node-security of different Parity operation modes.

The configurations 00 through 05 are all to be considered full nodes. Configuration 06 is a default-configuration warp-node which can be regarded as full once the ancient block download is finished. However, it does not replay all transactions; it only checks the Proof-of-Work of the historical blocks.

The configuration 07 is something users often ask for but should be highly discouraged in production use. This setting is comparable to a pruned bitcoin node as historical blocks are partially not available. This is not a full node anymore. Note, how I added a separator above this paragraph. You get the idea.

Configuration 08 is a light client, but that's worth another blog article. Thanks for scrolling this far down, here is your conclusion: An Ethereum full node does not require more than 20-30 GB disk space by default. :)

Noteworthy disclosures and bottom-line comments.

(1) I work for Parity. I'm comparing different Parity configurations not only because I sincerely know and understand them, but also because Parity allows users to configure pruning mode and synchronization mode separately.

(2) I hold some Bitcoin and some Ether. I hope this does not have any influence on the technical aspects I'm outlining in this article. Also, I'm trying not to become overly political about this.

(2) I have been running Parity in 36 different configurations over six weeks to gather the numbers. This is time- and resource-consuming, and still, it bears the issue that I can not keep all configurations running at the same time, and therefore, the accuracy of the numbers presented in this article have to be consumed with caution. I expect the results to differ up to plus/minus 20% from other nodes running the same configuration. But you get the idea:

| ID | Pruning / DB Config | Verification    | Available History          | ETH        | ETC        | MSC        | EXP        | Parity CLI Options                         |
|====|=====================|=================|============================|============|============|============|============|============================================|
| 00 | archive +Fat +Trace | Full/No-Warp    | All Blocks + States        | 385     GB |  90     GB |  25     GB |   5.6   GB | --pruning archive --tracing on --fat-db on |
| 01 | archive +Trace      | Full/No-Warp    | All Blocks + States        | 334     GB |  90     GB |  21     GB |   5.8   GB | --pruning archive --tracing on             |
| 02 | archive             | Full/No-Warp    | All Blocks + States        | 326     GB |  91     GB |  30     GB |   5.5   GB | --pruning archive                          |
| 03 | fast +Fat +Trace    | Full/No-Warp    | All Blocks + Recent States |  37     GB |  13     GB |   3.5   GB |   1.3   GB | --tracing on --fat-db on                   |
| 04 | fast +Trace         | Full/No-Warp    | All Blocks + Recent States |  34     GB |  13     GB |   3.5   GB |   1.2   GB | --tracing on                               |
| 05 | fast                | Full/No-Warp    | All Blocks + Recent States |  26     GB |   9.7   GB |   3.0   GB |   1.1   GB | --no-warp                                  |
| 06 | fast +Warp          | PoW-Only/Warp   | All Blocks + Recent States |  25     GB |   9.6   GB |   2.6   GB |   0.96  GB |                                            |
| 07 | fast +Warp -Ancient | No-Ancient/Warp | Recent Blocks + States     |   5.3   GB |   2.9   GB |   0.19  GB |   0.13  GB | --no-ancient-blocks                        |
| 08 | light               | Headers/Light   | No Blocks + No State       |       5 MB |       3 MB |       4 MB |       5 MB | --light                                    |

Meta-data:

Version: Parity/v1.8.0-unstable-7940bf6ec-20170921/x86_64-linux-gnu/rustc1.19.0 from source w/ musicoin support
Ubuntu: 17.04 Kernel 4.10.0-35-generic / September 2017 / Lenovo Thinkpad X270, Core i7-7600U, 1TB SSD, 16GB RAM

Thanks for scrolling to the bottom. <3

Update: Thanks for featuring me on dev.to and twitter. Users who enjoyed reading this article, might also find the following reddit discussion interesting.

Fun fact: While publishing this article, the price of Bitcoin broke 10_000 USD and Ethereum 500 USD. I think I will add current market prices to my articles in future, just for fun.

Update: Thanks for rating this top-1 post, dear dev.to team <3 <3 <3

Update: Here is a more controversial discussion on Hackernews.

Top comments (21)

Jason C. McDonald • Nov 29 '17

I only understood about 10% of this, being totally uninformed about blockchain (my fault)...but I still have to <3 and applaud, because I know research effort when I see it! Great job.

Thomas Jay Rush • Nov 30 '17 • Edited

I run a Parity node with --tracing on and --pruning archive and have done so since June of 2016. Since the Byzantium hard fork (October 16), the chain data (in this admittedly extreme case) has grown more than 125 GB. If one wishes to do what I call a "deep, full, audit level accounting", one needs the traces. It's unclear if one needs the archive from this article. If you're running tracing and archive, the chain will blow past 1TB very soon.

Erik Jonsson Thorén • Dec 17 '17

Hi Afri, thanks for the write-up!

I have a question, you write:

The configuration 07 is something users often ask for but should be highly discouraged in production use. This setting is comparable to a pruned bitcoin node as historical blocks are partially not available.

Could you explain why this should be discouraged?

Afri Schoedon • Dec 17 '17

Because it does not hold historic blocks. And you can only verify the integrity of the chain, the transactions, the state, and balances, if you have access to all historic block data. That is available in Configurations 00-06, and partially in 08, but not in 07.

Kari Ilkkala • Dec 7 '17

Great article!

If I understood correctly, a full but pruned node would need all the blocks + the recent state of each account (and smart contract)?

Yesterday Dec 6th 2017 the number of Ethereum accounts grew to 13.4 million. Based on your article, the size of the chaindata of a pruned Ethereum node is now, depending on the mode, somewhere between 25 and 40 GB?

As more and more apps meant for global use rely on users to have cryptocurrencies and tokens, the number of Ethereum accounts can start to approach that of Internet users in general.

How big even the pruned node will grow if we have like 1 billion plus accounts? Proportional growth from 35 GB / 13.4M accounts node size would give 2.6 TB node size.

Not trying to invoke FUD, just asking?

Afri Schoedon • Dec 7 '17

Hey, thanks for reading it! Without running the numbers, I want to say: it's not impossible. I am just highlighting that we are far away from this and this is not happening soon.

There are a lot of proposals for scalability, I don't really have an overview, and also I do not feel technically qualified to discuss them currently. But regarding the state size, or let's say, state bloat, you might want to read about state-cleaning or dust-cleaning. There are proposals to just purge entries from the state which are provable non-recoverable accounts (i.e., balance smaller lowest possible transaction fee).

The good thing, we still have time, in that regard, to discuss proposals and eventually implement them.

Craig • Feb 10 '18

Great article!

But I'm still confused by a few things.

What is the point of the archive node?

What happens when the archive node is projected to surpass 100 TB by early 2020?

If the fast full node is about 10% of an archived full node, again, what happens when say by 2020 that the archive node is over 100 TB?

And if the archive node is necessary to some degree to solidify the network, how does it affect the centralization aspect of things when say in 5 years that the archive node is over 5-10K TB?

Thanks in advance

Ali Askeri • Apr 2 '21

Just finished syncing a full node with parity on archive (ID 02 in your table) and I can confirm the current size of parity's db folder is 954GB with 14,751 items.
Manuela Lisa

huyhoangk50 • Dec 16 '17

I want to develop a server that deploys smart contract through web3j. Can I ask you that: is my server considered to be a node of ethereum decentralized system? And which parameters is the server required

Magento Chile • Mar 12 '18

Hi Rando,

Here my data for you to update or contrasts with the table. In Droplet Digital Ocean 2G RAM, 60G storage, 2 CPU (approx 48 hours synchronizing - IP London).
12.03.2018 Parity ethereum synchronized node (Fast --no-warp (cell 5)):

cd /root/.local/share/io.parity.ethereum/docker

[root@centos-s-2vcpu-2gb-lon1-01 docker]# du -sh chains
43G chains

[root@centos-s-2vcpu-2gb-lon1-01 docker]# df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/vda1 62903276 47366608 15536668 76% /
devtmpfs 920968 0 920968 0% /dev
tmpfs 941688 0 941688 0% /dev/shm
tmpfs 941688 102368 839320 11% /run
tmpfs 941688 0 941688 0% /sys/fs/cgroup
tmpfs 188340 0 188340 0% /run/user/0

[root@centos-s-2vcpu-2gb-lon1-01 io.parity.ethereum]# du -sh docker
44G docker

Regards,

Boris Durán

suffic • Apr 5 '18

If by 'anytime soon' you mean, one quarter later...

Just finished syncing a full node with parity on archive (ID 02 in your table) and I can confirm the current size of parity's db folder is 954GB with 14,751 items.

mohinimraut • Aug 23 '18

when I restart Rest Server I faced following error :
Discovering types from business network definition ...
Connection fails: Error: Error trying to ping. Error: make sure the chaincode landregistry has been successfully instantiated and try again: getccdata composerchannel/landregistry responded with error: could not find chaincode with name 'landregistry'
It will be retried for the next request.
Exception: Error: Error trying to ping. Error: make sure the chaincode landregistry has been successfully instantiated and try again: getccdata composerchannel/landregistry responded with error: could not find chaincode with name 'landregistry'
Error: Error trying to ping. Error: make sure the chaincode landregistry has been successfully instantiated and try again: getccdata composerchannel/landregistry responded with error: could not find chaincode with name 'landregistry'
at _checkRuntimeVersions.then.catch (/usr/lib/node_modules/composer-rest-server/node_modules/composer-connector-hlfv1/lib/hlfconnection.js:806:34)

ch0235 • Apr 18 '18

hello, I'd like to run "traceReplayTransaction" api with parity client node, which current cli param I used is "--pruning=archive". But the sync data size is too large, more than 1.0T . So I want to use the param "--tracing on --fat-db on" to instead and redo it.

The question is whether I can still run "traceReplayTransaction" api by using the param "--tracing on --fat-db on" ?

If not, could you please tell me the best optional param? thank you!