Web3 Indexing: The Ultimate Guide (No Prior Knowledge Required)

#solidity #ethereum #web3 #smartcontract

It’s hard to say that the data engineering culture is deeply ingrained in the Web3 developer community. And not every developer can easily determine what indexing means in the context of Web3. I would like to define some details on this topic and talk about The Graph which has become the de facto industry standard for accessing data on the blockchain for DApp builders.

Let’s start with indexing.

Indexing in databases is the process of creating a data structure that sorts and organizes the data in a database in such a way that search queries can be executed efficiently. By creating an index on a database table, the database server can more quickly search and retrieve the data that matches the criteria specified in a query. This helps to improve the performance of the database and reduces the time it takes to retrieve information.

But what about indexing in blockchains? The most popular blockchain architecture is EVM (Ethereum Virtual Machine).

The Ethereum Virtual Machine (EVM) is a runtime environment that executes smart contracts on the Ethereum blockchain. It is a computer program that runs in every node on the Ethereum network. It is responsible for executing the code of smart contracts and also provides security features such as sandboxing and gas usage control. The EVM ensures that all participants on the Ethereum network can execute smart contracts in a consistent and secure way.

As you might know, data on the blockchain is stored as blocks with transactions inside. Also, you might know that there are two types of accounts:

Externally owned account — described by any ordinary wallet address.
Contract account — described by any deployed smart contract address.

If you send some ether from your account to any other external owner account — there is nothing behind the scenes. But if you send some ether to a smart contract address with any payload you actually run some method on the smart contract that is actually creating some “internal” transactions.

Okay, if any transaction can be found on the blockchain, why not transform all the data into a big constantly updating database which can be queried in SQL-like format?

The problem is that you can access the data of a smart contract only if you have a “key” to decipher it. Without this “key,” the data of smart contracts on the blockchain is actually a mess. This key is called ABI (Application Binary Interface).

ABI (Application Binary Interface) is a standard that defines the way a smart contract communicates with the outside world, including other smart contracts and user interfaces. It defines the data structure, function signatures, and argument types of a smart contract to enable correct and efficient communication between the contract and its users.

Any smart contract on-chain has an ABI. The problem is that you might not have an ABI for a smart contract that you are interested in. Sometimes, you can find an ABI file (which is actually a JSON file with the names of functions and variables of a smart contract — like an interface to communicate with)

on Etherscan (if the smart contract has been verified)
on GitHub (if the developers open-sourced the project)
or if a smart contract relates to any standard type like ERC-20, ERC-721, etc.

Of course, if you are a developer of a smart contract, you have the ABI, because it is generated while compiling.

How it looks like from the developer’s side

But let’s not stop at the concept of ABI. What if we look at this topic from the smart contract developer side? What is a smart contract? The answer is much easier than you thought. Here is a simple explanation for anybody who is familiar with Object-oriented Programming:

A smart contract in the code of a developer is a class with fields and methods (for EVM-compatible chains smart contracts are usually written in Solidity). And the smart contract which has been deployed on-chain becomes an object of this class. So it lives its life allowing users to call its methods and change its internal fields.

What is worth highlighting is that any method call with the change in the state of a smart contract means a transaction which is usually followed by an event that a developer emits right from the code. Let’s illustrate a function call of the ERC-721 (a usual standard for non-fungible token collections like BoredApeYachtClub) smart contract which emits an event while transferring ownership of an NFT.

/**
     * @dev Transfers `tokenId` from `from` to `to`.
     *  As opposed to {transferFrom}, this imposes no restrictions on msg.sender.
     *
     * Requirements:
     *
     * - `to` cannot be the zero address.
     * - `tokenId` token must be owned by `from`.
     *
     * Emits a {Transfer} event.
     */
    function _transfer(address from, address to, uint256 tokenId) internal virtual {
        address owner = ownerOf(tokenId);
        if (owner != from) {
            revert ERC721IncorrectOwner(from, tokenId, owner);
        }
        if (to == address(0)) {
            revert ERC721InvalidReceiver(address(0));
        }

        _beforeTokenTransfer(from, to, tokenId, 1);

        // Check that tokenId was not transferred by `_beforeTokenTransfer` hook
        owner = ownerOf(tokenId);
        if (owner != from) {
            revert ERC721IncorrectOwner(from, tokenId, owner);
        }

        // Clear approvals from the previous owner
        delete _tokenApprovals[tokenId];

        // Decrease balance with checked arithmetic, because an `ownerOf` override may
        // invalidate the assumption that `_balances[from] >= 1`.
        _balances[from] -= 1;

        unchecked {
            // `_balances[to]` could overflow in the conditions described in `_mint`. That would require
            // all 2**256 token ids to be minted, which in practice is impossible.
            _balances[to] += 1;
        }

        _owners[tokenId] = to;

        emit Transfer(from, to, tokenId);

        _afterTokenTransfer(from, to, tokenId, 1);
    }

What we can see here. To transfer an NFT from your address to any other address you need to call a function _transfer passing the values of these two addresses and the ID of this NFT. In the code, you can see that there will be carried out some checking and then the balances of the users will be changed. But the important thing here is that in the end of the function code, there is a line

emit Transfer(from, to, tokenId);

It means that these three values will be “emitted” outside and can be found in the logs of the blockchain. It is much more efficient to save the historical data you need this way because it is too expensive to store data right on the blockchain.

Now we’ve defined all needed conceptions to show what indexing is.

Considering the fact that any smart contract (being an object of some class) living its life constantly being called by users (and other smart contracts) and changing state (emitting the events at the same time), we can define the indexing as a process of collecting a smart contract data (any internal variables inside of it and not only those which are emitted explicitly) during its lifetime saving this data together with the transaction ids (hash) and block numbers to be able to find any details about it in the future.

And it is crucial to note because it is just impossible to access, for instance, the first transaction of wallet “A” with a token “B” or the biggest transaction in the smart contract “C” (or any other stuff) if a smart contract doesn’t store this data explicitly (as we know it is super expensive).

That’s why we need indexing. The simple things that we can do in an SQL database become impossible in the blockchain. Without indexing.

In other words “indexing” here is a synonym for smart contract data collection because no indexing means no data access in Web3.

How developers did indexing in the past. They did it from scratch:

They write high-performance code on some fast programming languages like Go, Rust, etc.
They set up a database to store the data.
They set up an API to make the data accessible from an application.
They spin up an archival blockchain node.
In the first stage, they go over the entire blockchain finding all the transactions related to a particular smart contract.
They process these transactions by storing new entities and refreshing existing entities in the database.
When they reach the chain head they need to switch to a more complex mode to process new transactions because each new block (even a chain of blocks) can be rejected due to a chain reorganization.
If the chain has been reorganized they need to get back to the fork block and recalculate everything to the new chain head.

As you can notice it is not easy not just to develop but also to maintain in real-time because each node glitch can require some steps to achieve data consistency again. That’s actually the reason why The Graph has appeared. It is a simple idea that developers along with the end users need access to smart contract data easily without all this hassle.

The Graph project has defined a paradigm called “subgraph” that to extract smart contract data from the blockchain you need to describe 3 things:

General parameters like what blockchain to use, what smart contract address to index, what events to handle, and from what start block to begin. These variables are defined in a so-called “manifest” file.
How to store the data. What tables should be created in a database to keep the data from a smart contract? The answer will be found in the “schema” file.
How to collect the data. Which variables should be saved from events, and what accompanying data (like transaction hash, block number, a result of other method calls, etc.) should be also collected, and how do they need to be put into the schemas we defined?

These three things can be elegantly defined in the three following files:

subgraph.yaml — manifest file
schema.graphql — schema description
mapping.ts — AssemblyScript file

Thanks to this standard it is extremely easy to describe the entire indexing following any of these tutorials:

And how does it look like then:

As you can see here The Graph takes care of the indexing stuff. But you still need to run a graph-node (which is open-source software by The Graph). And here goes another paradigm shift.

As developers in the past had been running their own blockchain nodes and stopped doing it taking over this hassle to the blockchain node providers. The Graph showed another architectural simplification. The Graph hosted service which looks for a developer (“user” here) this way:

In this case, the user (or developer) doesn’t need to run their own indexer or graph node but still can control all the algorithms and not even get into a vendor lock, because different providers use the same Graph description format (Chainstack is fully compatible with The Graph subgraph hosting, but it is worth checking this statement with your web3 infrastructure provider). And this is a big deal because it helps developers speed up the development process and reduce operational maintenance costs.

But what is also cool in this paradigm is that any time a developer would like to make their application truly decentralized they can seamlessly migrate to The Graph decentralized network using the same subgraphs.

What I missed in the previous narrative.

As you may notice The Graph uses GraphQL instead of REST API. It allows users to flexible queries to any tables they created combining them and filtering with ease. Here is a good video on how to master it.
The Graph has its own hosted service with a lot of ready-to-use subgraphs. It is free, but unfortunately doesn’t fit any production requirements (reliability, SLA, support), and syncing is much slower than paid solutions but still can be used for development. The tutorial on how to use these ready-to-use subgraphs with Python can be found here.