Many blockchain projects don't survive long after hitting the initial production state. For most, lack of proper software testing is one of the main reasons for their demise. It's estimated that over half a billion dollars worth of cryptocurrency has been lost due to bad code in the last year alone. You probably heard about The DAO's code loophole, which allowed attackers to drain out 3.6 million ETH (worth $70 million at the time) from the Ethereum-based smart contract. Another notorious case was the Parity bug which resulted in over $150 million permanently frozen. Even Bitcoin itself is not immune to hacks. Late last year, a bug discovered in the code allowed malicious individuals to artificially inflate Bitcoin's supply via double input. If the bug wasn't quickly identified and addressed, it could have had catastrophic effects on the network. This is just the tip of the iceberg - there are plenty of smaller incidents caused by inexperienced or inattentive developers that don't make the headlines.
What does this tell us? In development, things can go wrong fast and the outcome can be ugly. This is why software testing is so important for any project utilizing blockchain technology, such as blockchain platforms, blockchain applications or blockchain-based services.
In this article, we will discuss our experience and best practices with software testing while developing Lisk, a blockchain application platform. We will also show you how implementing automation testing improved our internal workflows and code reliability. Lastly, we will show you how you can get involved in testing our open source software.This is a long blog post, but we've broken it down into bite-sized pieces for you.
- Introduction to Blockchain and Lisk
- What is Software Testing?
- Testing blockchain applications adds new metrics to traditional software testing
- Why blockchain developers need to pay much more attention to details
- Automation testing can significantly cut down the release process
- Different types of automated tests
- Continuous Integration (CI) is best practice when it comes to automation testing
- Which CI platform to pick? Travis CI vs CircleCI vs Jenkins
- Software testing is not enough - introducing Quality Assurance
- How manual testing slowed down our software development process
- How we implemented Quality Assurance at Lightcurve
- The results of having a QA team in place
- Our blockchain network's QA testing process
- Get involved in our open source automation testing
- How to start contributing to our QA
- Which QA tools can we offer you?
Introduction to Blockchain and Lisk (feel free to skip this part if you're a Lisker)
You've probably heard about blockchain in the context of cryptocurrencies like Bitcoin, but what makes this new technology so special? The blockchain, which is a type of distributed ledger technology (DLT), is an open, distributed database that is able to record transactions among parties permanently and in an efficient, verifiable way. Those transactions are packed into blocks, cryptographically signed and form the actual chain. Data stored in blockchain can't be altered or tampered with, as all the records are immutable. Once data is saved into the ledger, it stays there forever. The blockchain is also a decentralized network, meaning there is no central authority with control over it.
Software testing is not enough - introducing Quality Assurance
While software testing is very important, it belongs to a wider scope of Quality Assurance. What does this term mean? Quality Assurance (QA) is much more than just testing. It encompasses the entire software development process. Quality Assurance includes processes such as requirements definition, software design, coding, source code control, code reviews, software configuration management, testing, release management, and product integration.
Manual testing slowed down our software development process
It's common in tech startups to face challenges in the first years getting processes in place and it's no different for us at Lightcurve. We didn't have enough resources to dedicate to software testing, but we still had to do as many tests as possible to ensure the quality and reliability of every new software release. For example, testing a bug fix or a feature on a private network level required:
- Preparing the binaries (build from source)
- Spinning up the cloud infrastructure (multiple virtual machines, from 10 to 500)
- Deploying the software on all of the machines
- Performing actual test scenarios
- Gathering logs for further investigation
- Cleaning up the instances (destroying VMs)
- Analyzing the logs gathered in the process
The majority of our tests were initially manual and therefore time-consuming. In many cases, software testing also required coordination and significant help from our DevOps team. We were not able to test all the protocol features and scenarios in a reasonable amount of time as both the efforts and time required were pretty high. As a result, we experienced delays when making improvements and adding new features to our product suite. However, I am happy to confirm that we no longer depend solely on manual testing. Four months ago, we established our own QA team within our network development team to cover all the missing parts related to software testing, implementation processes, automation testing and enforcing high-quality standards.
How we implemented Quality Assurance at Lightcurve
Now that we've established different types of testing, let's have a look at how exactly QA is performed at Lightcurve and what exact processes we introduced to eliminate the risks of delivering unreliable code to production.
The result of having a QA team in place improved the following areas
- Designing test plans together with tests scenarios. The QA team works closely with developers in identifying the features that are being developed and then preparing well thought out test scenarios. This step is required before the actual release. In most cases, QA also is responsible for writing tests that cover scenarios prepared before and then executing them and evaluating the results.
- Automated Test Framework. We implemented various test scenarios that are executed in an automated way. Our automated tests involve sanity testing, regression testing, network testing (blocks and transactions propagation, p2p communication, backward compatibility, etc.), security and fault tolerance network tests. Those tests are part of our Continuous Integration (CI) and also can be executed by developers on-demand.
- Jenkins and Ansible for Continuous Integration. At Lightcurve, we benefit from Jenkins' flexibility while having to execute multiple jobs in parallel. We also want to have full control over the entire workflow. We have automated the process of creating the builds and spinning up test networks using cloud providers. To make our tests as close as possible to real-world scenarios, we are deploying nodes in different regions (US, China, Europe, Asia, etc.). We are also using Ansible, which as an orchestration tool. It enables us to roll-out the software and spin-up those networks with a push of a button.
- NewRelic APM for Performance Testing. One of the main indicators of a blockchain projects' vitality is the network's ongoing performance. This makes monitoring the performance of every release important. Our QA team uses NewRelic APM to determine whether there has been an improvement or degradation in the performance. We then give feedback to the development team to rectify the problem before we release. To ensure the network behaves as expected during high volumes of transactions, we run various types of stress tests (different transaction types, different workloads). We're monitoring metrics such as CPU and memory usage, I/O throughput and API response times. Another important factor to check is memory leaks. When the code needs to use the memory for a particular task, it's being automatically allocated (for example when creating an object) and it should be released when it's no longer needed. Sometimes it's not the case and the application refuses to clean the memory. The memory then stays consumed without a true need for it. Memory leaks cause memory used by the application to grow slowly (sometimes very slowly) until it finally takes all available memory and results in a crash. To improve overall agility and code reliability in development, we're currently in the process of migrating to TypeScript across our product suite.
- Devnet is a temporary, short-lived network that we create to execute tests against new changes that are not a part of a release on a case-by-case basis.
- Alphanet is a network we test alpha versions of new releases, at this stage we need a larger network that reproduces the actual real-world scenario.
- Betanet is a public network in which we test beta releases. This happens only if there are very big changes in the codebase. In most of the cases, we are skipping this network.
- Testnet is a public network to which we push release candidates. Lisk's Testnet has a huge set of historical data. You can check out our Testnet here.
- Mainnet is a public production network and contains the actual blockchain.
Our blockchain network's QA testing process
In blockchain, minor releases involve stages of testing before they reach the production network. In our case, we have the following types of networks:
- Building Lisk Core software:In this stage, the Lisk Core software will be built from a specific branch (default is development), the successful build creates a tar file with unique hash in its name (e.g: lisk-1.5.0-alpha.2-b430af6-Linux-x86_64.tar.gz).
- Deploying the software to multiple machines: Once the software is built successfully, it will be deployed to multiple nodes to replicate the network behavior.
- Enabling delegates to forge: At this point all the nodes are already started and have the network's genesis block loaded. Now we need to make the blockchain move, so in this step we're enabling forging, as delegates are producing blocks.
- Executing protocol test scenarios: Once the network is moving, Lisk Protocol feature tests will be executed against the network. These tests including sanity, regression and new features, which will ensure all basic protocol related scenarios work as intended.
- Managing network stress tests: To ensure the network stays reliable even under very high transaction loads, we run stress test. They involve sending the maximum supported amount of transactions. We expect the network to handle the load and accept all the transactions within the given block slots.
Our automation testing is configured to enable our developers to run tests on Devnet or Alphanet. The actual network size is configurable, ranging from 10 to 500 nodes. NewRelic APM monitoring is integrated with our software and enabled for each node. Once all the required tests are executed and their results evaluated, a decision can be made to release a feature or fix to Testnet. After a reasonable amount of time (depends on the size and complexity of the release) we will push it to the production, otherwise known as Mainnet.
The above picture depicts the Jenkins CI pipeline flow and a test report. The Jenkins CI pipeline consists of multiple stages, which include:
The pipeline is configured to run nightly, which allows the development team to create each release on time and with proper quality. As a result, developers can test features as and when they develop at a network level using the QA automated framework. This gives developers instant feedback if there are any failures, backward compatibility issues or performance changes, etc.
Get involved in our open source automation testing
Lisk is developed in the spirit of the open source ethos. Therefore, we would like to encourage all developers to participate in ensuring the continued quality and security of our open source network with our QA tools.
How to start contributing to our QA
Observe our quality assurance progress by following our public Jenkins interface. If you want to try using the test suite, however, you will you need to set up your node and network. To do so, read through Lisk's official documentation. You will especially need to follow the Lisk Core setup section to get the blockchain network up and running. Next, you can set up the QA tools by following the instructions in the Lisk Core QA repository.
- QA cycle checklist template to cover all possible scenarios
- BDD feature scenarios and its step_definitions implementation
- Support and utility class for testing
- Network configuration tools
- Stress test scenarios
Which QA tools can we offer you?
Now that you know how to set up your Lisk Core node, you can participate in the following:
If you are a developer and want to contribute to Lisk's Quality Assurance process, you can follow these contribution guidelines. You can then share your insights or join the discussion on Lisk.Chat's Network Channel.
From immutability to decentralization, blockchain's development presents its own set of challenges. This makes software testing even more important for our industry than it already is for centralized applications. To complicate things further, software testing in itself is a whole universe of options. The introduction of automation testing at Lightcurve along with a professionalized QA department, significantly improved our development speed, along with the quality of Lisk's codebase. When it comes to blockchains, however, community equals security. Use the above QA tools to get involved in the testing and contribute to our network's development starting today.