What is a payment system? This is a question I’m trying to answer before joining Agoda’s payment team as an associate software engineer this summer. This blog post will present my research and understanding of the high-level design of payment systems.
FYI: this blog post doesn’t dig too deep into details of the properties of the payment system, such as idempotency, fault-tolerance, or data consistency as the blog posts that this post is based on already provide (much better) details regarding the topics. I recommend reading the blog posts in the credit section if you are interested in learning more abouy payment systems. Think of this blog post as a brief introduction to the high-level design of a payment system.
Simply, a payment system is a system responsible for moving money from A to B. However, of course, it is not that easy. Since the payment system manages many transactions between customers and businesses, it needs to be not only scalable but most importantly, reliable and fault-tolerant. Let's look at the flow of a typical payment system:
There are two types of payment systems: one for businesses that act as intermediaries between customers and merchants like Amazon, Uber, and Agoda that moves money between, typically, three stakeholders (or more). Another type is a system that moves money straight from the user to the business. For this blog post, I will only talk about the intermediary type, exploring both the pay-in and pay-out flow. However, as you will see later on, both types are quite similar.
The pay-in flow is the initial step of the transaction: when the user clicks to pay for the product/service. To understand the flow, let's look at the components of a payment system:
The pay-in flow:
- A payment event is generated from user interaction and is sent to the payment service.
- The payment service logs the event and handles the request accordingly.
- The service sends payment orders to the executor. There can be multiple orders for a single payment event as one payment can consist of multiple products/services.
- The payment executor stores the order in the database.
- The executor calls the PSPs API to execute the transaction, which moves the money from the user’s account to the intermediary’s account.
- The wallet service updates the merchant’s balance information.
- After the merchant’s wallet is updated, the ledger service adds new entries into the ledger database for reconciliation.
This service is responsible for handling requests from users and storing payment events that are generated after the user clicks “pay”. It creates the necessary events for processing the payment. For example, one payment (transaction) can contain multiple products/services, requiring various services to work in unison.
The executor receives events from the payment service (typically through a message queue) and calls the PSP to process the payment.
Payment Service Provider (PSP)
This is a third-party API that moves money from account A to account B. In this case, it moves money from the user into the intermediary account (Amazon, Agoda, Uber, etc.). Examples of PSP include Stripe, Ayden, Paypal, etc.
These are organizations that process the actual payment transaction. Examples of well-known card schemes are Visa and Mastercard.
This service keeps track of the total balance of the merchant’s account.
This service tracks the financial record of every transaction. It uses the double-entry principle to ensure the reliability of the service and correct money movement between stakeholders. Check out Uber’s blog on their Job/Order based system to see how they apply the double-entry principle.
An important note to consider is that usually, businesses don’t handle the user’s credit card information by themselves as there are various compliance rules that they have to follow. Sometimes businesses use the PSP’s hosted payment page, where user can fill in their credentials, to process the payment.
Pay-out flow is similar to the pay-in. It uses some of the pay-in’s components but instead of handling transactions between the user and the intermediary, the transaction is between the intermediary and the user. Furthermore, it also may use a different third-party service provider for processing transactions like Tipalti. However, the pay-out flow also goes through the same bookkeeping process as the pay-in flow, adding entries to the ledger database to uphold the double-entry principle.
The payment system has to be not only scalable to millions of transactions but also reliable and secure so that there are minimal to zero errors during the payment process as such failure can lead to a hazardous situation for the business. What will happen if a user clicks pay twice? What if there is a network connection error during the process? How do we ensure that the system is always paying as it is supposed to?
These are the properties of the payment system that the developers have to implement to create a reliable and secure system:
- Exactly-once delivery
- Asynchronous communication
- Data consistency
I will only give a brief explanation of the possible implementations of the properties. To learn more, you can take a look at the articles/blogs in the credit section.
To ensure exactly-once delivery, the system has to be able to provide both at least and at most once guarantee deliveries.
- at least once ⇒ retry payment transaction until it succeeds.
- at most once ⇒ achieve idempotency by labeling each request with unique identifiers.
How does the system handle failed payments? There can be a network connection error as the payment request is being sent. The system can handle transient and persistent failures by implementing a retry queue and a dead letter queue. More reads:
By implementing asynchronous communication between services, the payment system will be able to handle multiple requests simultaneously. This can be achieved by using technology like Kafka, a distributed data streaming technology, that allows messages to be passed between services without creating a bottleneck in the system.
There are several components/services plus external services like PSPs that are running to operate the payment systems. How can we be sure that the system is operating correctly and that all data are consistent? A small mistake in the logging of the ledger service or any other service can be detrimental to the business. Therefore, all transactions, events, and data across the services must be consistent. One of the processes to maintain such reliability is the reconciliation process.
The system will periodically run the reconciliation process, comparing the states/data among the services to verify that all transactions are being processed correctly. For example, PSPs will send a settlement file to the intermediary business (Amazon, Uber, and Agoda), which is compared with the ledger system to ensure that all transactions are processed correctly.
A payment system is a system that handles transactions between users and businesses. The system should be scalable, reliable, and fault-tolerant. As mentioned before, the blog post only provides a brief overview of what a payment system is, to learn more about the system I recommend reading both Pragmatic Engineer and Uber blog posts in the credit section. Furthermore, many other relevant topics should be considered when developing a payment system that is not covered in this post, such as:
- Debugging tools
- Global payment