DEV Community

Cover image for protobuf-related: a GCP Pub/Sub intro/overview
yactouat
yactouat

Posted on

protobuf-related: a GCP Pub/Sub intro/overview

Hey 👋 ! Today we are going to talk about Pub/Sub a lot but not much about how protocol buffers fit into it (as this will be the subject of another article in this series). Still, I want to keep this article in this protobuf-related series as these two subjects are closely related.

overview

Stream processing is the practice of taking action on a series of data at the time the data is created, Pub/Sub solutions are an implementation of that.

The Pub/Sub pattern, which stands for Publisher/Subscriber, consists of an architecture that aims at allowing services to communicate asynchronously with each other (e.g. without having to wait for one another's responses) using a messaging system.

The asynchronicity of this system lies mainly in the fact that publishers are sources of asynchronous events: these events can arise at any point in time, if they arise at all. In this pattern, the GCP plays a buffering role between the publishers and the subscribers, which are synced to these events.

Pub/Sub systems are also many-to-many messaging systems as there can be 1 to n publishing services and 0 to n subscribing services; a publisher service can publish to 1 to n topics and a given service can subscribe to 1 to n subscriptions.

Using this pattern with the Google Cloud Platform means that the GCP will act as a big mailbox that will permit this aynschronous communication between services. This has the advantage of decoupling the sender and the receiver apps in a system.

For instance, if subscribing services are down for some time and go back up again, they can catch up with the stream of missed messages without the publisher apps having to care about it at all thanks to the big GCP mailbox.

Google Pub/Sub is serverless as it does not require to maintain or provision compute resources. The GCP is not the only platform that implements Pub/Sub, Azure and AWS implement it as well, but that's beyond the scope of this article.

the GCP Pub/Sub, a rough sketch

Goggle Pub/Sub architecture has the advantage of offering global replication, high availability, and auto scaling, it can be scaled across various dimensions:

  • the number of publishers
  • the number of subscribers
  • the number and size of messages sent

glossary

  • messages: they consist of pieces of data sent by publisher apps, they are persisted in the GCP Pub/Sub layer; every time a message is published to a topic, this generates a cloud event (google.pubsub.topic.publish, that you can use to trigger Cloud Functions for instance); messages can have attributes attached to them, which are arbitrary key/value pairs set by the publisher in order to give some metadata to the payload of the message itself; each message has an ACK_ID that can be used to acknowledge it from a subscriber's perspective
  • publishers: services (or, if you prefer, apps) that send messages (or, if you prefer, data) to one to n topics; as soon as they have published a message to a topic, publishers are not supposed to be aware of what happens next with this data; any application that can make an HTTPS call to the GCP, whether hosted on the GCP or not, can be a publisher app'
  • subscriptions: they consist of messaging queues that are set (hosted) on the GCP and that are associated to a single topic from which they receive all the messages; a subscription can have zero to many subscribers
  • topics: topics are abstractions that act as an interface between publishers apps and the GCP Pub/Sub; they live in the cloud and they are used to organize messages; in the GCP, topics are the resources that feed subscriptions with incoming messages they receive from one to many publishers

push and pull subscriptions

Subscriptions can be of 2 types:

  • pull subscriptions wait for a subscriber to pull messages and acknowledge them before removing them from the queue
  • push subscriptions invoke a web hook when a message comes in to send the data to that defined endpoint; in this mode, the subscriptions will keep retrying sending the message to the subscriber(s) until at least one acknowledgement has been sent back, before removing the message from the queue

Pull subscriptions may be chosen if you want to control the throughput of the data you may receive from a subscription at the subscriber service level, with rate limiting for instance.

Push subscriptions are great if you need automatic fast and guaranteed delivery of incoming messages to the subscriber services.

When Pub/Sub pushes a message to a subscriber, the subscriber needs to acknowledge receipt within a specific deadline otherwise Pub/Sub will retry delivery; if the subscriber estimates that it needs some time to process the message, it can send back to Pub/Sub a request to modify the acknowledgment deadline.

THE important thing to understand about subscriptions and subscribers

When a message is pulled by a subscriber app' from a subscription or a subscription pushes a message to a subscriber app', the subscriber app' can acknowledge this message. On a given message acknowledgement, the message is removed from the subscription queue.

This acknowledgement mechanism, achieved via the ACK_ID of the message on acknowledgement, is what makes Pub/Sub a reliable messaging system: it guarantees an at least once delivery from a subscription perspective and not from the perspective of an individual subscriber on a subscription.

This reliability is reinforced by the fact GCP Pub/Sub topics are persisted in internal storage (a topic message store) and are then replicated and sharded for you out of the box until delivered and acknowledged through different subscriptions.

Acknowledging a message deletes it from the subscription queue and ONLY FROM THAT subscription queue. Also, if multiple subscribers share the same subscription, for instance, the first one that acknowledges a message will prevent the other one from ever receiving it as an acknowledged message is removed from the queue.

Be mindful of that when you're making architectural decisions. It really depends on how your system is supposed to work, but you could for instance create one subscription for each subscriber app' to allow all listening services to receive all the messages published on a given topic separately.

Again, this is certainly not the only option and you may organize that as you and your business see fit.

options provided by the GCP Pub/Sub infrastructure and some other things that are good to know

  • because topics and subscriptions are GCP named resources, they can be subject to Role Based Access Control (or RBAC)
  • data snapshots are cheaper than retaining acknowledged messages at subscription level
  • deleting a topic does not automatically delete its subscriptions
  • in GCP Pub/Sub, costs are associated to
    • message ingestion (incoming messages into topics)
    • message delivery (outgoing messages on subscriptions)
    • seeking features (snapshots and retaining acknowledged messages)
    • the size of the request, which has a minimum billable request of 1KB
  • in GCP, it's not unusual that Pub/Sub messages are copied and persisted in many different places (once inside the topic and then across the subscription queues); this is something to be aware of regarding pricing of the Pub/Sub service provided by Google
  • the --retain-acked-messages flag on a GCP Pub/Sub subscription creation or update sets the retaining of acknowledged messages for this sub. Be aware that this may provide additional costs; retaining acknowledged messages does not signify that the messages stay in the subscription queue after they have been acknowledged, it just means they're saved somewhere else for whatever usage you may make of this data. If you plan to use the seeking feature, that flag needs to be set to true for a given sub.
  • the seeking for messages feature, allows you mainly to:
    • allow the subscriber to alter the acknowledgment state of messages in bulk
    • mark every message received before a given timestamp as acknowledged, this means that every message marked after the timestamp is unacknowledged and is going to be replayed
    • discard a bunch of messages by seeking to a timestamp in the future
  • when you create a push subscription to a given endpoint, Google will verify that the domain you're pushing to is indeed yours (using the Google Search Console for instance) to prevent the risk of having anybody slamming a URL with a huge number of Pub/Sub notifications
  • you can also use the GCP seeking feature to seek for a snapshot:
    • a snapshot allows to capture messages acknowledgment state of a subscription at a given point in time
    • a snapshot allows you to retain all unacknowledged messages at the time of the snapshot
    • a snapshot also retains all messages published to the topic after the snapshot
    • snapshots should be created ahead of time as a precaution against subscribers failures; as their name implies, they represent the state of a subscription at a point of time and all messages starting from that point will be remembered
    • snapshots are deleted after 7 days or if the oldest unacknowledged message in the snapshot exceeds the message retention duration you have set up
  • you can connect a GPC topic to a budget alert to keep track of your costs
  • you can directly publish and import/export Pub/Sub messages from the web UI of the GCP

a few useful commands using gcloud CLI

  • create a data snapshot => gcloud pubsub snapshots create {snapshot_name} --subscription={sub_name} (without the {}of course)
  • create a subscription to a given topic => gcloud pubsub subscriptions create --topic={topic-name} --ack-deadline={int_seconds} {sub_name}
  • create a topic => gcloud pubsub topics create {topic-name}
  • delete a subscription => gcloud pubsub subscriptions delete {sub_name}
  • explicitly acknowledge a Pub/Sub message as a subscriber => gcloud pubsub subscriptions ack {topic_name} --ack-ids={ACK_ID}
  • list subscriptions => gcloud pubsub subscriptions list
  • list subscriptions scoped to a given topic => gcloud pubsub topics list-subscriptions {topic_name}
  • list topics => gcloud pubsub topics list
  • publish a message to a topic => gcloud pubsub topics publish {topic-name} --message "your {plain string|json|serialized protocol buffer} message here"
  • pull messages and acknowledge them in one go => gcloud pubsub subscriptions pull {sub_name} --auto-ack --project={project_name} --limit={int}
  • seek to a subscription with a snapshot => gcloud pubsub subscriptions seek {sub_name} --snapshot={snapshot_name}; the command should output a snapshot id and a subscription id, this means that the specified subscription has been backed up to the specified snapshot
  • you would seek to a subscription with a timestamp (let's say up to 15 minutes ago and with a TS_FORMAT available in env) like so => gcloud pubsub subscriptions seek {sub_name} --time=$(date -u -d '-15min' +$TS_FORMAT) => this will have the effect of unacknowledging all messages received in the subscription queue after that timestamp, meaning replaying all the messages within the 15 minutes time frame (to seek in the future just write '+15min' for instance)

All these CLI commands can also be made programmatically using one of the many GCP Pub/Sub client libraries in the language of your choice.

That's it for my Pub/Sub notes, hope they may prove to be useful to you !👋

Discussion (0)