GitProtect Team for GitProtect

Posted on Jun 9, 2023 • Edited on Apr 8, 2024 • Originally published at gitprotect.io

GitHub Backup – Why Is It Important to Backup GitHub Metadata and Why You Do It Wrong

#devops #developers #github #backup

It’s a well known fact that all great things consist of a huge number of small elements. That’s the way our world exists, and that’s the way everything created by humans works. The same we can say about IT projects. They are small, individual modules that permit humans to make banking applications, provide video streaming, or perform many other things to improve their work. It’s worth always keeping that in mind.

When it comes to backup, it’s exactly the same. How do you think is it enough only to backup your database or source code? Nope… Nowadays, many projects are created by large groups of programmers. While developing and writing new lines of code, DevOps create not only the code itself, but also a lot of other data, so-called metadata.

What Git metadata is?

To be honest, metadata is a rather general term. The simplest definition tells us that it is “data that provides information about other data.” And this is obviously true, however, it is not enough. According to Wikipedia, there are many types of metadata. Let me list just a few most popular ones with some examples:

descriptive which is used for identification: title, abstract, keywords
structural which describes the types, versions, and relationships between elements
administrative which is used for managing access: resource type, permissions
legal which is information about copyrights, licensing, etc.

Of course, this is only a simplification and they are not all possible groups. For example, we also have metadata in databases, which includes information describing the names and sizes of tables, or the data types of the columns of a particular table. This topic is quite varied, and can definitely be applied to repositories. So, what kind of metadata can we find there? Here are some samples:

wiki – documentation, separated from the code
issues and comments – ideas, solutions for common problems, etc.
pull requests – discussions and a source code which isn’t merged yet
pipelines/actions – crucial for proper CI/CD process

It’s always important to know what the projects really represent “behind the scenes”. Why? Because in the event of any failure, such as a ransomware attack, a technical failure or simply a desire to migrate to another service provider (e.g. from GitHub to Bitbucket or GitLab), we need to be prepared for all eventualities. Unfortunately, copying and restoring the source code and data alone is not enough to continue work smoothly without any interruption. Our teams need metadata to continue developing our product, even if not everyone realizes it.

How to backup GitHub repositories and metadata

In general, backup is a copy of data taken and stored somewhere. What is important here is the possibility for the user to choose the storage instance, whether to keep data in the cloud, locally, or even a hybrid form (when both cloud and local instances are used). Why is that so important? Because that copy may be used to restore the original data, for example if you use 3-2-1 backup rule (3 copies in 2 different storages, 1 of which is offsite). So to say, it’s your business continuity assurance.

What other features should your backup plan include? Let’s see… First of all, it should be automated. What does it mean? For example, if you need to delegate somebody of your DevOps team to write backup scripts, perform those scripts, check them.. Can we call that automation? Nope… It’s time-consuming and distracts your employees from their core duties.

Though, if you set up a backup plan and it’s performed every day without your mechanical intervention, that’s backup automation. For example, GitProtect.io even provides you with monitoring via Slack or email notifications, advanced audit logs and many more. Will it distract you and your team from the core duties? Nope.. Your team can continue coding with peace of mind that their work is well-protected.

What is more, a good backup plan should include such features as:

AES encryption – when data is encrypted at-rest and in flight with your own encryption key,
long-term retention – when you don’t need to limit yourself to “standard” 90 days that GitHub provides its users with, and have a possibility to keep your data for as long as you need, up to forever,
ransomware protection – when you have an immutable WORM-compliant storage which writes your copies once and reads many times. Why is it important? Even if malware hits your storage, it won’t spread it. So, your data will be safe and sound,
restore – when you can use any of point-in-time or granular restore, cross-over recovery to another Git hosting platform (from GitHub to GitLab), restore to the same or new repo/account, or to your local device,
Disaster Recovery Technology – when you have a comprehensive and step-by-step plan of actions which should be taken in any event of failure: GitHub is down, your GitHub infrastructure is down, or your backup provider’s infrastructure is down.

For example, GitProtect.io has a lot of PRO backup features for your GitHub backup to guarantee its environment protection and business continuity. Moreover, if you want to learn some extra information about GitHub backup, it’s worth paying attention to this post, GitHub backup to S3. There you will find lots of insights on how easy and fast we can set up our own storage by creating a new backup plan for our repositories. This is a very convenient option which allows us to use the infrastructure we already have.

A comprehensive backup plan can help your organization to minimize all the risks and problems. It can guarantee an uninterrupted workflow as you will be sure of protection of all types of metadata, not only wikis, issues or pull requests which have already been mentioned above, but also webhooks, tags, milestones, releases, labels and much more.

When it comes to metadata backup, it is always important to understand what solution to choose – a third-party tool, which will definitely cover all the metadata, or creating your own backup scripts. Though, when we speak about backup scripts, we need to understand that it’s difficult to protect all the metadata with them. Moreover, they can be rather time-consuming.

Why is it important?

Let me make a small analogy to archaeology. When I was writing this article I was inspired by the story I read some time ago. Imagine an archaeological finding, let’s say, an ancient Greek coin. We see the object, we know that it was created during a certain period of time, it is made of a specific material and depicts the image of some politician. But what is next? This is relevant information, but it’s definitely hard to draw any conclusions based on it. And what does metadata have to do with it?

Now imagine that we know that the coin was excavated on the territory where modern Spain is located and that it wasn’t just one lone coin, but a whole purse full of coins from a completely different period of time and region. With this additional information which, after all, the base object itself doesn’t contain, we can learn much more. Now we can make a conclusion that the coin was in use for a particular number of years, that the ancient Greeks traded with tribes from the Iberian Peninsula, and etc. There can be many more conclusions drawn from the mentioned additional pieces of information.

This is what our metadata is. The data itself, in isolation without its description and context, can mean little or even nothing. So it is extremely important to determine what additional details will describe the data set. Depending on the context, this is either essential information, or simply useful and additional one that facilitates further work. Quite the same we have in IT projects. The above example was about a contrived situation from the world of archaeology, but after all, the world of IT (and probably any other one) cannot exist without metadata.

In conclusion, I would like to remind you that we need to be aware of the importance of metadata, and, thus, backing it up really matters. To backup GitHub metadata we can use existing third-party tools, such as GitProtect.io at GitHub marketplace. With that type of solution, we can be sure that our projects are safe. It can easily back up GitHub repositories and metadata, and also provide a restore plan for any metadata that is currently needed. Let’s use this to our advantage.

✍️ Subscribe to GitProtect DevSecOps X-Ray Newsletter – your guide to the latest DevOps & security insights

🚀 Ensure compliant DevOps backup and recovery with a 14-day free trial

📅 Let’s discuss your needs and see a live product tour