Disaster Recovery¶

Use cases¶

Deployment and operations.
Incident management.

Current state¶

Doing regular backups is a crucial part of providing hosting services. To reduce the maintenance burden, we use managed services for Grove wherever possible.

These managed services offer automated backups and -- most of the time -- point in time restoration. Although these services are covered, we have no backup solution or strategy implemented for the rest of resources, like AWS S3 buckets, DigitalOcean Spaces, or volumes generated by Kubernetes. To recover from a disaster, we would need to:

reprovision the infrastructure we had before the disaster
restore backups for managed and unmanaged services
spin up previously running instances
reconfigure networking as necessary including DNS records

The disaster can have multiple levels of severity, influenced by several factors. This discovery prepares for the following scenarios:

Complete disaster

As the name suggests, we are in a bad situation. The datacenter or a major infrastructure component we rely on may have ceased to exist. It's possible that someone destroyed a wrong environment and accidentally affected more resource than intended -- like AWS did in 2016 with S3 buckets.

We need to recover as fast as we can, but, depending on the affected component(s), it may be slow. Also, in this scenario, we probably need to reprovision the whole infrastructure from scratch.

Everyone should be involved in the recovery.
Partial disaster

We are a bit luckier, only some part of our hosting services are unavailable. It may happen because we are hosting in multiple regions or just some components of the infrastructure is affected -- like in case of OVH network replacements.

The recovery, depending on the affected component(s), probably faster than in the case of Severity 1.

Every firefighter should be involved in the recovery.
Need to restore

The infrastructure is up and running properly, but something went really wrong. We may have deleted a database table accidentally or destroyed a DigitalOcean Space.

In this scenario, we need to restore from a backup.

At least one firefighter should be involved in the recovery.

Backup & Restore¶

Infrastructure¶

The infrastructure has no components that are created manually. Since every component is managed by terraform and the code is hosted on a VCS, we don't really need to take backup or do restore from backups.

However, it may happen that we have to restore the infrastructure itself. This is something we cannot do as a 1-click solution. We need to run the terraform scripts again -- manually or through GitLab CIs. This results in a completely new cluster creation. It is time consuming, especially if there's a large number of instances to deploy, but we have no better option.

Databases¶

The databases are backed up by the given cloud provider. Depending on the provider -- either AWS or DO -- we may restore the automatically created backups using their dedicated user interface.

One interesting aspect of restoring data is selecting what to restore. In the case of a lost database table, we don't want to restore the whole database cluster, but simply that specific table.

Since we cannot access the raw backup data, providers are offering to restore the whole backup as a separate database cluster. To do selective restore, it is handy. After the restore completed, we can dump the affected database or table and load into the live database cluster.

In the case of a full restoration, we only need to run a restore which will create a new cluster.

Celery queues¶

Depending on when the disaster happened, we may or may not able to restore the celery queues' content. Since the celery queues' broker is set to Redis, which is an in-memory key-value store, the data is probably gone.

The worst case scenario is that some grades are not generated in time/automatically and a manual intervention is needed.

Doing backups for celery queues makes no sense.

Elasticsearch¶

Elasticsearch's data is persisted on a volume per instance at the time of writing. This may change in the future to a shared Elasticsearch approach, which would have one big persistent volume attached.

Static and media assets¶

Static assets and media assets uploaded by the users are stored in AWS S3/DigitalOcean Spaces. The buckets can be version controlled, providing a "restore" functionality.

Although this is not a real backup solution, it would make no sense to setup standalone backups for these buckets. There is a chance that the bucket is lost -- as it already happened in 2017 with AWS -- though we trust the cloud providers we use that they are doing regular backups.

If we accidentally delete a bucket, we cannot restore it. This is something that could happen when we deprovision a Grove cluster.

The buckets could be backed up by setting up a scheduled job.

OpenFAAS functions¶

OpenFAAS functions are additional resources in a Grove cluster. At the time of writing it is only used for periodic build notifications.

Since the functions' source code is hosted on GitLab, it is not necessary to backup the code. The functions can be redeployed anytime.

Monitoring¶

Monitoring components' backup is more-or-less negligible. The stack has two major components we may want to backup:

Opensearch entries
Custom Grafana dashboards

The Opensearch pod uses a persistent volume to store its data. This can be backed up and restored as necessary. As of Grafana dashboards, the custom dashboards are stored in an sqlite database which is not persisted on a volume. To backup custom bashboards, we would need to either persist custom dashboards on a volume or export the dashboards programatically.

Kubernetes cluster and volumes¶

Creating a Kubernetes cluster backup is a difficult job to do. Therefore, we don't even try to reimplement the wheel. To backup and restore Kubernetes cluster resources we recommend using Velero which is a tool made on purpose.

Summary¶

To implement a proper disaster recovery strategy, we need an additional tool to be installed on the cluster, Velero. this tool may not be required by all users of Grove, so the new feature should be turned off by default.

The backups will be done by both provider-offered backup solutions for managed services and Velero.

Next steps¶

Create a user guide for enabling/disabling backups with Velero
Add Velero integration