Maintenance Pages¶

Use cases¶

Deployment and operations.
Incident management.

Current state¶

In the case of a Grove cluster, while an instance is being created or unavailable, we are showing the default Nginx error pages.

Using Ansible playbooks, we had the option to show the following maintenance pages:

Error page that is displayed when a server throws a 50x error.
Maintenance page to be displayed when a server is undergoing scheduled maintenance. This page can be edited on the live server to include a maintenance window if needed.
Maintenance page to be displayed when there is unscheduled maintenance.
Maintenance page to be displayed when and server for an instance is in the process of being provisioned.

These pages share the same static files, including styling. As the ansible playbooks configured an Nginx server behind the scenes, we had the ability to use client specific maintenance pages if needed.

Providing a similar functionality in the current setup of Grove requires the modification of the Nginx ingress controller and serving static sites.

Achieving this functionality requires us to serve the static pages per instance. To serve the pages, we have numerous options, like:

Serving from a static site hosting platform.
Serving from an S3 compatible bucket.
Serving the sites in a pod, deployed in conjunction with the instance pods.

All options have their advantages and disadvantages. For pros and cons, see the "Pros and cons" section below.

Pros and cons¶

Serving from a static site hosting platform¶

Pros¶

Ability to turn on CDN features (in most cases)
Complex static pages
Served by client if needed

Cons¶

Depending on a 3rd party for showing maintenance pages
If the hosting service is not available, the maintenance page is not shown

Serving from an S3 compatible bucket¶

Pros¶

Ability to turn on CDN features (in most cases)
Served by client if needed

Cons¶

No option for complex static pages

Serving the sites in a pod, deployed in conjunction with the instance pods¶

Pros¶

Deploy the pod separately from the edX platform
Running on the Kubernetes cluster
Complex static pages

Cons¶

No option for CDN
Changing the content of the site requires a redeployment

Proposed solution¶

It is apparent from the above that having the maintenance pages for Kubernetes cluster which depends on a continuously changing infrastructure, where resources may not be available at the time we need them is not mainstream.

Although every option has its tradeoffs, S3 static page serving has the lowest impact on cluster usage and comes with very lower maintentance. Howover, this option won't allow complex static pages, even though it serves all of the current needs. Also, by serving maintenance pages from a bucket, we can use the same bucket for customers with the same design, and we only need to change maintenance page styling when it is truly necessary.

Since the maintenance pages should be available at multiple phases of an instance's lifecycle, we cannot completely automate the processes. We need to handle three different lifecycle scenarios:

services are being provisioned
already provisioned services
no services are available (no registered customer)
scheduled/unscheduled maintenance pages

Services are being provisioned or no services are available¶

At the time of writing, the ingress controller returns a default 404 page for unknown instances. An instance is known when the Caddy server is deployed and serving the desired traffic.

In the current ecosystem, we should change the main ingress controller every time a new instance is being provisioned to avoid the 404 page and show a "provisioning in progress" page. Although the ingress controller is canary-deployed, changing its configuration every time could mean a higher risk for failure. Also, those configuration updates will affect all incoming traffic, hence all served instances.

Instead, we would serve a common page for these two scenarios. A page, that would say "The instance is being provisioned or not found", or similar.

Note: If we change our mind later, and we would like to serve different pages for these two states, we can still implement that with really low effort.

Already provisioned services¶

In those cases when the service is provisioned at least one, we would deploy a patched Caddy configuration provided by Tutor to point to an S3 bucket endpoint for 5xx pages. This way, way we reuse existing resources (Caddy).

Scheduled/unscheduled maintenance pages¶

These pages are requiring manual intervention. Since the Caddy server of the platform installation may be affected by the maintenance, we cannot rely on changing that configuration, instead, we need to change the main ingress controller's routing to serve a static page for that given site.

To do that, we would need to execute kubectl commands, applying resource configuration. These configuration files would depend on the currently active configuration, meaning we need to download that first. After downloading the configuration, we can add the route for the instance pointing to an S3 bucket and update the main ingress controller with the new config.

This work could and should be automated in the scope of the implementation ticket. The routing configuration is simple, but sensitive to errors, therefore reducing the chances for human error is a good idea. If we accidentally change the routing configuration in an unexpected way, it is easily possible to show maintenance page for the whole cluster, not just for one instance.

Although the exact naming is depending on the implementor, the script could be called as ./grove enable_maintenance {instance_name} where instance name is a single instance or list of instances.

Summary¶

In conclusion, we would

change the main ingress controller to serve "The instance is being provisioned or not found" page for all 404 errors (meaning an instance is not found)
change the Caddy configuration for installations to serve 5xx errors with a custom page, pointing to an S3 bucket
setup documentation, resource config template for main ingress controller configuration, and a user guide for setting up maintenance pages