Batch redeployment¶

Use cases¶

Continuous delivery.
Mass config update.
Upgrades (not covered by this discovery).

Current state¶

Currently, we are:

Generating our deployment pipeline in Grove with the generate-tutor-stage.rb script. We are generating them once we add a commit to the default branch. Instance deployments need manual confirmation.
Hosting our customized Open edX branches (e.g. opencraft-release/koa.3) on GitHub.
Using grove-images to build Open edX images for specific branches. We can keep either the building processes there or integrate them with Grove. This discovery assumes the latter.
Using Grove to deploy instances.

Proposed solution¶

We will be running multiple releases, so we need to define prefixes for the instance redeployment.
We want to automatically redeploy all instances with a custom branch. We can use the GitHub push webhook for announcing the changes. The webhook will be sent from all of our custom Open edX repositories when we push new commits to the branch. It also needs to be secured.
The webhook will pass the following arguments that alter the deployment behavior:
Prefixes for the pipeline generator. It would apply when: on_success to the instances that use a specified branch or have a specific prefix.
Configurations to update.

graph TB subgraph Continuous Delivery GH[(GitHub Repository)] --> GH1(Send a webhook with: 1. Repository 2. Branch name) end subgraph Batch Redeployments GC[(Grove Console)] --> GC1(Send a webhook with: 1.Repository 2. Branch name prefix 3. Optional configurations to update) end GH1 --> G[(Grove)] GC1 --> G G --> G1(Build the base image) G1 --> T1(Disable Terraform state lock) T1 --> T2(Terraform init) T2 --> T3(Terraform plan) T3 --> T4(Terraform apply) T4 --> G2(Generate a pipeline for each specified instance) G2 --> I11(Update configurations) G2 --> I21(Update configurations) G2 --> I31(Update configurations) subgraph Instance 3 I31 --> I32(Redeploy instances) end subgraph Instance 2 I21 --> I22(Redeploy instances) end subgraph Instance 1 I11 --> I12(Redeploy instances) end

Next steps¶

Common image builder in Grove ¶

We have created grove-images for building the Tutor image. However, it's using tutor-openedx, which is deprecated. As we have already implemented this functionality in Grove, it is redundant.

We will need a custom Open edX image for most instances. We are using grove-images for building Open edX images. It is a relatively simple job, so we should:

Add image building step to the Grove CI. It will build a common vanilla image from the redeployed branch. Other instances will be reusing it as a cache to optimize the build process.
Optional extension - build an image with pre-defined XBlocks that we want to install for all instances.
Archive the grove-images repository.

As we have added the BuildKit support to the Tutor, building common images will significantly optimize further builds. If no changes are introduced to the image, it reduces the build time from 42 minutes to 4 minutes.

When the instance requires:

Custom dependencies - they are installed as step 10/10 of the python-requirements part.
Custom theme - compiling it is the last meaningful step of the build process. It takes around 3 minutes for our [simple-theme].
Custom translations - they are compiled before the theme. It should increase the build time by a minute.

Parallelize the builds¶

We want to redeploy all instances as quickly as possible. Therefore, we need to optimize the redeployment process. There are different things we could do to parallelize these builds if needed. We can:

Run an AWS EKS cluster with GitLab Runners that use the Kubernetes executor. It would simplify the setup, but:
We should create a separate cluster because we're exposing the Docker socket, which is not suitable for production environments. It could significantly increase the costs.
it looks like docker-in-docker is not well-supported. It might require a deeper investigation.
Set up AWS EC2 Autoscaling. It could be a bit harder to set up with Terraform than the previous point.
Look for another solution for scaling GitLab runners.

Cancelling deployments¶

There are some cases when we would like to cancel the scheduled/running deployment. For example:

We have discovered a bug in the merged PR, so redeploying instances would break them.
The introduced change requires a configuration update for all related instances. To perform such a redeployment, we will cancel the CD pipelines.
Then, we will request the redeployment from the console, providing optional configurations to update.

Therefore, we need a convenient way to cancel the deployments. We don't want to do it manually for each instance. As GitLab doesn't support cascade parent/child pipeline cancellation, we should look for an alternative way. We should use the following solution:

List pipeline jobs through the GitLab API.
Cancel each of them.

We need to consider whether it would be more suitable to implement this in:

The CI (used as a trigger) - this way we will keep the GitLab-specific logic in Grove.
The console backend - this will simplify the CI, and potentially provide more flexibility.

Instance-specific configuration¶

It is going to be a part of BB-4779.