Roadmap of Grove to use Harmony and beyond¶

Created at	Last update
2023-11-26	2023-11-27

Numerous Open edX providers and users find the necessity to deploy multiple instances of Open edX on Kubernetes. Currently, there is no standardized method for this, leading each provider to create their own tools for management. Harmony seeks to address this gap by offering a straightforward and standardized approach that incorporates industry best practices and the wisdom gained from previous experiences.

OpenCraft is committed to leveraging community-provided tools and contributing back to the community. In line with this commitment, we are preparing to eliminate duplicated code in Grove that is also present in Harmony, contributing improvements if necessary.

Grove has undergone significant evolution and changes in direction over time. Initially designed to establish a unified, Kubernetes-based hosting infrastructure for our P&T clients that were using OpenStack VMs beforehand; Grove has now shifted focus. While the original objective of achieving an enhanced hosting infrastructure has been realized, the landscape has changed – we no longer have P&T clients. Currently, we maintain three shared clusters – one for OpenCraft PR sandboxes, one for Axim PR sandboxes, and one serving a select group of clients on a shared cluster. Other clients have their dedicated Kubernetes clusters.

Our goals have transformed, prompting us to consider how we can better serve our clients while maintaining high-quality standards, keeping up with SLAs, and cost-effectiveness on our part.

Although the document's scope is broad and vague, our intention is to outline a clear set of next steps.

Improvement areas¶

Grove is designed with the primary goal of establishing a production-ready Kubernetes cluster and instance deployment environment. This setup allows for centralized control through git commits, ensuring that the cluster's and its instances' states are defined by the git repository, making it the single source of truth with regard to both configuration and commit history.

The related plugin (tutor-contrib-grove) provides convenient methods for the instances running on a Grove cluster to utilize auto-scaling, configure waffle flags, setup cron jobs, define LMS/CMS environment settings, and many more which are needed for operating Open EdX instances in production.

Grove aims to fulfill its commitment by offering a straightforward way to achieve the tasks mentioned above. This ease of use is intended to encourage adoption within the community, whether the objective is to create a cluster for a single instance or to manage a cluster for multiple instances.

As mentioned in the preface, Grove has room for improvements in terms of ease of use. Also, as the priorities shifted and multiple clusters created for clients, the need for a central place to manage multiple clusters emerged. Although the management of multiple clusters is not a common requirement, big providers in the community, such as OpenCraft, eduNEXT, and Racoon Gang need that as well.

During the past months as we used Grove on a wider range, we identified multiple improvement areas that are affecting OpenCraft and the community as well:

The images we/Tutor builds are not versioned. This makes rollback a time-consuming process.
It is time-consuming to set up a cluster from scratch. Considering the DigitalOcean setup guide, it is taking about 2 hours to setup a cluster from scratch if all accounts already exist (GitLab, DigitalOcean, etc).
GitLab CI/CD is generally reliable, operating smoothly most of the time. However, there are instances when various issues arise, occasionally occurring multiple times within a week, as demonstrated in the past. These issues range from Agents disconnecting and Workers experiencing significant delays in job pickup to Kubernetes agents disconnecting without apparent cause, resulting in failed deployments. GitLab may encounter errors leading to job termination midway through execution, occasionally causing the terraform state to become stale as well.
Terraform licensing has been changed, and Grove is relying on it a lot.
Pipelines are hard to overview.
Grove is not collaborating well with community-provided tooling (like Harmony).
Terraform states for the cluster and instances are combined, hence the – in theory – instances could destroy each other or apply infrastructure changes accidentally by running untargeted “terraform apply”s in the GitLab CI/CD pipelines.

Out of these issues, the most troublesome are the CI/CD pipeline failures and GitLab reliability. Also, as stated above, it is impossible to manage multiple clusters from one place. If one has to operate multiple clusters, that may be really time consuming. After all, Grove wasn’t architected focusing on many clusters.

Keeping the above in mind, we should focus on two main areas to keep all stakeholders (OpenCraft, the community, and clients) satisfied: interoperate with community tools better and increase the reliability of the ecosystem.

The former can be achieved relatively easily by utilizing the Harmony project and using existing 3rd party open-source Tutor plugins that were evolved during the development of Grove. To make the latter possible, however, we have a harder time as we are completely depending on other service providers.

The resolution for the issues above are split into multiple sections below.

Integrating with Harmony¶

Harmony and Grove have many common pieces, like setting up a cert issuer, deploying a monitoring stack, configuring a main ingress controller, or even configuring autoscaling.

To utilize the features provided by Harmony, we have to cut the common pieces out from Grove in conjunction with those features that could be beneficial for the community. The first step is identifying what pieces are common and what should be upstreamed. It should be kept in mind that there are some features that may be relevant only for OpenCraft. These features should remain in Grove.

At the time of writing, Harmony supports the following features that could be removed from Grove, reducing the size of Terraform scripts and dependencies we have.

A main nginx ingress controller
Cert manager for releasing Let’s Encrypt certificates
A metrics server and vertical pod autoscaler (which is not in Grove)
An Open Search cluster

However, it does not contain the following crucial features/components on a Helm chart or Tutor plugin level.

Verified SSL certificates for OpenSearch
Log collection and aggregation needed for tracking logs
Cluster monitoring tools (ref1, ref2)

Also, we are using some optional but useful tools that are not installed by Harmony (and the repository has no tracking issue), though the project might benefit from it.

In order to give back more to the community (and profit from shared maintenance), we could upstream the missing features/components and make a proposal for the optional ones.

With the introduction of Harmony, the helm chart, the related resources and their configuration will live in Grove’s configuration, hence the Harmony version changes can be controlled by the Grove repository. I.e. Grove has control over what Harmony version runs in a cluster, keeping a single source of truth in Grove.

As Harmony will be a Helm chart (deploying and configuring other helm charts as well), its integration to Grove will be straightforward and we probably have an easier task compared to other providers.

Refactoring Grove’s state management¶

The potential for trouble in state management arises from the shared nature of the state across both infrastructure and instances. While this approach offers certain advantages, it also introduces significant drawbacks, such as the risk of unintentional state manipulation affecting resources that were not intended for modification.

In order to resolve these issues, we must separate the state of the infrastructure and the instances. The more we can separate them, the less likely we face state manipulation issues. The best approach would be to separate the terraform script in a way that the infrastructure and every instance’s state is separated.

Although it sounds good on paper, in reality it can make complications as duplicating the terraform scripts of instances would lead to code duplication that is hard to manage and maintain. A possible way, however, would be parameterizing the terraform scripts for instances, so it can be executed per instance.

The benefits of doing so is having a separate state per instance and infrastructure, while keeping all states and code-responsibility separately. This approach comes with drawbacks as well: the modules split across multiple terraform scripts cannot read each other's output by default. This issue can be handled by a little bit of extra work, namely running and storing “terraform output” command’s output in input files for the next terraform plan and terraform apply commands. This requires us to introduce extra logic to the ./control/tf wrapper command to know if we want to manage instance or infrastructure scripts.

Resolving reliability issues¶

Probably, the hardest part of the roadmap is to improve the reliability. The main concern lies in GitLab. We were having the assumption at the beginning that having a unified place for the source code, terraform states, the CI and CD of the cluster is a superb approach as we don’t have to handle many moving pieces.

At the beginning, when only a few instances were available in the cluster with relatively low update frequency on the code base, it wasn’t an issue at all. It served really great with no noticeable issues. To be clear, it is not an issue as of now either, on those clusters where only a handful of instances are served.

On the other hand, if a cluster has to serve ~5+ instances with high update frequency (like a shared cluster with development on the instances), the issues will soon be visible. The state of instances are clashing, the jobs start to fail, even some MRs are not merged as GitLab experiences internal issues and thinks that the git state needs to be resolved manually. Simply clicking a "Merge" button on the MR helps, though it is far from automated.

Part of the issue relies on the many pipeline triggers we do, though it shouldn’t be an issue on this scale. Also, the pipelines cannot easily depend on each other if they can at all: we can define some pipeline dependencies and execution restrictions, however these pipeline definitions have limitations we cannot workaround as it stands today. At this point, replacing the CI/CD pipelines could have been beneficial as GitLab’s CI/CD infrastructure is clearly not designed for our use case.

An alternative, and probably more reliable option could be the Argo ecosystem. Some components of the ecosystem are already used by other providers in the community (EduNext) and we never heard a complaint from them. In fact, we had discussions in OpenCraft (between Braden MacDonald and Gábor Boros) if Argo would serve us better. In fact, Argo is designed for GitOps, hence it shall handle "triggers" based on repository changes way better than GitLab.

By introducing Argo Events (for triggering), Workflows (for CI and other workflows), and CD (for deployment) we could achieve a way more reliable and stable ecosystem, running in the given cluster if we want to. This shift would mean that the cluster could be responsible for deploying Open EdX instances on itself. This approach reduces connectivity or agent related failures to zero, and encapsulates all cluster-related operations in the affected cluster.

Also, this allows the possibility of another feature that would resolve a long-standing issue: we would be able to manage multiple clusters’ instance deployments from a single, centralized ArgoCD installation. In fact, we could architect a solution using ArgoCD’s existing features to allow our clients to take control over their clusters if they want to break up with us. This is a selling point of “no vendor locking” for clients.

Summary¶

This document summarized the current issues with Grove and the potential resolution to them, however, not every item has the same priority. The list of action items are listed below in order of proposed execution.

Remove components already installed by Harmony. Install and configure Harmony Helm Chart using Grove.
Upstream missing mandatory features to Harmony (listed above in the “Integrating with Harmony” section)
Upstream missing optional features to Harmony (listed above in the “Integrating with Harmony” section)
Refactor Grove’s state management of the infrastructure and instances
Introduce Argo CD for the clusters (within the clusters) to replace the CD part of the GitLab pipeline
Create a discovery and proof of concept for using Argo Events & Workflows as CI
Create a discovery and proof of concept for using Argo CD as a central management platform of Grove clusters

Of course, these steps could be broken down into multiple smaller tasks, but that is out of scope for this document. On day 1 (i.e. next sprint) we could start working on replacing current components with Harmony, which is a big win.