Skip to content

Troubleshooting

Tips on how to deal with some common errors.

Control characters are not allowed

Problem: This error can occur if you try to run ./tutor NAME k8s quickstart before terraform has finished applying your changes and creating your cluster.

error: error loading config file "/root/.kube/config": yaml: control characters are not allowed
Error: Command failed with status 1: kubectl apply --kustomize /workspace/env --wait --selector app.kubernetes.io/component=namespace

Solution: Let the wrapper script regenerate the config file

cd my-cluster/control/
rm ../kubeconfig-private.yml

Provisioning the first Open edX instance fails

Problem: This error can occur when you run /tutor NAME config save --interactive when provisioning the first Open edX instance. The execution of get_kubeconfig_path will error out with an error related to filesystem permissions.

Solution: Bypass the calls to get_kubeconfig_path and write the output manually using:

cd my-cluster/control/
./tf output -raw kubeconfig > ../kubeconfig-private.yml

Namespace "monitoring" not found

Problem: This error can occur if the initial creation of resources took to long to provision. Solution: Run ./tf plan and ./tf apply again.

Building an image on MacOS fails with permission errors

If you see this error on MacOS:

Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock

You can fix it by manually adjusting the permissions of the socket on the host VM:

docker run -it --rm -v /var/run/docker.sock:/var/run/docker.sock docker:dind chmod go+w /var/run/docker.sock

Cannot mount the Docker socket on MacOS

If you get an obscure error related to mount points and the docker socket, change your Docker settings to use the "osxfs (legacy)" file sharing mode.

Note: on MacOS, even if you are using the newer per-user Docker socket location (~/.docker/run/docker.sock), you must use /var/run/docker.sock as the mount source, because Docker for Mac detects that exact mount point and re-routes the mount to be served from the Docker Linux VM, not your MacOS host.

Fetching images fails

Errors are of the form:

Error from server (BadRequest): container "lms" in pod "lms-job-20221010183155-jtfpn" is waiting to start: trying and failing to pull image

Or

Failed to pull image "registry.gitlab.com//grove-stage/openedx:latest": rpc error: code = Unknown desc = Error response from daemon: pull access denied for registry.gitlab.com/grove-stage/edxapp-stage/openedx, repository does not exist or may require 'docker login': denied: requested access to the resource is denied

Check that the value CI_REGISTRY_IMAGE is configured correctly and rebuild your infrastructure.

The node was low on resource: ephemeral-storage1

The default AMI used in AWS clusters provides 20GB of storage per node. Not much data is stored on the nodes besides what is needed on the cluster. Disk space is mainly taken up by Docker images and logs.

When this error occurs, it's likely that Kubernetes hasn't had the opportunity to prune unused docker images yet. Kubernetes' garbage collector checks every minute and clears up these images. If your pod is stuck with this error message, delete the pod and the mounted PersistentVolume. Eg. for the redis deployment.

./kubectl delete deployment -nopenedx redis
./kubectl get pvc -nopenedx redis
NAME    STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS       AGE
redis   Bound    pvc-96ee4686-7434-4925-9751-5573cf52680d   1Gi        RWO            do-block-storage   67d
./kubectl delete pvc -nopenedx redis
./kubectl delete pv pvc-96ee4686-7434-4925-9751-5573cf52680d

If the error persists you will need to increase the disk size on your node manually by going following AWS's guide. Once done, you'll need to increase the size of the filesystem on the node.

To get root shell on the node you can make use of kubectl-plugins within a Grove shell:

./shell
git clone https://github.com/luksa/kubectl-plugins $HOME/kubectl-plugins
export PATH=$PATH:$HOME/kubectl-plugins
kubectl get nodes
kubectl ssh node [node-name]

For a permanent solution, you can change the AMI used for your nodes by updating the TF_VAR_ami_id variable in the repository's cluster.yml. Amazon provides a complete tutorial on accomplishing this.

Once your AMI is changed, your nodes will not switch to the new AMI automatically. You can force this change by recreating you nodes (replace the autoscaling_group_name and aws_region in the commands below):

./shell
aws autoscaling start-instance-refresh --auto-scaling-group-name={autoscaling_group_name} --region={aws_region}

To retrieve your autoscaling group name, run:

aws autoscaling describe-auto-scaling-groups --region={region} | jq ".AutoScalingGroups[] | .AutoScalingGroupName"

Couldn't get current server API group list: the server has asked for the client to provide credentials

Problem: This error can occur if the kubeconfig-private.yml is out of sync and you try to run ./kubectl commands.

E0904 12:42:41.259776 22 memcache.go:265] couldn't get current server API group list: the server has asked for the client to provide credentials

Solution: Remove the kubeconfig-private.yml file and regenerate it using ./tf init

cd my-cluster/control/
rm ../kubeconfig-private.yml
./tf init

Modify files inside LMS or CMS pod and see changes without redeployment

Sometimes, to debug issues in live instances, we need to make changes to files inside the pod and see the changes without redeployment. This can be done in any LMS or CMS pod by following these steps:

  1. Install an editor in the pod to edit files; apt-get does not work due to permission issues. You can use pyvim, a pure Python Vim clone, which is available on GitHub. Run pip install pyvim to install it.
  2. Edit any file inside the pod, for example, /openedx/edx-platform/lms/envs/common.py, using pyvim.
  3. Restart the uwsgi process by running kill -HUP 1, then refresh your browser to see the changes live.

Warning

Any changes made to files in a pod will be lost in the next deployment. This method should only be used for debugging issues.

Rolling back helm charts

Problem: You might need to do this for example when ./tf apply is timing out while trying to upgrade a helm chart to a newer version. You might also get messages like Provider produced inconsistent result after apply or another operation (install/upgrade/rollback) is in progress. The cluster might be stuck in a broken state, so you would want to rollback the chart to recover it, while you're looking for a solution.

Solution:

TLDR:

cd control/
./helm list -aA
./helm history -n <namespace> <name>
./helm rollback -n <namespace> <name> <revision number>

Navigate to /control directory of your Grove cluster repo. If you don't have helm symlink, create it by running ln -s ../grove/wrapper-script.rb helm.

Then you want to get a list of all releases that are installed - ./helm list -aA. Most of the time you won't be sure in which namespace the broken/stuck release is in, so it's very important to run the command with the -aA flag to list ALL installed releases (this is the step that many got stuck before).

Look for any releases where status is not deployed and note their name and namespace. Checking revision, updated date and app version might also be helpful for debugging.

Before rolling back, you would want to see the history of the revisions to see which one is the last healthy one (this is important if you're doing it multiple times and it also might give you a bit more information one where things went wrong). For that you would want to run ./helm history -n <namespace> <name>.

Lastly, to rollback you would run ./helm rollback -n <namespace> <name> <revision number>.

Unfortunately, helm doesn't give almost any logs about what went wrong, so you will have to figure it out yourself. Usually, creation/modification/deletion of some resources went wrong, so kubectl get/describe is your best friend. Look for resources that look stuck and then try to figure out why. Sometimes (although very rarely) the nodes themselves could be stuck, so you might need to terminate them via AWS/DO UI.