Troubleshooting¶
Tips on how to deal with some common errors.
Control characters are not allowed¶
Problem: This error can occur if you try to run ./tutor NAME k8s quickstart
before terraform has finished applying your changes and creating your cluster.
error: error loading config file "/root/.kube/config": yaml: control characters are not allowed
Error: Command failed with status 1: kubectl apply --kustomize /workspace/env --wait --selector app.kubernetes.io/component=namespace
Solution: Let the wrapper script regenerate the config file
cd my-cluster/control/
rm ../kubeconfig-private.yml
Provisioning the first Open edX instance fails¶
Problem: This error can occur when you run /tutor NAME config save --interactive
when provisioning the first Open edX instance. The execution of get_kubeconfig_path
will error out with an error related to filesystem permissions.
Solution: Bypass the calls to get_kubeconfig_path
and write the output manually using:
cd my-cluster/control/
./tf output -raw kubeconfig > ../kubeconfig-private.yml
Namespace "monitoring" not found¶
Problem: This error can occur if the initial creation of resources took to long to provision.
Solution: Run ./tf plan
and ./tf apply
again.
Building an image on MacOS fails with permission errors¶
If you see this error on MacOS:
Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock
You can fix it by manually adjusting the permissions of the socket on the host VM:
docker run -it --rm -v /var/run/docker.sock:/var/run/docker.sock docker:dind chmod go+w /var/run/docker.sock
Cannot mount the Docker socket on MacOS¶
If you get an obscure error related to mount points and the docker socket, change your Docker settings to use the "osxfs (legacy)" file sharing mode.
Note: on MacOS, even if you are using the newer per-user Docker socket location
(~/.docker/run/docker.sock
), you must use /var/run/docker.sock
as the
mount source, because Docker for Mac detects that exact mount point and
re-routes the mount to be served from the Docker Linux VM, not your MacOS host.
Fetching images fails¶
Errors are of the form:
Error from server (BadRequest): container "lms" in pod "lms-job-20221010183155-jtfpn" is waiting to start: trying and failing to pull image
Or
Failed to pull image "registry.gitlab.com//grove-stage/openedx:latest": rpc error: code = Unknown desc = Error response from daemon: pull access denied for registry.gitlab.com/grove-stage/edxapp-stage/openedx, repository does not exist or may require 'docker login': denied: requested access to the resource is denied
Check that the value CI_REGISTRY_IMAGE
is configured correctly and rebuild your infrastructure.
The node was low on resource: ephemeral-storage1¶
The default AMI used in AWS clusters provides 20GB of storage per node. Not much data is stored on the nodes besides what is needed on the cluster. Disk space is mainly taken up by Docker images and logs.
When this error occurs, it's likely that Kubernetes hasn't had the opportunity to prune
unused docker images yet. Kubernetes' garbage collector
checks every minute and clears up these images. If your pod is stuck with this error message, delete the pod and the mounted PersistentVolume. Eg. for the redis
deployment.
./kubectl delete deployment -nopenedx redis
./kubectl get pvc -nopenedx redis
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
redis Bound pvc-96ee4686-7434-4925-9751-5573cf52680d 1Gi RWO do-block-storage 67d
./kubectl delete pvc -nopenedx redis
./kubectl delete pv pvc-96ee4686-7434-4925-9751-5573cf52680d
If the error persists you will need to increase the disk size on your node manually by going following AWS's guide. Once done, you'll need to increase the size of the filesystem on the node.
To get root shell on the node you can make use of kubectl-plugins within a Grove shell:
./shell
git clone https://github.com/luksa/kubectl-plugins $HOME/kubectl-plugins
export PATH=$PATH:$HOME/kubectl-plugins
kubectl get nodes
kubectl ssh node [node-name]
For a permanent solution, you can change the AMI
used for your nodes by updating the TF_VAR_ami_id
variable in the repository's cluster.yml
. Amazon provides a complete tutorial on accomplishing this.
Once your AMI is changed, your nodes will not switch to the new AMI automatically. You can force this change by recreating you nodes (replace the autoscaling_group_name
and aws_region
in the commands below):
./shell
aws autoscaling start-instance-refresh --auto-scaling-group-name={autoscaling_group_name} --region={aws_region}
To retrieve your autoscaling group name, run:
aws autoscaling describe-auto-scaling-groups --region={region} | jq ".AutoScalingGroups[] | .AutoScalingGroupName"
Couldn't get current server API group list: the server has asked for the client to provide credentials¶
Problem: This error can occur if the kubeconfig-private.yml
is out of sync and you try to run ./kubectl
commands.
E0904 12:42:41.259776 22 memcache.go:265] couldn't get current server API group list: the server has asked for the client to provide credentials
Solution: Remove the kubeconfig-private.yml
file and regenerate it using ./tf init
cd my-cluster/control/
rm ../kubeconfig-private.yml
./tf init
Modify files inside LMS or CMS pod and see changes without redeployment¶
Sometimes, to debug issues in live instances, we need to make changes to files inside the pod and see the changes without redeployment. This can be done in any LMS or CMS pod by following these steps:
- Install an editor in the pod to edit files;
apt-get
does not work due to permission issues. You can use pyvim, a pure Python Vim clone, which is available on GitHub. Runpip install pyvim
to install it. - Edit any file inside the pod, for example,
/openedx/edx-platform/lms/envs/common.py
, using pyvim. - Restart the uwsgi process by running
kill -HUP 1
, then refresh your browser to see the changes live.
Warning
Any changes made to files in a pod will be lost in the next deployment. This method should only be used for debugging issues.
Rolling back helm charts¶
Problem: You might need to do this for example when ./tf apply
is timing out while trying to upgrade a helm chart to a newer version. You might also get messages like Provider produced inconsistent result after apply
or another operation (install/upgrade/rollback) is in progress
. The cluster might be stuck in a broken state, so you would want to rollback the chart to recover it, while you're looking for a solution.
Solution:
TLDR:
cd control/
./helm list -aA
./helm history -n <namespace> <name>
./helm rollback -n <namespace> <name> <revision number>
Navigate to /control
directory of your Grove cluster repo. If you don't have helm
symlink, create it by running ln -s ../grove/wrapper-script.rb helm
.
Then you want to get a list of all releases that are installed - ./helm list -aA
. Most of the time you won't be sure in which namespace the broken/stuck release is in, so it's very important to run the command with the -aA
flag to list ALL installed releases (this is the step that many got stuck before).
Look for any releases where status is not deployed
and note their name and namespace. Checking revision, updated date and app version might also be helpful for debugging.
Before rolling back, you would want to see the history of the revisions to see which one is the last healthy one (this is important if you're doing it multiple times and it also might give you a bit more information one where things went wrong). For that you would want to run ./helm history -n <namespace> <name>
.
Lastly, to rollback you would run ./helm rollback -n <namespace> <name> <revision number>
.
Unfortunately, helm doesn't give almost any logs about what went wrong, so you will have to figure it out yourself. Usually, creation/modification/deletion of some resources went wrong, so kubectl get/describe
is your best friend. Look for resources that look stuck and then try to figure out why. Sometimes (although very rarely) the nodes themselves could be stuck, so you might need to terminate them via AWS/DO UI.