Skip to content

Troubleshooting

Tips on how to deal with some common errors.

Control characters are not allowed

Problem: This error can occur if you try to run ./tutor NAME k8s quickstart before terraform has finished applying your changes and creating your cluster.

error: error loading config file "/root/.kube/config": yaml: control characters are not allowed
Error: Command failed with status 1: kubectl apply --kustomize /workspace/env --wait --selector app.kubernetes.io/component=namespace

Solution: Let the wrapper script regenerate the config file

cd my-cluster/control/
rm ../kubeconfig-private.yml

Provisioning the first Open edX instance fails

Problem: This error can occur when you run /tutor NAME config save --interactive when provisioning the first Open edX instance. The execution of get_kubeconfig_path will error out with an error related to filesystem permissions.

Solution: Bypass the calls to get_kubeconfig_path and write the output manually using:

cd my-cluster/control/
./tf output -raw kubeconfig > ../kubeconfig-private.yml

Namespace "monitoring" not found

Problem: This error can occur if the initial creation of resources took to long to provision. Solution: Run ./tf plan and ./tf apply again.

Building an image on MacOS fails with permission errors

If you see this error on MacOS:

Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock

You can fix it by manually adjusting the permissions of the socket on the host VM:

docker run -it --rm -v /var/run/docker.sock:/var/run/docker.sock docker:dind chmod go+w /var/run/docker.sock

Cannot mount the Docker socket on MacOS

If you get an obscure error related to mount points and the docker socket, change your Docker settings to use the "osxfs (legacy)" file sharing mode.

Note: on MacOS, even if you are using the newer per-user Docker socket location (~/.docker/run/docker.sock), you must use /var/run/docker.sock as the mount source, because Docker for Mac detects that exact mount point and re-routes the mount to be served from the Docker Linux VM, not your MacOS host.

Fetching images fails

Errors are of the form:

Error from server (BadRequest): container "lms" in pod "lms-job-20221010183155-jtfpn" is waiting to start: trying and failing to pull image

Or

Failed to pull image "registry.gitlab.com//grove-stage/openedx:latest": rpc error: code = Unknown desc = Error response from daemon: pull access denied for registry.gitlab.com/grove-stage/edxapp-stage/openedx, repository does not exist or may require 'docker login': denied: requested access to the resource is denied

Check that the value CI_REGISTRY_IMAGE is configured correctly and rebuild your infrastructure.

The node was low on resource: ephemeral-storage1

The default AMI used in AWS clusters provides 20GB of storage per node. Not much data is stored on the nodes besides what is needed on the cluster. Disk space is mainly taken up by Docker images and logs.

When this error occurs, it's likely that Kubernetes hasn't had the opportunity to prune unused docker images yet. Kubernetes' garbage collector checks every minute and clears up these images. If your pod is stuck with this error message, delete the pod and the mounted PersistentVolume. Eg. for the redis deployment.

./kubectl delete deployment -nopenedx redis
./kubectl get pvc -nopenedx redis
NAME    STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS       AGE
redis   Bound    pvc-96ee4686-7434-4925-9751-5573cf52680d   1Gi        RWO            do-block-storage   67d
./kubectl delete pvc -nopenedx redis
./kubectl delete pv pvc-96ee4686-7434-4925-9751-5573cf52680d

If the error persists you will need to increase the disk size on your node manually by going following AWS's guide. Once done, you'll need to increase the size of the filesystem on the node.

To get root shell on the node you can make use of kubectl-plugins within a Grove shell:

./shell
git clone https://github.com/luksa/kubectl-plugins $HOME/kubectl-plugins
export PATH=$PATH:$HOME/kubectl-plugins
kubectl get nodes
kubectl ssh node [node-name]

For a permanent solution, you can change the AMI used for your nodes by updating the TF_VAR_ami_id variable in the repository's cluster.yml. Amazon provides a complete tutorial on accomplishing this.

Once your AMI is changed, your nodes will not switch to the new AMI automatically. You can force this change by recreating you nodes (replace the autoscaling_group_name and aws_region in the commands below):

./shell
aws autoscaling start-instance-refresh --auto-scaling-group-name={autoscaling_group_name} --region={aws_region}

To retrieve your autoscaling group name, run:

aws autoscaling describe-auto-scaling-groups --region={region} | jq ".AutoScalingGroups[] | .AutoScalingGroupName"

Couldn't get current server API group list: the server has asked for the client to provide credentials

Problem: This error can occur if the kubeconfig-private.yml is out of sync and you try to run ./kubectl commands.

E0904 12:42:41.259776 22 memcache.go:265] couldn't get current server API group list: the server has asked for the client to provide credentials

Solution: Remove the kubeconfig-private.yml file and regenerate it using ./tf init

cd my-cluster/control/
rm ../kubeconfig-private.yml
./tf init