Troubleshooting¶
Tips on how to deal with some common errors.
Control characters are not allowed¶
Problem: This error can occur if you try to run ./tutor NAME k8s quickstart
before terraform has finished applying your changes and creating your cluster.
error: error loading config file "/root/.kube/config": yaml: control characters are not allowed
Error: Command failed with status 1: kubectl apply --kustomize /workspace/env --wait --selector app.kubernetes.io/component=namespace
Solution: Let the wrapper script regenerate the config file
cd my-cluster/control/
rm ../kubeconfig-private.yml
Provisioning the first Open edX instance fails¶
Problem: This error can occur when you run /tutor NAME config save --interactive
when provisioning the first Open edX instance. The execution of get_kubeconfig_path
will error out with an error related to filesystem permissions.
Solution: Bypass the calls to get_kubeconfig_path
and write the output manually using:
cd my-cluster/control/
./tf output -raw kubeconfig > ../kubeconfig-private.yml
Namespace "monitoring" not found¶
Problem: This error can occur if the initial creation of resources took to long to provision.
Solution: Run ./tf plan
and ./tf apply
again.
Building an image on MacOS fails with permission errors¶
If you see this error on MacOS:
Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock
You can fix it by manually adjusting the permissions of the socket on the host VM:
docker run -it --rm -v /var/run/docker.sock:/var/run/docker.sock docker:dind chmod go+w /var/run/docker.sock
Cannot mount the Docker socket on MacOS¶
If you get an obscure error related to mount points and the docker socket, change your Docker settings to use the "osxfs (legacy)" file sharing mode.
Note: on MacOS, even if you are using the newer per-user Docker socket location
(~/.docker/run/docker.sock
), you must use /var/run/docker.sock
as the
mount source, because Docker for Mac detects that exact mount point and
re-routes the mount to be served from the Docker Linux VM, not your MacOS host.
Fetching images fails¶
Errors are of the form:
Error from server (BadRequest): container "lms" in pod "lms-job-20221010183155-jtfpn" is waiting to start: trying and failing to pull image
Or
Failed to pull image "registry.gitlab.com//grove-stage/openedx:latest": rpc error: code = Unknown desc = Error response from daemon: pull access denied for registry.gitlab.com/grove-stage/edxapp-stage/openedx, repository does not exist or may require 'docker login': denied: requested access to the resource is denied
Check that the value CI_REGISTRY_IMAGE
is configured correctly and rebuild your infrastructure.
The node was low on resource: ephemeral-storage1¶
The default AMI used in AWS clusters provides 20GB of storage per node. Not much data is stored on the nodes besides what is needed on the cluster. Disk space is mainly taken up by Docker images and logs.
When this error occurs, it's likely that Kubernetes hasn't had the opportunity to prune
unused docker images yet. Kubernetes' garbage collector
checks every minute and clears up these images. If your pod is stuck with this error message, delete the pod and the mounted PersistentVolume. Eg. for the redis
deployment.
./kubectl delete deployment -nopenedx redis
./kubectl get pvc -nopenedx redis
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
redis Bound pvc-96ee4686-7434-4925-9751-5573cf52680d 1Gi RWO do-block-storage 67d
./kubectl delete pvc -nopenedx redis
./kubectl delete pv pvc-96ee4686-7434-4925-9751-5573cf52680d
If the error persists you will need to increase the disk size on your node manually by going following AWS's guide. Once done, you'll need to increase the size of the filesystem on the node.
To get root shell on the node you can make use of kubectl-plugins within a Grove shell:
./shell
git clone https://github.com/luksa/kubectl-plugins $HOME/kubectl-plugins
export PATH=$PATH:$HOME/kubectl-plugins
kubectl get nodes
kubectl ssh node [node-name]
For a permanent solution, you can change the AMI
used for your nodes by updating the TF_VAR_ami_id
variable in the repository's cluster.yml
. Amazon provides a complete tutorial on accomplishing this.
Once your AMI is changed, your nodes will not switch to the new AMI automatically. You can force this change by recreating you nodes (replace the autoscaling_group_name
and aws_region
in the commands below):
./shell
aws autoscaling start-instance-refresh --auto-scaling-group-name={autoscaling_group_name} --region={aws_region}
To retrieve your autoscaling group name, run:
aws autoscaling describe-auto-scaling-groups --region={region} | jq ".AutoScalingGroups[] | .AutoScalingGroupName"
Couldn't get current server API group list: the server has asked for the client to provide credentials¶
Problem: This error can occur if the kubeconfig-private.yml
is out of sync and you try to run ./kubectl
commands.
E0904 12:42:41.259776 22 memcache.go:265] couldn't get current server API group list: the server has asked for the client to provide credentials
Solution: Remove the kubeconfig-private.yml
file and regenerate it using ./tf init
cd my-cluster/control/
rm ../kubeconfig-private.yml
./tf init