Cluster Monitoring¶

By default Grove ships with Prometheus, Grafana and Alert Manager and OpenSearch for monitoring.

Currently, none of these services are exposed via the ingress. They can all be accessed by forwarding the relevant port. Listed below are the commands to view the UI for each of the services. These need to be invoked from within the control directory.

Prometheus: ./kubectl --namespace harmony port-forward --address 0.0.0.0 svc/prometheus 8001:9090
Alert Manager: ./kubectl --namespace harmony port-forward --address 0.0.0.0 svc/alertmanager 8001:9093
OpenSearch Dashboard: ./kubectl --namespace monitoring port-forward --address 0.0.0.0 deployments/opensearch-dashboard-opensearch-dashboards 8001:5601

After running any of the commands above, you will then be able to view the relevant UI in your browser at http://localhost:8001.

Components¶

OpenSearch¶

Grove deploys OpenSearch and Fluent-bit instances as part of its monitoring stack to store logs generated by any pod on the cluster. Logs can be viewed by accessing the OpenSearch Dashboard, detailed below.

By default, the logs are persisted to a PVC of 8Gi. You can change this values my modifying the TF_VAR_opensearch_persistence_size value to a size of your choosing. Note that this value cannot be changed for existing PVC's. Your options are:

Data loss: Delete your existing storage and create new PV's.
Backup and restore: Backup your data, delete and create the new PV then restore your data.
Resize the PV: Using your providers UI, resize the PV. Then apply the new PV changes so that your Terraform code is aligned with reality.

OpenSearch Dashboard¶

Accessing the dashboard is possible by running the following command within the control directory.

./kubectl --namespace monitoring port-forward --address 0.0.0.0 deployments/opensearch-dashboard-opensearch-dashboards 8001:5601

The username is admin and the password can be retrieved with:

./tf output -raw opensearch_dashboard_admin_password

On the first run, you will need to create an Index Pattern for fluent-bit-logs. Once done, you will be able to view the logs in your discover page.

Grafana¶

By default, Grafana is accessible at grafana.<cluster_domain>.

The username is admin and the password can be retrieved with:

./tf output -raw grafana_admin_password

The Kubernetes Resource Workload dashboard is loaded by default with more dashboards available via the sidebar's Browse item.

Alert Manager¶

Alert Manager is not configured by default to send any notifications. The configuration can be changed by setting the TF_VAR_alertmanager_config variable in Gitlab or in your private.yml if working locally.

The provided value needs to be valid yaml as expected by Alert Manager.

Shown below is an example of configuring email alerts:

TF_VAR_alertmanager_config: |
  receivers:
  - name: "null"
  - name: email
    email_configs:
    - to: 'receiver_mail_id@example.com'
      from: 'mail_id@example.com'
      smarthost: smtp.example.com:587
      auth_username: 'mail_id@example.com'
      auth_identity: 'mail_id@example.com'
      auth_password: 'password'

Default null route

Note that "null" receiver is required. Due to the way values are merged in helm, this receiver needs to exist otherwise you'll receive undefined receiver error. Example:

level=error ts=2020-10-23T12:08:02.428Z caller=coordinator.go:124 component=configuration msg="Loading configuration file failed" file=/etc/alertmanager/config/alertmanager.yaml err="undefined receiver \"null\" used in route"

Visit this Github issue for more details.

Ingress¶

Ingress for the monitoring services are disabled by default, but can be enabled by setting the Terraform variable TF_VAR_enable_monitoring_ingress to true in your CI/CD vars/cluster.yml and updating your DNS to point to the cluster.

Lets Encrypt Email¶

Set TF_VAR_lets_encrypt_notification_inbox variable to a valid email address to received Lets Encrypt Renewal notifications. Note that certificate generation will not work if this address isn't valid.

DNS¶

You will need a valid base domain to set up the monitoring services.

Assuming your base domain is *.monitoring.grove.dev, ingresses will be created for:

prometheus.monitoring.grove.dev
grafana.monitoring.grove.dev
alert-manager.monitoring.grove.dev
opensearch-dashboards.monitoring.grove.dev

Access to the above is handled via the Nginx Controller. To set this up:

Set the variable TF_VAR_cluster_domain to your desired domain.
Obtain your controller's External IP with the command ./kubectl get services -nkube-system ingress-nginx-controller.
Create an A Record for *.your-monitoring-domain.com to the controller's External IP.

After applying the changes your services will be available as described above.

If certificates aren't generated, please check the Cert Manager documentation for troubleshooting steps.

Authentication¶

All services are protected with Basic Authentication to stop unfettered access to your data. The credentials are the same for all services, with the username admin and the password that can be retrieved with ./tf output -raw monitoring_ingress_password.

Prometheus Alerts¶

Grove also ships with a default set of critical alerts (see critical-alerts.yaml file for alerts definitions) for a cluster. Additional alerts may be added through the terraform variable TF_VAR_additional_prometheus_alerts. This variable allows for extending the base set of critical alerts with custom alert rules or groups. The content of this variable must be a string representing valid YAML, structured as a list of prometheus alert groups that can be directly combined with the existing groups in the critical-alerts.yaml file. Each additional alert group must start with a hyphen (-) followed by the group attributes and a list of rules, conforming to the PrometheusRule spec.

Example:

  - name: "additional-alert-group"
    rules:
    - alert: "HighRequestLatency"
      expr: "histogram_quantile(0.9, rate(http_request_duration_seconds_bucket[5m])) > 0.5"
      for: "10m"
      labels:
        severity: "warning"
      annotations:
        summary: "High HTTP request latency"
        description: "90% of requests are taking longer than 0.5 seconds"

    - alert: "HighErrorRate"
      expr: "rate(http_requests_total{status=~'5..'}[5m]) / rate(http_requests_total[5m]) > 0.1"
      for: "10m"
      labels:
        severity: "warning"
      annotations:
        summary: "High HTTP error rate"
        description: "More than 10% of HTTP requests are failing"

The string must not contain document start markers (---) and must be indented correctly to ensure it's parsed as a continuation of the existing 'groups' list in the PrometheusRule resource. Ensure no extra trailing characters (e.g., spaces, tabs) are present at the beginning or end of the YAML string.