EKS with Helm and the OpenTelemetry Collector

This section contains some guidelines for troubleshooting and handling errors that you may encounter when trying to collect Amazon Elastic Kubernetes (EKS) metrics.

Problem: Permanent error - context deadline exceeded

The following error appears:

Permanent error: Post \"https://<<LISTENER-HOST>>:8053\": context deadline exceeded
meaning that the post request timeout.

Possible cause - Connectivity issue

A connectivity issue may be causing this error.

Suggested remedy

Check your shipper's connectivity as follows.

For MacOS and Linux, use telnet to make sure your log shipper can connect to Logz.io listeners.

note

As of MacOS High Sierra (10.13), telnet is not installed by default. You can install telnet with Homebrew by running brew install telnet.

Run this command from the environment you're shipping from, after adding the appropriate port number:

telnet listener.logz.io {port-number}

For Windows servers running Windows 8/Server 2012 and later, run the following command in PowerShell:

Test-NetConnection listener.logz.io -Port {port-number}

The port numbers are 8052 and 8053.

Possible cause - Service exposing the metrics need more time

A service exposing the metrics may need more time to send the response to the OpenTelemetry collector.

Suggested remedy

Increase the OpenTelemetry collector timeout as follows.

In values.yaml,under: config: receivers: prometheus: config: global: scrape_timeout: <<timeout time>>.

Problem: Incorrect listener and/or token

You may be using an incorrect listener and/or token.

You will need to look in the logs of a pod whose name contains otel-collector.

Possible cause - The token is not valid

In the logs, for the token the error will be: "error": "Permanent error: remote write returned HTTP status 401 Unauthorized; err = <nil>: Shipping token is not valid".

Possible cause - The listener is not valid

For the Url the error will be: "error": "Permanent error: Post \"https://liener.logz.io:8053\": dial tcp: lookup <<provided listener>> on <<ip>>: no such host".

Suggested remedy

Check that the listener and token of your account are correct. You can view them in the Manage tokens section.

Problem: Windows nodes error

Possible cause - Incorrect username and/or password for Windows nodes

You may be using an incorrect username and/or password for Windows nodes.

You will need to look in the logs of the windows-exporter-installer pod. The error will look like this: INFO:paramiko.transport:Authentication (password) failed.

ERROR:root:SSH connection to node aksnpwin000002 failed, please check username and password.

Suggested remedy

Ensure the username and password to Windows nodes are correct.

Problem: Invalid helm chart version

Possible cause - The version of the helm chart is not up to date

The helm chart version that you are using may have expired.

Suggested remedy

Update the helm chart by running:

helm repo update

Problem: The prometheusremotewrite exporter timeout

When checking the Logz.io app you don't see any metrics, or you only see some of your metrics, but when checking your otel-collector pod for logs, you don't see any errors. This might indicate this issue.

Possible cause - The timeout in prometheusremotewrite exporter too short

The timeout setting in the prometheusremotewrite exporter is too short.

Suggested remedy

Increase the timeout setting in the prometheusremotewrite exporter.

For example, if our timeout setting is 5s:

endpoint: ${LISTENER_URL}
      timeout: 5s
      external_labels:
        p8s_logzio_name: ${P8S_LOGZIO_NAME}
      headers:
        Authorization: "Bearer ${METRICS_TOKEN}"

You can increase it to 20s:

endpoint: ${LISTENER_URL}
      timeout: 20s
      external_labels:
        p8s_logzio_name: ${P8S_LOGZIO_NAME}
      headers:
        Authorization: "Bearer ${METRICS_TOKEN}"

Problem: Permanent error - log state shows as waiting

The log shows the following:

State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: OOMKilled
Exit Code: 137

Possible cause

Insufficient memory allocated to the pod.

Suggested remedy

In values.yaml, increase the memory of the standaloneCollector resources by approximately 100Mi.

For example, if you are using 512Mi:

standaloneCollector:
  enabled: true

  containerLogs:
    enabled: false

  resources:
    limits:
      cpu: 256m
      memory: 512Mi

You can increase it as much as needed. In this example, it's 612Mi:

standaloneCollector:
  enabled: true

  containerLogs:
    enabled: false

  resources:
    limits:
      cpu: 256m
      memory: 612Mi

When running apps on Kubernetes

You need to make sure that the prometheus.io/scrape is set to true:

prometheus.io/scrape: true

Problem: You have reached your pull rate limit

In some cases (i.e. spot clusters) where the pods or nodes are replaced frequently, they might reach the pull rate limit for images pulled from dockerhub with the following error:

You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: 
https://www.docker.com/increase-rate-limits

Suggested remedy

You can use the following --set commands to use an alternative image repository:

For the monitoring chart and the Telemetry Collector Kubernetes installation:

--set logzio-k8s-telemetry.image.repository=ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib --set logzio-k8s-telemetry.prometheus-pushgateway.image.repository=public.ecr.aws/logzio/prom-pushgateway

For the telemetry chart:

--set image.repository=ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib --set prometheus-pushgateway.image.repository=public.ecr.aws/logzio/prom-pushgateway

Problem: Pod failing on endless crash loop error

When a Kubernetes pod reaches the maximum delay time of 5 minutes and continues to fail, Kubernetes stops attempting to deploy the pod, assigning it a status of CrashLoopBackOff or Out Of Memory (OOM).

Possible cause

The container in the pod has exceeded its memory limit, and the system cannot allocate additional memory.

Suggested remedy

Increase the pod's memory limit to 400Mi and the CPU limit to 400m.

Problem: Permanent error - context deadline exceeded​

Possible cause - Connectivity issue​

Suggested remedy​

Possible cause - Service exposing the metrics need more time​

Suggested remedy​

Problem: Incorrect listener and/or token​

Possible cause - The token is not valid​

Possible cause - The listener is not valid​

Suggested remedy​

Problem: Windows nodes error​

Possible cause - Incorrect username and/or password for Windows nodes​

Suggested remedy​

Problem: Invalid helm chart version​

Possible cause - The version of the helm chart is not up to date​

Suggested remedy​

Problem: The prometheusremotewrite exporter timeout​

Possible cause - The timeout in prometheusremotewrite exporter too short​

Suggested remedy​

Problem: Permanent error - log state shows as waiting​

Possible cause​

Suggested remedy​

When running apps on Kubernetes​

Problem: You have reached your pull rate limit​

Suggested remedy​

Problem: Pod failing on endless crash loop error​

Possible cause​

Suggested remedy​

Problem: Permanent error - context deadline exceeded

Possible cause - Connectivity issue

Suggested remedy

Possible cause - Service exposing the metrics need more time

Suggested remedy

Problem: Incorrect listener and/or token

Possible cause - The token is not valid

Possible cause - The listener is not valid

Suggested remedy

Problem: Windows nodes error

Possible cause - Incorrect username and/or password for Windows nodes

Suggested remedy

Problem: Invalid helm chart version

Possible cause - The version of the helm chart is not up to date

Suggested remedy

Problem: The prometheusremotewrite exporter timeout

Possible cause - The timeout in prometheusremotewrite exporter too short

Suggested remedy

Problem: Permanent error - log state shows as waiting

Possible cause

Suggested remedy

When running apps on Kubernetes

Problem: You have reached your pull rate limit

Suggested remedy

Problem: Pod failing on endless crash loop error

Possible cause

Suggested remedy