Blog
January 13, 2020
TABLE OF CONTENTS
Kubernetes has become the de-facto standard when it comes to containerization. It is widely deployed in organizations of all sizes, especially those that are moving on-premise workloads to the cloud and to a microservices-based architecture. While raw Kubernetes is not easy to deploy and manage, cloud services providers such as AWS, Azure and IBM Bluemix provide managed services that significantly ease the adoption of this technology. Specifically, AWS offers Elastic Kubernetes Service (EKS) which is nicely integrated into a variety of other AWS services including compute, networking, and security.
Given its complexity, Kubernetes errors can be hard to diagnose and troubleshoot, even with such managed services. This document describes some common errors with AWS EKS deployment and techniques to troubleshoot them. This document does not cover errors associated with deploying containerized applications into Kubernetes, but only focuses on errors related to the infrastructure.
EKS consists of 2 subsystems: a control plane that is fully managed by AWS, and worker nodes which are provisioned by the customer as needed. The control plane runs Kubernetes components such as etcd (which acts as a backing store for cluster data) and API server (which allows worker nodes and command line tools to communicate with the control plane). Worker nodes are EC2 instances provisioned using an Auto-Scaling Group, which allows the customer to decide how much capacity and elasticity is required.
This section describes a few common error situations that you may encounter with EKS. Each of these errors has an underlying cause which can be recognized by the symptoms and error messages found in a variety of log files.
Worker nodes (i.e. EC2 instances) register themselves with the control plane on startup. In order to do this, the instance requires a number of additional packages and configurations. AWS provides AMIs for EKS that include these prerequisites. (AWS also provides the source code required for building custom AMIs). EC2 instances that do not include (or include an incompatible version of) the required packages will result in a node status of “Not Ready”. A comparison of the version of the AMI with the version of your EKS cluster will show whether they are compatible. In the example below, the version of the AMI is incompatible with the version of the cluster. (Tip: Click view raw in the gist viewer embedded below to see messages in their raw format).
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-0-62.ec2.internal NotReady <none> 11m v1.14.7-eks-1861c5
ip-10-0-0-95.ec2.internal NotReady <none> 10m v1.14.7-eks-1861c5
Server version: 1.11
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.3", GitCommit:"721bfa751924da8d1680787490c54b9179b1fed0", GitTreeState:"clean", BuildDate:"2019-02-04T04:48:03Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"11+", GitVersion:"v1.11.10-eks-7f15cc", GitCommit:"7f15ccb4e58f112866f7ddcfebf563f199558488", GitTreeState:"clean", BuildDate:"2019-08-19T17:46:02Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
Security Groups must be correctly configured in order for worker nodes and the control plane to communicate with each other. When these security groups are incorrectly configured, Kubernetes will not be able to register worker nodes. Specifically, the following rules must be configured:
This error manifests itself as a network error as can be seen in several ways. In the following example, pods in the kube-system namespace show error or Pending status:
NAME READY STATUS RESTARTS AGE
aws-node-bbwpq 0/1 CrashLoopBackOff 12 51m
aws-node-nw7v8 0/1 CrashLoopBackOff 12 51m
coredns-7bcbfc4774-g8sz7 0/1 Pending 0 54m
coredns-7bcbfc4774-qnrcw 0/1 Pending 0 54m
kube-proxy-dnhr6 1/1 Running 0 51m
kube-proxy-j5gps 1/1 Running 0 51m
This error can also be diagnosed by examining the log files on the worker node (in /var/log) - assuming you have SSH access to the nodes. To see the error message, view the log file corresponding to the aws-node container as shown in the example below. The error message in this example is “Failed to communicate with K8S Server”.
{"log":"====== Installing AWS-CNI ======\n","stream":"stdout","time":"2019-10-03T18:42:42.807515657Z"}
{"log":"====== Starting amazon-k8s-agent ======\n","stream":"stdout","time":"2019-10-03T18:42:42.821888604Z"}
{"log":"ERROR: logging before flag.Parse: E1003 18:43:12.854890 9 memcache.go:138] couldn't get current server API group list; will keep using cached value. (Get https://172.20.0.1:443/api?timeout=32s: dial tcp 172.20.0.1:443: i/o timeout)\n","stream":"stderr","time":"2019-10-03T18:43:12.855417379Z"}
{"log":"Failed to communicate with K8S Server. Please check instance security groups or http proxy setting","stream":"stdout","time":"2019-10-03T18:43:42.907376066Z"}
You can confirm this error by trying to connect to the Kubernetes server from the EC2 node using curl as shown below (replace the IP address with the cluster IP of your EKS cluster), which will error out if there is no network connectivity due to incorrect security group configuration.
curl -vk https://172.20.0.1:443/api
As mentioned previously, worker nodes require a number of additional linux packages to be installed at startup in order to communicate with the control plane. This is accomplished via standard linux package managers and package repositories. An Internet Gateway (or NAT Gateway) must be attached to the VPC to enable EC2 instances to communicate with package repositories. If this gateway is either missing or incorrectly configured, worker nodes will not bootstrap correctly, and the cluster will not recognize them. This error can be seen in the system log of the EC2 instance as shown below:
[ 44.600544] cloud-init[3835]: and yum doesn't have enough cached data to continue. At this point the only
[ 44.612407] cloud-init[3835]: safe thing yum can do is fail. There are a few ways to work "fix" this:
[ 44.622313] cloud-init[3835]: 1. Contact the upstream for the repository and get them to fix the problem.
[ 44.628873] cloud-init[3835]: 2. Reconfigure the baseurl/etc. for the repository, to point to a working
This error can also be caused by incorrectly configured outbound rules in the security group associated with worker node EC2 instances. If the security group does not allow outbound access, the instance will not be able to communicate with the package repository to install required packages.
Worker nodes require a few AWS IAM permissions in order to access required resources during startup. One such permissions is ecr:GetAuthorizationToken. If the instance profile attached to the worker node EC2 instances does not have this permission, the nodes will not be able to download Docker container images required to run Kubernetes. In such situations, an error message similar to the listing below may be seen in /var/log/messages on the EC2 instances.
Oct 4 17:48:57 ip-10-0-0-9 kubelet: status code: 400, request id: 81a6c1bc-977d-48c7-9032-fdea26b8e7bd
Oct 4 17:48:57 ip-10-0-0-9 dockerd: time="2019-10-04T17:48:57.474787082Z" level=info msg="Attempting next endpoint for pull after error: Get https://602401143452.dkr.ecr.us-east-1.amazonaws.com/v2/eks/pause-amd64/manifests/3.1: no basic auth credentials"
Availability of network resources must be taken into account when determining how many EC2 instances to provision as worker nodes, and how large each instance should be. Kubernetes uses CNI (Container Networking Interface) to allocated network resources. Amazon VPC CNI plugin for Kubernetes assigns VPC IP addresses to each pod. As a result, the number of pods that can be deployed in the cluster is limited by the number of IP addresses available with the selected EC2 instance type. For example, the instance type t3.medium supports 3 interfaces with up to 6 IP addresses each. When the number of pods exceed the available number of IP addresses, they will remain in a state of Pending or ContainerCreating status.
This error can be diagnosed in 2 ways (with a “failed to assign an IP address to container” seen in both cases):
a. Examining pod events using kubectl CLI (kubectl describe pod … )
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 2m21s (x5 over 2m38s) default-scheduler 0/1 nodes are available: 1 Insufficient pods.
Normal Scheduled 2m19s default-scheduler Successfully assigned istio-system/servicegraph-849c995588-n2tjg to ip-10-0-0-39.ec2.internal
Warning FailedCreatePodSandBox 2m17s kubelet, ip-10-0-0-39.ec2.internal Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "1814151dac7d051c85eb667e1c1f6abd595fda434a7bcee975558b9c46ade728" network for pod "servicegraph-849c995588-n2tjg": NetworkPlugin cni failed to set up pod "servicegraph-849c995588-n2tjg_istio-system" network: add cmd: failed to assign an IP address to container
b. Examining /var/log/messages on EC2 instance
Oct 7 16:41:26 ip-10-0-0-61 kubelet: E1007 16:41:26.903309 4468 remote_runtime.go:92] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = failed to set up sandbox container "5a12123016eec119fb3c1fd6baa233344e650992cb11b23ebabc72890b4e39ca" network for pod "istio-policy-7d667689b7-xkjz6": NetworkPlugin cni failed to set up pod "istio-policy-7d667689b7-xkjz6_istio-system" network: add cmd: failed to assign an IP address to container
Here are some general guidelines for troubleshooting and root cause analysis. These guidelines assume you have access to the kubectl CLI and, in some cases, also have access to the EC2 instances via SSH.
Authored By
Sonny Werghis
Principal Architecture Consultant, Levvel
Meet our Experts
Principal Architecture Consultant, Levvel
Sonny Werghis is a Principal Architecture Consultant at Levvel where he advises clients on Payment technology. Previously, Sonny worked at IBM as a Product Manager and a Solution Architect focused on Cloud and Cognitive technology where he developed AI and Machine Learning-based business solutions for customers in various industries, including Finance, Government, Healthcare, and Transportation. Sonny is an Open Group Master Certified IT Architect and a certified Enterprise Architect.
Let's chat.
You're doing big things, and big things come with big challenges. We're here to help.