Troubleshooting

This topic provides information to help troubleshoot issues with your Upwind components installation.

Kernel version 4.14

The standard Upwind Sensor uses the BPF CO-RE concept, but it is not available on version 4.14 of the Linux kernel. In order to install the Upwind Sensor on instances with kernel version 4.14, use the "bcc" flavor of the sensor image.

Terraform
Cloud Formation)

Set the bcc boolean value to true for the image_sensor variable.

image_sensor = {
  bcc = true
}

Update the SensorContainerImage variable and add a -bcc suffix to the tag.
For example

registry.upwind.io/images/agent:0.116.1 -> registry.upwind.io/images/agent:0.116.1-bcc

Cluster Manager Egress Issues

The Cluster Manager needs to be deployed in a subnet that has egress capabilities. Ensure that the Cluster Manager gets deployed into a private subnet that is connected to a NAT gateway.

In the case the Cluster Manager was created with incorrect network configuration the service can be updated via Terraform or CloudFormation with the correct subnets or be fixed manually via the AWS CLI.

aws ecs update-service \
  --service upwind-cluster-manager \
  --cluster <YOUR_ECR_CLUSTER_NAME> \
  --network-configuration file://updated-network-config.json

Where the contents of the file looks like this:

{
  "awsvpcConfiguration": {
    "subnets": [
      "<specify new subnet ID>"
    ],
    "securityGroups": [],
    "assignPublicIp": "DISABLED"
  }
}

We can also do this using the shorthand syntax:

aws ecs update-service \
  --service upwind-cluster-manager \
  --cluster <YOUR_ECR_CLUSTER_NAME> \
  --network-configuration "awsvpcConfiguration={subnets=[<specify new subnet ID>]}"

After updating the network configuration the task needs to redeployed:

aws ecs update-service \
  --service upwind-cluster-manager \
  --cluster <YOUR_ECR_CLUSTER_NAME> \
  --force-new-deployment

Cluster Manager Startup Logs

The Cluster Manager is configured to use awsvpc network mode and the network can take some time to be set up properly, even while the Cluster Manager process is starting up. Some errors may be logged during startup until the Docker network for the task is fully set up.

Cluster Manager or Sensors Fail To Start due to Missing Secret Immediately After Install

The Cluster Manager and sensor tasks needs a secret which is created by a lambda that is created as part of the CloudFormation stack or Terraform module. Immediately after installing, the ECS Service may be created before the lambda has run the first time, and the Cluster Manager task will fail to start. The task will be recreated by the ECS Service a short time later and the secret should have been created by then.

If the Cluster Manager or sensor tasks persistently fail to start, check the status of the UpwindRegistryCredentials-<YOUR_ECR_CLUSTER_NAME> Lambda function and inspect logs from the lambda. Double-check that the credentials you provided to the Terraform or CloudFormation are correct.

Sensor to Cluster Manager Communication

The sensors that run on the host network must have network connectivity to the Cluster Manager that runs with awsvpc network mode. If the sensors are not able to communicate with the Cluster Manager, you may see errors like the following:

Error while dialing: dial tcp 172.29.218.199:8444: i/o timeout"

By default, the Cluster Manager is deployed without any security group associated with it. This means that AWS will assign the default security group for the VPC to it, and if the default security group does not allow inbound traffic on ports 8082 and 8444 the sensors will not be able to communicate with the Cluster Manager. The security group for the Cluster Manager can be set with the ClusterManagerSecurityGroup parameter of the CloudFormation stack or the security_groups_cluster_manager variable of the Terraform module.

In case a security group needs to be specified the group will need to allow for ingress from hosts in the cluster to port 8082 and 8444.

Destroying and Recreating Fails on Conflicting Secret Manager Secrets

When removing the sensor installation and a secret has been created then the secret will get scheduled for deletion instead of being deleted right away. In order to reinstall the sensor before the scheduled deletion happens, this secret needs to be forcefully removed.

Delete the secret of the sensor credentials immediately:

aws secretsmanager delete-secret \
  --secret-id upwind-sensor-credentials-<YOUR_ECR_CLUSTER_NAME> \
  --force-delete-without-recovery

Delete the secret of the registry credentials immediately:

aws secretsmanager delete-secret \
  --secret-id upwind-registry-credentials-<YOUR_ECR_CLUSTER_NAME> \
  --force-delete-without-recovery

Kernel version 4.14​

Cluster Manager Egress Issues​

Cluster Manager Startup Logs​

Cluster Manager or Sensors Fail To Start due to Missing Secret Immediately After Install​

Sensor to Cluster Manager Communication​

Destroying and Recreating Fails on Conflicting Secret Manager Secrets​