Skip to main content

Troubleshooting

Cluster Manager Egress Issues

The cluster manager needs to be deployed in a subnet that has egress capabilities. Ensure that the cluster manager gets deployed into a private subnet that is connected to a NAT gateway.

In the case the cluster manager was created with incorrect network configuration the service can be updated via Terraform or CloudFormation with the correct subnets or be fixed manually via the AWS CLI.

aws ecs update-service \
--service upwind-cluster-manager \
--cluster <YOUR_ECR_CLUSTER_NAME> \
--network-configuration file://updated-network-config.json

Where the contents of the file looks like this:

{
"awsvpcConfiguration": {
"subnets": [
"<specify new subnet ID>"
],
"securityGroups": [],
"assignPublicIp": "DISABLED"
}
}

We can also do this using the shorthand syntax:

aws ecs update-service \
--service upwind-cluster-manager \
--cluster <YOUR_ECR_CLUSTER_NAME> \
--network-configuration "awsvpcConfiguration={subnets=[<specify new subnet ID>]}"

After updating the network configuration the task needs to redeployed:

aws ecs update-service \
--service upwind-cluster-manager \
--cluster <YOUR_ECR_CLUSTER_NAME> \
--force-new-deployment

Sensor to Cluster Manager Communication

We need to ensure that the sensors have network connectivity to the cluster manager. If the sensors are not able to communicate with the cluster manager, you may see errors like the following:

Error while dialing: dial tcp 172.29.218.199:8082: i/o timeout"

The cluster manager by default is deployed without any security group associated with it. This means that AWS will assign the default security group for the VPC to it, and if the default security group does not allow inbound traffic on port8082 the tracers will not be able to communicate with the cluster manager. The security group for the cluster manager can be set with the ClusterManagerSecurityGroup parameter of the CloudFormation stack or the security_groups_cluster_manager variable of the Terraform module.

In case a security group needs to be specified the group will need to allow for ingress from hosts in the cluster to port 8082.

Cluster Manager Fails to Start due to Missing Secret Immediately After Install

The cluster manager needs a secret which is created by a lambda that is created as part of the CloudFormation stack or Terraform module. Immediately after installing, the ECS Service may be created before the lambda has run the first time, and the cluster manager task will fail to start. The task will be recreated by the ECS Service a short time later and the secret should have been created by then.

If the cluster manager tasks persistently fail to start, check the status of the UpwindRegistryCredentials-<YOUR_ECR_CLUSTER_NAME> Lambda function and inspect the logs. Double-check that the credentials you provided to the Terraform or CloudFormation are correct.

Destroying and Recreating Fails on Conflicting Secret Manager Secrets

When removing the sensor installation and a secret has been created then the secret will get scheduled for deletion instead of being deleted right away. In order to reinstall the sensor before the scheduled deletion happens, this secret needs to be forcefully removed.

  • Delete the secret of the sensor credentials immediately:
aws secretsmanager delete-secret \
--secret-id upwind-sensor-credentials-<YOUR_ECR_CLUSTER_NAME> \
--force-delete-without-recovery
  • Delete the secret of the registry credentials immediately:
aws secretsmanager delete-secret \
--secret-id upwind-registry-credentials-<YOUR_ECR_CLUSTER_NAME> \
--force-delete-without-recovery