Fixing AWS CodeDeploy issue where Auto Scaling Group is not attached to Target Group

César H. Vortmann
4 min readMay 8, 2019

AWS CodeDeploy Blue/green deployments can automatically copy the previous Auto Scaling Group configuration. One thing that we discovered the hard way was: after a deployment the newly created Auto Scaling Group, that had a Target Group attached, no longer has it.

Missing Target Group

It’s is a known issue that hasn’t been fixed yet. You could have noticed it incidentally or you could have noticed, as in our case, due a major outage.

How could that lead to an outage?

Target Groups are responsible for checking the health of instances and remove them from the Load Balancer, when they are unhealthy. An Auto Scaling Group configured to maintain a fixed number of instances takes care of launching new instances to replace unhealthy ones and terminate the unhealthy instances after the replacements report as healthy.

Since the Auto Scaling Group created by CodeDeploy is not attached to the existing Target Group used by the original Auto Scaling Group from which CodeDeploy copied information, the unhealthy instances will be removed from the Load Balancer by the Target Group but no new instances will be launched by the new Auto Scaling Group.

If your unhealthy instances can’t recover by themselves that means your application will be completely down. In our situation it turned out that the we were using a version of the Erlang VM that would suddenly crash and make our instances unhealthy.

How could we fix it?

We imagined that we could attach the Target Group into to newly created Auto Scaling Group as part of the deployment process. As seen in the documentation the last hook for a Blue/green deployment is the AfterAllowTraffic hook, so we decided to attach them in it.

We’ve created a simple bash script that uses the AWS Command Line Interface (CLI) to find the necessary information and then attach the Target Group to the Auto Scaling Group, again using CLI. You can see the Environment Variables available for hooks to fetch information in the official documentation.

To attach the Target Group to the Auto Scaling Group we need the Auto Scaling Group’s name and the Target Group’s Amazon Resource Name (ARN). We find the Target Group’s name using the $DEPLOYMENT_ID of the current deploy and then use that to find the Target Group’s ARN. We find the Auto Scaling Group’s name with the same call we used to find the Target Group’s name but with a different query. Finally we attach them using the attach-load-balancer-target-groups command.

It’s worth mentioning that to be able to run those commands your instances would need permission to the following actions:

  • codedeploy:GetDeployment
  • elasticloadbalancing:DescribeTargetGroups
  • autoscaling:AttachLoadBalancerTargetGroups

That fixes the problem of having the Auto Scaling Group not being attached to the Target Group after a deploy. But what happens when the Auto Scaling Group decides to launch a new instance?

Auto Scaling Group decides to launch a new instance

When the Auto Scaling Group decides to launch a new instance, a new deploy will be triggered for the that instance after it is started. But that deploy has a different Initiating Event. A normal Blue/green deployment would have a Initiating Event as user while a deploy created by an Auto Scaling Group action would have a Initiating Event as autoScaling.

Important to us, though, is that the information used by our naive script in the last section would be different. More specifically, since the deployment is not copying the Auto Scaling Group configuration this time, there’s no information about the Auto Scaling Group. So our command to find the autoscalingGroupName would return ”None”. That leads to a infinite loop of instances failing the deployment and the Auto Scaling Group_ launching new instances and trying to deploy to them again.

Since the Target Group would already been attached to the Auto Scaling Group when the script ran in the AfterAllowTraffic during the Blue/green deployment, we can change our script to only attach them if the autoscalingGroupName is different from ”None”.

Conclusion

Running this script as part of our deployment process helped us to mitigate an outage. Bear in mind that we see the use of the aforementioned bash script as a temporary fix while we wait for the issue to be fixed in the AWS CodeDeploy side.

After further testing and simulation of the case that caused the outage, we’re now confident that the replacement of unhealthy instances are working as expected. That brought us immense peace of mind.

Originally posted to dev.to

--

--