At Hadapt we provision a lot of EC2 instances for development and test purposes. This gives us some unhappily deep experience with error cases around EC2 provisioning (at least, in us-east, where we provision most of our nodes).
One problem we have seen is nodes transitioning into a running state but not becoming available to SSH. This can be detected programmatically (or on the EC2 dashboard) as a node that fails the reachability check. If you are looping trying to reconnect via SSH, it might look something like this:
ssh: connect to host 184.108.40.206 port 22: Connection timed out
looping forever, whereas a normally operating host would transition from timing out to connection refused (IP address acquired but before SSH starts) to normal operation:
ssh: connect to host 220.127.116.11 port 22: Connection timed out ssh: connect to host 18.104.22.168 port 22: Connection refused Warning: Permanently added '22.214.171.124' (RSA) to the list of known hosts.
We’ve learned, through working with EC2 support, that this can happen if the new instance misses the DHCP offer from the local DHCP server and thus never acquires its IP address.
The Amazon Linux AMI uses the following configuration for
/etc/sysconfig/network-scripts/ifcfg-eth0 to avoid this problem:
1 2 3 4 5 6 7 8
Notably, it disables the use of NetworkManager (which may cause an
interface to be disabled) and configures
dhclient to be persistent
and retry until it gets a lease. It also disables IPv6.
Neither the CentOS public AMIs nor the
cloud-init-enabled CentOS AMIs use this
configuration. If you are using those frequently and seeing SSH
timeout issues, you may wish to re-capture those images with this
networking configuration. If you are constructing your own images
based on CentOS (e.g., using Packer), it would probably be a good
idea to use this to configure network interfaces in your AMI.