The golden rules of running infrastructure on AWS:
- Instances fail all the time, just reboot them.
- Design everything to fail, and handle it consistently.
- AWS has a large community where you can ask questions and find a solution to everything; failing that refer to rules (1) and (2)
When we launched Interana, we decided to deploy our solution on AWS as a managed service so our customers could benefit from our pre-built orchestration. Interana's proprietary analytics database runs with a replication factor of 1. This is in converse to transactional NoSQL stores which tend to replicate data for achieving HA. So occasional reboots themselves seems not so bad in an analytics-purposed system. Cost and performance benefits are realized when managing many TB's of data.
Initially when we deployed in AWS, we observed a large number of node outages. We performed many reboots over and over again until we were faced with a recurrence rate of 5 outages per 100 instances per day. This seemed excessively high. Left unattended, nodes would stay down for hours.
Something was up.
Indicative of a bigger problem on AWS
The instances would become unreachable, usually signaled from Amazon Status Check alerts:
“You are receiving this email because instance i-xxxxxx in region US - N. Virginia has failed an instance or a system status check for at least 5 period(s) of 60 seconds at "Tuesday 10 March, 2015 22:01:32 UTC". You can view status check details about this instance by navigating to the EC2 console”
Auditing our setup, we were using stock Ubuntu LTS 14.04 AMI from AWS. At first, we thought a simple upgrade of Ubuntu to the latest version may clear things.
sudo apt-get update && sudo apt-get upgrade.
Nothing changed. So we looked at the syslog and saw the following:
Mar 10 20:39:42 ip-10-0-0-56 dhclient: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 7 (xid=0x7f84a245) Mar 10 20:39:49 ip-10-0-0-56 dhclient: No DHCPOFFERS received. Mar 10 20:39:49 ip-10-0-0-56 dhclient: No working leases in persistent database - sleeping.
Ahh. A DNS failure. What happens is every 1800 seconds, the DNS server is consulted for any possible changes in resolution with regards to this host. When there are problems with the DNS master, the current host will become unreachable.
However, these incidents seemed isolated to a single day and to AWS congestion issues that resolved themselves. This was clearly not the root cause.
Unintended Consequences
Another problem was that instances would get “stuck” trying to reboot. I guess we should stop/start the instance. However, our usage of i2.xlarge machine type is largely ephemeral which leads to 100% data loss. This forces us to restore from disk, which takes many hours.
Even worse, a side-by-side restore (i.e. from a EBS Snapshot) can take an entire day as cold access to EBS is almost 1/10 of its peak values.
The most obvious approach is to replicate hot data but i2.xlarge are the most expensive machine in the AWS family, costing almost 3K per server per year. Running 100 TBs of spares for a replication factor of 2 would be prohibitively expensive.
Monitoring Mashup
As we saw nodes continuing to go pear shaped, we needed to add additional monitoring in a hurry and start tracking all the outages.
Amazon comes with Cloudwatch built in. While that seems great, it doesn’t really go beyond just simple metrics and outage notices. It also gives statistics from the XEN hypervisors point of view, not what is on the machine. There are pros and cons to this, but mainly if XEN Hypervisor thinks everything is fine it won’t provide any further insights when you have an outage.
Next we installed a port ping utility called nping. Nping is useful for checking ports without requiring ICMP, which is disabled by default on Amazon for security purposes.
sudo apt-get install nmap nping -p 22 -c 1
I wrote a wrapper script that ran distributed nping in a loop that resulted in the following log that is easy enough to parse via linux command line
20150325T200634|10.0.0.84|import6|Starting Nping 0.6.40 ( http://nmap.org/nping ) at 2015-03-25 20:06 UTC 20150325T200634|10.0.0.84|import6|SENT (0.0013s) Starting TCP Handshake > 10.0.0.84:22 20150325T200634|10.0.0.84|import6|RECV (0.0143s) Handshake with 10.0.0.84:22 completed 20150325T200634|10.0.0.84|import6| 20150325T200634|10.0.0.84|import6|Max rtt: 13.017ms | Min rtt: 13.017ms | Avg rtt: 13.017ms 20150325T200634|10.0.0.84|import6|TCP connection attempts: 1 | Successful connections: 1 | Failed: 0 (0.00%) 20150325T200634|10.0.0.84|import6|Nping done: 1 IP address pinged in 0.01 seconds 20150325T200634|10.0.0.82|import4|
Check for outages
cat ~/tmp/monitor_nping_all_20150325T200351.txt | grep -Pe "Failed: 1" | cut -d'|' -f2,3 | sort -n | uniq -c 14 10.0.0.48|data0 95 10.0.0.49|data1 83 10.0.0.51|data2 13 10.0.0.52|data 47 10.0.0.53|data5 31 10.0.0.54|data6 11 10.0.0.55|data7 21 10.0.0.56|data8 46 10.0.0.57|data9 108 10.0.0.58|data10 26 10.0.0.59|data11
After getting data we can process for outages using python scripts also:
Check for default response times
Check for default response times
cat /monitor_nping_all_xxx.txt | cut -d'|' -f4 | tr -s ' ' | grep "pinged" | grep -woPe "\d+\.\d+" | python -c "\ import sysimport pprint import pprint from collections import defaultdict allf = [float(line.strip()) for k,line in enumerate(sys.stdin)] hist = defaultdict(int) for valf in allf: hist[valf] += 1 for resp,count in sorted(hist.iteritems()): print '{:0.2f} = {}'.format(resp,count)" avg=0.01,max=1.06,min=0.00, count=2398
We also leveraged Datadog, a popular cloud-based monitoring stack. Once installed, the story became clear.
While AWS monitoring would alert us an hour later many times before the initial outage was detected by our monitoring. The graph above should be a set of nicely peaked lines with a small taper. Instead, we see outages as soon as network bandwidth drops from peak. This problem wasn’t just a few nodes going out, the network cards were dropping at all times as load increased.
Finally, we added a SCP large file to all nodes in a distributed fashion. That would cause the network to drop immediately. Reproducing became easy.
Enhanced is the new “Working”
After some wrangling AWS support folks were able to monitor the Xen Hypervisor and found that network hardware would never receive the data as it died on the OS/driver. While we never got the full picture, the problem was due to likely Jumbo Packet support, which are TCP packets greater than 1500 bytes MTU. Instead Enhanced Networking has to be turned on. What this is a dedicated link to your EBS that proves far better transfer (up to 50%) and latency as well. It uses SR-IOV, a feature that allows multi-tenancy hosts on a machine to utilize DMA style transfer.
After we finished installing this, the transfer graph looked happy:
Built-Tough, SR-IOV Tough.
After going through the install process and waiting on Ubuntu’s canonical to provide a fix, we decided to build the AMI’s and make them public for all to use. These are based on a LTS Ubuntu 14.04.1 LTS. They should be immediately patch with security after deployment, i.e.
sudo apt-get update && sudo apt-get upgrade
Region | AMI |
US-EAST-1 (Virginia) | ami-30ebc55a |
US-WEST-1 (N. California) | ami-6f04720f |
US-WEST-2 (Oregon) | ami-440bed24 |
EU-WEST-1(Ireland) | ami-68fb4d1b |
EU-CENTRAL-1(Frankfurt) | ami-4a899126 |
AP_NORTHEAST-1(Tokyo) | ami-7f213611 |
Sometimes they come back
The driver uses linux DKMS support, which requires a current kernel header to compile the network driver. So when doing a dist-upgrade, do not forget to add the kernel headers or your network will revert to the packaged VIF driver and enhanced networking will turn off. We had this happen a few times before we realized the driver was not installed. Note the last command is the one that really is important to determine which driver is in use.
Install headers for current OS
dpkg -l | grep "linux-headers-`uname -r`" update-initramfs -c -k all
Check if driver is available
modinfo ixgbevf | grep -Pe "version:\s+2.16"
Check if driver is actually in use!
ethtool -i eth0 | grep -E '(driver: version: 2.16)'"
Another problem faced was segmentation offloading causing the host ssh connection to timeout on non-enhanced networking hosts (i.e. using the vif driver). The following message appear in /var/log/syslog manytimes:
xen_netfront: xennet: skb rides the rocket: 19 slots
We later found this bug against the Ubuntu source tree and a workaround https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1317811
Turn off segmentation offloading on hosts using VIF driver:
sudo ethtool -K eth0 sg off
Conclusion
When building out a AMI, AWS provides you with a lot of tools and good set of base AMI’s to start with. However at all time ensure that the hardware is validated before deploying to production. Also, when restart/rebooting machines, try to do as much root cause investigation early on before problem becomes malignant and spreads to all your infrastructure. Use all monitoring tools available to get a complete picture.