Technograzing: 2016

Sunday, June 19, 2016

Reboot all the thing

The golden rules of running infrastructure on AWS:

Instances fail all the time, just reboot them.
Design everything to fail, and handle it consistently.
AWS has a large community where you can ask questions and find a solution to everything; failing that refer to rules (1) and (2)

When we launched Interana, we decided to deploy our solution on AWS as a managed service so our customers could benefit from our pre-built orchestration. Interana's proprietary analytics database runs with a replication factor of 1. This is in converse to transactional NoSQL stores which tend to replicate data for achieving HA. So occasional reboots themselves seems not so bad in an analytics-purposed system. Cost and performance benefits are realized when managing many TB's of data.

Initially when we deployed in AWS, we observed a large number of node outages. We performed many reboots over and over again until we were faced with a recurrence rate of 5 outages per 100 instances per day. This seemed excessively high. Left unattended, nodes would stay down for hours.

Something was up.

Indicative of a bigger problem on AWS

The instances would become unreachable, usually signaled from Amazon Status Check alerts:

“You are receiving this email because instance i-xxxxxx in region US - N. Virginia has failed an instance or a system status check for at least 5 period(s) of 60 seconds at "Tuesday 10 March, 2015 22:01:32 UTC". You can view status check details about this instance by navigating to the EC2 console”

Auditing our setup, we were using stock Ubuntu LTS 14.04 AMI from AWS. At first, we thought a simple upgrade of Ubuntu to the latest version may clear things.

sudo apt-get update && sudo apt-get upgrade.

Nothing changed. So we looked at the syslog and saw the following:

Mar 10 20:39:42 ip-10-0-0-56 dhclient: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 7 (xid=0x7f84a245)
Mar 10 20:39:49 ip-10-0-0-56 dhclient: No DHCPOFFERS received.
Mar 10 20:39:49 ip-10-0-0-56 dhclient: No working leases in persistent database - sleeping.

Ahh. A DNS failure. What happens is every 1800 seconds, the DNS server is consulted for any possible changes in resolution with regards to this host. When there are problems with the DNS master, the current host will become unreachable.

However, these incidents seemed isolated to a single day and to AWS congestion issues that resolved themselves. This was clearly not the root cause.

Unintended Consequences

Another problem was that instances would get “stuck” trying to reboot. I guess we should stop/start the instance. However, our usage of i2.xlarge machine type is largely ephemeral which leads to 100% data loss. This forces us to restore from disk, which takes many hours.

Even worse, a side-by-side restore (i.e. from a EBS Snapshot) can take an entire day as cold access to EBS is almost 1/10 of its peak values.

The most obvious approach is to replicate hot data but i2.xlarge are the most expensive machine in the AWS family, costing almost 3K per server per year. Running 100 TBs of spares for a replication factor of 2 would be prohibitively expensive.

Monitoring Mashup

As we saw nodes continuing to go pear shaped, we needed to add additional monitoring in a hurry and start tracking all the outages.

Amazon comes with Cloudwatch built in. While that seems great, it doesn’t really go beyond just simple metrics and outage notices. It also gives statistics from the XEN hypervisors point of view, not what is on the machine. There are pros and cons to this, but mainly if XEN Hypervisor thinks everything is fine it won’t provide any further insights when you have an outage.

Next we installed a port ping utility called nping. Nping is useful for checking ports without requiring ICMP, which is disabled by default on Amazon for security purposes.

sudo apt-get install nmap
nping -p 22 -c 1

I wrote a wrapper script that ran distributed nping in a loop that resulted in the following log that is easy enough to parse via linux command line

20150325T200634|10.0.0.84|import6|Starting Nping 0.6.40 ( http://nmap.org/nping ) at 2015-03-25 20:06 UTC
20150325T200634|10.0.0.84|import6|SENT (0.0013s) Starting TCP Handshake > 10.0.0.84:22
20150325T200634|10.0.0.84|import6|RECV (0.0143s) Handshake with 10.0.0.84:22 completed
20150325T200634|10.0.0.84|import6|
20150325T200634|10.0.0.84|import6|Max rtt: 13.017ms | Min rtt: 13.017ms | Avg rtt: 13.017ms
20150325T200634|10.0.0.84|import6|TCP connection attempts: 1 | Successful connections: 1 | Failed: 0 (0.00%)
20150325T200634|10.0.0.84|import6|Nping done: 1 IP address pinged in 0.01 seconds
20150325T200634|10.0.0.82|import4|

Check for outages

cat ~/tmp/monitor_nping_all_20150325T200351.txt | grep -Pe "Failed: 1" | cut -d'|' -f2,3 | sort -n | uniq -c
14 10.0.0.48|data0
95 10.0.0.49|data1
83 10.0.0.51|data2
13 10.0.0.52|data
47 10.0.0.53|data5
31 10.0.0.54|data6
11 10.0.0.55|data7
21 10.0.0.56|data8
46 10.0.0.57|data9
108 10.0.0.58|data10
26 10.0.0.59|data11

After getting data we can process for outages using python scripts also:
Check for default response times

cat /monitor_nping_all_xxx.txt | cut -d'|' -f4 | tr -s ' ' | grep "pinged" | grep -woPe "\d+\.\d+" | python -c "\
import sysimport pprint
import pprint
from collections import defaultdict
allf = [float(line.strip()) for k,line in enumerate(sys.stdin)]
hist = defaultdict(int)
for valf in allf:
hist[valf] += 1
for resp,count in sorted(hist.iteritems()):
print '{:0.2f} = {}'.format(resp,count)"
avg=0.01,max=1.06,min=0.00, count=2398

We also leveraged Datadog, a popular cloud-based monitoring stack. Once installed, the story became clear.

While AWS monitoring would alert us an hour later many times before the initial outage was detected by our monitoring. The graph above should be a set of nicely peaked lines with a small taper. Instead, we see outages as soon as network bandwidth drops from peak. This problem wasn’t just a few nodes going out, the network cards were dropping at all times as load increased.

Finally, we added a SCP large file to all nodes in a distributed fashion. That would cause the network to drop immediately. Reproducing became easy.

Enhanced is the new “Working”

After some wrangling AWS support folks were able to monitor the Xen Hypervisor and found that network hardware would never receive the data as it died on the OS/driver. While we never got the full picture, the problem was due to likely Jumbo Packet support, which are TCP packets greater than 1500 bytes MTU. Instead Enhanced Networking has to be turned on. What this is a dedicated link to your EBS that proves far better transfer (up to 50%) and latency as well. It uses SR-IOV, a feature that allows multi-tenancy hosts on a machine to utilize DMA style transfer.

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-ec2-config.html

After we finished installing this, the transfer graph looked happy:

Built-Tough, SR-IOV Tough.

After going through the install process and waiting on Ubuntu’s canonical to provide a fix, we decided to build the AMI’s and make them public for all to use. These are based on a LTS Ubuntu 14.04.1 LTS. They should be immediately patch with security after deployment, i.e.

sudo apt-get update && sudo apt-get upgrade

Region	AMI
US-EAST-1 (Virginia)	ami-30ebc55a
US-WEST-1 (N. California)	ami-6f04720f
US-WEST-2 (Oregon)	ami-440bed24
EU-WEST-1(Ireland)	ami-68fb4d1b
EU-CENTRAL-1(Frankfurt)	ami-4a899126
AP_NORTHEAST-1(Tokyo)	ami-7f213611

Sometimes they come back

The driver uses linux DKMS support, which requires a current kernel header to compile the network driver. So when doing a dist-upgrade, do not forget to add the kernel headers or your network will revert to the packaged VIF driver and enhanced networking will turn off. We had this happen a few times before we realized the driver was not installed. Note the last command is the one that really is important to determine which driver is in use.

Install headers for current OS

dpkg -l | grep "linux-headers-`uname -r`"
update-initramfs -c -k all

Check if driver is available

modinfo ixgbevf | grep -Pe "version:\s+2.16"

Check if driver is actually in use!

ethtool -i eth0 | grep -E '(driver: version: 2.16)'"

Another problem faced was segmentation offloading causing the host ssh connection to timeout on non-enhanced networking hosts (i.e. using the vif driver). The following message appear in /var/log/syslog manytimes:

xen_netfront: xennet: skb rides the rocket: 19 slots

We later found this bug against the Ubuntu source tree and a workaround https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1317811

Turn off segmentation offloading on hosts using VIF driver:

sudo ethtool -K eth0 sg off

Conclusion

When building out a AMI, AWS provides you with a lot of tools and good set of base AMI’s to start with. However at all time ensure that the hardware is validated before deploying to production. Also, when restart/rebooting machines, try to do as much root cause investigation early on before problem becomes malignant and spreads to all your infrastructure. Use all monitoring tools available to get a complete picture.

Thursday, February 18, 2016

How to get an Honorary Roaming Architect Degree

Back in grade 6, we had reading club (stochastic remember that). The purpose was to transform into Roboreader that could finish the most amount of books in a month. Then you get a cute little magnet that you could put onto your fridge for you parents could brag to all your parents (while your sibling got the same one for trying hard, and you made fun of him until he cried).

I also remember almost finishing the perfect season. We had math drills once a week. 10 math questions in one minute, I earned the nickname: Terminator. In the Last month, i had a brain fart and slipped on a 2x3 = 8. We had learned powers that day, bit flipped somewhere in my head. Latter in college there was Keener Bingo and the unix Beirdos which i realized us engineers are bored out and needs labels to help cluster like-minded people.

The aftermath: Can't remember a single book I read (Charlotte's web, was that a pig or spider?) and after reaching pinnacle of 12x12 i didn't want to go further in chances of lowering my average. (I again sat during the final softball game to preserve my .900 batting average 15 years latter, sorry Joe). I also routinely go to the Beirdos for help for Perl, and of course those wonderful vi-shorcut sheets.

So back on topic, what do you need to do to earn this title? BTW its not Roman Architect - a person who designs stuff in Fortran claiming its the top mathematical system in the world (they are lurking around, beware!). I suppose when I'm 60 i won't remember a single journey or an object from a class, but we have blogs now right?!? Blogger don't get bought out by Micro-AARP!

Honary Doctrate In Travel'n & Architect'n

Must have done some architecting in 5 different countries. Note off-shoring your design does count. and we do TRANSFER credits for if the country has split due to civil unrest (Czechoslovakia), but its limited to last 50 years (sorry california).
You must of been employed by 5 different companies, your own ventures count, but aliases or domains of the same ventures do not. Working = 1 year min. Revenue = optional.
Must of designed systems in least 3 different sectors, i.e. banking, finance, military, social, media etc.. (no facebook, twitter, squared in are not sectors).
Must of attempted to learn 5 different languages ( sign language does not count ). An attempt means you got a reply, toll-free and automated services do not count.
Must of been primary source for 1 data breach involving foreign data security practices. Note fixing and reporting the breach, optional. Distribution of breach is highly discouraged.
Taken 2 completely hacks which broke at least one of following, patent, copyright, reverse-engineering, and obfuscated to something you sold of as original work.
Must of been recruited at least 3 foreign tech workers ending up in a financial stipend. Off-shoring does not count (see above) and bonus if was during a rescission.
Must of participated in 2 complete disaster of a project, one had to be at least government related and costing tax-payers money.
You've tried karokee'ing at least 2 other languages other than English. Note singing to your boarding wards, eligible daughters do count.
You've done a blog in 2 other laungages other than English, only to give up once realized they pay in centavos/per/click, not centos/per/click. Zut Alors, C'est Domage!
Must be on two patent disclosures. Not following is excluded : fonts/colors/fusion. but we accept as long as your name shown. Social widgets are accepted.

So there it is, we'll be providing the Honary Doctrate In Travel'n & Architect'n pending underwriting from one our out-sourced eastern sponsors

Meet these and get a badge (sorry not magnet, but likely to be a widget).

How can you get validation? Resumes... hell no. Your blog of course, after all who else is going to provide you with an honorary degree, your not Bill Crosby after all.
Are some of these personal and get you in hot water?. do what i do, wait until things blow over and we can all laugh about it. After all your Roaming Architect degree is a lifetime adventure.

Tech : Microsoft Why all the hate

Its amazing to how much we've evolved in the 2000's with the following buzzwords Open Source, Web APIs, Cloud Computing, Mobile Computing, Commerce, Social, Geo-location becoming part of lingua-franca.

But one word and the company is all but forgotten and left for dead: Microsoft and Productivity.

The very first OS i worked on was ms-basic 5.1 and doshell. I hacked up my first visual basic program based using the GORILLAS.BAS demo (Angry Birds anyone?) latter used in a Turbo Pascal Hockey emulator (yes all we Canadians luv hockey).

Now before I begin my list, I will have to say, by the time I was in university, we all used to hate Microsoft.   It was simple, they were the biggest baddest, software company out there, with there seemingly infallible leader, Bill Gates (which after his philanthropist career, seems all that much harder to hate). The software interns came back very cocky and half repossessed, they offered the crappiest pizza during the recruiting sessions, and notoriously had the titles for every lame job (SDET are still around!) and worst you had to mail them and beg for a Platform SDK dvd, so they could keep tabs on you. So much hate that my very first job we participated in creating a new advertisement to post for new hires somethine like this as the tag line:

<Want to generate next generation function enterPlayerNumberOne(name) function for Solitare 2.0? Don't become a microserf, join us for a real challenge>

My first commercial product was in Visual Studio C++, using their Document - View, thanks to old one big UI hook event pump I prototyped it quick (3 days) complete with serial driver and a multitasking processing layer. It fell as quick (5 months latter) due to inability to decouple UI from the model and constant redesign of UI. I cursed at the absolutely having the idealist Java around the door, but could never leverage it due to MS continuous tampering and forcing everyone on to their Visual Studio MFC (multifunctional crap) APIs. Even then it was way faster, and how could to I explain to my clients "Oh thats the java that makes it slow and ugly, but underneath its all OO!).

I also recall another death struggle with corporate powers, that old Microsoft freebie syndrome. Sourcesafe, the "free and functional" Microsoft version control. Well we did an old pc-world shutout, and it finished last in every category. People gasped when making code vanish in to thin air using it. Nevertheless I was singled out for forcing a $500/seat version control system into the company and not giving positive reviews for the Microsoft's ugly daughter.   This was quite often the case from upper management, its from Microsoft and its FREE, what else do u wanT!

Anyways, I digress, So here's the things we owe to them, in my time

1) Office UI : Quite simply the first graphical word office suite, most productive, and integrated, OLE support

2) Windows User Experience Guidelines. Quite simply one of the most follow UI standards at the time.

3) Direct 3d : Pretty much a center of all 3d games for the last decade and half (okay it didn't power Wolfenstein)

4) Bill Gates : Before Steve Jobs, the single most compelling man in Tech.   His Humbleness, giving, quiet and in-corruptable nature, made it easy to want to him to represent us.

5) Windows 95: Simply the first UI that was adopted by the masses. Yes it was slow but until that command line was out of the reach for most people.

6) Windows NT : Well before google, someone came up with the idea to house big ole mainframes on with commodity hardware and portable user space code. IMHO pushed unix into what linux is today

7) Importance of Internet: Though microsoft has suprinsgly done very poorly in ever faction of mobile computing, Bill Gates predicted in 1995

8) Productivity of the United States : From 1980 to year 2000, US held a great advantage due to accessible computing

Okay so there is probably a list double this size of why people don't like them, but why kick someone while there down.

Hadn't it been for the Anti-Trust suit in the early 2000's where would they be today?