An apparent software error in a networking device at Amazon Web Services’ Northern Virginia data center led to a brief service interruption on August 25 for several highly trafficked services, including Instagram, Vine, AirBnB, Flipboard and some Netflix services. The glitch is the latest culprit in a string of Amazon outages that have raised concerns about the reliability of such cloud service providers.
The outage, which, by varying accounts, led to intermittent service for anywhere from 49 minutes to several hours, were due to the “partial failure of a networking device,” Amazon said after the incident. According to BBC, Amazon was investigating a series of issues in its databases, the software that spreads queries across its servers and the code underlying core servers.
The incident came on the heels of an Amazon outage the previous week that took the Web giant’s own ecommerce site offline for about 25 minutes, during which some estimates pegged the losses at a rate as high as $1,100 in lost sales each second, according to ZDNet. The issue of a software error leading to an outage could also help fuel ongoing speculation about the uptime of cloud services, Businessweek noted, highlighting the words of Amazon engineer James Hamilton.
“Inside a single facility, there are simply too many ways to shoot one’s own foot,” Hamilton wrote on his personal blog.
With the myriad other challenges facing data center administrators, the possibility of a software failure taking out a critical piece of equipment should be avoided at all costs. Equipment and networking solution vendors should look to minimize errors in embedded software during the development process through the use of source code analysis solutions such as static analysis software. By catching bugs in advance, manufacturers can help avoid service outages down the line.
Software news brought to you by Klocwork Inc., dedicated to helping software developers create better code with every keystroke.