Power grid near Google datacenter struck by lightning 4 times

By Kim Smiley

A small amount of data was permanently lost at a Google datacenter after lightning struck the nearby power grid four times on August 13, 2015. About five percent of the disks in Google’s Europe-west1-b cloud zone datacenter were impacted by the lightning strikes, but nearly all of the data was eventually recovered with less than 0.000001% of the stored data not able to be recovered.

A Cause Map, or visual root cause analysis, can be built to analyze this issue. The first step in the Cause Mapping process is to fill in an Outline with the basic background information such as the date, time and specific equipment involved. The bottom of the Outline has a spot to list the impacted goals to help define the scope of an issue. The impacted goals are then used to begin building the Cause Map. The impacted goals are listed in red boxes on the Cause Map and the impacts are the first cause boxes on the Cause Map. Why questions are then asked to add to the Cause Map and visually lay out the cause-and-effect relationships.

For this example, the customer service goal was impacted because some data was permanently lost. Why did this happen? Data was lost because datacenter equipment failed, this particular data was stored on less stable system and wasn’t duplicated in another location. Google has stated that the lost data was newly written data that was located on storage systems which were more susceptible to power failures. The datacenter equipment failed because the nearby power grid was struck by lightning four times and was damaged. Additionally, the automatic auxiliary power systems and backup battery were not able to prevent data loss after the lightning damage.

When more than one cause was required to produce an effect, all the causes are listed vertically and separated by an “and”. You can click on “Download PDF” above to see a high level Cause Map of this issue that shows how an “and” can be used to build a Cause Map. A more detailed Cause Map could be built that could include all the technical details of exactly why the datacenter equipment failed. This would be useful to the engineers developing detailed solutions.

The final step in the Cause Mapping process is to develop solutions to reduce the risk of a problem recurring in the future. For this example, Google has stated that they are upgrading the datacenter equipment so that it is more robust in the event of a similar event in the future. Google also stated that customers should backup essential data so that it is stored in another physical location to improve reliability.

Few of us probably design datacenter storage systems, but this incident is a good reminder of the importance of having a backup. If data is essential to you or your business, make sure there is a backup that is stored in a physically separate location from the original. Similar to the “unsinkable” Titanic, it is always a good idea to include enough life boats or backups in a design just in case something you didn’t expect goes wrong. Sometimes lightning strikes four times so it’s best to be prepared just in case.

Explosions raise concern over hazardous material storage

By ThinkReliability Staff

On August 12, a fire began at a storage warehouse in Tianjin, China. More than a thousand firefighters were sent in to fight the fire. About an hour after the firefighters went in, two huge explosions registered on the earthquake measurement scale (2.3 and 2.9, respectively). Follow-on explosions continued and at least 114 firefighters, workers and area residents have been reported dead so far, with 57 still missing (at this point, most are presumed dead).

Little is known for sure about what caused the initial fire and continuing explosions. What is known is that the fire, explosions and release of hazardous chemicals that were stored on site have caused significant impacts to the surrounding population and rescuers. These impacts can be used to develop cause-and-effect relationships to determine the causes that contributed to an event. It’s particularly important in an issue like this – where so many were adversely affected – to find effective solutions to reduce the risk of a similar incident recurring in the future.

Even with so much information unavailable, an initial root cause analysis can identify many issues that led to an adverse event. In this case, the cause of the initial fire is still unknown, but the site was licensed to handle calcium carbide, which releases flammable gases when exposed to water. If the chemical was present on site, the fire would have continued to spread when firefighters attempted to fight it using water. Contract firefighters, who are described as being young and inexperienced, have said that they weren’t adequately trained for the hazards they faced. Once the fire started, it likely ignited explosive chemicals, including the 800 tons of ammonium nitrate and 500 tons of potassium nitrate stored on site.

Damage to the site released those and other hazardous chemicals. More than 700 tons of sodium cyanide were reported to be stored at the site, though it was only permitted 10 tons at a time. Sodium cyanide is a particular problem for human safety. Says David Leggett, a chemical risk consultant, “Sodium cyanide is a very toxic chemical. It would take about a quarter of teaspoon to kill you. Another problem with sodium cyanide is that it can change into prussic acid, which is even more deadly.”

But cleaning up the mess is necessary, especially because there are residents living within 2,000 ft. of the site, despite regulations that hazardous sites are a minimum of 3,200 ft. away from residential areas. Developers who built an apartment building within the exclusion zone say they were told the site stored only common goods. Rain could make the situation worse, both by spreading the chemicals and because of the potential that the released chemicals will react with water.

The military has taken over the response and cleanup. Major General Shi Luze, chief of the general staff of the military region, said, “After on-site inspection, we have found several hundred tons of cyanide material at two locations. If the blasts have ripped the barrels open, we neutralize it with hydrogen peroxide or other even better methods. If a large quantity is already mixed with other debris, which may be dangerous, we have built 1-meter-high walls around it to contain the material — in case of chemical reactions if it rains. If we find barrels that remain intact, we collect them and have police transport them to the owners.”

In addition to sending in a team of hazardous materials experts to neutralize and/or contain the chemicals and limiting the public from the area in hopes to limit further impact to public safety, the state media had said they were trying to prevent rain from falling, presumably using the same strategies developed to ensure clear skies for the 2008 Summer Olympics. Whether it worked or not hasn’t been said, but it did rain on August 18, nearly a week after the blast, leaving white foam that residents have said creates a burning or itchy sensation with contact.

View an initial Cause Map of the incident by clicking on “Download PDF” above.

Legionnaires’ Disease Outbreak Blamed on Contaminated Cooling Towers

By ThinkReliability Staff

An outbreak of Legionnaires’ disease has affected at least 115 and killed 12 in the South Bronx area of New York City. While Legionnaires’, a respiratory disease caused by breathing in vaporized Legionella bacteria, has struck the New York City area before, the magnitude of the current outbreak is catching the area by surprise. (Because the vaporization is required, drinking water is safe, as is home air conditioning.) It’s also galvanizing a call for actions to better regulate the causes of the outbreak.

It’s important when dealing with an outbreak that affects public health to fully analyze an issue to determine all the causes that contributed to the problem. In the case of the current Legionnaires’ outbreak, our analysis will be performed in the form of a Cause Map, or visual root cause analysis. We begin by capturing the basic information (what, when and where) about the issue in a problem outline. Because the issue unfolded over months, we will reference the timeline (to view the analysis including the timeline, click on “Download PDF”) to describe when the incident occurred. Some important differences to note – people with underlying medical conditions and smokers are at a higher risk from Legionnaires’, and Legionella bacteria are resistant to chlorine. Infection results from breathing in contaminated mist, which has been determined to have come from South Bronx area cooling towers (which is part of the air conditioning and heating systems of some large buildings).

Next we capture the impact to the goals. The safety goal is impacted due to the 12 deaths, and 115 who have been infected. The customer service goal is impacted by the outbreak of Legionnaires’. The environmental and property goals are impacted because at least eleven cooling towers in the area have been found to be contaminated with Legionella. The issue is resulting in increased regulation, an impact to the regulatory goal, and testing and disinfection, which is being performed by at least 350 workers and is an impact to the labor goal.

The analysis begins by asking “why” questions from one of the impacted goals. In this case, the deaths resulted from an outbreak of Legionnaires’ disease. The outbreak results from exposure to mist from one of the contaminated cooling towers. The design of some cooling towers allows exposure to the mist produced. It is common for water sources to contain Legionella (which again, is resistant to chlorine) but certain conditions allow the bacteria to “take root”: the damp warm environment found in cooling towers and insufficient cleaning/ disinfection. The cost of cleaning is believed to be an issue – studies have found that, like this outbreak, impoverished areas are more prone to these types of outbreaks. Additionally, there are insufficient regulations regarding cooling towers. The city does not regularly inspect cooling towers. According to the mayor and the city’s deputy commissioner for disease control, there just hasn’t been enough evidence to indicate that cooling towers are a potential source of Legionnaires’ outbreaks.

Evidence would indicate otherwise, however. A study that researched risk factors for Legionnaires’ in New York City from 2002-2011 specifically indicated that proximity to cooling towers was an environmental risk. A 2010 hearing on indoor air quality discussed Legionella after a failed resolution in 2000 to reduce outbreaks at area hospitals. New York City is no stranger to Legionnaires’; the first outbreak occurred in 1977, just after Legionnaires’ was identified. There have been two previous outbreaks of Legionnaires’ this year. Had there been a look at other outbreaks, such as the 2012 outbreak in Quebec City, cooling towers would have been identified as a definite risk factor.

For now, though the outbreak appears to be waning (no new cases have been reported since August 3), the city is playing catch-up. Though they are requiring all cooling towers to be disinfected by August 20 and plan increase inspections, right now there isn’t even a list of all the cooling towers in the city. Echoing the frustrations of many, Bill Pearson, member of the committee that wrote standards to address the risk of legionella in cooling towers, says “Hindsight is 20-20, but it’s not a new disease. And it’s not like we haven’t known about the risk of cooling towers, and it’s not like people in New York haven’t died of Legionnaires’ before.”

Ruben Diaz Jr., Bronx borough president, brings up a good point for the cities that may have Legionella risks from cooling towers, “Why, instead of doing a good job responding, don’t we do a good job proactively inspecting?” Let’s hope this outbreak will be a call for others to learn from these tragic deaths, and take a proactive approach to protecting their citizens from Legionnaire’s disease.

Unintended Consequences, Serendipity, and Prawns

By ThinkReliability Staff

The Diama dam in Senegal was installed to create a freshwater reservoir. Unfortunately, that very dam also led to an outbreak of schistosomiasis. This was an unintended consequence: a negative result from something meant to be positive.   Schistosomiasis, which weakens the immune system and impairs the operation of organs, is transmitted by parasitic flatworms. These parasitic flatworms are hosted by snails. When the dam was installed, the snails’ main predators lost a migration route and died off. Keeping the saltwater out of the river allowed algae and plants that feed the snails to flourish. The five why analysis of the issue would go something like this: The safety goal is impacted. Why? Because of an outbreak of schistosomiasis. Why? Because of the increase in flatworms. Why? Because of the increase in snails. Why? Because of the lack of snail predators. Why? Because of the installation of the dam.

Clearly, there’s more to it. We can capture more details about this issue in a Cause Map, or visual form of root cause analysis. First, it’s important to capture the impact to the goals. In this case, the safety goal is impacted because of a serious risk to health and the environmental goal is impacted due to the spread of parasitic flatworms. The customer service goal (if we consider customers as all those who get water from the reservoir created by the dam) is impacted due to the outbreak of schistosomiasis.

Beginning with the safety goal, we can ask why questions. Instead of including just one effect, we include all effects to create a map of the cause-and-effect relationships. The serious risk to health is caused by the villagers suffering from schistosomiasis, which can cause serious health impacts. The villagers are infected with schistosomiasis and do not receive effective treatment. Not all those infected are receiving drugs due to cost and availability concerns. The drugs do not reverse the damage already done. And, most importantly, even those treated are quickly reinfected as they have little choice but to continue to use the contaminated water.

The outbreak of schistosomiasis is caused by the spread of parasitic flatworms, which carry the disease. The increase in flatworms is caused by the increased population of snails, which host the flatworms. The snail population increased after the installation of the dam killed off their predators and increased their food supply.

Many solutions to this issue were attempted and found to be less than desirable. Administering medication for treatment on its own wasn’t very effective, because (as described above) the villagers kept getting reinfected. The use of molluscicide killed off other animals in the reservoir as well. Introducing crayfish to eat the snails was derided by environmentalists as they were considered an invasive species. But they were on the right track. Now, a team is studying the reintroduction of the prawns which ate the snails. During the pilot study, the rates of schistosomiasis decreased. In addition, the prawns will serve as a valuable food source. This win-win solution is an example of serendipity and should actually return money to the community. Says Michael Hsieh, the project’s principal investigator and an assistant professor of urology, “The broad potential of this project is validation of a sustainable economic solution that not only addresses a major neglected tropical disease, but also holds the promise of breaking the poverty cycle in affected communities.”

Introducing animals to get rid of other animals can be problematic, as Macquarie Island discovered when they introduced cats to eat their exploding rodent population who ate the native seabirds). (Click here to read more about Macquarie Island.) Further research is planned to ensure the project will continue to be a success. To learn more about the project, click here. Or, click “Download PDF” to view an overview of the Cause Map.