When You Call Yourself ThinkReliability…

By ThinkReliability Staff

While I was bombasting about the Valdez oil spill in 1989, one of those ubiquitous internet fairies decided that I did not really need the network connection at my remote office.  Sadly this meant that the attendees on my Webinar had to listen only to me speaking without seeing the pretty diagrams I made for the occasion (after a short delay to switch audio mode).

Though I have all sorts of redundancies built in to Webinar presentations (seriously, I use a checklist every time), I have not prepared for the complete loss of network access, which is what happened during my March 20th, 2014 Webinar.  I’m not going to use the term “root cause”, because I still had another plan . . . (yep, that failed, too).

For our mutual amusement (and because I get asked for this all the time), here is a Cause Map, or visual root cause analysis – the very method I was demonstrating during the failure – of what happened.

First we start with the what, when and where.  No who because blame isn’t the point, though in this case I will provide full disclosure and clarify that I am, in fact, writing about myself.  The Webinar in question was presented on March 20, 2014 at 2:00 PM EST (although to my great relief the issues didn’t start until around 2:30 pm).  That little thorn in my side? It was the loss of a network connection at the Wisconsin remote office (where I typically present from).  I was using Citrix Online’s GoToWebinar© program to present a root cause analysis case study of the Valdez oil spill online.

Next we capture the impact to the organization’s (in this case, ThinkReliability) goals.  Luckily, in the grand scheme of things, the impacted goals were pretty minor.  I annoyed a bunch of customers who didn’t get to see my slides and I scheduled an additional Webinar.  Also I spent some time doing follow-up to those who were impacted, scheduling another Webinar, and writing this blog.

Next we start with the impacted goals and ask “Why” questions.  The customer service goal was impacted because of the interruption in the Webinar.  GoToWebinar© (as well as other online meeting programs) has two parts: audio and visual.  I temporarily lost audio as I was using the online option (VOIP), which I use as a default because I like my USB headset better than my wireless headset.  The other option is to dial in using the phone.  As soon as I figured out I had lost audio, I switched to phone and was able to maintain the audio connection until the end of the Webinar (and after, for those lucky enough to hear me venting my frustration at my office assistant).

In addition to losing audio, I lost the visual screen-sharing portion of the Webinar.   Unlike audio, there’s only one option for this.  Screen sharing occurs through an online connection to GoToWebinar©.  Loss of that connection means there’s a problem with the GoToWebinar© program, or my network connection.  (I’ve had really good luck with GoToWebinar; over the last 5 years I have used the program at least weekly with only two connection problems attributed to Citrix.)  At this point I started running through my troubleshooting checklist.  I was able to reconnect to audio, so it seemed the problem was not with GoToWebinar©.  I immediately changed from my wired router connection to wireless, which didn’t help.  Meanwhile my office assistant checked the router and determined that the router was not connected to the network.

You will quickly see that at this point I reached the end of my expertise.  I had my assistant restart the router, which didn’t work, at least not immediately.  At this point, my short-term connection attempts (“immediate solutions”) were over.  Router troubleshooting (beyond the restart) or a call to my internet provider were going to take far longer than I had on the Webinar.

Normally there would have been one other possibility to save the day.  For online presentations, I typically have other staff members online to assist with questions and connection issues, who have access to the slides I’m presenting.  That presenter (and we have done this before) could take over the screen sharing while I continued the audio presentation.  However, the main office in Houston was unusually short-staffed last week (which is to say most everyone was out visiting cool companies in exciting places).  And (yes, this was the wound that this issue rubbed salt in), I had been out sick until just prior to the Webinar.  I didn’t do my usual coordination of ensuring I had someone online as my backup.

Because my careful plans failed me so completely, I scheduled another Webinar on the same topic.  (Click the graphic below to register.)  I’ll have another staff member (at another location) ready online to take over the presentation should I experience another catastrophic failure (or a power outage, which did not occur last week but would also result in complete network loss to my location).   Also, as was suggested by an affected attendee, I’ll send out the slides ahead of time.  That way, even if this exact series of unfortunate events should recur, at least everyone can look at the slides while I keep talking.

To view my comprehensive analysis of a presentation that didn’t quite go as planned, please click “Download PDF above.  To view one of our presentations that will be “protected” by my new redundancy plans, please see our upcoming Webinar schedule.

Microsoft Withdrawing Support for Windows XP, Still Used by 95% of World’s 2.2 Million ATMs

By ThinkReliability Staff

On April 8, 2014, Microsoft will withdraw support for its XP operating system.  While this isn’t new news (Microsoft made the announcement in 2007), it’s quickly becoming an issue for the world’s automated teller machines (ATMs).  Of the 2.2 million ATMs in the world, 95% run Windows XP.  Of these, only about a third will be upgraded by the April 8th deadline.

These banks then face a choice: upgrade to a newer operating system (which will have to be done eventually anyway), pay for extended support, or go it alone.  We can look at the potential consequences for each decision – and the reasons behind the choices – in a Cause Map, a visual form of root cause analysis.

First we look at the consequences, or the impacts to the goals.  The customer service goal is impacted by the potential exposure to security threats.  (According to Microsoft, it’s more than just potential.  Says Timothy Rains, Microsoft’s Director of trustworthy computing, “The probability of attackers using security updates for Windows 7, Windows 8, Windows Vista to attack Windows XP is about 100 per cent.”)  Required upgrades, estimated to cost each bank in the United Kingdom $100M (US) by security experts, impact the production/schedule and property/equipment goals.   Lastly, if implemented, extended service/ support contracts will impact the labor/time goal.  Though many banks have announced they will extend their contract, the costs of such an extension are unclear, and likely vary due to particular circumstances.

As mentioned above, banks have a choice.  They can upgrade immediately, as will be required at some point anyways.  However, it’s estimated that most (about two-thirds) of banks worldwide won’t make the deadline.  They will then continue to operate in XP, with or without an extended service/ support contract.

Operating without an extended contract will create a high vulnerability to security risks – hackers and viruses.  It has been surmised that hackers will take security upgrades developed for other operating systems and reverse engineer them to find weaknesses in XP.  The downside of the extended contracts is the cost.

Given the risk of security issues with maintaining XP as an operating system, why haven’t more banks upgraded in the 7 years since Microsoft announced it would be withdrawing support?  There are multiple reasons.  First, because of the huge number of banks that still need to upgrade, experts available to assist with the upgrade are limited.  Many banks use proprietary software based on the operating system, so it’s not just the operating system that would need to be upgraded – so would many additional programs.

The many changes that banks have been dealing with as a result of the financial crisis may have also contributed to the delay.  (For more on the financial crisis, see our example page.)  Banks are having trouble implementing the many changes within the time periods specified.  Another potential cause is that banks may be trying to perform many upgrades together.  For example, some ATMs will move to a new operating system and begin accepting chip cards as part of the same upgrade.  (For more about the move towards chip cards, see our previous blog.)

Some banks are just concerned about such a substantial change.  “I ask these companies why they are using old software, they say ‘Come on, it works and we don’t want to touch that,'” says Jaime Blasco, a malware researcher for AlienVault.  The problem is, soon it won’t be working.

To view the Outline and Cause Map, please click “Download PDF” above.  Or click here to read more.

Cleaning up Fukushima Daiichi

By ThinkReliability Staff

The nuclear power plants at Fukushima Daiichi were damaged beyond repair during the earthquake and subsequent tsunami on March 11, 2011.  (Read more about the issues that resulted in the damage in our previous blog.)  Release of radioactivity as a result of these issues is ongoing and will end only after the plants have been decommissioned.  Decommissioning the nuclear power plants at Fukushima Daiichi will be a difficult and time consuming process.  Not only the process but the equipment being used are essentially being developed on the fly for this particular purpose.

Past nuclear incidents offer no help.  The reactor at Chernobyl which exploded was entombed in concrete, not dismantled as is the plan for the reactors at Fukushima Daiichi.  The reactor at Three Mile Island which overheated was defueled, but the pressure vessel and buildings in that case were not damaged, meaning the cleanup was of an entirely different magnitude.  Lake Barrett, the site director during the decommissioning process at Three Mile Island and a consultant on the Fukushima Daiichi cleanup, says that nothing like Fukushima has ever happened before.

An additional challenge?  Though the reactors have been shut down since March 2011, the radiation levels remain too high for human access (and will be for some time).  All access, including for inspection, has to be done by robot.

The decommissioning process involves 5 basic steps (though the completion of them will take decades).

First, an inspection of the site must be completed using robots.  These inspection robots aren’t your run-of-the-mill Roombas.  Because of the steel and concrete structures involved with nuclear power, wireless communication is difficult.  One type of robot used to survey got stuck in reactor 2 after its cable was entangled and damaged.   The next generation of survey robots unspools cable, takes up slack when it changes direction and plugs itself in for a recharge.  This last one is particularly important: not only can humans not access the reactor building, they can’t handle the robots after they’ve been in there.  The new robots should be able to perform about 100 missions before component failure, pretty impressive for access in a site where the hourly radiation dose can be the same as a cleanup worker’s annual limit (54 millisieverts an hour).

Second, internal surfaces will be decontaminated.  This requires even more robots, with different specialties.  One type of robot will clear a path for another type, which will be outfitted with water and dry ice, to be blasted at surfaces in order to remove the outer level, and the radiation with it.  The robots will them vacuum up and remove the radioactive sludge from the building.  The resulting sludge will have to be stored, though the plan for the storage is not yet clear.

Third, spent fuel rods will be removed, further reducing the radiation within the buildings.  A shielded cask is lowered with a crane-like machine, which then packs the fuel assemblies into the cask.  The cask is then removed and transported to a common pool for storage.  (The fuel assemblies must remain in water due to the decay heat still being produced.)

Fourth, radioactive water must be contained.  An ongoing issue with the Fukushima Daiichi reactors is the flow of groundwater through contaminated buildings.  (Read more about the issues with water contamination in a previous blog.)  First, the flow of groundwater must be stopped.  The current plan is to freeze soil to create a wall of ice and put in a series of pumps to reroute the water.    Then, the leaks in the pressure vessels must be found and fixed.  If the leaks can’t be fixed, the entire system may be blocked off with concrete.

Another challenge is what to do with the radioactive water being collected.  So far, over 1,000 tanks have been installed.  But these tanks have had problems with leaks.    Public sentiment is against releasing the water into the ocean, though the contamination is low and of a form that poses a “negligible threat”.  The alternative would be using evaporation to dispose of the water over years, as was done after Three Mile Island.

Finally, the remaining damaged nuclear material must be removed.  More mapping is required, to determine the location of the melted fuel.  This fuel must then be broken up using long drills capable of withstanding the radiation that will still be present.  The debris will then be taken into more shielded casks to a storage facility, the location of which is yet to be determined.  The operator of the plant estimates this process will take at least 20 years.

To view the Process Map laid out visually, please click “Download PDF” above.  Or click here to read more.

Dangerous Combination: Propane Shortages and a Bitterly Cold Winter

By Kim Smiley

Propane shortages and skyrocketing prices in parts of the United States have made it difficult for some homeowners to affordably and consistently heat their homes this winter.   The brutally cold winter many regions are experiencing is also worsening both the causes and effects of the shortages.

A Cause Map can be built to help understand this issue.  Cause Maps are a visual format for performing a root cause analysis that intuitively lay out the causes that contributed to an issue to show the cause-and-effect relationships.  To view a high level Cause Map of this issue, click on “Download PDF” above.

Why have there been recent propane shortages in regions of the United States?  This question is particularly interesting given the fact that propane production in the United States has increased 15 percent in the past year.   One of the reasons that propane prices have dramatically increased is because of a spike in demand.  There was a larger than normal grain crop this fall, which was also wetter than usual.  Wet grains must be dried prior to storing to prevent spoiling and propane is used in the process.  Local propane supplies were depleted in some areas because five times more propane was used to dry crops this year than last.   About 5 percent of homes in the United States depend on propane for heat and the unusually frigid temperatures this winter have resulted in additional increases in propane demand.

In addition to the increase in demand, there have been issues replenishing local supplies of propane quickly enough to support the increased demand.  There have been some logistical problems transporting propane this winter.  The Cochin pipeline was out of service for repairs, limiting how quickly propane could be transported to areas experiencing shortages.  There were rail rerouting issues that impacted shipments from Canada.

Additionally, many are asking questions about what role propane exports have played into the domestic shortages.   Propane exports have quadrupled in the last 3 years.  New mining techniques and improved infrastructure have made exporting propane to foreign markets more lucrative and companies have begun to ship more propane overseas. As more propane is shipped to foreign markets, there is less available for use in the United States.

The propane shortages are an excellent example of supply and demand in action.  Increasing demand combined with decreasing supply will result in higher prices.  Unfortunately addressing the problem isn’t simple. There are very complex logistic and economic issues that need to be addressed, but if people don’t have access to affordable heating, the situation can quickly become dangerous, or even deadly.  In the short term, lawmakers are taking a number of steps to get propane shipped to the impacted areas, but how the US chooses to deal with this issue in the long term is still being debated.