Stephen Goldsworth is systems engineering group manager for Tesco Controls Inc. Goldsworth can be reached at [email protected].
Supervisory control and data acquisition (SCADA) systems are an integral factor for operating any modern water collection, treatment or distribution operation. SCADA systems may consist of a few local controllers and operator interfaces or may be far more complex configurations that include networking, radio telemetry, distributed computers and other elements. These systems invariably combine multiple different hardware, software, networking and communications technologies. This can make systems difficult to maintain in perfect working order.
Figure 1 indicates some root causes of unplanned outages in 2016 for data centers. Cyber attacks are a notable percentage but perhaps not as high on the list as one might expect based on recent reports.
Whether there are isolated failures or a larger disaster strikes, every facility needs to develop and follow a comprehensive disaster recovery (DR) plan that includes backup and recovery strategies, as well as methods for testing and continuous evolution of the plan. DR plans also are useful for restoring operation after planned downtime for maintenance and upgrades.
Business Life Cycle Continuity
Without a comprehensive DR plan and proactive testing, SCADA system recovery will be slow, disorganized and expensive. The goal is to maintain business continuity so critical services can quickly be restored. Creating a DR plan is one step in the never-ending business continuity life cycle (Figure 2).
The following sections examine how this life cycle approach applies to SCADA systems, usually beginning with the “identify” step.
Identify Existing Conditions
Documentation: Accurate as-built documentation is a neglected aspect of automation projects. Sometimes the design information is not properly updated when construction and commissioning is completed, or modifications and upgrades are performed without permanent documentation updates. Weak documentation hinders daily operation and maintenance efforts, and thwarts quick recovery from more serious problems. Documentation includes any sort of specifications, drawings, databases or anything else used to design, build, configure and test the automation system.
Risk Assessment: As challenging as it can be to document the existing system, it can be even more difficult to assess the risks. Every asset must be evaluated as critical or non-critical by a multi-discipline team, with consideration as to whether equipment can be operated manually in the event of SCADA failure. Some risks are internal, like network cabling failures on site, while others are external, like loss of utility power. Single points of failure (SPOF) deserve specific attention—whether they are related to mechanical, electrical or automation elements—because SPOF issues cause the majority of outages.
Analyze Goals & Costs
Once the SCADA system is documented and failure risks are known, consequences can be evaluated. Outages cause a loss of revenue, service disruptions to customers and decreased productivity. Other possible impacts include violating laws or compliance regulations and the inability to meet service-level agreements.
Water and wastewater end users must identify recovery goals and costs. Obvious goals are to minimize the number and duration of disruptions, prevent damage and plan alternative methods of preserving operation. Installing redundancy provisions, developing a DR plan, training personnel, performing preventative maintenance and periodic testing all support smooth and rapid restoration service; however, each comes with a cost. Comparing the cost of disruption and the cost to recovery is part of a business impact analysis, revealing the acceptable cost balance point (Figure 3).
Extensive redundancy provisions can significantly reduce the disruption time but can be expensive additions to a SCADA system. Costs include initial hardware and software expenses, ongoing licensing and support and manpower. Once the cost of disruption is understood, it is possible to design strategies to reduce downtime.
Design Redundant Strategies
While a basic failure could entail just one component going bad, a disaster-level event, such as a fire or flood is much larger and could disable significant amounts of the SCADA system. The most resilient systems are designed with built-in redundancy in multiple areas, including additional geographic locations.
A normal operational primary production system on site can be backed up by an entire secondary parallel SCADA system, including computers, networks and power (Figure 4). In addition to operational backups, end user system administrators should create and archive periodic backups of all component configuration files, including software installation files, firmware, virtual machines, programmable logic controller (PLC) code, SCADA applications, network equipment configurations and historical data.
There is not a single superior architecture, but rather there are many factors demanding consideration when developing an appropriate architecture for a system. For example, a secondary system could be installed on the same site and in the same building, but it is far better to install the secondary system in a separate building some distance away if the appropriate security and environmental conditions are available.
Varying levels of redundant functionality are possible. “Hot” backups at secondary sites can take over automatically, whereas “warm” backups require some user intervention. Another option is “cold” backups, which typically require some effort to bring online. Virtualized infrastructure can add resiliency, make management of geographically distributed systems easier and improve recovery strategies by backing up entire host computers as data.
What Role Does the Cloud Play?
It is possible to host primary and secondary systems in the cloud. However, this method binds the end user to services they cannot directly control and introduces cybersecurity concerns, so this should be approached with caution. The cloud could be an excellent additional option for backup and data storage.
One of the obvious redundancy constraints is cost. Other variables factoring into architecture design include:
• Hardware/software platform age and functional capabilities;
• Process requirements;
• Site distribution and geography; and
• Networking and telemetry.
A full redundancy plan should address multiple components, including the SCADA servers, I/O servers, historians, alarm notification systems, PLCs and operator workstations. It deals with many aspects of networking with a high availability approach for devices, including routers, switches, firewalls, RF telemetry infrastructure and even ISP services that are leveraged for remote access or communication between automation system components. The typical goal for networking is to provide alternate communication paths for critical components.
Backup the Data
The settings of any configurable system devices must be backed up, and this can include field instruments, variable speed drives, controllers, PCs and network components. Software backups may include operating systems, drivers, applications, control programs, visualization configurations, historical data, alarm/event logs and more.
There are sometimes automated provisions for backups, but many must be handled manually. Virtualization can facilitate PC-based backups, and some SCADA vendors offer native backup functionality.
Storage media and locations are important. End users should strive to follow the software and data backup 3-2-1 rule:
• Create three copies (one primary and two backup);
• Store copies on at least two different media types (hard drive, tape, cloud, etc.); and
• Keep one of those copies off site.
Creating procedures dictating who performs backups, what is backed up, where it is backed up to and how often backups are performed is critical for a successful DR plan. The next part of the life cycle is documenting all this information.
Create & Execute the DR Plan
Once the conditions are identified, the risks are assessed and the strategies are determined, the DR plan can be created, resulting in materials to guide personnel on what to do during various failure types. Due to the wide variety of technologies and operations involved, it is important to define the roles and responsibilities necessary to execute the plan and the associated levels of authority. In basic terms, who is going to do what?
The DR plan must contain or point to all necessary system documentation needed for recovery, and it requires clear technical procedures to restore any failed SCADA elements. In addition, it should provide guidance regarding how to continue operating certain processes in a manual mode during a DR event. Other useful information may include supporting vendor contacts.
Beyond the actual recovery steps, a DR plan also must offer direction on how to periodically maintain SCADA hardware, software, and test the redundancy and backup systems. This also includes procedures to verify the backup and restoration methods are valid.
Measure by Training, Testing & Maintaining
The last step of a business continuity life cycle is to measure the SCADA DR plan effectiveness, which is best to do under controlled conditions instead of during a crisis. This will help ensure the team is ready in the event of a failure.
Make sure staff are trained on the DR plan, and then prove out both the training and the plan itself by executing test scenarios and attempted recoveries. Good test plans will use actual backup media to confirm they are viable and will proactively exercise redundant components, systems, and sites by triggering failovers from primary to backup elements. Redundancy systems and elements must be maintained just like any other mechanical or electrical equipment.
Close the cycle by updating the DR plan and other documentation as discrepancies are discovered or changes are made. Be aware of and capture subtle underlying changes like software, firmware or hardware upgrades. Recognize changes coming from any source, whether they are a major process area reconfiguration or a small utility system upgrade.
Seek DR Plan Expertise From a System Integrator
Developing a DR plan can be a daunting activity due to the massive scope of the endeavor, especially because SCADA systems require a large team to develop a comprehensive DR plan.
Because of this, end users should consider engaging a system integrator that understands the automation technologies, the processes being automated and proven best practices necessary for creating an effective DR plan. A trusted system integration partner with the right experience can help end users build a solid DR plan with the right balance of technical options addressing the cost impacts.