C3S Resilience Preventive Maintenance

The objective of the Resilience Preventive Maintenance is to systematically check the proper operation of a mission-critical system especially the proper operation of the failover mechanism. Proper operation of the failover mechanism is essential to the resilience of the mission-critical system as the continuity of operation of the mission-critical system depends on the successful failover from the primary to the secondary should a fault occur in the primary.

In Resilience PM, we introduced the time series data points (dynamic) to widen the net to catch more potential igniting points that could lead to failure.  By this method, we are more assured that the fault-tolerant server is placed even further away from the unexpected point of potential failure.

Resilience PM has additional measurements in order to increase the awareness of system performance, and if the necessary corrective action is taken to restore the components to its specification. On top of all these, Resilience PM performs a separate time series analysis of all the measurements taken and draw additional alerts from the trending behaviour of the system and a wider range of mining potential faults. 

Resilience PM is based on the C3S Resilience Framework. C3S developed the Resilience Framework to analyse and communicate the resilience of the system, primarily against system failure and cyber incidents. 

The Framework makes use of our proprietary, easy to understand Resilience Framework Notation to describe the resilience characteristics of the fault-tolerant server in seven layers of interdependent nodes. The proper operation of a node is dependent on the proper operation of the immediate lower level node linked by a path. Failure in any active node in the primary has to be responded by a new working path that has been established in the secondary.

The failover mechanism is designed to detect a failure in the active node and promote the designated nodes in the secondary to active nodes and resume all responsibilities of the faulty component without a hitch seamlessly. The failover mechanism is represented by the Horizontal Redundancy Channel. The left start point of the symbol represents the location of the sensor; the right endpoint of the symbol represents the node that will be promoted to an active node in this transformation process.  

The health and performance of each fault-tolerant server during the PM period is systematically analysed and reported using our proprietary Resilience Framework representation of the fault-tolerant server as shown in the table below.

The table represents the framework indicating the health status of each layer and within each layer the component that is most important to the fault tolerant design namely the primary, the secondary and the failover mechanism. The health status is categorised into four broad status: Working represents a healthy working system with no event of errors or failure. Verified represents an explicit test that has been performed to ascertain that the component is in good working order. Error represents abnormally associated with the operation of the component but does not affect the continuity of its performance. Failure represents a serious fault that may have to be manually intervened in order to restore to its normal working order. Each status is represented by a circle with different colour as shown in the legend. The grey box in each layer indicates a measurable component in the fault-tolerant server design. 

The half-yearly RPM Report will include:
•    Hardware Check (CPU, I/O, Hard Disk, Network Adapter Team and BMC)
•    Health Check (Temperature, Fan Speed)
•    System Trending Check (System error, System alerts)
•    System Rhythm Check (Temporal Differential Health and Resource Usage) 
•    System Resource Usage (Hard Disk, Memory, CPU)
•    Log Review (Application, System, Hardware)