RESILIENCE COMPUTING

Spectrum of solutions for continuity of operations

 

EXCELLENCE IN COMPUTING

C3S started as a hardware resilience solution provider to mission-critical customers in Singapore and the region. Customers included government, regional stock exchanges, national utility providers, high-value manufacturers, just to name a few.


We offer a spectrum of fault tolerant solutions to provide our customers with continuity of their operations upon failure of a component in their information and communication technology (ICT) system that their operations are critically dependent on.

We take a lifecycle approach to resilience. 

  • Specification, Procurement and Implementation of a Resilience Computing Solution using our Resilience Framework

  • Resilience Preventive Maintenance to systematically check the proper operation of a mission-critical system.

 
Continuity upon Failure.png

RESILIENCE COMPUTING SOLUTIONS

We offer three fault tolerant solutions to meet a variety of resilience specifications and budgets. The chart below summarises the resilience behaviours of the three fault tolerant solutions that we offer:

  • Hardware Fault Tolerance

  • Software Fault Tolerance

  • High Availability.

 

ANALYSED USING OUR OPERATIONS RESILIENCE ANALYSIS FRAMEWORK

We use our Operations Resilience Analysis Framework to describe, assess and test the resilient characteristics of the three solutions. The resilient characteristics are measured and compared in the dimensions of:

  • Recovery Latency – the length of downtime when the system is not performing normally

  • Data Loss Vulnerability – the risk of data loss and data corruption during failover


Our Resilience Notation is used to represent and then analyse and communicate the resilience design and recovery behaviour during a failure of each resilient computing system. The Notation is used as a basis for walking through a time series simulation of the resilience design in great detail. 

 

HARDWARE FAULT TOLERANCE SOLUTION

RESILIENCE DESIGN

In the hardware fault tolerance solution:

  • Primary and secondary both process same transaction which is kept in sync by a lockstep mechanism

  • Sensors detect failure in: memory, hard disk, CPU, network card, power supply unit

  • Failover mechanism promotes secondary hardware

  • Secondary hardware continue processing seamlessly as nodes are kept in sync by lockstep mechanism

COMPARISON

Hardware Fault Tolerance compared with Software Fault Tolerance and High Availability solutions:

  • Recovery latency is zero

  • Data Loss Vulnerability is zero

 
RN HW FT.png

HARDWARE FAULT TOLERANCE

Resilience Notation

Resilience Notation explains failover process:

  1. During normal operation: Application running lockstep in both Primary and Secondary

  2. Failover Process: Primary stops running due to a fault. Secondary running in Lock Step is not affected.

  3. Secondary continues operating: Secondary running in lockstep is not affected by failure of Primary.

Result of Hardware Fault Tolerance Failover Process:

  1. Application is processed in Lock Step by both primary & secondary

  2. In the event of failure, output of the other system is used, resulting in:

    1. no data loss

    2. no recovery latency.

 

SOFTWARE FAULT TOLERANCE SOLUTION

RESILIENCE DESIGN

Software Fault Tolerance design uses a state-pointing process where data in storage and memory are being replicated and synchronized from the primary to the secondary. To do this no input and output can leave the primary server until any modified memory data has been mirrored to the secondary server.

When failures occur the secondary server will continue operation from the last state-point.

COMPARISON

Software Fault Tolerance compared with Hardware Fault Tolerance and High Availability solutions:

  • Recovery latency is low

  • Data Loss Vulnerability is low.

 
RN SW FT.png

SOFTWARE FAULT TOLERANCE

Resilience Notation

Resilience Notation explains failover process:

  1. During normal operation: The application is running in the Primary.  The Application, Data and Operating System status copied to Secondary.

  2. Failover Process: Primary stops running due to a fault

  3. Failover to Secondary. Secondary recover from latest status stored. Secondary resumes operation.

Result of Software fault Tolerance Failover Process:

  1. Secondary recovers with data in memory and disk

  2. Data in memory & not replicated to secondary is lost

  3. Application is not available while:

    1. Time detect failure

    2. Failover to secondary

    3. Recovery of application with data memory of secondary.

 

HIGH AVAILABILITY SOLUTION

RESILIENCE DESIGN

Resilience Design:
High Availability design uses a check-pointing process where data in storage in primary server is replicated to storage in the secondary server but not in-memory data. 

If a failure occurs, the data is rolled back to its last known state.
If failure is on the primary server, a restart is needed and the restart time is specific to the application.

COMPARISON

High Availability compared to Hardware Fault Tolerance and Software Fault Tolerance:

  • Recovery latency is high

  • Data Loss Vulnerability is high.

 
RN HA.png

HIGH AVAILABILITY SOLUTION

Resilience Notation

Resilience Notation explains failover process:

  1. During normal operations: The application running in the Primary. Data in storage of primary server copied to storage of secondary servier.

  2. Failover Process: Primary stops running due to fault.

  3. Failover to Secondary: Application is restarted in secondary server from latest data stored in storage of secondary server. Secondary resumes operation.


Software fault Tolerance Failover Process

  1. Secondary recovers with data in disk

  2. Data in memory & not replicated to secondary is lost

  3. Data in primary storage & not replicated to secondary storage is lost

  4. Application is down while:

    1. Time detect failure

    2. Failover to secondary

    3. Restart application with data memory of secondary.

 

RESILIENCE PREVENTIVE MAINTENANCE

The objective of the Resilience Preventive Maintenance is to systematically check the proper operation of a mission-critical system especially the proper operation of the failover mechanism. Proper operation of the failover mechanism is essential to the resilience of the mission-critical system as the continuity of operation of the mission-critical system depends on the successful failover from the primary to the secondary should a fault occur in the primary.

There are two types of maintenance commonly practised in a resilience system, Preventive Maintenance and Corrective Maintenance. Preventive Maintenance is the service before the fault by ensuring the system is maintained to its specifications.  While Corrective Maintenance is the action after a fault to restore the system back to its specifications. For the purpose of resilience, we focus on Preventive Maintenance as it has far more impact on the resilience of the system compared to Corrective Maintenance which merely restores a failed operation.

The purpose of maintenance is about restoring the specifications.  We are looking for specifications that can deteriorate over the use of the system, perhaps less static and can dynamically change according to the usage of the system. For example, the specifications of a tyre may be the size and the maximum air pressure it can take, but the operating specifications of the tyre are about the optimal air pressure to maintain and the depth of the rubber groove for water displacement and traction. 

Why RPM.png