BlockSim Example 1 - Reliability Analysis of a Storage Cluster System

This example is based on the example shown in Figure 8 of the article "Determining the Availability and Reliability of Storage Configurations" by Santosh Shetty, August 2002, as posted on Dell's website.

Example

Consider a "high-availability" cluster with a reliability block diagram (RBD), as shown next.
Figure 1: Storage Cluster System

Assume the following life distributions and parameters for the components: (Note that this example, unlike the original article, assumes no repair of failed components.)

  • Server: Exponential with mean = 45,753 hours
  • Switch: Exponential with mean = 255,358 hours
  • HBA: Exponential with mean = 252,550 hours
  • Controller: Exponential with mean = 68,961 hours

The objective of the analysis is to study the reliability of the system.

Analysis

Step 1: Create the RBD of the system in BlockSim, and then use the given information to configure the universal reliability definitions (URDs) of each block. For example, the following picture shows the Block Properties window of Server1. The inset shows the Model Wizard, which allows you to define the failure model of the block. The URDs of the other blocks can be configured in a similar manner.

Figure 2: Block Properties Window of Server1 and Model Wizard (inset)

Step 2: Once the URDs have been configured, analyze the diagram and obtain the system reliability equation of the system, as shown next. In this equation, each R is the reliability (1-cdf) function of the item. As an example, RServer2 is the reliability function of Server 2.

Figure 3: System Reliability Equation of the Storage Cluster System

Step 3: Generate system level plots to see more information about the system. The next two charts are component reliability importance plots at t = 8544 hr (1 year). Both plots (a tableau area plot and a bar chart) illustrate the same concept; that is, the higher the importance of the component, the higher its effect on system reliability.

Figure 4: Static Reliability Importance - Tableau Area Chart
Figure 5: Static Reliability Importance - Bar Chart

As you can see, the servers in this configuration are the most critical component, while the hubs are the least critical.

The following pictures show additional plots.

Figure 6: RI vs. Time Plot
Figure 8: System Failure Rate Plot
Figure 7: System Reliability Plot
Figure 9: System pdf plot

Step 4: Use BlockSim's Analytical Quick Calculation Pad (QCP) to obtain some of the most frequently requested reliability results. For example, the MTTF (mean time to failure) of the system is about 42,135 hours, as shown next.

Figure 10: Analytical QCP