## Visualising cost versus risk of loss

Estimating the long-term costs and risks of archiving digital content is not an easy task! Nearly five years ago Richard Wright from the BBC expressed this challenge in terms of what an archive might want [1]. He asked for a cost curve for the percentage of loss of archive content over 20 years, including the probability of loss, the size of the uncertainties, and a cost function to show how the probability of loss varies against increase - or decrease - in investment. His request was for no more than basic actuarial information needed for investment decisions. This article shows how we have approached the problem of providing quantified costs, risks and uncertainties for long-term storage of file-based assets using IT storage technology.

### Stochastic simulation

The interactive storage simulation tool allows a user to manipulate a storage model in order to observe the effects of changing the storage strategy on cost and on the risk of loss of assets. The tool can also be used to batch process a number of parameterised configurations in order to explore the space of possible storage strategies. Given the results of this, it is possible to compare directly the effect of, for example, keeping a greater number of replicas of each asset while scrubbing the files less frequently.

The storage simulation tool uses a stochastic simulation, which means that each time it is run using the same initial configuration the results may vary in terms of the number of files lost and the total running cost. By repeatedly running the tool using the same configuration, we can generate a probability distribution of asset loss (Fig. 1). Figure 1 was generated by sampling the model 1000 times, each time simulating 10 years of preservation activity, and indicates that we would expect to lose around 0.3% of the archive over 10 years (or around 75 assets for an archive of 25,000 assets).

The probability distribution function (PDF) in figure 1 tells us the probability of losing *exactly* a given percentage of the archive. By cumulatively summing the probabilities up to a given percentage of archive loss, we can generate a cumulative distribution function (CDF) of the PDF. The CDF gives the probability of losing this amount of the archive *or less*. However, we are interested in the probability of losing a given amount of the archive *or more*. This can be found by taking the complement of the CDF, shown in Fig. 2. The probability of losing more than a very small amount of the archive is high, while losing more than large amounts of the archive is low.

### Multi-dimensional visualisation in decision support

Using this approach we can investigate the effects of any number of simulation input parameters (independent variables) on the simulation outputs (dependent variables). However, the greater the number of variables, the higher the dimensionality of the results and the more difficult it is to visualise. Visualisation of the preservation parameter space is an important tool for archive decision makers as it allows them to understand the complexity of the data in a way that is intuitive given their domain expertise. Where there is uncertainty in the data, it is critical that the decision maker appreciate the level of risk they are accepting.

Figure 3 shows a two-dimensional representation of the multi-dimensional data produced by the process outlined above. The representation is based on a technique originally used to evaluate risk in fisheries management [2]. Figure 3 shows a single storage system where the number of additional copies of a file stored in that system and frequency of integrity checking (scrubbing) have an impact on both the cost and the risk of file loss. The figure illustrates the risk and cost landscape for for the loss of more than a specified percentage of the archive's assets (the acceptable maximum level of loss). The boundary between adjacent coloured bands represents configurations of equal cost. The white contour lines are lines of equal risk of loss. Each intersection of values from the X and Y axes represents a storage simulation that was actually executed (multiple times). The intervening values are interpolated.

### Balancing cost versus risk of loss

In most storage system configurations, the cost of storage increases as risk of loss decreases (as shown in Fig. 3). Increasing the number of copies reduces the risk of loss, as expected, but also increases the cost because more storage capacity is needed. Increasing the frequency of scrubbing also reduces the risk of loss, but again increases cost because of increased access to data and equipment needed to compute checksums for a larger volume of data. Therefore, it is often not possible to have a low cost, low risk storage strategy. The visualisation shown here helps the decision maker to identify the configuration that is cost efficient given their acceptable maximum level of loss and appetite for risk (since the maximum level of loss is inevitably based on simulation results that contain uncertainty). By shifting the acceptable maximum level of loss, the costs remain unchanged but the risk 'landscape' alters considerably. Compare figure 4, representing the probability of losing more than 0.1% of the archive over 10 years, to figure 5, representing the probability of losing more than 3% over the same period.

This type of visualisation helps the decision maker to identify the optimal storage strategy gven their constraints. Firstly, given a fixed budget, it enables them to select the storage strategy with the lowest probability of asset loss. Figure 6 shows that, given a budget of 50 million (Euros), the strategy with the lowest probability of loss of more than 0.1% of the archive over 10 years is to keep 3 additional copies of each asset and to check the integrity of the files every 10 months. In this case, it is not cost efficient to increase the frequency of scrubbing, as it will cost more but is unlikely to deliver any benefit in terms of data safety. Similarly, given that we are willing to accept the risk of losing 0.1% of the archive over 10 years with a probability of 1 in 5, then the strategy with the lowest cost is to keep 3 additional copies of each asset and to check the integrity of the files every 12 months.

Different storage strategies have different balances between cost and risk of loss, which is why a tool to allow the trade-offs to be analysed on a case by case basis is so important. An interactive version of this visualisation has been developed, which allows a user to change the acceptable maximum level of loss, budget threshold and risk appetite in order to evaluate the storage strategy that best balances cost versus risk of loss. An interactive demonstration of this tool is available here (N.B. use of this tool can cause a large volume of data to be downloaded by your browser).

### References

- [1] Wright, R. 2007. Structural requirements for digital audiovisual preservation: Tools and Trends. International Conference on Digital Preservation, Koninklijke Bibliotheek, The Hague, The Netherlands
- [2] Goodyear, C. P. 1993. Spawning stock biomass per recruit in fisheries management: foundation and current use. Canadian Special Publication of Fisheries and Aquatic Sciences 120: 67-81