Dbench - Project Programme

Workplan

General description

1. Introduction

2. WP1: Conceptual Framework

3. WP2: Enabling Technologies Identification and Evaluation

4. WP3: Benchmark Definition, Experimentation and Validation

5. WP4: Consolidation

General description

1. Introduction

To fulfil the final objectives (i.e., dependability benchmark concepts, specifications, guidelines and prototype tools), extensive experimental work is needed both for identifying important characteristics and for validating the various assumptions that are necessary when dealing with system dependability. To perform meaningful experiments, several enabling technologies, related to benchmark conduct have to be either defined or extended. Experimentation relies mainly on workload and faultload selection and application: a workload is submitted to the system together with a faultload, the reaction of the system is observed and measurements are performed on the system.

To facilitate project control and monitoring, the work is structured in workpackages in such a way that their sequence reflects the project progress. Four workpackages are planned: i) definition of the conceptual framework for system benchmarking, ii) identification and evaluation of the enabling technologies, iii) specific benchmark definition and application to pilot experiments in order to design, experiment and validate the benchmark prototypes and iv) consolidation of the conceptual framework with the experimental results.

WP1: Conceptual Framework

This defines the objectives, the properties and the measures to be evaluated by dependability benchmarks and gives an overview of the utilisation of these benchmarks. It addresses several important aspects such as i) the need for an overall and global system viewpoint to correctly select the measures to be evaluated and interpret the results, ii) the end-user and the system developer perspectives, and iii) the build up of the environment to conduct the benchmark. It will provide all the information deemed relevant to understand and use dependability benchmarks.

WP2: Enabling Technologies Identification and Evaluation

To put into practice the conceptual framework, enabling technologies have to be investigated; they should be refined and adapted in some respects. The measures defined in WP1 allow identification of the measurements to be performed on the target system as well as the associated events. As events are tightly related to the system activity in presence of faults, it is essential i) to check the representativeness of the faults to be injected, and ii) to define and generate meaningful workload and faultload under which the system will be exercised. Fault representativeness together with workload and faultload selection will constitute the core of our work on enabling technologies. The selected enabling technologies will be experimentally evaluated to provide feedback and validate the guidelines developed in WP1, using the target systems for which dependability benchmark will be performed in WP3.

WP3: Benchmark definition, experimentation and validation

The main concepts and techniques developed in WP1 and WP2 will be applied to selected system domains and specific application areas for which benchmark prototypes will be developed. Two families of operating systems Windows and Linux have been selected because of their widespread use in the European marketplace. For each family, we will develop benchmarks for the OS itself and for applications running on top of the OS. Two application areas will be considered for each family: an embedded application, and a database and web server application. The embedded application is a control and monitoring application, while the database applications are standard and widely accepted performance transaction-processing benchmarks implemented over an Oracle database management system (DBMS). The latter runs on top of Windows-NT and Linux, while the embedded application runs on top of the respective reduced versions of the OSs: Windows-CE and "embedded-Linux" (referred to as Linux-EB in this proposal). Benchmark validation aims at demonstrating the usefulness and effectiveness of the techniques and prototypes developed.

The pilots are central in assessing and supporting the results of the development of the enabling technologies and of the benchmark guidelines and prototypes that will be provided at the end of the project.

WP4: Consolidation

This workpackage addresses the consolidation of the whole set of results to issue recommendations on benchmarking and to finalise benchmark prototypes. The dissemination of the prototype tools will be emphasised during this period of time.

These workpackages are detailed in what follows. WP2 and WP3 are decomposed into tasks to facilitate their presentation and understanding. Figure 1 summarises the workpackage and task titles, and acronyms.

WP1

Conceptual Framework (CF)

WP2

Enabling Technologies Identification and Evaluation (ETIE)

T21

Measurements

T22

Fault Representativeness

T23

Workload and Faultload Selection

WP3

Benchmark Definition, Experimentation and Validation (BDEV)

T31

Benchmark Definition

T32

Benchmark Experimentation

T33

Benchmark Validation

WP4

Consolidation (CD)

Figure 1: WP subdivision into tasks

2. WP1: Conceptual Framework

The conceptual framework will address the most relevant issues involved in dependability benchmarking. It is a fundamental part of the project, so that the studies and experiments that follow will abide to common guidelines. Building upon relevant advances in performance and dependability evaluation, WP1 will be decomposed into three main items related respectively to i) assessing the state-of-the-art in performance benchmarking and dependability characterisation, ii) identifying and investigating the concepts for dependability benchmarking, iii) setting up the foundations for the project development. In the sequel, we concentrate on the detailed description of the activities concerning the second item; these include: benchmark measures, objectives and utilisation, properties and conduct.

Concerning benchmark measures, emphasis will be put on identifying and defining meaningful measures for the several users of a dependability benchmark (including system developers and end-users). Some qualitative or statistical measures can be obtained directly by measurements (by executing a benchmark on the target system) and some others can be derived from these results with the help of modelling.

The measures obtained by measurements characterise the reaction of the system to faults, given the fact that faults are present. We will consider both classical dependability measures and measures adapted from the performance area. Examples of measures that can be obtained directly by measurements and processing are: error detection efficiency, error detection latency, time to diagnosis, failure modes, recovery factors, time to initialise or restart the system, system response time or number of transactions in presence of faults.

In addition to the reaction of the system to these faults, the second set of measures, derived via modelling, accounts also for other processes, such as the occurrence of faults and maintenance policies. When fed to probabilistic models, the statistical measurements allow the evaluation of probabilistic dependability measures like reliability, availability and safety. Our previous work shows the powerfulness of modelling for evaluating the dependability of complex real-life systems (see e. g., [13, 14]. Additional relevant measures may be identified during the project. A central topic is composability of measures. In particular, the classes of faults and their relation to the classes of failures, as well as the selection criteria for the faults will be considered.

For what concerns benchmark objectives and utilisation, special attention will be devoted to distinguish the commonalities and specificities attached to the various benchmarking attributes. Besides the issues concerning the types of dependability measures targeted, an important dimension is related to the point of view considered: an end-user has usually a perspective that is different from a system developer (or provider) perspective. In addition, the system boundaries can vary for different end-users and, for systems composed of systems. Moreover, very different levels of, for example, accessibility, controllability, and knowledge of the system are to be expected. Thus, different kinds of users of dependability benchmarks have different needs and requirements. Benchmarks can be used for instance to:

• Assess the dependability of a component or a system, requiring the benchmarks to produce just global aggregate numbers, but with comprehensive fault models.
• Identify malfunctioning or less robust parts, requiring more attention and perhaps necessitating a change at the architectural level.

• Tune a particular component to enhance its dependability (using wrapping), or tune a system architecture (by adding fault tolerance mechanisms or spare units, for example) to ensure an appropriate dependability level.

• Compare, grade or rank the dependability of alternative or competitive solutions.

All the above mentioned points of view will be explored and guidelines will be established for the most common cases.

Several properties of dependability benchmarks are relevant. Examples of such properties are:

• Portability, an important requirement but quite hard to attain due to the intricate nature of faults and their consequences.
• Modularity, for ease of adaptability and expandability.

• Ease-of-use and simplicity, for wide adoption.

• Non-interference and non-damaging for simple use.

• Repeatability and reproducibility for high confidence in the results.

The extent to which each of these, and other, properties can be attained will be clarified. Most likely, it will not be possible to satisfy all of them, and a trade-off shall be made according to the measure to be evaluated.

For conducting a benchmark, users need explicit guidelines. In the same way, guidelines are needed for understanding and correctly interpreting the results. A typical benchmark will consist of a series of runs applied to the target system(s). Each run will be conducted in several phases: i) initialisation, ii) workload submission without fault injection, iii) workload submission with fault injection (using the results of the previous phase as a reference), iv) observation of the target system reactions, v) integrity checking before the beginning of the subsequent run, etc. As already pointed out, it is anticipated that the individual benchmarks have to be adapted to the specific target system(s) and also according to the users requirements. These phases and their co-ordination have to be carefully designed and evaluated. Tools are needed that support and automate large parts of this development. One of the goals of this activity is, therefore, to develop a comprehensive and tractable model on "how to conduct benchmarking" that delivers i) guidelines for proven best practices to set-up benchmarks and apply them efficiently, and ii) assistant tools for conducting the benchmark execution. The development of such a comprehensive supporting framework is highly desirable both to serve as a unifying basis to carry out the various experiments within the project and also to facilitate technology transfer and take-up initiatives at the end of the project. For example, a knowledge database can help specify the scope and objectives of an assessment, define success criteria or estimate time and costs. The database should be configurable and may identify several phases and iteration cycles for benchmark activities. There are several available tools that allow for a precise definition, design and evaluation of such a framework (e.g., see ).

3. WP2: Enabling Technologies Identification and Evaluation

In addition to a sound conceptual and experimental foundation, attempting to develop dependability benchmarks requires also appropriate enabling technologies so as to derive meaningful results.

One major extension of dependability benchmarking with respect to classical performance benchmarking concerns the characterisation of the behaviour of a target system in the presence of a specific faultload (in addition to the workload). Thus, fault injection will play a central role for developing dependability benchmarks. However, the technologies that were classically used in fault injection experiments (mainly aimed at assessing the fault tolerance mechanisms of a particular system) are not directly usable for benchmarking purpose. Indeed, to be adopted and so as to provide meaningful quantitative results, benchmarking calls for the proposal of a new set of easy-to-handle, yet reliable, technologies. We will address this fundamental issue by attempting to answer the following specific questions:

what are the relevant measurements to perform on the target system in order to derive meaningful measures,
how to assess the representativeness of the faults to be injected,
what are the relevant workload and faultload to consider.

All selected technologies will be combined and described in a consistent way at the end of the workpackage.

T21: Measurements

It is essential to identify the events to be observed as well as how to observe them and when to measure their effects. This will define the measurements to be performed on the target system. Typical measurement results will be, among others, failure modes (hang, crash, etc.), time to failure or indications about possible weak points in the system design. Our experience relative to processing of real software failure data [12] will be helpful.

Both distributed and non-distributed systems will be targeted. A central question when considering a distributed system, is how to make coherent distributed observations and how to co-ordinate them; for example, data in various nodes should be checked for consistency.

Specific set of benchmarks will be defined, each tailored to the most relevant measures for the identified users (system developer and end-user). This will include the analysis of various usage profiles. Also, to make the benchmarks easily exploitable and usable, a graphic user interface will be developed. For example, to achieve high coverage the visual representation of execution and error propagation paths can be used to identify locations where test cases will effectively stimulate the system. This way, the results of the experiments will be visualised.

T22: Fault representativeness

As many fault injection experiments revealed a wide discrepancy among the behaviours caused by several candidate fault injection techniques (e.g., see ), fault representativeness is indeed a key issue when considering benchmarking. Besides some recent efforts towards these ends (e.g., ), the results currently available are still very limited. Accordingly, work is needed to identify and validate the nature of the erroneous behaviours that are induced by various classes of faults. For example, the characteristics of an error can be established by exploiting the execution trace of a faulty system.

As it is expected that similar error patterns may often originate from various distinct causes (faults), for cost-efficiency, we advocate that, in the case of benchmarking, fault injection should rather aim at producing directly such error patterns rather than focusing on their potential multiple causes.

Elaborating on our previous experience (e.g., see ), we will carry out specific experiments using tools already developed by the partners of the consortium, encompassing physical fault injection , software-implemented fault injection (SWIFI) and software mutation . We will also elaborate on previous relevant related work (e.g., see ). The aim is to investigate what are the types of faults to be injected. Specific aspects to deal with include: distribution among the observed error patterns and the multiplicity, dynamics and timing characteristics of these error patterns. As already pointed out when dealing with repeatability, instead of exact matching between the many faults considered and the erroneous situations observed, that is indeed hardly achievable &endash; and in fact not necessary in our context &endash; we will rather found our investigation of the representativeness issue on a statistical/probabilistic basis.

T23: Workload and faultload selection

Considering the workload, as a first basis, we will be using the type of workloads that conform to the well-identified performance benchmarks. It is anticipated that adaptations or enhancements will concern the definition of workloads primarily aimed at selectively activating specific services of the target OSs. In particular, in that case we will consider also workloads to test the nominal behaviour as well as exceptional behaviour provided by the services. The main innovative work associated to this task will be devoted to the investigation of how to combine the faultload and the workload according to the objectives of the benchmark on a given target system: either evaluation of dependability measures, or testing of its behaviour in the presence of faults.

Considering the faultload, it is well known that probabilistic selection of the faults to be injected is very much needed when fault injection experiments are meant to rate the behaviour of a target system in the presence of faults. For example for evaluating a coverage parameter (efficiency) of a fault tolerance mechanism, a fault tolerance property or dependability attributes. On the other hand, tailored and focused test cases are mandatory when the fault injection experiments are to reveal design flaws and weaknesses (e.g., in fault tolerance mechanisms). Still, no clear inclination exists for either of these two alternatives when dealing with dependability benchmarking. Indeed, the set of faults used in a benchmark should ideally be selected such that each test case generates a unique and significant error pattern; this way, the test cases would exercise as many features of the target system as possible. However, such an involvement might impair the portability and ease-of-use properties that should be fulfilled by a benchmark.

In that context, one important practical issue is the level of synchronisation of the faultload with respect to the workload. Indeed, synchronisation allows for specific test cases to be applied, but at the price of some necessary knowledge and detailed analysis of the target system(s). On the other side, "random" injection is easier to implement, but at the expense and risk of slow convergence or possible bias of the results. Along the same lines, another dimension of the same problem concerns the impact on the results of the clustering of the faultload into classes that are prone to lead to distinct test cases. The question is: what is the alternative (or right balance) between considering a small set of elaborate and focused test cases (deterministic selection) versus the simple reliance on a statistical argument (probabilistic selection).

Elaborating on our previous work , we will specifically investigate and validate means (e.g., bias reduction techniques and confidence interval statistics) to support a sensible assessment and to identify objective compromises. We will also address the repeatability of experimental assessment. As deterministic repeatability cannot be achieved when dealing with complex systems, we will investigate to what extent statistical fault injection and probabilistic interpretation of the results can offer rational means to compensate this lack of repeatability.

In order to support these assessments, we propose to investigate several techniques for identifying test cases with high coverage. These techniques will make use of methods, including formal methods based approaches, to conduct pre-injection analysis of the target system .

4. WP3: Benchmark Definition, Experimentation and Validation

The definition, specification, and implementation of dependability benchmark prototypes for a carefully selected set of system domains and application areas will be based on both the concepts and techniques developed in WP1 and WP2. In order to achieve a comprehensive work, different dependability benchmark prototypes will be defined and specified for two major application areas: embedded and transactional applications. Additionally, the benchmark prototypes will actually be developed for two different families of COTS operating systems (Windows and Linux), allowing a cross evaluation of the concepts (WP1) and the enabling technologies (WP2) in a true dependability benchmark context. The final goal of this comprehensive set of experiments is the validation of the dependability benchmark prototypes in the sense of assuring that the benchmarks results represent a practical and meaningful characterisation of the dependability properties of the target systems, both from the end-user and the system developer’s points of view.

The proposed experimental set-up comprises the cross-cutting issues involved in the characterisation of the large variety of today’s computer systems: representative application areas (embedded and transactional applications), widely used COTS operating systems (Windows and Linux), and available in different versions (Windows-NT/Windows-CE and Linux/Linux-EB) for a large range of hardware platforms. The combination and the cross exploitation of these multiple dimensions in the planned experiments do represent a realistic portrait of the computer systems industry.

This workpackage is organised in three major tasks, comprising the logical steps of benchmark definition, experimentation, and validation.

T31: Benchmark Definition

The definition and specification of dependability benchmark prototypes for embedded and transactional applications will bring together for the first time the concepts and techniques developed in previous workpackages. The specification of the prototypes must include all the details required to implement and use the benchmarks for different real systems. We anticipate that these specifications will include five major issues: i) global system view, ii) workload, iii) faultload, iv) measurements, and v) benchmark conduct.

The first thing to be defined is the global view of the target system. This description specifies the key components of the system, assuming a typical system for each application area. This description must have enough detail to give meaning to the dependability attributes and allow the definition of other aspects of the benchmark such as the workload, faultload, and measurements, but cannot be regarded as a detailed description of a specific system.

Representative workloads have been selected for each application area. A control and monitoring application has been selected to represent typical embedded application and the TPC-C benchmark workload from the Transaction Processing Performance Council (TPC) has been selected as a typical transactional application . It is worth noting that these workloads represent well the typical workload nature and the different degrees of criticality that can be found in each application area. In fact, the control and monitoring application comprises both non-critical and mission critical functions, while the TPC-C workload is used to represent a comprehensive range of database applications, including high demanding business-critical database application. Additionally, the use of a workload from an established performance benchmark (the TPC-C) will provide us with useful insights from this very successful benchmark.

The faultload describes the set of faults that are going to be inserted in the target system, defined according to the fault representativeness and fault selection criteria established in WP2. The means (tools and techniques) that must be used to apply the faultload must also be described for a complete benchmark specification.

The sets of measurements to be performed on the target system and the corresponding measures are defined from WP1 and WP2. The benchmark results, representing both the end-user and the system developer’s perspectives, combine functional (performance) and dependability measures. While for the end-user’s perspective the quantitative characterisation of the performance of the global system in the presence of faults could be enough, the system developer will be interested in a more detailed characterisation of the dependability attributes of the system or system components.

The benchmark conduct specifies all the details required to correctly implement and use the benchmark. This is very important to assure a uniform (standard) use of the benchmark and guarantee that the benchmark results are meaningful and can be used to compare alternative solutions from the dependability point of view.

T32: Benchmark Experimentation

The benchmarks defined for each application area will be developed for two different system families, resulting in two major benchmark experiments over COTS-based systems:

• Dependability benchmark for embedded applications (over Windows CE and Linux EB).
• Dependability benchmark for transactional applications over Windows and Linux with Oracle executing on top of each of them.

Naturally, the need for performing these major benchmark experiments provides us with a comprehensive experimental environment that can be exploited in many directions, allowing a complete and detailed evaluation of concepts and techniques. This cross exploitation will be done within the same system family (to emphasise the difference and similarities between the two application areas) and between the two families (Windows and Linux) considered in the project. Following are some examples:

• Evaluation of dependability features of the target systems using synthetic workloads. The idea is to benchmark the operating systems "alone" and to evaluate the dependability features of specific system components. The use of this information to accelerate the benchmarking process at the system level will be investigated;
• Comparison between black-box and white-box benchmarking. Although the needed dependability benchmarks must be designed for systems for which the source code is not available (black-box benchmarking), it is very interesting to investigate specific features dependent on the knowledge of the source code;

• Analysis of the differences and similarities between benchmarking the full versions of the OS and the reduced version designed for embedded world. This is particularly interesting as it is a practical way to isolate the influence of specific components of the OS that are present in one version and are absent in the other;

• Analysis of the common dependability characteristics and differences of the two OS families;

• Comparison of the measures characterising an embedded application and a transactional application;

• Comparison of the benchmark results obtained for each benchmark (i.e., the benchmark for embedded and the benchmark for transactional applications) running on top of different OSs.

T33: Benchmark Validation

The most difficult part of the validation is to assure that the benchmark results do represent an accurate characterisation of the dependability properties of the tested systems and can be used to compare alternative solutions. In fact, this validation is tightly related to the evaluation of aspects such as fault representativeness that will be performed in WP2. Complementary approaches based on theoretical, modelling, and experimental techniques will be used to acquire confidence in the benchmark results.

The benchmark validation also consists of assuring that the benchmark prototypes satisfy a set of properties required to have practical, useful, and meaningful dependability benchmarks. Examples of such properties are portability, scalability, usability, non-interference, non-damaging, and repeatability for high confidence in the results. The actual implementation of the planned benchmark prototypes in the rich and diverse experimental environment allows a secure evaluation and validation of these benchmark properties. In fact, the diversity of the different systems and application areas assure a right validation of properties such as portability, scalability, non-interference, and non-damaging properties. On the other hand, the large number of required experiments will validate results repeatability and benchmark usability.

The validation of the benchmark prototypes is a pragmatic way of validating the concept of dependability benchmarking, as defined, specified, and implemented in the DBench project. In fact, the validation of the benchmark prototypes also represents the actual validation of the conceptual framework (WP1) and of the different enabling technologies (WP2) in an actual benchmarking environment.

The benchmark validation carried out in this task will be consolidated through field validation in real application environments provided by the companies that constitute the Industrial Advisory Board.

5. WP4: Consolidation

This Workpackage is the final step of the project, where all advances and developments made in the project are summed up, discussed and put into their final format. It will give the full description of the proposed Framework for Dependability Benchmarking, which will take into account all the methods, insights and tools produced or studied in the project

The Workpackage is expected to clarify the benchmark objective and utilisation, and the benchmark measures and properties. Significant advances are needed in all enabling technologies (conceptual and experimental foundations, fault representativeness, fault injection, measurements, experiment conduct). These advances are expected to be made possible by the experiments performed in WP2 and WP3, and by the detailed studies made in WP1. Some final experiments with the existing set-ups might still be performed at this state to clarify some points that may need it, probably in issues like fault representativeness and measurements.

The practical outcomes of this workpackage are i) a set of guidelines and recommendations for the benchmark users and ii) a set of benchmark prototypes. The most relevant expected impacts are the contribution to the formulation of:

• A set of dependability measures that are meaningful to system end-users and system developers.
• A strategy for characterising and quantifying system dependability based on modelling and experimentation, depending on the target system nature (operating system, embedded or transactional system).

• Methods for system dependability measurements: data collection, analysis and processing.

• Recommendations for system developers on how to "grade" their systems and appropriately tune it to enhance its dependability, and for system end-users on how to characterise and, ultimately, select the most appropriate system according to their needs in terms of dependability.

The final report will:

• Present the framework for dependability benchmarking according to various dimensions:

The system nature and the application domain.

The perspective: end-user, system developer.

The objectives of the benchmark.

• Summarise the results related to the pilot-experiments, in order to discuss their applicability and extension to other systems and to other application domains.

• Give recommendations and guidelines for system benchmarking (concerning the methods and supporting tools) and make suggestions for new directions to carry on with the efforts and progress made during the project for possible improvements.

• Analyse the adequacy of the industrialisation of the benchmark prototype tools and proposition for further developments.

An appendix will be produced containing the guides for using the benchmark prototype tools.

Finally, a special effort will be devoted to the dissemination and exploitation of the results during this period. The prototype tools will be widely disseminated (through the web, whenever possible) for the community to become acquainted with the technology, and convinced of its usefulness.