What is Shadow Computing?

Photo by cayusa.

We are proposing a new computational model, called shadow computing, which provides goal-based adaptive resilience using dynamic execution to meet the requirements of complex applications in highly parallelized faulty environments. Adaptive Resilience is the ability of the system to dynamically harness all available resources to achieve the highest level of QoS for a given application. Dynamic execution is the ability to execute an application while being able to change the QoS of that application. For example, we have built systems with the ability to execute processes at variable execution speeds using dynamic voltage and frequency scaling (DVFS), changing the QoS of application response time. The challenge is to maintain applications QoS while minimizing the system resources in spite of systems-level changes, such as failures or the availability of additional system resources. In order to achieve adaptive resilience, the shadow computing model associates a set of shadows to the main execution, which are dynamically instantiated and adjusted in order to address the current state of the system and maintain the application's QoS requirements.

Recent News

CLOSER 2014 Paper Accepted

Wed 22 January 2014

Our recent submission to CLOSER 2014 has been accepted for publication. In this paper we develop shadow computing for the could computing environment. We show that using shadow replication as a fault tolerance scheme for map-reduce applications can both maximize profit and reduce energy consumption.

As the demand for cloud computing continues to increase, cloud service providers face the daunting challenge to meet the negotiated SLA agreement, in terms of reliability and timely performance, while achieving cost-effectiveness. This challenge is increasingly compounded by the increasing likelihood of failure in large-scale clouds and the rising cost of energy consumption. This paper proposes Shadow Replication, a novel profit-maximization resiliency model, which seamlessly addresses failure at scale, while minimizing energy consumption. The basic tenet of the model is to associate a suite of shadow processes to execute concurrently with the main process, but initially at a much reduced execution speed, to overcome failures as they occur. Two computationally-feasible schemes are proposed to achieve shadow replication. A performance evaluation framework is developed to analyze these schemes and compare their performance to traditional replication-based fault tolerance methods, focusing on the inherent tradeoff between fault tolerance, the specified SLA and profit maximization. The results show Shadow Replication leads to significant energy reduction, and is better suited for compute-intensive execution models, where up to 30% more profit increase can be achieved.

PDP 2014 Paper Accepted

Fri 15 November 2013

Our recent submission to PDP 2014 has been accepted for publication. This paper was a joint project between University of Pittsburgh and Sandia National Laboratories. In this paper we develop an instance of shadow computing to high performance computing (HPC) and show that we can conserve power consumption while increasing application performance.

As HPC systems continue to grow to meet the requirements of tomorrow's exascale-class systems, two of the biggest challenges are power consumption and system resilience. On current systems, the dominant resilience technique is checkpoint/restart. It is believed, however, that this technique alone will not scale to the level necessary to support future systems. Therefore, alternative methods have been suggested to augment checkpoint/restart -- for example process replication. In this paper we address both resilience and power together, this is in contrast to much of the competed work which does so independently. Using an analytical model that accounts for both power consumption and failures, we study the performance of checkpoint and replication-based techniques on current and future systems and use power measurements from current systems to validate our findings. Lastly, in an attempt to optimize power consumption for replication, we introduce a new protocol termed shadow replication which not only reduces energy consumption but also produces faster response times than checkpoint/restart and traditional replication when operating under system power constraints.