Understanding Unsuccessful Executions in Big-Data Systems

Understanding Unsuccessful Executions in Big-Data Systems Big-data applications are being increasingly used in today’s large-scale data enters for a large variety of purposes, such as solving scientific problems, running enterprise services, and computing data-intensive tasks. Due to the growing scale of these systems and the complexity of running applications, jobs running in big-data systems experience unsuccessful terminations of different nature. While a large body of existing studies sheds light on failures occurred in large-scale data enters, the current literature overlooks the characteristics and the performance impairment of a broader class of unsuccessful executions which can arise due to application failures, dependency violations, machine constraints, job kills, and task pre-emption. Nonetheless, deepening our understanding in this field is of paramount importance, as unsuccessful executions can lower user satisfaction, impair reliability, and lead to a high resource waste. In this paper, we describe the problem of unsuccessful executions in big-datasystems, and highlight the critical importance of improving our knowledge on this subject. We review the existing literature on this field, discuss its limitations, and present our own contributions to the problem, along with our research plan for the future.