Subscribe to the gold package and get unlimited access to Shamra Academy

Large-scale Complex IT Systems

542 0 0.0 ( 0 )

Download Cite

Added by Ian Sommerville Prof

Publication date 2011

fields Informatics Engineering

and research's language is English

Authors Ian Sommerville - Dave Cliff - Radu Calinescu

Software Engineering Computers and Society

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

This paper explores the issues around the construction of large-scale complex systems which are built as systems of systems and suggests that there are fundamental reasons, derived from the inherent complexity in these systems, why our current software engineering methods and techniques cannot be scaled up to cope with the engineering challenges of constructing such systems. It then goes on to propose a research and education agenda for software engineering that identifies the major challenges and issues in the development of large-scale complex, software-intensive systems. Central to this is the notion that we cannot separate software from the socio-technical environment in which it is used.

rate research

Application Checkpoint and Power Study on Large Scale Systems

107 - Yuping Fan 2021

Power efficiency is critical in high performance computing (HPC) systems. To achieve high power efficiency on application level, it is vital importance to efficiently distribute power used by application checkpoints. In this study, we analyze the relation of application checkpoints and their power consumption. The observations could guide the design of power management.

Software Engineering Hardware Architecture

MicroHECL: High-Efficient Root Cause Localization in Large-Scale Microservice Systems

146 - Dewei Liu , Chuan He , Xin Peng 2021

Availability issues of industrial microservice systems (e.g., drop of successfully placed orders and processed transactions) directly affect the running of the business. These issues are usually caused by various types of service anomalies which propagate along service dependencies. Accurate and high-efficient root cause localization is thus a critical challenge for large-scale industrial microservice systems. Existing approaches use service dependency graph based analysis techniques to automatically locate root causes. However, these approaches are limited due to their inaccurate detection of service anomalies and inefficient traversing of service dependency graph. In this paper, we propose a high-efficient root cause localization approach for availability issues of microservice systems, called MicroHECL. Based on a dynamically constructed service call graph, MicroHECL analyzes possible anomaly propagation chains, and ranks candidate root causes based on correlation analysis. We combine machine learning and statistical methods and design customized models for the detection of different types of service anomalies (i.e., performance, reliability, traffic). To improve the efficiency, we adopt a pruning strategy to eliminate irrelevant service calls in anomaly propagation chain analysis. Experimental studies show that MicroHECL significantly outperforms two state-of-the-art baseline approaches in terms of both accuracy and efficiency. MicroHECL has been used in Alibaba and achieves a top-3 hit ratio of 68% with root cause localization time reduced from 30 minutes to 5 minutes.

Software Engineering

Prototype of Fault Adaptive Embedded Software for Large-Scale Real-Time Systems

146 - Derek Messie 2005

This paper describes a comprehensive prototype of large-scale fault adaptive embedded software developed for the proposed Fermilab BTeV high energy physics experiment. Lightweight self-optimizing agents embedded within Level 1 of the prototype are responsible for proactive and reactive monitoring and mitigation based on specified layers of competence. The agents are self-protecting, detecting cascading failures using a distributed approach. Adaptive, reconfigurable, and mobile objects for reliablility are designed to be self-configuring to adapt automatically to dynamically changing environments. These objects provide a self-healing layer with the ability to discover, diagnose, and react to discontinuities in real-time processing. A generic modeling environment was developed to facilitate design and implementation of hardware resource specifications, application data flow, and failure mitigation strategies. Level 1 of the planned BTeV trigger system alone will consist of 2500 DSPs, so the number of components and intractable fault scenarios involved make it impossible to design an `expert system that applies traditional centralized mitigative strategies based on rules capturing every possible system state. Instead, a distributed reactive approach is implemented using the tools and methodologies developed by the Real-Time Embedded Systems group.

Software Engineering

AID: Efficient Prediction of Aggregated Intensity of Dependency in Large-scale Cloud Systems

89 - Tianyi Yang , Jiacheng Shen , Yuxin Su 2021

Service reliability is one of the key challenges that cloud providers have to deal with. In cloud systems, unplanned service failures may cause severe cascading impacts on their dependent services, deteriorating customer satisfaction. Predicting the cascading impacts accurately and efficiently is critical to the operation and maintenance of cloud systems. Existing approaches identify whether one service depends on another via distributed tracing but no prior work focused on discriminating to what extent the dependency between cloud services is. In this paper, we survey the outages and the procedure for failure diagnosis in two cloud providers to motivate the definition of the intensity of dependency. We define the intensity of dependency between two services as how much the status of the callee service influences the caller service. Then we propose AID, the first approach to predict the intensity of dependencies between cloud services. AID first generates a set of candidate dependency pairs from the spans. AID then represents the status of each cloud service with a multivariate time series aggregated from the spans. With the representation of services, AID calculates the similarities between the statuses of the caller and the callee of each candidate pair. Finally, AID aggregates the similarities to produce a unified value as the intensity of the dependency. We evaluate AID on the data collected from an open-source microservice benchmark and a cloud system in production. The experimental results show that AID can efficiently and accurately predict the intensity of dependencies. We further demonstrate the usefulness of our method in a large-scale commercial cloud system.

Software Engineering

Safety Case Templates for Autonomous Systems

225 - Robin Bloomfield , Gareth Fletcher , Heidy Khlaaf 2021

This report documents safety assurance argument templates to support the deployment and operation of autonomous systems that include machine learning (ML) components. The document presents example safety argument templates covering: the development of safety requirements, hazard analysis, a safety monitor architecture for an autonomous system including at least one ML element, a component with ML and the adaptation and change of the system over time. The report also presents generic templates for argument defeaters and evidence confidence that can be used to strengthen, review, and adapt the templates as necessary. This report is made available to get feedback on the approach and on the templates. This work was sponsored by the UK Dstl under the R-cloud framework.

Software Engineering Computers and Society Machine Learning

comments

Fetching comments

Higher Institute for Applied Sciences and Technology

Additional details More universities

Large-scale Complex IT Systems

Ask ChatGPT about the research

No Arabic abstract

Read More