No Arabic abstract
Hundreds of millions of network cameras have been installed throughout the world. Each is capable of providing a vast amount of real-time data. Analyzing the massive data generated by these cameras requires significant computational resources and the demands may vary over time. Cloud computing shows the most promise to provide the needed resources on demand. In this article, we investigate how to allocate cloud resources when analyzing real-time data streams from network cameras. A resource manager considers many factors that affect its decisions, including the types of analysis, the number of data streams, and the locations of the cameras. The manager then selects the most cost-efficient types of cloud instances (e.g. CPU vs. GPGPU) to meet the computational demands for analyzing streams. We evaluate the effectiveness of our approach using Amazon Web Services. Experiments demonstrate more than 50% cost reduction for real workloads.
The LHCb experiment stores around $10^{11}$ collision events per year. A typical physics analysis deals with a final sample of up to $10^7$ events. Event preselection algorithms (lines) are used for data reduction. Since the data are stored in a format that requires sequential access, the lines are grouped into several output file streams, in order to increase the efficiency of user analysis jobs that read these data. The scheme efficiency heavily depends on the stream composition. By putting similar lines together and balancing the stream sizes it is possible to reduce the overhead. We present a method for finding an optimal stream composition. The method is applied to a part of the LHCb data (Turbo stream) on the stage where it is prepared for user physics analysis. This results in an expected improvement of 15% in the speed of user analysis jobs, and will be applied on data to be recorded in 2017.
With the advancement of technology, the data generated in our lives is getting faster and faster, and the amount of data that various applications need to process becomes extremely huge. Therefore, we need to put more effort into analyzing data and extracting valuable information. Cloud computing used to be a good technology to solve a large number of data analysis problems. However, in the era of the popularity of the Internet of Things (IoT), transmitting sensing data back to the cloud for centralized data analysis will consume a lot of wireless communication and network transmission costs. To solve the above problems, edge computing has become a promising solution. In this paper, we propose a new algorithm for processing probabilistic skyline queries over uncertain data streams in an edge computing environment. We use the concept of a second skyline set to filter data that is unlikely to be the result of the skyline. Besides, the edge server only sends the information needed to update the global analysis results on the cloud server, which will greatly reduce the amount of data transmitted over the network. The results show that our proposed method not only reduces the response time by more than 50% compared with the brute force method on two-dimensional data but also maintains the leading processing speed on high-dimensional data.
In Wolke et al. [1] we compare the efficiency of different resource allocation strategies experimentally. We focused on dynamic environments where virtual machines need to be allocated and deallocated to servers over time. In this companion paper, we describe the simulation framework and how to run simulations to replicate experiments or run new experiments within the framework.
Distributed dataflow systems enable data-parallel processing of large datasets on clusters. Public cloud providers offer a large variety and quantity of resources that can be used for such clusters. Yet, selecting appropriate cloud resources for dataflow jobs - that neither lead to bottlenecks nor to low resource utilization - is often challenging, even for expert users such as data engineers. We present C3O, a collaborative system for optimizing data processing cluster configurations in public clouds based on shared historical runtime data. The shared data is utilized for predicting the runtimes of data processing jobs on different possible cluster configurations, using specialized regression models. These models take the diverse execution contexts of different users into account and exhibit mean absolute errors below 3% in our experimental evaluation with 930 unique Spark jobs.
The availability of Infrastructure-as-a-Service (IaaS) computing clouds gives researchers access to a large set of new resources for running complex scientific applications. However, exploiting cloud resources for large numbers of jobs requires significant effort and expertise. In order to make it simple and transparent for researchers to deploy their applications, we have developed a virtual machine resource manager (Cloud Scheduler) for distributed compute clouds. Cloud Scheduler boots and manages the user-customized virtual machines in response to a users job submission. We describe the motivation and design of the Cloud Scheduler and present results on its use on both science and commercial clouds.