ﻻ يوجد ملخص باللغة العربية
It is important for big data systems to identify their performance bottleneck. However, the popular indicators such as resource utilizations, are often misleading and incomparable with each other. In this paper, a novel indicator framework which can directly compare the impact of different indicators with each other is proposed to identify and analyze the performance bottleneck efficiently. A methodology which can construct the indicator from the performance change with the CPU frequency scaling is described. Spark is used as an example of a big data system and two typical SQL benchmarks are used as the workloads to evaluate the proposed method. Experimental results show that the proposed method is accurate compared with the resource utilization method and easy to implement compared with some white-box method. Meanwhile, the analysis with our indicators lead to some interesting findings and valuable performance optimization suggestions for big data systems.
Next Generation Sequencing (NGS) technology has resulted in massive amounts of proteomics and genomics data. This data is of no use if it is not properly analyzed. ETL (Extraction, Transformation, Loading) is an important step in designing data analy
Incremental processing is widely-adopted in many applications, ranging from incremental view maintenance, stream computing, to recently emerging progressive data warehouse and intermittent query processing. Despite many algorithms developed on this t
Scientific discoveries are increasingly driven by analyzing large volumes of image data. Many new libraries and specialized database management systems (DBMSs) have emerged to support such tasks. It is unclear, however, how well these systems support
Cloud computing refers to maximizing efficiency by sharing computational and storage resources, while data-parallel systems exploit the resources available in the cloud to perform parallel transformations over large amounts of data. In the same line,
In Big data era, information integration often requires abundant data extracted from massive data sources. Due to a large number of data sources, data source selection plays a crucial role in information integration, since it is costly and even impos