No Arabic abstract
Big data is gaining overwhelming attention since the last decade. Almost all the fields of science and technology have experienced a considerable impact from it. The cloud computing paradigm has been targeted for big data processing and mining in a more efficient manner using the plethora of resources available from computing nodes to efficient storage. Cloud data mining introduces the concept of performing data mining and analytics of huge data in the cloud availing the cloud resources. But can we do better? Yes, of course! The main contribution of this chapter is the identification of four game-changing technologies for the acceleration of computing and analysis of data mining tasks in the cloud. Graphics Processing Units can be used to further accelerate the mining or analytic process, which is called GPU accelerated analytics. Further, Approximate Computing can also be introduced in big data analytics for bringing efficacy in the process by reducing time and energy and hence facilitating greenness in the entire computing process. Quantum Computing is a paradigm that is gaining pace in recent times which can also facilitate efficient and fast big data analytics in very little time. We have surveyed these three technologies and established their importance in big data mining with a holistic architecture by combining these three game-changers with the perspective of big data. We have also talked about another future technology, i.e., Neural Processing Units or Neural accelerators for researchers to explore the possibilities. A brief explanation of big data and cloud data mining concepts are also presented here.
Background: Metabolomics datasets are becoming increasingly large and complex, with multiple types of algorithms and workflows needed to process and analyse the data. A cloud infrastructure with portable software tools can provide much needed resources enabling faster processing of much larger datasets than would be possible at any individual lab. The PhenoMeNal project has developed such an infrastructure, allowing users to run analyses on local or commercial cloud platforms. We have examined the computational scaling behaviour of the PhenoMeNal platform using four different implementations across 1-1000 virtual CPUs using two common metabolomics tools. Results: Our results show that data which takes up to 4 days to process on a standard desktop computer can be processed in just 10 min on the largest cluster. Improved runtimes come at the cost of decreased efficiency, with all platforms falling below 80% efficiency above approximately 1/3 of the maximum number of vCPUs. An economic analysis revealed that running on large scale cloud platforms is cost effective compared to traditional desktop systems. Conclusions: Overall, cloud implementations of PhenoMeNal show excellent scalability for standard metabolomics computing tasks on a range of platforms, making them a compelling choice for research computing in metabolomics.
We present our experiences using cloud computing to support data-intensive analytics on satellite imagery for commercial applications. Drawing from our background in high-performance computing, we draw parallels between the early days of clustered computing systems and the current state of cloud computing and its potential to disrupt the HPC market. Using our own virtual file system layer on top of cloud remote object storage, we demonstrate aggregate read bandwidth of 230 gigabytes per second using 512 Google Compute Engine (GCE) nodes accessing a USA multi-region standard storage bucket. This figure is comparable to the best HPC storage systems in existence. We also present several of our application results, including the identification of field boundaries in Ukraine, and the generation of a global cloud-free base layer from Landsat imagery.
Recent advances in computer vision and neural networks have made it possible for more surveillance videos to be automatically searched and analyzed by algorithms rather than humans. This happened in parallel with advances in edge computing where videos are analyzed over hierarchical clusters that contain edge devices, close to the video source. However, the current video analysis pipeline has several disadvantages when dealing with such advances. For example, video encoders have been designed for a long time to please human viewers and be agnostic of the downstream analysis task (e.g., object detection). Moreover, most of the video analytics systems leverage 2-tier architecture where the encoded video is sent to either a remote cloud or a private edge server but does not efficiently leverage both of them. In response to these advances, we present SIEVE, a 3-tier video analytics system to reduce the latency and increase the throughput of analytics over video streams. In SIEVE, we present a novel technique to detect objects in compressed video streams. We refer to this technique as semantic video encoding because it allows video encoders to be aware of the semantics of the downstream task (e.g., object detection). Our results show that by leveraging semantic video encoding, we achieve close to 100% object detection accuracy with decompressing only 3.5% of the video frames which results in more than 100x speedup compared to classical approaches that decompress every video frame.
Cloud platform came into existence primarily to accelerate IT delivery and to promote innovation. To this point, it has performed largely well to the expectations of technologists, businesses and customers. The service aspect of this technology has paved the road for a faster set up of infrastructure and related goals for both startups and established organizations. This has further led to quicker delivery of many user-friendly applications to the market while proving to be a commercially viable option to companies with limited resources. On the technology front, the creation and adoption of this ecosystem has allowed easy collection of massive data from various sources at one place, where the place is sometimes referred as just the cloud. Efficient data mining can be performed on raw data to extract potentially useful information, which was not possible at this scale before. Targeted advertising is a common example that can help businesses. Despite these promising offerings, concerns around security and privacy of user information suppressed wider acceptance and an all-encompassing deployment of the cloud platform. In this paper, we discuss security and privacy concerns that occur due to data exchanging hands between a cloud servicer provider (CSP) and the primary cloud user - the data collector, from the content generator. We offer solutions that encompass technology, policy and sound management of the cloud service asserting that this approach has the potential to provide a holistic solution.
It is crucial to provide compatible treatment schemes for a disease according to various symptoms at different stages. However, most classification methods might be ineffective in accurately classifying a disease that holds the characteristics of multiple treatment stages, various symptoms, and multi-pathogenesis. Moreover, there are limited exchanges and cooperative actions in disease diagnoses and treatments between different departments and hospitals. Thus, when new diseases occur with atypical symptoms, inexperienced doctors might have difficulty in identifying them promptly and accurately. Therefore, to maximize the utilization of the advanced medical technology of developed hospitals and the rich medical knowledge of experienced doctors, a Disease Diagnosis and Treatment Recommendation System (DDTRS) is proposed in this paper. First, to effectively identify disease symptoms more accurately, a Density-Peaked Clustering Analysis (DPCA) algorithm is introduced for disease-symptom clustering. In addition, association analyses on Disease-Diagnosis (D-D) rules and Disease-Treatment (D-T) rules are conducted by the Apriori algorithm separately. The appropriate diagnosis and treatment schemes are recommended for patients and inexperienced doctors, even if they are in a limited therapeutic environment. Moreover, to reach the goals of high performance and low latency response, we implement a parallel solution for DDTRS using the Apache Spark cloud platform. Extensive experimental results demonstrate that the proposed DDTRS realizes disease-symptom clustering effectively and derives disease treatment recommendations intelligently and accurately.