No Arabic abstract
The conventional approach to pre-process data for compression is to apply transforms such as the Fourier, the Karhunen-Lo`{e}ve, or wavelet transforms. One drawback from adopting such an approach is that it is independent of the use of the compressed data, which may induce significant optimality losses when measured in terms of final utility (instead of being measured in terms of distortion). We therefore revisit this paradigm by tayloring the data pre-processing operation to the utility function of the decision-making entity using the compressed (and therefore noisy) data. More specifically, the utility function consists of an Lp-norm, which is very relevant in the area of smart grids. Both a linear and a non-linear use-oriented transforms are designed and compared with conventional data pre-processing techniques, showing that the impact of compression noise can be significantly reduced.
Modern smart distribution system requires storage, transmission and processing of big data generated by sensors installed in electric meters. On one hand, this data is essentially required for intelligent decision making by smart grid but on the other hand storage, transmission and processing of that huge amount of data is also a challenge. Present approaches to compress this information have only relied on the traditional matrix decomposition techniques benefitting from low number of principal components to represent the entire data. This paper proposes a cascaded data compression technique that blends three different methods in order to achieve high compression rate for efficient storage and transmission. In the first and second stages, two lossy data compression techniques are used, namely Singular Value Decomposition (SVD) and Normalization; Third stage achieves further compression by using the technique of Sparsity Encoding (SE) which is a lossless compression technique but only having appreciable benefits for sparse data sets. Our simulation results show that the combined use of the 3 techniques achieves data compression ratio to be 15% higher than state of the art SVD for small, sparse datasets and up to 28% higher in large, non-sparse datasets with acceptable Mean Absolute Error (MAE).
Data analytics and data science play a significant role in nowadays society. In the context of Smart Grids (SG), the collection of vast amounts of data has seen the emergence of a plethora of data analysis approaches. In this paper, we conduct a Systematic Mapping Study (SMS) aimed at getting insights about different facets of SG data analysis: application sub-domains (e.g., power load control), aspects covered (e.g., forecasting), used techniques (e.g., clustering), tool-support, research methods (e.g., experiments/simulations), replicability/reproducibility of research. The final goal is to provide a view of the current status of research. Overall, we found that each sub-domain has its peculiarities in terms of techniques, approaches and research methodologies applied. Simulations and experiments play a crucial role in many areas. The replicability of studies is limited concerning the provided implemented algorithms, and to a lower extent due to the usage of private datasets.
Smart grids are large and complex cyber physical infrastructures that require real-time monitoring for ensuring the security and reliability of the system. Monitoring the smart grid involves analyzing continuous data-stream from various measurement devices deployed throughout the system, which are topologically distributed and structurally interrelated. In this paper, graph signal processing (GSP) has been used to represent and analyze the power grid measurement data. It is shown that GSP can enable various analyses for the power grids structured data and dynamics of its interconnected components. Particularly, the effects of various cyber and physical stresses in the power grid are evaluated and discussed both in the vertex and the graph-frequency domains of the signals. Several techniques for detecting and locating cyber and physical stresses based on GSP techniques have been presented and their performances have been evaluated and compared. The presented study shows that GSP can be a promising approach for analyzing the power grids data.
The advancement of various research sectors such as Internet of Things (IoT), Machine Learning, Data Mining, Big Data, and Communication Technology has shed some light in transforming an urban city integrating the aforementioned techniques to a commonly known term - Smart City. With the emergence of smart city, plethora of data sources have been made available for wide variety of applications. The common technique for handling multiple data sources is data fusion, where it improves data output quality or extracts knowledge from the raw data. In order to cater evergrowing highly complicated applications, studies in smart city have to utilize data from various sources and evaluate their performance based on multiple aspects. To this end, we introduce a multi-perspectives classification of the data fusion to evaluate the smart city applications. Moreover, we applied the proposed multi-perspectives classification to evaluate selected applications in each domain of the smart city. We conclude the paper by discussing potential future direction and challenges of data fusion integration.
Biomedical data are widely accepted in developing prediction models for identifying a specific tumor, drug discovery and classification of human cancers. However, previous studies usually focused on different classifiers, and overlook the class imbalance problem in real-world biomedical datasets. There are a lack of studies on evaluation of data pre-processing techniques, such as resampling and feature selection, on imbalanced biomedical data learning. The relationship between data pre-processing techniques and the data distributions has never been analysed in previous studies. This article mainly focuses on reviewing and evaluating some popular and recently developed resampling and feature selection methods for class imbalance learning. We analyse the effectiveness of each technique from data distribution perspective. Extensive experiments have been done based on five classifiers, four performance measures, eight learning techniques across twenty real-world datasets. Experimental results show that: (1) resampling and feature selection techniques exhibit better performance using support vector machine (SVM) classifier. However, resampling and Feature Selection techniques perform poorly when using C4.5 decision tree and Linear discriminant analysis classifiers; (2) for datasets with different distributions, techniques such as Random undersampling and Feature Selection perform better than other data pre-processing methods with T Location-Scale distribution when using SVM and KNN (K-nearest neighbours) classifiers. Random oversampling outperforms other methods on Negative Binomial distribution using Random Forest classifier with lower level of imbalance ratio; (3) Feature Selection outperforms other data pre-processing methods in most cases, thus, Feature Selection with SVM classifier is the best choice for imbalanced biomedical data learning.