No Arabic abstract
The origins of herbal medicines are important for their treatment effect, which could be potentially distinguished by electronic nose system. As the odor fingerprint of herbal medicines from different origins can be tiny, the discrimination of origins can be much harder than that of different categories. Better feature extraction methods are significant for this task to be more accurately done, but there lacks systematic studies on different feature extraction methods. In this study, we classified different origins of three categories of herbal medicines with different feature extraction methods: manual feature extraction, mathematical transformation, deep learning algorithms. With 50 repetitive experiments with bootstrapping, we compared the effectiveness of the extractions with a two-layer neural network w/o dimensionality reduction methods (principal component analysis, linear discriminant analysis) as the three base classifiers. Compared with the conventional aggregated features, the Fast Fourier Transform method and our novel approach (longitudinal-information-in-a-line) showed an significant accuracy improvement(p < 0.05) on all 3 base classifiers and all three herbal medicine categories. Two of the deep learning algorithm we applied also showed partially significant improvement: one-dimensional convolution neural network(1D-CNN) and a novel graph pooling based framework - multivariate time pooling(MTPool).
In machine learning applications, the reliability of predictions is significant for assisted decision and risk control. As an effective framework to quantify the prediction reliability, conformal prediction (CP) was developed with the CPKNN (CP with kNN). However, the conventional CPKNN suffers from high variance and bias and long computational time as the feature dimensionality increases. To address these limitations, a new CP framework-conformal prediction with shrunken centroids (CPSC) is proposed. It regularizes the class centroids to attenuate the irrelevant features and shrink the sample space for predictions and reliability quantification. To compare CPKNN and CPSC, we employed them in the classification of 12 categories of alternative herbal medicine with electronic nose as a case and assessed them in two tasks: 1) offline prediction: the training set was fixed and the accuracy on the testing set was evaluated; 2) online prediction with data augmentation: they filtered unlabeled data to augment the training data based on the prediction reliability and the final accuracy of testing set was compared. The result shows that CPSC significantly outperformed CPKNN in both two tasks: 1) CPSC reached a significantly higher accuracy with lower computation cost, and with the same credibility output, CPSC generally achieves a higher accuracy; 2) the data augmentation process with CPSC robustly manifested a statistically significant improvement in prediction accuracy with different reliability thresholds, and the augmented data were more balanced in classes. This novel CPSC provides higher prediction accuracy and better reliability quantification, which can be a reliable assistance in decision support.
Electronic nose has been proven to be effective in alternative herbal medicine classification, but due to the nature of supervised learning, previous research heavily relies on the labelled training data, which are time-costly and labor-intensive to collect. To alleviate the critical dependency on the training data in real-world applications, this study aims to improve classification accuracy via data augmentation strategies. The effectiveness of five data augmentation strategies under different training data inadequacy are investigated in two scenarios: the noise-free scenario where different availabilities of unlabelled data were considered, and the noisy scenario where different levels of Gaussian noises and translational shifts were added to represent sensor drifts. The five augmentation strategies, namely noise-adding data augmentation, semi-supervised learning, classifier-based online learning, Inductive Conformal Prediction (ICP) online learning and our novel ensemble ICP online learning proposed in this study, are experimented and compared against supervised learning baseline, with Linear Discriminant Analysis (LDA) and Support Vector Machine (SVM) as the classifiers. Our novel strategy, ensemble ICP online learning, outperforms the others by showing non-decreasing classification accuracy on all tasks and a significant improvement on most simulated tasks (25out of 36 tasks,p<=0.05). Furthermore, this study provides a systematic analysis of different augmentation strategies. It shows at least one strategy significantly improved the classification accuracy with LDA (p<=0.05) and non-decreasing classification accuracy with SVM in each task. In particular, our proposed strategy demonstrated both effectiveness and robustness in boosting the classification model generalizability, which can be employed in other machine learning applications.
A number of recent emerging applications call for studying data streams, potentially infinite flows of information updated in real-time. When multiple co-evolving data streams are observed, an important task is to determine how these streams depend on each other, accounting for dynamic dependence patterns without imposing any restrictive probabilistic law governing this dependence. In this paper we argue that flexible least squares (FLS), a penalized version of ordinary least squares that accommodates for time-varying regression coefficients, can be deployed successfully in this context. Our motivating application is statistical arbitrage, an investment strategy that exploits patterns detected in financial data streams. We demonstrate that FLS is algebraically equivalent to the well-known Kalman filter equations, and take advantage of this equivalence to gain a better understanding of FLS and suggest a more efficient algorithm. Promising experimental results obtained from a FLS-based algorithmic trading system for the S&P 500 Futures Index are reported.
It is basic question in biology and other fields to identify the char- acteristic properties that on one hand are shared by structures from a particular realm, like gene regulation, protein-protein interaction or neu- ral networks or foodwebs, and that on the other hand distinguish them from other structures. We introduce and apply a general method, based on the spectrum of the normalized graph Laplacian, that yields repre- sentations, the spectral plots, that allow us to find and visualize such properties systematically. We present such visualizations for a wide range of biological networks and compare them with those for networks derived from theoretical schemes. The differences that we find are quite striking and suggest that the search for universal properties of biological networks should be complemented by an understanding of more specific features of biological organization principles at different scales.
The exploration of epidemic dynamics on dynamically evolving (adaptive) networks poses nontrivial challenges to the modeler, such as the determination of a small number of informative statistics of the detailed network state (that is, a few good observables) that usefully summarize the overall (macroscopic, systems level) behavior. Trying to obtain reduced, small size, accurate models in terms of these few statistical observables - that is, coarse-graining the full network epidemic model to a small but useful macroscopic one - is even more daunting. Here we describe a data-based approach to solving the first challenge: the detection of a few informative collective observables of the detailed epidemic dynamics. This will be accomplished through Diffusion Maps, a recently developed data-mining technique. We illustrate the approach through simulations of a simple mathematical model of epidemics on a network: a model known to exhibit complex temporal dynamics. We will discuss potential extensions of the approach, as well as possible shortcomings.