No Arabic abstract
Airborne gamma-ray surveys are useful for many applications, ranging from geology and mining to public health and nuclear security. In all these contexts, the ability to decompose a measured spectrum into a linear combination of background source terms can provide useful insights into the data and lead to improvements over techniques that use spectral energy windows. Multiple methods for the linear decomposition of spectra exist but are subject to various drawbacks, such as allowing negative photon fluxes or requiring detailed Monte Carlo modeling. We propose using Non-negative Matrix Factorization (NMF) as a data-driven approach to spectral decomposition. Using aerial surveys that include flights over water, we demonstrate that the mathematical approach of NMF finds physically relevant structure in aerial gamma-ray background, namely that measured spectra can be expressed as the sum of nearby terrestrial emission, distant terrestrial emission, and radon and cosmic emission. These NMF background components are compared to the background components obtained using Noise-Adjusted Singular Value Decomposition (NASVD), which contain negative photon fluxes and thus do not represent emission spectra in as straightforward a way. Finally, we comment on potential areas of research that are enabled by NMF decompositions, such as new approaches to spectral anomaly detection and data fusion.
The Baum-Welsh algorithm together with its derivatives and variations has been the main technique for learning Hidden Markov Models (HMM) from observational data. We present an HMM learning algorithm based on the non-negative matrix factorization (NMF) of higher order Markovian statistics that is structurally different from the Baum-Welsh and its associated approaches. The described algorithm supports estimation of the number of recurrent states of an HMM and iterates the non-negative matrix factorization (NMF) algorithm to improve the learned HMM parameters. Numerical examples are provided as well.
Extracting genetic information from a full range of sequencing data is important for understanding diseases. We propose a novel method to effectively explore the landscape of genetic mutations and aggregate them to predict cancer type. We used multinomial logistic regression, nonsmooth non-negative matrix factorization (nsNMF), and support vector machine (SVM) to utilize the full range of sequencing data, aiming at better aggregating genetic mutations and improving their power in predicting cancer types. Specifically, we introduced a classifier to distinguish cancer types using somatic mutations obtained from whole-exome sequencing data. Mutations were identified from multiple cancers and scored using SIFT, PP2, and CADD, and grouped at the individual gene level. The nsNMF was then applied to reduce dimensionality and to obtain coefficient and basis matrices. A feature matrix was derived from the obtained matrices to train a classifier for cancer type classification with the SVM model. We have demonstrated that the classifier was able to distinguish the cancer types with reasonable accuracy. In five-fold cross-validations using mutation counts as features, the average prediction accuracy was 77.1% (SEM=0.1%), significantly outperforming baselines and outperforming models using mutation scores as features. Using the factor matrices derived from the nsNMF, we identified multiple genes and pathways that are significantly associated with each cancer type. This study presents a generic and complete pipeline to study the associations between somatic mutations and cancers. The discovered genes and pathways associated with each cancer type can lead to biological insights. The proposed method can be adapted to other studies for disease classification and pathway discovery.
In this paper we explore avenues for improving the reliability of dimensionality reduction methods such as Non-Negative Matrix Factorization (NMF) as interpretive exploratory data analysis tools. We first explore the difficulties of the optimization problem underlying NMF, showing for the first time that non-trivial NMF solutions always exist and that the optimization problem is actually convex, by using the theory of Completely Positive Factorization. We subsequently explore four novel approaches to finding globally-optimal NMF solutions using various ideas from convex optimization. We then develop a new method, isometric NMF (isoNMF), which preserves non-negativity while also providing an isometric embedding, simultaneously achieving two properties which are helpful for interpretation. Though it results in a more difficult optimization problem, we show experimentally that the resulting method is scalable and even achieves more compact spectra than standard NMF.
In the non-negative matrix factorization (NMF) problem, the input is an $mtimes n$ matrix $M$ with non-negative entries and the goal is to factorize it as $Mapprox AW$. The $mtimes k$ matrix $A$ and the $ktimes n$ matrix $W$ are both constrained to have non-negative entries. This is in contrast to singular value decomposition, where the matrices $A$ and $W$ can have negative entries but must satisfy the orthogonality constraint: the columns of $A$ are orthogonal and the rows of $W$ are also orthogonal. The orthogonal non-negative matrix factorization (ONMF) problem imposes both the non-negativity and the orthogonality constraints, and previous work showed that it leads to better performances than NMF on many clustering tasks. We give the first constant-factor approximation algorithm for ONMF when one or both of $A$ and $W$ are subject to the orthogonality constraint. We also show an interesting connection to the correlation clustering problem on bipartite graphs. Our experiments on synthetic and real-world data show that our algorithm achieves similar or smaller errors compared to previous ONMF algorithms while ensuring perfect orthogonality (many previous algorithms do not satisfy the hard orthogonality constraint).
Modeling of headway/spacing between two consecutive vehicles has many applications in traffic flow theory and transport practice. Most known approaches only study the vehicles running on freeways. In this paper, we propose a model to explain the spacing distribution of queuing vehicles in front of a signalized junction based on random-matrix theory. We show that the recently measured spacing distribution data well fit the spacing distribution of a Gaussian symplectic ensemble (GSE). These results are also compared with the spacing distribution observed for car parking problem. Why vehicle-stationary-queuing and vehicle-parking have different spacing distributions (GSE vs GUE) seems to lie in the difference of driving patterns.