No Arabic abstract
During the semiconductor manufacturing process, predicting the yield of the semiconductor is an important problem. Early detection of defective product production in the manufacturing process can save huge production cost. The data generated from the semiconductor manufacturing process have characteristics of highly non-normal distributions, complicated missing patterns and high missing rate, which complicate the prediction of the yield. We propose Dirichlet process - naive Bayes model (DPNB), a classification method based on the mixtures of Dirichlet process and naive Bayes model. Since the DPNB is based on the mixtures of Dirichlet process and learns the joint distribution of all variables involved, it can handle highly non-normal data and can make predictions for the test dataset with any missing patterns. The DPNB also performs well for high missing rates since it uses all information of observed components. Experiments on various real datasets including semiconductor manufacturing data show that the DPNB has better performance than MICE and MissForest in terms of predicting missing values as percentage of missing values increases.
Microorganisms play critical roles in human health and disease. It is well known that microbes live in diverse communities in which they interact synergistically or antagonistically. Thus for estimating microbial associations with clinical covariates, multivariate statistical models are preferred. Multivariate models allow one to estimate and exploit complex interdependencies among multiple taxa, yielding more powerful tests of exposure or treatment effects than application of taxon-specific univariate analyses. In addition, the analysis of microbial count data requires special attention because data commonly exhibit zero inflation. To meet these needs, we developed a Bayesian variable selection model for multivariate count data with excess zeros that incorporates information on the covariance structure of the outcomes (counts for multiple taxa), while estimating associations with the mean levels of these outcomes. Although there has been a great deal of effort in zero-inflated models for longitudinal data, little attention has been given to high-dimensional multivariate zero-inflated data modeled via a general correlation structure. Through simulation, we compared performance of the proposed method to that of existing univariate approaches, for both the binary and count parts of the model. When outcomes were correlated the proposed variable selection method maintained type I error while boosting the ability to identify true associations in the binary component of the model. For the count part of the model, in some scenarios the the univariate method had higher power than the multivariate approach. This higher power was at a cost of a highly inflated false discovery rate not observed with the proposed multivariate method. We applied the approach to oral microbiome data from the Pediatric HIV/AIDS Cohort Oral Health Study and identified five species (of 44) associated with HIV infection.
Additive manufacturing (AM) technology is being increasingly adopted in a wide variety of application areas due to its ability to rapidly produce, prototype, and customize designs. AM techniques afford significant opportunities in regard to nuclear materials, including an accelerated fabrication process and reduced cost. High-fidelity modeling and simulation (M&S) of AM processes is being developed in Idaho National Laboratory (INL)s Multiphysics Object-Oriented Simulation Environment (MOOSE) to support AM process optimization and provide a fundamental understanding of the various physical interactions involved. In this paper, we employ Bayesian inverse uncertainty quantification (UQ) to quantify the input uncertainties in a MOOSE-based melt pool model for AM. Inverse UQ is the process of inversely quantifying the input uncertainties while keeping model predictions consistent with the measurement data. The inverse UQ process takes into account uncertainties from the model, code, and data while simultaneously characterizing the uncertain distributions in the input parameters--rather than merely providing best-fit point estimates. We employ measurement data on melt pool geometry (lengths and depths) to quantify the uncertainties in several melt pool model parameters. Simulation results using the posterior uncertainties have shown improved agreement with experimental data, as compared to those using the prior nominal values. The resulting parameter uncertainties can be used to replace expert opinions in future uncertainty, sensitivity, and validation studies.
We develop a new methodology for spatial regression of aggregated outputs on multi-resolution covariates. Such problems often occur with spatial data, for example in crop yield prediction, where the output is spatially-aggregated over an area and the covariates may be observed at multiple resolutions. Building upon previous work on aggregated output regression, we propose a regression framework to synthesise the effects of the covariates at different resolutions on the output and provide uncertainty estimation. We show that, for a crop yield prediction problem, our approach is more scalable, via variational inference, than existing multi-resolution regression models. We also show that our framework yields good predictive performance, compared to existing multi-resolution crop yield models, whilst being able to provide estimation of the underlying spatial effects.
Missing data are a common problem in experimental and observational physics. They can be caused by various sources, either an instruments saturation, or a contamination from an external event, or a data loss. In particular, they can have a disastrous effect when one is seeking to characterize a colored-noise-dominated signal in Fourier space, since they create a spectral leakage that can artificially increase the noise. It is therefore important to either take them into account or to correct for them prior to e.g. a Least-Square fit of the signal to be characterized. In this paper, we present an application of the {it inpainting} algorithm to mock MICROSCOPE data; {it inpainting} is based on a sparsity assumption, and has already been used in various astrophysical contexts; MICROSCOPE is a French Space Agency mission, whose launch is expected in 2016, that aims to test the Weak Equivalence Principle down to the $10^{-15}$ level. We then explore the {it inpainting} dependence on the number of gaps and the total fraction of missing values. We show that, in a worst-case scenario, after reconstructing missing values with {it inpainting}, a Least-Square fit may allow us to significantly measure a $1.1times10^{-15}$ Equivalence Principle violation signal, which is sufficiently close to the MICROSCOPE requirements to implement {it inpainting} in the official MICROSCOPE data processing and analysis pipeline. Together with the previously published KARMA method, {it inpainting} will then allow us to independently characterize and cross-check an Equivalence Principle violation signal detection down to the $10^{-15}$ level.
This work is motivated by the Obepine French system for SARS-CoV-2 viral load monitoring in wastewater. The objective of this work is to identify, from time-series of noisy measurements, the underlying auto-regressive signals, in a context where the measurements present numerous missing data, censoring and outliers. We propose a method based on an auto-regressive model adapted to censored data with outliers. Inference and prediction are produced via a discretised smoother. This method is both validated on simulations and on real data from Obepine. The proposed method is used to denoise measurements from the quantification of the SARS-CoV-2 E gene in wastewater by RT-qPCR. The resulting smoothed signal shows a good correlation with other epidemiological indicators and an estimate of the whole system noise is produced.