ترغب بنشر مسار تعليمي؟ اضغط هنا

Data Integration with High Dimensionality

62   0   0.0 ( 0 )
 نشر من قبل Xin Gao Dr.
 تاريخ النشر 2016
  مجال البحث الاحصاء الرياضي
والبحث باللغة English




اسأل ChatGPT حول البحث

We consider a problem of data integration. Consider determining which genes affect a disease. The genes, which we call predictor objects, can be measured in different experiments on the same individual. We address the question of finding which genes are predictors of disease by any of the experiments. Our formulation is more general. In a given data set, there are a fixed number of responses for each individual, which may include a mix of discrete, binary and continuous variables. There is also a class of predictor objects, which may differ within a subject depending on how the predictor object is measured, i.e., depend on the experiment. The goal is to select which predictor objects affect any of the responses, where the number of such informative predictor objects or features tends to infinity as sample size increases. There are marginal likelihoods for each way the predictor object is measured, i.e., for each experiment. We specify a pseudolikelihood combining the marginal likelihoods, and propose a pseudolikelihood information criterion. Under regularity conditions, we establish selection consistency for the pseudolikelihood information criterion with unbounded true model size, which includes a Bayesian information criterion with appropriate penalty term as a special case. Simulations indicate that data integration improves upon, sometimes dramatically, using only one of the data sources.

قيم البحث

اقرأ أيضاً

This article is concerned with the design and analysis of discrete time Feynman-Kac particle integration models with geometric interacting jump processes. We analyze two general types of model, corresponding to whether the reference process is in con tinuous or discrete time. For the former, we consider discrete generation particle models defined by arbitrarily fine time mesh approximations of the Feynman-Kac models with continuous time path integrals. For the latter, we assume that the discrete process is observed at integer times and we design new approximation models with geometric interacting jumps in terms of a sequence of intermediate time steps between the integers. In both situations, we provide non asymptotic bias and variance theorems w.r.t. the time step and the size of the system, yielding what appear to be the first results of this type for this class of Feynman-Kac particle integration models. We also discuss uniform convergence estimates w.r.t. the time horizon. Our approach is based on an original semigroup analysis with first order decompositions of the fluctuation errors.
Robust real-time monitoring of high-dimensional data streams has many important real-world applications such as industrial quality control, signal detection, biosurveillance, but unfortunately it is highly non-trivial to develop efficient schemes due to two challenges: (1) the unknown sparse number or subset of affected data streams and (2) the uncertainty of model specification for high-dimensional data. In this article, motivated by the detection of smaller persistent changes in the presence of larger transient outliers, we develop a family of efficient real-time robust detection schemes for high-dimensional data streams through monitoring feature spaces such as PCA or wavelet coefficients when the feature coefficients are from Tukey-Hubers gross error models with outliers. We propose to construct a new local detection statistic for each feature called $L_{alpha}$-CUSUM statistic that can reduce the effect of outliers by using the Box-Cox transformation of the likelihood function, and then raise a global alarm based upon the sum of the soft-thresholding transformation of these local $L_{alpha}$-CUSUM statistics so that to filter out unaffected features. In addition, we propose a new concept called false alarm breakdown point to measure the robustness of online monitoring schemes, and also characterize the breakdown point of our proposed schemes. Asymptotic analysis, extensive numerical simulations and case study of nonlinear profile monitoring are conducted to illustrate the robustness and usefulness of our proposed schemes.
Forecasting the (open-high-low-close)OHLC data contained in candlestick chart is of great practical importance, as exemplified by applications in the field of finance. Typically, the existence of the inherent constraints in OHLC data poses great chal lenge to its prediction, e.g., forecasting models may yield unrealistic values if these constraints are ignored. To address it, a novel transformation approach is proposed to relax these constraints along with its explicit inverse transformation, which ensures the forecasting models obtain meaningful openhigh-low-close values. A flexible and efficient framework for forecasting the OHLC data is also provided. As an example, the detailed procedure of modelling the OHLC data via the vector auto-regression (VAR) model and vector error correction (VEC) model is given. The new approach has high practical utility on account of its flexibility, simple implementation and straightforward interpretation. Extensive simulation studies are performed to assess the effectiveness and stability of the proposed approach. Three financial data sets of the Kweichow Moutai, CSI 100 index and 50 ETF of Chinese stock market are employed to document the empirical effect of the proposed methodology.
We propose a distributed bootstrap method for simultaneous inference on high-dimensional massive data that are stored and processed with many machines. The method produces a $ell_infty$-norm confidence region based on a communication-efficient de-bia sed lasso, and we propose an efficient cross-validation approach to tune the method at every iteration. We theoretically prove a lower bound on the number of communication rounds $tau_{min}$ that warrants the statistical accuracy and efficiency. Furthermore, $tau_{min}$ only increases logarithmically with the number of workers and intrinsic dimensionality, while nearly invariant to the nominal dimensionality. We test our theory by extensive simulation studies, and a variable screening task on a semi-synthetic dataset based on the US Airline On-time Performance dataset. The code to reproduce the numerical results is available at GitHub: https://github.com/skchao74/Distributed-bootstrap.
Let ${X}_{k}=(x_{k1}, cdots, x_{kp}), k=1,cdots,n$, be a random sample of size $n$ coming from a $p$-dimensional population. For a fixed integer $mgeq 2$, consider a hypercubic random tensor $mathbf{{T}}$ of $m$-th order and rank $n$ with begin{eqnar ray*} mathbf{{T}}= sum_{k=1}^{n}underbrace{{X}_{k}otimescdotsotimes {X}_{k}}_{m~multiple}=Big(sum_{k=1}^{n} x_{ki_{1}}x_{ki_{2}}cdots x_{ki_{m}}Big)_{1leq i_{1},cdots, i_{m}leq p}. end{eqnarray*} Let $W_n$ be the largest off-diagonal entry of $mathbf{{T}}$. We derive the asymptotic distribution of $W_n$ under a suitable normalization for two cases. They are the ultra-high dimension case with $ptoinfty$ and $log p=o(n^{beta})$ and the high-dimension case with $pto infty$ and $p=O(n^{alpha})$ where $alpha,beta>0$. The normalizing constant of $W_n$ depends on $m$ and the limiting distribution of $W_n$ is a Gumbel-type distribution involved with parameter $m$.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا