Do you want to publish a course? Click here

Experimental Design Issues in Big Data. The Question of Bias

66   0   0.0 ( 0 )
 Added by Elena Pesce
 Publication date 2017
and research's language is English




Ask ChatGPT about the research

Data can be collected in scientific studies via a controlled experiment or passive observation. Big data is often collected in a passive way, e.g. from social media. In studies of causation great efforts are made to guard against bias and hidden confounders or feedback which can destroy the identification of causation by corrupting or omitting counterfactuals (controls). Various solutions of these problems are discussed, including randomization.



rate research

Read More

We develop an analytical framework to study experimental design in two-sided marketplaces. Many of these experiments exhibit interference, where an intervention applied to one market participant influences the behavior of another participant. This interference leads to biased estimates of the treatment effect of the intervention. We develop a stochastic market model and associated mean field limit to capture dynamics in such experiments, and use our model to investigate how the performance of different designs and estimators is affected by marketplace interference effects. Platforms typically use two common experimental designs: demand-side (customer) randomization (CR) and supply-side (listing) randomization (LR), along with their associated estimators. We show that good experimental design depends on market balance: in highly demand-constrained markets, CR is unbiased, while LR is biased; conversely, in highly supply-constrained markets, LR is unbiased, while CR is biased. We also introduce and study a novel experimental design based on two-sided randomization (TSR) where both customers and listings are randomized to treatment and control. We show that appropriate choices of TSR designs can be unbiased in both extremes of market balance, while yielding relatively low bias in intermediate regimes of market balance.
This paper embodies the usage of Big Data in Healthcare. It is important to note that big data in terms of Architecture and implementation might be or has already or will continue to assist the continuous growth in the field of healthcare. The main important aspects of this study are the general importance of big data in healthcare, the positives big data will help tackle and enhance in this field and not to also forget to mention the tremendous downside big data has on healthcare that is still needed to improve or putting extensive research on. We believe there is still a long way in which institutions and individuals understand the hidden truth about big data. We have highlighted the various ways one could be confidently relied on big data and on the other hand highlighted the weighted importance of big problem big data and expected solutions.
256 - Lu Lin , Jun Lu 2019
The strategy of divide-and-combine (DC) has been widely used in the area of big data. Bias-correction is crucial in the DC procedure for validly aggregating the locally biased estimators, especial for the case when the number of batches of data is large. This paper establishes a race-DC through a residual-adjustment composition estimate (race). The race-DC applies to various types of biased estimators, which include but are not limited to Lasso estimator, Ridge estimator and principal component estimator in linear regression, and least squares estimator in nonlinear regression. The resulting global estimator is strictly unbiased under linear model, and is acceleratingly bias-reduced in nonlinear model, and can achieve the theoretical optimality, for the case when the number of batches of data is large. Moreover, the race-DC is computationally simple because it is a least squares estimator in a pro forma linear regression. Detailed simulation studies demonstrate that the resulting global estimator is significantly bias-corrected, and the behavior is comparable with the oracle estimation and is much better than the competitors.
The hematopoietic system has a highly regulated and complex structure in which cells are organized to successfully create and maintain new blood cells. Feedback regulation is crucial to tightly control this system, but the specific mechanisms by which control is exerted are not completely understood. In this work, we aim to uncover the underlying mechanisms in hematopoiesis by conducting perturbation experiments, where animal subjects are exposed to an external agent in order to observe the system response and evolution. Developing a proper experimental design for these studies is an extremely challenging task. To address this issue, we have developed a novel Bayesian framework for optimal design of perturbation experiments. We model the numbers of hematopoietic stem and progenitor cells in mice that are exposed to a low dose of radiation. We use a differential equations model that accounts for feedback and feedforward regulation. A significant obstacle is that the experimental data are not longitudinal, rather each data point corresponds to a different animal. This model is embedded in a hierarchical framework with latent variables that capture unobserved cellular population levels. We select the optimum design based on the amount of information gain, measured by the Kullback-Leibler divergence between the probability distributions before and after observing the data. We evaluate our approach using synthetic and experimental data. We show that a proper design can lead to better estimates of model parameters even with relatively few subjects. Additionally, we demonstrate that the model parameters show a wide range of sensitivities to design options. Our method should allow scientists to find the optimal design by focusing on their specific parameters of interest and provide insight to hematopoiesis. Our approach can be extended to more complex models where latent components are used.
To fast approximate maximum likelihood estimators with massive data, this paper studies the Optimal Subsampling Method under the A-optimality Criterion (OSMAC) for generalized linear models. The consistency and asymptotic normality of the estimator from a general subsampling algorithm are established, and optimal subsampling probabilities under the A- and L-optimality criteria are derived. Furthermore, using Frobenius norm matrix concentration inequalities, finite sample properties of the subsample estimator based on optimal subsampling probabilities are also derived. Since the optimal subsampling probabilities depend on the full data estimate, an adaptive two-step algorithm is developed. Asymptotic normality and optimality of the estimator from this adaptive algorithm are established. The proposed methods are illustrated and evaluated through numerical experiments on simulated and real datasets.
comments
Fetching comments Fetching comments
Sign in to be able to follow your search criteria
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا