No Arabic abstract
The root-cause diagnostics of product quality defects in multistage manufacturing processes often requires a joint identification of crucial stages and process variables. To meet this requirement, this paper proposes a novel penalized matrix regression methodology for two-dimensional variable selection. The method regresses a scalar response variable against a matrix-based predictor using a generalized linear model. The unknown regression coefficient matrix is decomposed as a product of two factor matrices. The rows of the first factor matrix and the columns of the second factor matrix are simultaneously penalized to inspire sparsity. To estimate the parameters, we develop a block coordinate proximal descent (BCPD) optimization algorithm, which cyclically solves two convex sub-optimization problems. We have proved that the BCPD algorithm always converges to a critical point with any initialization. In addition, we have also proved that each of the sub-optimization problems has a closed-form solution if the response variable follows a distribution whose (negative) log-likelihood function has a Lipschitz continuous gradient. A simulation study and a dataset from a real-world application are used to validate the effectiveness of the proposed method.
Our work was motivated by a recent study on birth defects of infants born to pregnant women exposed to a certain medication for treating chronic diseases. Outcomes such as birth defects are rare events in the general population, which often translate to very small numbers of events in the unexposed group. As drug safety studies in pregnancy are typically observational in nature, we control for confounding in this rare events setting using propensity scores (PS). Using our empirical data, we noticed that the estimated odds ratio for birth defects due to exposure varied drastically depending on the specific approach used. The commonly used approaches with PS are matching, stratification, inverse probability weighting (IPW) and regression adjustment. The extremely rare events setting renders the matching or stratification infeasible. In addition, the PS itself may be formed via different approaches to select confounders from a relatively long list of potential confounders. We carried out simulation experiments to compare different combinations of approaches: IPW or regression adjustment, with 1) including all potential confounders without selection, 2) selection based on univariate association between the candidate variable and the outcome, 3) selection based on change in effects (CIE). The simulation showed that IPW without selection leads to extremely large variances in the estimated odds ratio, which help to explain the empirical data analysis results that we had observed. The simulation also showed that IPW with selection based on univariate association with the outcome is preferred over IPW with CIE. Regression adjustment has small variances of the estimated odds ratio regardless of the selection methods used.
Online platforms collect rich information about participants and then share some of this information back with them to improve market outcomes. In this paper we study the following information disclosure problem in two-sided markets: If a platform wants to maximize revenue, which sellers should the platform allow to participate, and how much of its available information about participating sellers quality should the platform share with buyers? We study this information disclosure problem in the context of two distinct two-sided market models: one in which the platform chooses prices and the sellers choose quantities (similar to ride-sharing), and one in which the sellers choose prices (similar to e-commerce). Our main results provide conditions under which simple information structures commonly observed in practice, such as banning certain sellers from the platform while not distinguishing between participating sellers, maximize the platforms revenue. An important innovation in our analysis is to transform the platforms information disclosure problem into a constrained price discrimination problem. We leverage this transformation to obtain our structural results.
We compare two major approaches to variable selection in clustering: model selection and regularization. Based on previous results, we select the method of Maugis et al. (2009b), which modified the method of Raftery and Dean (2006), as a current state of the art model selection method. We select the method of Witten and Tibshirani (2010) as a current state of the art regularization method. We compared the methods by simulation in terms of their accuracy in both classification and variable selection. In the first simulation experiment all the variables were conditionally independent given cluster membership. We found that variable selection (of either kind) yielded substantial gains in classification accuracy when the clusters were well separated, but few gains when the clusters were close together. We found that the two variable selection methods had comparable classification accuracy, but that the model selection approach had substantially better accuracy in selecting variables. In our second simulation experiment, there were correlations among the variables given the cluster memberships. We found that the model selection approach was substantially more accurate in terms of both classification and variable selection than the regularization approach, and that both gave more accurate classifications than $K$-means without variable selection.
When testing for a disease such as COVID-19, the standard method is individual testing: we take a sample from each individual and test these samples separately. An alternative is pooled testing (or group testing), where samples are mixed together in different pools, and those pooled samples are tested. When the prevalence of the disease is low and the accuracy of the test is fairly high, pooled testing strategies can be more efficient than individual testing. In this chapter, we discuss the mathematics of pooled testing and its uses during pandemics, in particular the COVID-19 pandemic. We analyse some one- and two-stage pooling strategies under perfect and imperfect tests, and consider the practical issues in the application of such protocols.
Competing risk analysis considers event times due to multiple causes, or of more than one event types. Commonly used regression models for such data include 1) cause-specific hazards model, which focuses on modeling one type of event while acknowledging other event types simultaneously; and 2) subdistribution hazards model, which links the covariate effects directly to the cumulative incidence function. Their use and in particular statistical properties in the presence of high-dimensional predictors are largely unexplored. Motivated by an analysis using the linked SEER-Medicare database for the purposes of predicting cancer versus non-cancer mortality for patients with prostate cancer, we study the accuracy of prediction and variable selection of existing statistical learning methods under both models using extensive simulation experiments, including different approaches to choosing penalty parameters in each method. We then apply the optimal approaches to the analysis of the SEER-Medicare data.