No Arabic abstract
Penetrance, which plays a key role in genetic research, is defined as the proportion of individuals with the genetic variants (i.e., {genotype}) that cause a particular trait and who have clinical symptoms of the trait (i.e., {phenotype}). We propose a Bayesian semiparametric approach to estimate the cancer-specific age-at-onset penetrance in the presence of the competing risk of multiple cancers. We employ a Bayesian semiparametric competing risk model to model the duration until individuals in a high-risk group develop different cancers, and accommodate family data using family-wise likelihoods. We tackle the ascertainment bias arising when family data are collected through probands in a high-risk population in which disease cases are more likely to be observed. We apply the proposed method to a cohort of 186 families with Li-Fraumeni syndrome identified through probands with sarcoma treated at MD Anderson Cancer Center from 1944 to 1982.
In order to implement disease-specific interventions in young age groups, policy makers in low- and middle-income countries require timely and accurate estimates of age- and cause-specific child mortality. High quality data is not available in settings where these interventions are most needed, but there is a push to create sample registration systems that collect detailed mortality information. Current methods that estimate mortality from this data employ multistage frameworks without rigorous statistical justification that separately estimate all-cause and cause-specific mortality and are not sufficiently adaptable to capture important features of the data. We propose a flexible Bayesian modeling framework to estimate age- and cause-specific child mortality from sample registration data. We provide a theoretical justification for the framework, explore its properties via simulation, and use it to estimate mortality trends using data from the Maternal and Child Health Surveillance System in China.
Entity resolution identifies and removes duplicate entities in large, noisy databases and has grown in both usage and new developments as a result of increased data availability. Nevertheless, entity resolution has tradeoffs regarding assumptions of the data generation process, error rates, and computational scalability that make it a difficult task for real applications. In this paper, we focus on a related problem of unique entity estimation, which is the task of estimating the unique number of entities and associated standard errors in a data set with duplicate entities. Unique entity estimation shares many fundamental challenges of entity resolution, namely, that the computational cost of all-to-all entity comparisons is intractable for large databases. To circumvent this computational barrier, we propose an efficient (near-linear time) estimation algorithm based on locality sensitive hashing. Our estimator, under realistic assumptions, is unbiased and has provably low variance compared to existing random sampling based approaches. In addition, we empirically show its superiority over the state-of-the-art estimators on three real applications. The motivation for our work is to derive an accurate estimate of the documented, identifiable deaths in the ongoing Syrian conflict. Our methodology, when applied to the Syrian data set, provides an estimate of $191,874 pm 1772$ documented, identifiable deaths, which is very close to the Human Rights Data Analysis Group (HRDAG) estimate of 191,369. Our work provides an example of challenges and efforts involved in solving a real, noisy challenging problem where modeling assumptions may not hold.
Identifying individuals who are at high risk of cancer due to inherited germline mutations is critical for effective implementation of personalized prevention strategies. Most existing models to identify these individuals focus on specific syndromes by including family and personal history for a small number of cancers. Recent evidence from multi-gene panel testing has shown that many syndromes once thought to be distinct are overlapping, motivating the development of models that incorporate family history information on several cancers and predict mutations for more comprehensive panels of genes. Once such class of models are Mendelian risk prediction models, which use family history information and Mendelian laws of inheritance to estimate the probability of carrying genetic mutations, as well as future risk of developing associated cancers. To flexibly model the complexity of many cancer-mutation associations, we present a new software tool called PanelPRO, a R package that extends the previously developed BayesMendel R package to user-selected lists of susceptibility genes and associated cancers. The model identifies individuals at an increased risk of carrying cancer susceptibility gene mutations and predicts future risk of developing hereditary cancers associated with those genes. Additional functionalities adjust for prophylactic interventions, known genetic testing results, and risk modifiers such as race and ancestry. The package comes with a customizable database with default parameter values estimated from published studies. The PanelPRO package is open-source and provides a fast and flexible back-end for multi-gene, multi-cancer risk modeling with pedigree data. The software enables the identification of high-risk individuals, which will have an impact on personalized prevention strategies for cancer and individualized decision making about genetic testing.
Lung cancer is among the most common cancers in the United States, in terms of incidence and mortality. In 2009, it is estimated that more than 150,000 deaths will result from lung cancer alone. Genetic information is an extremely valuable data source in characterizing the personal nature of cancer. Over the past several years, investigators have conducted numerous association studies where intensive genetic data is collected on relatively few patients compared to the numbers of gene predictors, with one scientific goal being to identify genetic features associated with cancer recurrence or survival. In this note, we propose high-dimensional survival analysis through a new application of boosting, a powerful tool in machine learning. Our approach is based on an accelerated lifetime model and minimizing the sum of pairwise differences in residuals. We apply our method to a recent microarray study of lung adenocarcinoma and find that our ensemble is composed of 19 genes, while a proportional hazards (PH) ensemble is composed of nine genes, a proper subset of the 19-gene panel. In one of our simulation scenarios, we demonstrate that PH boosting in a misspecified model tends to underfit and ignore moderately-sized covariate effects, on average. Diagnostic analyses suggest that the PH assumption is not satisfied in the microarray data and may explain, in part, the discrepancy in the sets of active coefficients. Our simulation studies and comparative data analyses demonstrate how statistical learning by PH models alone is insufficient.
Cancer development is a multistep process often starting with a single cell in which a number of epigenetic and genetic alterations have accumulated thus transforming it into a tumor cell. The progeny of such a single benign tumor cell expands in the tissue and can at some point progress to malignant tumor cells until a detectable tumor is formed. The dynamics from the early phase of a single cell to a detectable tumor with billions of tumor cells are complex and still not fully resolved, not even for the well-known prototype of multistage carcinogenesis, the adenoma-adenocarcinoma sequence of colorectal cancer. Mathematical models of such carcinogenesis are frequently tested and calibrated based on reported age-specific incidence rates of cancer, but they usually require calibration of four or more parameters due to the wide range of processes these models aim to reflect. We present a cell-based model, which focuses on the competition between wild-type and tumor cells in colonic crypts, with which we are able reproduce epidemilogical incidence rates of colon cancer. Additionally, the fraction of cancerous tumors with precancerous lesions predicted by the model agrees with clinical estimates. The match between model and reported data suggests that the fate of tumor development is dominated by the early phase of tumor growth and progression long before a tumor becomes detectable. Due to the focus on the early phase of tumor development, the model has only a single fit parameter, the replacement rate of stem cells in the crypt. We find this rate to be consistent with recent experimental estimates.