مساحة جديدة

اشترك بالحزمة الذهبية واحصل على وصول غير محدود شمرا أكاديميا

تسجيل مستخدم جديد

Unique Entity Estimation with Application to the Syrian Conflict

92 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Rebecca Steorts

تاريخ النشر 2017

مجال البحث الاحصاء الرياضي الهندسة المعلوماتية

والبحث باللغة English

تأليف Beidi Chen - Anshumali Shrivastava - Rebecca C. Steorts

تطبيقات الإحصاء قواعد البيانات بنى وهياكل البيانات والخوارزميات

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Entity resolution identifies and removes duplicate entities in large, noisy databases and has grown in both usage and new developments as a result of increased data availability. Nevertheless, entity resolution has tradeoffs regarding assumptions of the data generation process, error rates, and computational scalability that make it a difficult task for real applications. In this paper, we focus on a related problem of unique entity estimation, which is the task of estimating the unique number of entities and associated standard errors in a data set with duplicate entities. Unique entity estimation shares many fundamental challenges of entity resolution, namely, that the computational cost of all-to-all entity comparisons is intractable for large databases. To circumvent this computational barrier, we propose an efficient (near-linear time) estimation algorithm based on locality sensitive hashing. Our estimator, under realistic assumptions, is unbiased and has provably low variance compared to existing random sampling based approaches. In addition, we empirically show its superiority over the state-of-the-art estimators on three real applications. The motivation for our work is to derive an accurate estimate of the documented, identifiable deaths in the ongoing Syrian conflict. Our methodology, when applied to the Syrian data set, provides an estimate of $191,874 pm 1772$ documented, identifiable deaths, which is very close to the Human Rights Data Analysis Group (HRDAG) estimate of 191,369. Our work provides an example of challenges and efforts involved in solving a real, noisy challenging problem where modeling assumptions may not hold.

قيم البحث

156 - Rebecca C. Steorts , Anshumali Shrivastava 2018

Entity resolution seeks to merge databases as to remove duplicate entries where unique identifiers are typically unknown. We review modern blocking approaches for entity resolution, focusing on those based upon locality sensitive hashing (LSH). First , we introduce $k$-means locality sensitive hashing (KLSH), which is based upon the information retrieval literature and clusters similar records into blocks using a vector-space representation and projections. Second, we introduce a subquadratic variant of LSH to the literature, known as Densified One Permutation Hashing (DOPH). Third, we propose a weighted variant of DOPH. We illustrate each method on an application to a subset of the ongoing Syrian conflict, giving a discussion of each method.

قواعد البيانات التعلم الآلي تطبيقات الإحصاء

Estimation and testing for spatially indexed curves with application to ionospheric and magnetic field trends

276 - Oleksandr Gromenko , Piotr Kokoszka , Lie Zhu 2012

We develop methodology for the estimation of the functional mean and the functional principal components when the functions form a spatial process. The data consist of curves $X(mathbf{s}_k;t),tin[0,T],$ observed at spatial locations $mathbf{s}_1,mat hbf{s}_2,...,mathbf{s}_N$. We propose several methods, and evaluate them by means of a simulation study. Next, we develop a significance test for the correlation of two such functional spatial fields. After validating the finite sample performance of this test by means of a simulation study, we apply it to determine if there is correlation between long-term trends in the so-called critical ionospheric frequency and decadal changes in the direction of the internal magnetic field of the Earth. The test provides conclusive evidence for correlation, thus solving a long-standing space physics conjecture. This conclusion is not apparent if the spatial dependence of the curves is neglected.

تطبيقات الإحصاء

Bayesian Semiparametric Estimation of Cancer-specific Age-at-onset Penetrance with Application to Li-Fraumeni Syndrome

128 - Seung Jun Shin , Ying Yuan , Louise C. Strong 2017

Penetrance, which plays a key role in genetic research, is defined as the proportion of individuals with the genetic variants (i.e., {genotype}) that cause a particular trait and who have clinical symptoms of the trait (i.e., {phenotype}). We propose a Bayesian semiparametric approach to estimate the cancer-specific age-at-onset penetrance in the presence of the competing risk of multiple cancers. We employ a Bayesian semiparametric competing risk model to model the duration until individuals in a high-risk group develop different cancers, and accommodate family data using family-wise likelihoods. We tackle the ascertainment bias arising when family data are collected through probands in a high-risk population in which disease cases are more likely to be observed. We apply the proposed method to a cohort of 186 families with Li-Fraumeni syndrome identified through probands with sarcoma treated at MD Anderson Cancer Center from 1944 to 1982.

تطبيقات الإحصاء

Power Maxwell distribution: Statistical Properties, Estimation and Application

68 - Abhimanyu Singh Yadav , Hassan S. Bakouch , Sanjay Kumar Singh 2018

In this article, we proposed a new probability distribution named as power Maxwell distribution (PMaD). It is another extension of Maxwell distribution (MaD) which would lead more flexibility to analyze the data with non-monotone failure rate. Differ ent statistical properties such as reliability characteristics, moments, quantiles, mean deviation, generating function, conditional moments, stochastic ordering, residual lifetime function and various entropy measures have been derived. The estimation of the parameters for the proposed probability distribution has been addressed by maximum likelihood estimation method and Bayes estimation method. The Bayes estimates are obtained under gamma prior using squared error loss function. Lastly, real-life application for the proposed distribution has been illustrated through different lifetime data.

تطبيقات الإحصاء

d-blink: Distributed End-to-End Bayesian Entity Resolution

101 - Neil G. Marchant , Andee Kaplan , Daniel N. Elazar 2019

Entity resolution (ER; also known as record linkage or de-duplication) is the process of merging noisy databases, often in the absence of unique identifiers. A major advancement in ER methodology has been the application of Bayesian generative models , which provide a natural framework for inferring latent entities with rigorous quantification of uncertainty. Despite these advantages, existing models are severely limited in practice, as standard inference algorithms scale quadratically in the number of records. While scaling can be managed by fitting the model on separate blocks of the data, such a naive approach may induce significant error in the posterior. In this paper, we propose a principled model for scalable Bayesian ER, called distributed Bayesian linkage or d-blink, which jointly performs blocking and ER without compromising posterior correctness. Our approach relies on several key ideas, including: (i) an auxiliary variable representation that induces a partition of the entities and records into blocks; (ii) a method for constructing well-balanced blocks based on k-d trees; (iii) a distributed partially-collapsed Gibbs sampler with improved mixing; and (iv) fast algorithms for performing Gibbs updates. Empirical studies on six data sets---including a case study on the 2010 Decennial Census---demonstrate the scalability and effectiveness of our approach.

حساب قواعد البيانات التعلم الآلي

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

جامعة الرشيد الدولية الخاصة للعلوم والتكنولوجيا

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Unique Entity Estimation with Application to the Syrian Conflict

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

اقرأ أيضاً