ترغب بنشر مسار تعليمي؟ اضغط هنا

Can Common Crawl reliably track persistent identifier (PID) use over time?

59   0   0.0 ( 0 )
 نشر من قبل Henry S Thompson
 تاريخ النشر 2018
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

We report here on the results of two studies using two and four monthly web crawls respectively from the Common Crawl (CC) initiative between 2014 and 2017, whose initial goal was to provide empirical evidence for the changing patterns of use of so-called persistent identifiers. This paper focusses on the tooling needed for dealing with CC data, and the problems we found with it. The first study is based on over $10^{12}$ URIs from over $5 * 10^9$ pages crawled in April 2014 and April 2017, the second study adds a further $3 * 10^9$ pages from the April 2015 and April 2016 crawls. We conclude with suggestions on specific actions needed to enable studies based on CC to give reliable longitudinal information.

قيم البحث

اقرأ أيضاً

Event collections are frequently built by crawling the live web on the basis of seed URIs nominated by human experts. Focused web crawling is a technique where the crawler is guided by reference content pertaining to the event. Given the dynamic natu re of the web and the pace with which topics evolve, the timing of the crawl is a concern for both approaches. We investigate the feasibility of performing focused crawls on the archived web. By utilizing the Memento infrastructure, we obtain resources from 22 web archives that contribute to building event collections. We create collections on four events and compare the relevance of their resources to collections built from crawling the live web as well as from a manually curated collection. Our results show that focused crawling on the archived web can be done and indeed results in highly relevant collections, especially for events that happened further in the past.
We quantify the extent to which references to papers in scholarly literature use persistent HTTP URIs that leverage the Digital Object Identifier infrastructure. We find a significant number of references that do not, speculate why authors would use brittle URIs when persistent ones are available, and propose an approach to alleviate the problem.
We give the first dimension-efficient algorithms for learning Rectified Linear Units (ReLUs), which are functions of the form $mathbf{x} mapsto max(0, mathbf{w} cdot mathbf{x})$ with $mathbf{w} in mathbb{S}^{n-1}$. Our algorithm works in the challeng ing Reliable Agnostic learning model of Kalai, Kanade, and Mansour (2009) where the learner is given access to a distribution $cal{D}$ on labeled examples but the labeling may be arbitrary. We construct a hypothesis that simultaneously minimizes the false-positive rate and the loss on inputs given positive labels by $cal{D}$, for any convex, bounded, and Lipschitz loss function. The algorithm runs in polynomial-time (in $n$) with respect to any distribution on $mathbb{S}^{n-1}$ (the unit sphere in $n$ dimensions) and for any error parameter $epsilon = Omega(1/log n)$ (this yields a PTAS for a question raised by F. Bach on the complexity of maximizing ReLUs). These results are in contrast to known efficient algorithms for reliably learning linear threshold functions, where $epsilon$ must be $Omega(1)$ and strong assumptions are required on the marginal distribution. We can compose our results to obtain the first set of efficient algorithms for learning constant-depth networks of ReLUs. Our techniques combine kernel methods and polynomial approximations with a dual-loss approach to convex programming. As a byproduct we obtain a number of applications including the first set of efficient algorithms for convex piecewise-linear fitting and the first efficient algorithms for noisy polynomial reconstruction of low-weight polynomials on the unit sphere.
The centroid energy of the Fe K$alpha$ line has been used to identify the progenitors of supernova remnants (SNRs). These investigations generally considered the energy of the centroid derived from the spectrum of the entire remnant. Here we use {it XMM-Newton} data to investigate the Fe K$alpha$ centroid in 6 SNRs: 3C~397, N132D, W49B, DEM L71, 1E 0102.2-7219, and Kes 73. In Kes 73 and 1E 0102.2-7219, we fail to detect any Fe K$alpha$ emission. We report a tentative first detection of Fe K$alpha$ emission in SNR DEM L71, with a centroid energy consistent with its Type Ia designation. In the remaining remnants, the spatial and spectral sensitivity is sufficient to investigate spatial variations of the Fe K$alpha$ centroid. We find in N132D and W49B that the centroids in different regions are consistent with that derived from the overall spectrum, although not necessarily with the remnant type identified via other means. However, in SNR 3C~397, we find statistically significant variation in the centroid of up to 100 eV, aligning with the variation in the density structure around the remnant. These variations span the intermediate space between centroid energies signifying core-collapse and Type Ia remnants. Shifting the dividing line downwards by 50 eV can place all the centroids in the CC region, but contradicts the remnant type obtained via other means. Our results show that caution must be used when employing the Fe K$alpha$ centroid of the entire remnant as the sole diagnostic for typing a remnant.
In this paper, we reviewed the notes on using Web map image provided by Web map service, from the viewpoint of copyright act. The copyright act aims to contribute to creation of culture by protecting the rights of authors and others, and promoting fa ir exploitation of cultural products. Therefore, everyone can use copyrighted materials to the extent of the copyright limitation based on copyright act. The Web map image, including maps, aerial photo and satellite image, are one of copyrighted materials, so it can be used within the limits of copyright. However, the available range of Web map image under the copyright act is not wide. In addition, it is pointed out that the copyright act has not been able to follow the progress of digitalization of copyrighted materials. It is expected to revise the copyright act corresponding to digitalization of copyrighted work.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا