Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Can Common Crawl reliably track persistent identifier (PID) use over time?

59 0 0.0 ( 0 )

Download Cite

Added by Henry S Thompson

Publication date 2018

fields Informatics Engineering

and research's language is English

Authors Henry S. Thompson - Jian Tong

Digital Libraries Networking and Internet Architecture

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

We report here on the results of two studies using two and four monthly web crawls respectively from the Common Crawl (CC) initiative between 2014 and 2017, whose initial goal was to provide empirical evidence for the changing patterns of use of so-called persistent identifiers. This paper focusses on the tooling needed for dealing with CC data, and the problems we found with it. The first study is based on over $10^{12}$ URIs from over $5 * 10^9$ pages crawled in April 2014 and April 2017, the second study adds a further $3 * 10^9$ pages from the April 2015 and April 2016 crawls. We conclude with suggestions on specific actions needed to enable studies based on CC to give reliable longitudinal information.

rate research

Focused Crawl of Web Archives to Build Event Collections

216 - Martin Klein , Lyudmila Balakireva , Herbert Van de Sompel 2018

Event collections are frequently built by crawling the live web on the basis of seed URIs nominated by human experts. Focused web crawling is a technique where the crawler is guided by reference content pertaining to the event. Given the dynamic nature of the web and the pace with which topics evolve, the timing of the crawl is a concern for both approaches. We investigate the feasibility of performing focused crawls on the archived web. By utilizing the Memento infrastructure, we obtain resources from 22 web archives that contribute to building event collections. We create collections on four events and compare the relevance of their resources to collections built from crawling the live web as well as from a manually curated collection. Our results show that focused crawling on the archived web can be done and indeed results in highly relevant collections, especially for events that happened further in the past.

Digital Libraries

Persistent URIs Must Be Used To Be Persistent

133 - Herbert Van de Sompel , Martin Klein , Shawn M. Jones 2016

We quantify the extent to which references to papers in scholarly literature use persistent HTTP URIs that leverage the Digital Object Identifier infrastructure. We find a significant number of references that do not, speculate why authors would use brittle URIs when persistent ones are available, and propose an approach to alleviate the problem.

Digital Libraries

Reliably Learning the ReLU in Polynomial Time

68 - Surbhi Goel , Varun Kanade , Adam Klivans 2016

We give the first dimension-efficient algorithms for learning Rectified Linear Units (ReLUs), which are functions of the form $mathbf{x} mapsto max(0, mathbf{w} cdot mathbf{x})$ with $mathbf{w} in mathbb{S}^{n-1}$. Our algorithm works in the challenging Reliable Agnostic learning model of Kalai, Kanade, and Mansour (2009) where the learner is given access to a distribution $cal{D}$ on labeled examples but the labeling may be arbitrary. We construct a hypothesis that simultaneously minimizes the false-positive rate and the loss on inputs given positive labels by $cal{D}$, for any convex, bounded, and Lipschitz loss function. The algorithm runs in polynomial-time (in $n$) with respect to any distribution on $mathbb{S}^{n-1}$ (the unit sphere in $n$ dimensions) and for any error parameter $epsilon = Omega(1/log n)$ (this yields a PTAS for a question raised by F. Bach on the complexity of maximizing ReLUs). These results are in contrast to known efficient algorithms for reliably learning linear threshold functions, where $epsilon$ must be $Omega(1)$ and strong assumptions are required on the marginal distribution. We can compose our results to obtain the first set of efficient algorithms for learning constant-depth networks of ReLUs. Our techniques combine kernel methods and polynomial approximations with a dual-loss approach to convex programming. As a byproduct we obtain a number of applications including the first set of efficient algorithms for convex piecewise-linear fitting and the first efficient algorithms for noisy polynomial reconstruction of low-weight polynomials on the unit sphere.

Machine Learning Computational Complexity Machine Learning

Can the Fe K-alpha Line Reliably Predict Supernova Remnant Progenitors?

83 - Jared Siegel , Vikram V. Dwarkadas , Karin A.Frank (CIERA 2021

The centroid energy of the Fe K$alpha$ line has been used to identify the progenitors of supernova remnants (SNRs). These investigations generally considered the energy of the centroid derived from the spectrum of the entire remnant. Here we use {it XMM-Newton} data to investigate the Fe K$alpha$ centroid in 6 SNRs: 3C~397, N132D, W49B, DEM L71, 1E 0102.2-7219, and Kes 73. In Kes 73 and 1E 0102.2-7219, we fail to detect any Fe K$alpha$ emission. We report a tentative first detection of Fe K$alpha$ emission in SNR DEM L71, with a centroid energy consistent with its Type Ia designation. In the remaining remnants, the spatial and spectral sensitivity is sufficient to investigate spatial variations of the Fe K$alpha$ centroid. We find in N132D and W49B that the centroids in different regions are consistent with that derived from the overall spectrum, although not necessarily with the remnant type identified via other means. However, in SNR 3C~397, we find statistically significant variation in the centroid of up to 100 eV, aligning with the variation in the density structure around the remnant. These variations span the intermediate space between centroid energies signifying core-collapse and Type Ia remnants. Shifting the dividing line downwards by 50 eV can place all the centroids in the CC region, but contradicts the remnant type obtained via other means. Our results show that caution must be used when employing the Fe K$alpha$ centroid of the entire remnant as the sole diagnostic for typing a remnant.

High Energy Astrophysical Phenomena

Use of Web Map Image and copyright act

52 - Nobusuke Iwasaki , Satoshi Iida 2018

In this paper, we reviewed the notes on using Web map image provided by Web map service, from the viewpoint of copyright act. The copyright act aims to contribute to creation of culture by protecting the rights of authors and others, and promoting fair exploitation of cultural products. Therefore, everyone can use copyrighted materials to the extent of the copyright limitation based on copyright act. The Web map image, including maps, aerial photo and satellite image, are one of copyrighted materials, so it can be used within the limits of copyright. However, the available range of Web map image under the copyright act is not wide. In addition, it is pointed out that the copyright act has not been able to follow the progress of digitalization of copyrighted materials. It is expected to revise the copyright act corresponding to digitalization of copyrighted work.

Digital Libraries

comments

Fetching comments

Tartous University

Additional details More universities

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Can Common Crawl reliably track persistent identifier (PID) use over time?

Ask ChatGPT about the research

No Arabic abstract

Read More