No Arabic abstract
We present ARCHANGEL; a de-centralised platform for ensuring the long-term integrity of digital documents stored within public archives. Document integrity is fundamental to public trust in archives. Yet currently that trust is built upon institutional reputation --- trust at face value in a centralised authority, like a national government archive or University. ARCHANGEL proposes a shift to a technological underscoring of that trust, using distributed ledger technology (DLT) to cryptographically guarantee the provenance, immutability and so the integrity of archived documents. We describe the ARCHANGEL architecture, and report on a prototype of that architecture build over the Ethereum infrastructure. We report early evaluation and feedback of ARCHANGEL from stakeholders in the research data archives space.
We present ARCHANGEL; a novel distributed ledger based system for assuring the long-term integrity of digital video archives. First, we describe a novel deep network architecture for computing compact temporal content hashes (TCHs) from audio-visual streams with durations of minutes or hours. Our TCHs are sensitive to accidental or malicious content modification (tampering) but invariant to the codec used to encode the video. This is necessary due to the curatorial requirement for archives to format shift video over time to ensure future accessibility. Second, we describe how the TCHs (and the models used to derive them) are secured via a proof-of-authority blockchain distributed across multiple independent archives. We report on the efficacy of ARCHANGEL within the context of a trial deployment in which the national government archives of the United Kingdom, Estonia and Norway participated.
Event collections are frequently built by crawling the live web on the basis of seed URIs nominated by human experts. Focused web crawling is a technique where the crawler is guided by reference content pertaining to the event. Given the dynamic nature of the web and the pace with which topics evolve, the timing of the crawl is a concern for both approaches. We investigate the feasibility of performing focused crawls on the archived web. By utilizing the Memento infrastructure, we obtain resources from 22 web archives that contribute to building event collections. We create collections on four events and compare the relevance of their resources to collections built from crawling the live web as well as from a manually curated collection. Our results show that focused crawling on the archived web can be done and indeed results in highly relevant collections, especially for events that happened further in the past.
Knowledge of how science is consumed in public domains is essential for a deeper understanding of the role of science in human society. While science is heavily supported by public funding, common depictions suggest that scientific research remains an isolated or ivory tower activity, with weak connectivity to public use, little relationship between the quality of research and its public use, and little correspondence between the funding of science and its public use. This paper introduces a measurement framework to examine public good features of science, allowing us to study public uses of science, the public funding of science, and how use and funding relate. Specifically, we integrate five large-scale datasets that link scientific publications from all scientific fields to their upstream funding support and downstream public uses across three public domains - government documents, the news media, and marketplace invention. We find that the public uses of science are extremely diverse, with different public domains drawing distinctively across scientific fields. Yet amidst these differences, we find key forms of alignment in the interface between science and society. First, despite concerns that the public does not engage high-quality science, we find universal alignment, in each scientific field and public domain, between what the public consumes and what is highly impactful within science. Second, despite myriad factors underpinning the public funding of science, the resulting allocation across fields presents a striking alignment with the fields collective public use. Overall, public uses of science present a rich landscape of specialized consumption, yet collectively science and society interface with remarkable, quantifiable alignment between scientific use, public use, and funding.
Over the past few decades, the amount of scientific articles and technical literature has increased exponentially in size. Consequently, there is a great need for systems that can ingest these documents at scale and make the contained knowledge discoverable. Unfortunately, both the format of these documents (e.g. the PDF format or bitmap images) as well as the presentation of the data (e.g. complex tables) make the extraction of qualitative and quantitive data extremely challenging. In this paper, we present a modular, cloud-based platform to ingest documents at scale. This platform, called the Corpus Conversion Service (CCS), implements a pipeline which allows users to parse and annotate documents (i.e. collect ground-truth), train machine-learning classification algorithms and ultimately convert any type of PDF or bitmap-documents to a structured content representation format. We will show that each of the modules is scalable due to an asynchronous microservice architecture and can therefore handle massive amounts of documents. Furthermore, we will show that our capability to gather ground-truth is accelerated by machine-learning algorithms by at least one order of magnitude. This allows us to both gather large amounts of ground-truth in very little time and obtain very good precision/recall metrics in the range of 99% with regard to content conversion to structured output. The CCS platform is currently deployed on IBM internal infrastructure and serving more than 250 active users for knowledge-engineering project engagements.
This paper reviews the major methods and theories regarding the preservation of new media artifacts such as videogames, and argues for the importance of collecting and coming to a better understanding of videogame artifacts of creation, which will help build a more detailed understanding of the essential qualities of these culturally significant artifacts. We will also review the major videogame collections in the United States, Europe and Japan to give an idea of the current state of videogame archives, and argue for a fuller, more comprehensive coverage of these materials in institutional repositories.