New community

Subscribe to the gold package and get unlimited access to Shamra Academy

A Distributed Process Infrastructure for a Distributed Data Structure

89 0 0.0 ( 0 )

Download Cite

Added by Marko A. Rodriguez

Publication date 2008

fields Informatics Engineering

and research's language is English

Authors Marko A. Rodriguez

Artificial Intelligence Digital Libraries

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

The Resource Description Framework (RDF) is continuing to grow outside the bounds of its initial function as a metadata framework and into the domain of general-purpose data modeling. This expansion has been facilitated by the continued increase in the capacity and speed of RDF database repositories known as triple-stores. High-end RDF triple-stores can hold and process on the order of 10 billion triples. In an effort to provide a seamless integration of the data contained in RDF repositories, the Linked Data community is providing specifications for linking RDF data sets into a universal distributed graph that can be traversed by both man and machine. While the seamless integration of RDF data sets is important, at the scale of the data sets that currently exist and will ultimately grow to become, the download and index philosophy of the World Wide Web will not so easily map over to the Semantic Web. This essay discusses the importance of adding a distributed RDF process infrastructure to the current distributed RDF data structure.

rate research

ds-array: A Distributed Data Structure for Large Scale Machine Learning

71 - Javier Alvarez Cid-Fuentes , Pol Alvarez , Salvi Sol`a 2021

Machine learning has proved to be a useful tool for extracting knowledge from scientific data in numerous research fields, including astrophysics, genomics, and molecular dynamics. Often, data sets from these research areas need to be processed in distributed platforms due to their magnitude. This can be done using one of the various distributed machine learning libraries available. One of these libraries is dislib, a distributed machine learning library for Python especially designed to process large scale data sets on HPC clusters, which makes dislib an ideal candidate for analyzing scientific data. However, dislibs main distributed data structure, called Dataset, has some limitations, including poor performance in certain operations and low flexibility and usability. In this paper, we propose a novel distributed data structure for dislib, called ds-array, that addresses dislibs main limitations in data management. Ds-arrays simplify distributed data management in dislib by exposing a NumPy-like API, provide more flexibility, and reduce the computational complexity of some operations. This results in performance improvements of up to two orders of magnitude over Datasets, while also greatly improving scalability and usability.

Distributed Parallel and Cluster Computing

Archer: A Community Distributed Computing Infrastructure for Computer Architecture Research and Education

471 - Renato Figueiredo , P. Oscar Boykin , Jose A. B. Fortes 2008

This paper introduces Archer, a community-based computing resource for computer architecture research and education. The Archer infrastructure integrates virtualization and batch scheduling middleware to deliver high-throughput computing resources aggregated from resources distributed across wide-area networks and owned by different participating entities in a seamless manner. The paper discusses the motivations leading to the design of Archer, describes its core middleware components, and presents an analysis of the functionality and performance of a prototype wide-area deployment running a representative computer architecture simulation workload.

Hardware Architecture

Multi-Scale Process Modelling and Distributed Computation for Spatial Data

81 - Andrew Zammit-Mangion , Jonathan Rougier 2019

Recent years have seen a huge development in spatial modelling and prediction methodology, driven by the increased availability of remote-sensing data and the reduced cost of distributed-processing technology. It is well known that modelling and prediction using infinite-dimensional process models is not possible with large data sets, and that both approximate models and, often, approximate-inference methods, are needed. The problem of fitting simple global spatial models to large data sets has been solved through the likes of multi-resolution approximations and nearest-neighbour techniques. Here we tackle the next challenge, that of fitting complex, nonstationary, multi-scale models to large data sets. We propose doing this through the use of superpositions of spatial processes with increasing spatial scale and increasing degrees of nonstationarity. Computation is facilitated through the use of Gaussian Markov random fields and parallel Markov chain Monte Carlo based on graph colouring. The resulting model allows for both distributed computing and distributed data. Importantly, it provides opportunities for genuine model and data scaleability and yet is still able to borrow strength across large spatial scales. We illustrate a two-scale version on a data set of sea-surface temperature containing on the order of one million observations, and compare our approach to state-of-the-art spatial modelling and prediction methods.

Computation Applications

Epistemic Protocols for Distributed Gossiping

140 - Krzysztof R. Apt 2016

Gossip protocols aim at arriving, by means of point-to-point or group communications, at a situation in which all the agents know each others secrets. We consider distributed gossip protocols which are expressed by means of epistemic logic. We provide an operational semantics of such protocols and set up an appropriate framework to argue about their correctness. Then we analyze specific protocols for complete graphs and for directed rings.

Artificial Intelligence Distributed Parallel and Cluster Computing Logic in Computer Science

HetSeq: Distributed GPU Training on Heterogeneous Infrastructure

127 - Yifan Ding , Nicholas Botzer , Tim Weninger 2020

Modern deep learning systems like PyTorch and Tensorflow are able to train enormous models with billions (or trillions) of parameters on a distributed infrastructure. These systems require that the internal nodes have the same memory capacity and compute performance. Unfortunately, most organizations, especially universities, have a piecemeal approach to purchasing computer systems resulting in a heterogeneous infrastructure, which cannot be used to compute large models. The present work describes HetSeq, a software package adapted from the popular PyTorch package that provides the capability to train large neural network models on heterogeneous infrastructure. Experiments with transformer translation and BERT language model shows that HetSeq scales over heterogeneous systems. HetSeq can be easily extended to other models like image classification. Package with supported document is publicly available at https://github.com/yifding/hetseq.

Distributed Parallel and Cluster Computing Machine Learning

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

A Distributed Process Infrastructure for a Distributed Data Structure

Ask ChatGPT about the research

No Arabic abstract

Read More

suggested questions