HEPCloud is rapidly becoming the primary system for provisioning compute resources for all Fermilab-affiliated experiments. In order to reliably meet the peak demands of the next generation of High Energy Physics experiments, Fermilab must plan to elastically expand its computational capabilities to cover the forecasted need. Commercial cloud and allocation-based High Performance Computing (HPC) resources both have explicit and implicit costs that must be considered when deciding when to provision these resources, and at which scale. In order to support such provisioning in a manner consistent with organizational business rules and budget constraints, we have developed a modular intelligent decision support system (IDSS) to aid in the automatic provisioning of resources spanning multiple cloud providers, multiple HPC centers, and grid computing federations. In this paper, we discuss the goals and architecture of the HEPCloud Facility, the architecture of the IDSS, and our early experience in using the IDSS for automated facility expansion both at Fermi and Brookhaven National Laboratory.
How to efficiently generate an accurate, well-structured overview report (ORPT) over thousands of related documents is challenging. A well-structured ORPT consists of sections of multiple levels (e.g., sections and subsections). None of the existing multi-document summarization (MDS) algorithms is directed toward this task. To overcome this obstacle, we present NDORGS (Numerous Documents Overview Report Generation Scheme) that integrates text filtering, keyword scoring, single-document summarization (SDS), topic modeling, MDS, and title generation to generate a coherent, well-structured ORPT. We then devise a multi-criteria evaluation method using techniques of text mining and multi-attribute decision making on a combination of human judgments, running time, information coverage, and topic diversity. We evaluate ORPTs generated by NDORGS on two large corpora of documents, where one is classified and the other unclassified. We show that, using Saatys pairwise comparison 9-point scale and under TOPSIS, the ORPTs generated on SDSs with the length of 20% of the original documents are the best overall on both datasets.
The power of networks manifests itself in a highly non-linear amplification of a number of effects, and their weakness - in propagation of cascading failures. The potential systemic risk effects can be either exacerbated or mitigated, depending on the resilience characteristics of the network. The goals of this paper are to study some characteristics of network amplification and resilience. We simulate random Erdos-Renyi networks and measure amplification by varying node capacity, transaction volume, and expected failure rates. We discover that network throughput scales almost quadratically with respect to the node capacity and that the effects of excessive network load and random and irreparable node faults are equivalent and almost perfectly anticorrelated. This knowledge can be used by capacity planners to determine optimal reliability requirements that maximize the optimal operational regions.
Modern data stores achieve scalability by partitioning data into shards and fault-tolerance by replicating each shard across several servers. A key component of such systems is a Transaction Certification Service (TCS), which atomically commits a transaction spanning multiple shards. Existing TCS protocols require 2f+1 crash-stop replicas per shard to tolerate f failures. In this paper we present atomic commit protocols that require only f+1 replicas and reconfigure the system upon failures using an external reconfiguration service. We furthermore rigorously prove that these protocols correctly implement a recently proposed TCS specification. We present protocols in two different models--the standard asynchronous message-passing model and a model with Remote Direct Memory Access (RDMA), which allows a machine to access the memory of another machine over the network without involving the latters CPU. Our protocols are inspired by a recent FARM system for RDMA-based transaction processing. Our work codifies the core ideas of FARM as distributed TCS protocols, rigorously proves them correct and highlights the trade-offs required by the use of RDMA.
Abundant examples of complex transaction-oriented networks (TONs) can be found in a variety of disciplines, including information and communication technology, finances, commodity trading, and real estate. A transaction in a TON is executed as a sequence of subtransactions associated with the network nodes, and is committed if every subtransaction is committed. A subtransaction incurs a two-fold overhead on the host node: the fixed transient operational cost and the cost of long-term management (e.g. archiving and support) that potentially grows exponentially with the transaction length. If the overall cost exceeds the node capacity, the node fails and all subtransaction incident to the node, and their parent distributed transactions, are aborted. A TON resilience can be measured in terms of either external workloads or intrinsic node fault rates that cause the TON to partially or fully choke. We demonstrate that under certain conditions, these two measures are equivalent. We further show that the exponential growth of the long-term management costs can be mitigated by adjusting the effective operational cost: in other words, that the future maintenance costs could be absorbed into the transient operational costs.