No Arabic abstract
System noise can negatively impact the performance of HPC systems, and the interconnection network is one of the main factors contributing to this problem. To mitigate this effect, adaptive routing sends packets on non-minimal paths if they are less congested. However, while this may mitigate interference caused by congestion, it also generates more traffic since packets traverse additional hops, causing in turn congestion on other applications and on the application itself. In this paper, we first describe how to estimate network noise. By following these guidelines, we show how noise can be reduced by using routing algorithms which select minimal paths with a higher probability. We exploit this knowledge to design an algorithm which changes the probability of selecting minimal paths according to the application characteristics. We validate our solution on microbenchmarks and real-world applications on two systems relying on a Dragonfly interconnection network, showing noise reduction and performance improvement.
As one of the most popular south-bound protocol of software-defined networking(SDN), OpenFlow decouples the network control from forwarding devices. It offers flexible and scalable functionality for networks. These advantages may cause performance issues since there are performance penalties in terms of packet processing speed. It is important to understand the performance of OpenFlow switches and controllers for its deployments. In this paper we model the packet processing time of OpenFlow switches and controllers. We mainly analyze how the probability of packet-in messages impacts the performance of switches and controllers. Our results show that there is a performance penalty in OpenFlow networks. However, the penalty is not much when probability of packet-in messages is low. This model can be used for a network designer to approximate the performance of her deployments.
An interrelation between a topological design of network and efficient algorithm on it is important for its applications to communication or transportation systems. In this paper, we propose a design principle for a reliable routing in a store-carry-forward manner based on autonomously moving message-ferries on a special structure of fractal-like network, which consists of a self-similar tiling of equilateral triangles. As a collective adaptive mechanism, the routing is realized by a relay of cyclic message-ferries corresponded to a concatenation of the triangle cycles and using some good properties of the network structure. It is recoverable for local accidents in the hierarchical network structure. Moreover, the design principle is theoretically supported with a calculation method for the optimal service rates of message-ferries derived from a tandem queue model for stochastic processes on a chain of edges in the network. These results obtained from a combination of complex network science and computer science will be useful for developing a resilient network system.
The recently proposed RCube network is a cube-based server-centric data center network (DCN), including two types of heterogeneous servers, called core and edge servers. Remarkably, it takes the latter as backup servers to deal with server failures and thus achieve high availability. This paper first points out that RCube is suitable as a candidate topology of DCNs for edge computing. Three transmission types are among core and edge servers based on the demand for applications computation and instant response. We then employ protection routing to analyze the transmission failure of RCube DCNs. Unlike traditional protection routing, which only tolerates a single link or node failure, we use the multi-protection routing scheme to improve fault-tolerance capability. To configure a protection routing in a network, according to Tapolcais suggestion, we need to construct two completely independent spanning trees (CISTs). A logic graph of RCube, denoted by $L$-$RCube(n,m,k)$, is a network with a recursive structure. Each basic building element consists of $n$ core servers and $m$ edge servers, where the order $k$ is the number of recursions applied in the structure. In this paper, we provide algorithms to construct $min{n,lfloor(n+m)/2rfloor}$ CISTs in $L$-$RCube(n,m,k)$ for $n+mgeqslant 4$ and $n>1$. From a combination of the multiple CISTs, we can configure the desired multi-protection routing. In our simulation, we configure up to 10 protection routings for RCube DCNs. As far as we know, in past research, there were at most three protection routings developed in other network structures. Finally, we summarize some crucial analysis viewpoints about the transmission efficiency of DCNs with heterogeneous edge-core servers from the simulation results.
Distributed digital infrastructures for computation and analytics are now evolving towards an interconnected ecosystem allowing complex applications to be executed from IoT Edge devices to the HPC Cloud (aka the Computing Continuum, the Digital Continuum, or the Transcontinuum). Understanding end-to-end performance in such a complex continuum is challenging. This breaks down to reconciling many, typically contradicting application requirements and constraints with low-level infrastructure design choices. One important challenge is to accurately reproduce relevant behaviors of a given application workflow and representative settings of the physical infrastructure underlying this complex continuum. We introduce a rigorous methodology for such a process and validate it through E2Clab. It is the first platform to support the complete experimental cycle across the Computing Continuum: deployment, analysis, optimization. Preliminary results with real-life use cases show that E2Clab allows one to understand and improve performance, by correlating it to the parameter settings, the resource usage and the specifics of the underlying infrastructure.
Principal component analysis (PCA) is not only a fundamental dimension reduction method, but is also a widely used network anomaly detection technique. Traditionally, PCA is performed in a centralized manner, which has poor scalability for large distributed systems, on account of the large network bandwidth cost required to gather the distributed state at a fusion center. Consequently, several recent works have proposed various distributed PCA algorithms aiming to reduce the communication overhead incurred by PCA without losing its inferential power. This paper evaluates the tradeoff between communication cost and solution quality of two distributed PCA algorithms on a real domain name system (DNS) query dataset from a large network. We also apply the distributed PCA algorithm in the area of network anomaly detection and demonstrate that the detection accuracy of both distributed PCA-based methods has little degradation in quality, yet achieves significant savings in communication bandwidth.