No Arabic abstract
Unexpected increases in demand and most of all flash crowds are considered the bane of every web application as they may cause intolerable delays or even service unavailability. Proper quality of service policies must guarantee rapid reactivity and responsiveness even in such critical situations. Previous solutions fail to meet common performance requirements when the system has to face sudden and unpredictable surges of traffic. Indeed they often rely on a proper setting of key parameters which requires laborious manual tuning, preventing a fast adaptation of the control policies. We contribute an original Self-* Overload Control (SOC) policy. This allows the system to self-configure a dynamic constraint on the rate of admitted sessions in order to respect service level agreements and maximize the resource utilization at the same time. Our policy does not require any prior information on the incoming traffic or manual configuration of key parameters. We ran extensive simulations under a wide range of operating conditions, showing that SOC rapidly adapts to time varying traffic and self-optimizes the resource utilization. It admits as many new sessions as possible in observance of the agreements, even under intense workload variations. We compared our algorithm to previously proposed approaches highlighting a more stable behavior and a better performance.
By introducing programmability, automated verification, and innovative debugging tools, Software-Defined Networks (SDNs) are poised to meet the increasingly stringent dependability requirements of todays communication networks. However, the design of fault-tolerant SDNs remains an open challenge. This paper considers the design of dependable SDNs through the lenses of self-stabilization - a very strong notion of fault-tolerance. In particular, we develop algorithms for an in-band and distributed control plane for SDNs, called Renaissance, which tolerate a wide range of (concurrent) controller, link, and communication failures. Our self-stabilizing algorithms ensure that after the occurrence of an arbitrary combination of failures, (i) every non-faulty SDN controller can reach any switch (or another controller) in the network within a bounded communication delay (in the presence of a bounded number of concurrent failures) and (ii) every switch is managed by at least one controller (as long as at least one controller is not faulty). We evaluate Renaissance through a rigorous worst-case analysis as well as a prototype implementation (based on OVS and Floodlight), and we report on our experiments using Mininet.
Distributed storage systems provide reliable access to data through redundancy spread over individually unreliable nodes. Application scenarios include data centers, peer-to-peer storage systems, and storage in wireless networks. Storing data using an erasure code, in fragments spread across nodes, requires less redundancy than simple replication for the same level of reliability. However, since fragments must be periodically replaced as nodes fail, a key question is how to generate encoded fragments in a distributed way while transferring as little data as possible across the network. For an erasure coded system, a common practice to repair from a node failure is for a new node to download subsets of data stored at a number of surviving nodes, reconstruct a lost coded block using the downloaded data, and store it at the new node. We show that this procedure is sub-optimal. We introduce the notion of regenerating codes, which allow a new node to download emph{functions} of the stored data from the surviving nodes. We show that regenerating codes can significantly reduce the repair bandwidth. Further, we show that there is a fundamental tradeoff between storage and repair bandwidth which we theoretically characterize using flow arguments on an appropriately constructed graph. By invoking constructive results in network coding, we introduce regenerating codes that can achieve any point in this optimal tradeoff.
Geo-distributed private chain and database have created higher performance requirements for consistency models. However, with millisecond network latency between nodes, the widely used leader-based SMR models cause frequent retransmission of logs since they cannot know the logs replication status in time, which resulting in the leader costing high network and computing resource. To address the problem, we proposed a Leader Confirmation based Replication (LCR) model. First, we demonstrate the efficacy of the approach by designing the Future-Log Replication model, which the followers are responsible for non-transactional log replication. It reduces the leaders network load using the signal log. Secondly, we designed a Generation Re-replication strategy, which can ensure the security and consistency of future-logs when the number of nodes changes. Finally, we implemented LCR-Raft and designed experiments. The results show that in the single-ms network latency environments, LCR-Raft can provide 1.5X higher TPS, reduces transactional data response time 40%-60%, and network traffic by 20%-30% with acceptable network traffic and CPU cost on followers. Besides, LCR can provide high portability since it does not change the number of leader and election process.
Control of wireless multihop networks, while simultaneously meeting end-to-end mean delay requirements of different flows is a challenging problem. Additionally, distributed computation of control parameters adds to the complexity. Using the notion of discrete review used in fluid control of networks, a distributed algorithm is proposed for control of multihop wireless networks with interference constraints. The algorithm meets end-to-end mean delay requirements by solving an optimization problem at review instants. The optimization incorporates delay requirements as weights in the function being maximized. The weights are dynamic and vary depending on queue length information. The optimization is done in a distributed manner using an incremental gradient ascent algorithm. The stability of the network under the proposed policy is analytically studied and the policy is shown to be throughput optimal.
We consider a multicast scheme recently proposed for a wireless downlink in [1]. It was shown earlier that power control can significantly improve its performance. However for this system, obtaining optimal power control is intractable because of a very large state space. Therefore in this paper we use deep reinforcement learning where we use function approximation of the Q-function via a deep neural network. We show that optimal power control can be learnt for reasonably large systems via this approach. The average power constraint is ensured via a Lagrange multiplier, which is also learnt. Finally, we demonstrate that a slight modification of the learning algorithm allows the optimal control to track the time varying system statistics.