No Arabic abstract
In recent years, many techniques have been developed to improve the performance and efficiency of data center networks. While these techniques provide high accuracy, they are often designed using heuristics that leverage domain-specific properties of the workload or hardware. In this vision paper, we argue that many data center networking techniques, e.g., routing, topology augmentation, energy savings, with diverse goals actually share design and architectural similarity. We present a design for developing general intermediate representations of network topologies using deep learning that is amenable to solving classes of data center problems. We develop a framework, DeepConfig, that simplifies the processing of configuring and training deep learning agents that use the intermediate representation to learns different tasks. To illustrate the strength of our approach, we configured, implemented, and evaluated a DeepConfig-Agent that tackles the data center topology augmentation problem. Our initial results are promising --- DeepConfig performs comparably to the optimal.
Today, network devices share buffer across priority queues to avoid drops during transient congestion. While cost-effective most of the time, this sharing can cause undesired interference among seemingly independent traffic. As a result, low-priority traffic can cause increased packet loss to high-priority traffic. Similarly, long flows can prevent the buffer from absorbing incoming bursts even if they do not share the same queue. The cause of this perhaps unintuitive outcome is that todays buffer sharing techniques are unable to guarantee isolation across (priority) queues without statically allocating buffer space. To address this issue, we designed FB, a novel buffer sharing scheme that offers strict isolation guarantees to high-priority traffic without sacrificing link utilizations. Thus, FB outperforms conventional buffer sharing algorithms in absorbing bursts while achieving on-par throughput. We show that FB is practical and runs at line-rate on existing hardware (Barefoot Tofino). Significantly, FBs operations can be approximated in non-programmable devices.
Network device syslogs are ubiquitous and abundant in modern data centers with most large data centers producing millions of messages per day. Yet, the operational information reflected in syslogs and their implications on diagnosis or management tasks are poorly understood. Prevalent approaches to understanding syslogs focus on simple correlation and abnormality detection and are often limited to detection providing little insight towards diagnosis and resolution. Towards improving data center operations, we propose and implement Log-Prophet, a system that applies a toolbox of statistical techniques and domain-specific models to mine detailed diagnoses. Log-Prophet infers causal relationships between syslog lines and constructs succinct but valuable problem graphs, summarizing root causes and their locality, including cascading problems. We validate Log-Prophet using problem tickets and through operator interviews. To demonstrate the strength of Log-Prophet, we perform an initial longitudinal study of a large online service providers data center. Our study demonstrates that Log-Prophet significantly reduces the number of alerts while highlighting interesting operational issues.
Traffic Engineering (TE) is a basic building block of the Internet. In this paper, we analyze whether modern Machine Learning (ML) methods are ready to be used for TE optimization. We address this open question through a comparative analysis between the state of the art in ML and the state of the art in TE. To this end, we first present a novel distributed system for TE that leverages the latest advancements in ML. Our system implements a novel architecture that combines Multi-Agent Reinforcement Learning (MARL) and Graph Neural Networks (GNN) to minimize network congestion. In our evaluation, we compare our MARL+GNN system with DEFO, a network optimizer based on Constraint Programming that represents the state of the art in TE. Our experimental results show that the proposed MARL+GNN solution achieves equivalent performance to DEFO in a wide variety of network scenarios including three real-world network topologies. At the same time, we show that MARL+GNN can achieve significant reductions in execution time (from the scale of minutes with DEFO to a few seconds with our solution).
Network management often relies on machine learning to make predictions about performance and security from network traffic. Often, the representation of the traffic is as important as the choice of the model. The features that the model relies on, and the representation of those features, ultimately determine model accuracy, as well as where and whether the model can be deployed in practice. Thus, the design and evaluation of these models ultimately requires understanding not only model accuracy but also the systems costs associated with deploying the model in an operational network. Towards this goal, this paper develops a new framework and system that enables a joint evaluation of both the conventional notions of machine learning performance (e.g., model accuracy) and the systems-level costs of different representations of network traffic. We highlight these two dimensions for two practical network management tasks, video streaming quality inference and malware detection, to demonstrate the importance of exploring different representations to find the appropriate operating point. We demonstrate the benefit of exploring a range of representations of network traffic and present Traffic Refinery, a proof-of-concept implementation that both monitors network traffic at 10 Gbps and transforms traffic in real time to produce a variety of feature representations for machine learning. Traffic Refinery both highlights this design space and makes it possible to explore different representations for learning, balancing systems costs related to feature extraction and model training against model accuracy.
Deep reinforcement learning (RL) methods have significant potential for dialogue policy optimisation. However, they suffer from a poor performance in the early stages of learning. This is especially problematic for on-line learning with real users. Two approaches are introduced to tackle this problem. Firstly, to speed up the learning process, two sample-efficient neural networks algorithms: trust region actor-critic with experience replay (TRACER) and episodic natural actor-critic with experience replay (eNACER) are presented. For TRACER, the trust region helps to control the learning step size and avoid catastrophic model changes. For eNACER, the natural gradient identifies the steepest ascent direction in policy space to speed up the convergence. Both models employ off-policy learning with experience replay to improve sample-efficiency. Secondly, to mitigate the cold start issue, a corpus of demonstration data is utilised to pre-train the models prior to on-line reinforcement learning. Combining these two approaches, we demonstrate a practical approach to learn deep RL-based dialogue policies and demonstrate their effectiveness in a task-oriented information seeking domain.