No Arabic abstract
Network device syslogs are ubiquitous and abundant in modern data centers with most large data centers producing millions of messages per day. Yet, the operational information reflected in syslogs and their implications on diagnosis or management tasks are poorly understood. Prevalent approaches to understanding syslogs focus on simple correlation and abnormality detection and are often limited to detection providing little insight towards diagnosis and resolution. Towards improving data center operations, we propose and implement Log-Prophet, a system that applies a toolbox of statistical techniques and domain-specific models to mine detailed diagnoses. Log-Prophet infers causal relationships between syslog lines and constructs succinct but valuable problem graphs, summarizing root causes and their locality, including cascading problems. We validate Log-Prophet using problem tickets and through operator interviews. To demonstrate the strength of Log-Prophet, we perform an initial longitudinal study of a large online service providers data center. Our study demonstrates that Log-Prophet significantly reduces the number of alerts while highlighting interesting operational issues.
Modern semantic parsers suffer from two principal limitations. First, training requires expensive collection of utterance-program pairs. Second, semantic parsers fail to generalize at test time to new compositions/structures that have not been observed during training. Recent research has shown that automatic generation of synthetic utterance-program pairs can alleviate the first problem, but its potential for the second has thus far been under-explored. In this work, we investigate automatic generation of synthetic utterance-program pairs for improving compositional generalization in semantic parsing. Given a small training set of annotated examples and an infinite pool of synthetic examples, we select a subset of synthetic examples that are structurally-diverse and use them to improve compositional generalization. We evaluate our approach on a new split of the schema2QA dataset, and show that it leads to dramatic improvements in compositional generalization as well as moderate improvements in the traditional i.i.d setup. Moreover, structurally-diverse sampling achieves these improvements with as few as 5K examples, compared to 1M examples when sampling uniformly at random -- a 200x improvement in data efficiency.
Detection of malicious behavior is a fundamental problem in security. One of the major challenges in using detection systems in practice is in dealing with an overwhelming number of alerts that are triggered by normal behavior (the so-called false positives), obscuring alerts resulting from actual malicious activity. While numerous methods for reducing the scope of this issue have been proposed, ultimately one must still decide how to prioritize which alerts to investigate, and most existing prioritization methods are heuristic, for example, based on suspiciousness or priority scores. We introduce a novel approach for computing a policy for prioritizing alerts using adversarial reinforcement learning. Our approach assumes that the attackers know the full state of the detection system and dynamically choose an optimal attack as a function of this state, as well as of the alert prioritization policy. The first step of our approach is to capture the interaction between the defender and attacker in a game theoretic model. To tackle the computational complexity of solving this game to obtain a dynamic stochastic alert prioritization policy, we propose an adversarial reinforcement learning framework. In this framework, we use neural reinforcement learning to compute best response policies for both the defender and the adversary to an arbitrary stochastic policy of the other. We then use these in a double-oracle framework to obtain an approximate equilibrium of the game, which in turn yields a robust stochastic policy for the defender. Extensive experiments using case studies in fraud and intrusion detection demonstrate that our approach is effective in creating robust alert prioritization policies.
Recent results in coupled or temporal graphical models offer schemes for estimating the relationship structure between features when the data come from related (but distinct) longitudinal sources. A novel application of these ideas is for analyzing group-level differences, i.e., in identifying if trends of estimated objects (e.g., covariance or precision matrices) are different across disparate conditions (e.g., gender or disease). Often, poor effect sizes make detecting the differential signal over the full set of features difficult: for example, dependencies between only a subset of features may manifest differently across groups. In this work, we first give a parametric model for estimating trends in the space of SPD matrices as a function of one or more covariates. We then generalize scan statistics to graph structures, to search over distinct subsets of features (graph partitions) whose temporal dependency structure may show statistically significant group-wise differences. We theoretically analyze the Family Wise Error Rate (FWER) and bounds on Type 1 and Type 2 error. On a cohort of individuals with risk factors for Alzheimers disease (but otherwise cognitively healthy), we find scientifically interesting group differences where the default analysis, i.e., models estimated on the full graph, do not survive reasonable significance thresholds.
Today, network devices share buffer across priority queues to avoid drops during transient congestion. While cost-effective most of the time, this sharing can cause undesired interference among seemingly independent traffic. As a result, low-priority traffic can cause increased packet loss to high-priority traffic. Similarly, long flows can prevent the buffer from absorbing incoming bursts even if they do not share the same queue. The cause of this perhaps unintuitive outcome is that todays buffer sharing techniques are unable to guarantee isolation across (priority) queues without statically allocating buffer space. To address this issue, we designed FB, a novel buffer sharing scheme that offers strict isolation guarantees to high-priority traffic without sacrificing link utilizations. Thus, FB outperforms conventional buffer sharing algorithms in absorbing bursts while achieving on-par throughput. We show that FB is practical and runs at line-rate on existing hardware (Barefoot Tofino). Significantly, FBs operations can be approximated in non-programmable devices.
In recent years, many techniques have been developed to improve the performance and efficiency of data center networks. While these techniques provide high accuracy, they are often designed using heuristics that leverage domain-specific properties of the workload or hardware. In this vision paper, we argue that many data center networking techniques, e.g., routing, topology augmentation, energy savings, with diverse goals actually share design and architectural similarity. We present a design for developing general intermediate representations of network topologies using deep learning that is amenable to solving classes of data center problems. We develop a framework, DeepConfig, that simplifies the processing of configuring and training deep learning agents that use the intermediate representation to learns different tasks. To illustrate the strength of our approach, we configured, implemented, and evaluated a DeepConfig-Agent that tackles the data center topology augmentation problem. Our initial results are promising --- DeepConfig performs comparably to the optimal.