No Arabic abstract
Efficient virtual machine (VM) management can dramatically reduce energy consumption in data centers. Existing VM management algorithms fall into two categories based on whether the VMs resource demands are assumed to be static or dynamic. The former category fails to maximize the resource utilization as they cannot adapt to the dynamic nature of VMs resource demands. Most approaches in the latter category are heuristical and lack theoretical performance guarantees. In this work, we formulate dynamic VM management as a large-scale Markov Decision Process (MDP) problem and derive an optimal solution. Our analysis of real-world data traces supports our choice of the modeling approach. However, solving the large-scale MDP problem suffers from the curse of dimensionality. Therefore, we further exploit the special structure of the problem and propose an approximate MDP-based dynamic VM management method, called MadVM. We prove the convergence of MadVM and analyze the bound of its approximation error. Moreover, MadVM can be implemented in a distributed system, which should suit the needs of real data centers. Extensive simulations based on two real-world workload traces show that MadVM achieves significant performance gains over two existing baseline approaches in power consumption, resource shortage and the number of VM migrations. Specifically, the more intensely the resource demands fluctuate, the more MadVM outperforms.
Because failures in distribution systems caused by extreme weather events directly result in consumers outages, this paper proposes a state-based decision-making model with the objective of mitigating loss of load to improve the distribution system resilience throughout the unfolding events. The sequentially uncertain system states, e.g., feeder line on/off states, driven by the unfolding events are modeled as Markov states, and the probabilities from one Markov state to another Markov state throughout the unfolding events are determined by the component failure caused by the unfolding events. A recursive optimization model based on Markov decision processes (MDP) is developed to make state-based actions, i.e., system reconfiguration, at each decision time. To overcome the curse of dimensionality caused by enormous states and actions, an approximate dynamic programming (ADP) approach based on post-decision states and iteration is used to solve the proposed MDP-based model. IEEE 33-bus system and IEEE 123-bus system are used to validate the proposed model.
Several applications of Reinforcement Learning suffer from instability due to high variance. This is especially prevalent in high dimensional domains. Regularization is a commonly used technique in machine learning to reduce variance, at the cost of introducing some bias. Most existing regularization techniques focus on spatial (perceptual) regularization. Yet in reinforcement learning, due to the nature of the Bellman equation, there is an opportunity to also exploit temporal regularization based on smoothness in value estimates over trajectories. This paper explores a class of methods for temporal regularization. We formally characterize the bias induced by this technique using Markov chain concepts. We illustrate the various characteristics of temporal regularization via a sequence of simple discrete and continuous MDPs, and show that the technique provides improvement even in high-dimensional Atari games.
We study controllable text summarization which allows users to gain control on a particular attribute (e.g., length limit) of the generated summaries. In this work, we propose a novel training framework based on Constrained Markov Decision Process (CMDP), which conveniently includes a reward function along with a set of constraints, to facilitate better summarization control. The reward function encourages the generation to resemble the human-written reference, while the constraints are used to explicitly prevent the generated summaries from violating user-imposed requirements. Our framework can be applied to control important attributes of summarization, including length, covered entities, and abstractiveness, as we devise specific constraints for each of these aspects. Extensive experiments on popular benchmarks show that our CMDP framework helps generate informative summaries while complying with a given attributes requirement.
In this paper, a rather general online problem called dynamic resource allocation with capacity constraints (DRACC) is introduced and studied in the realm of posted price mechanisms. This problem subsumes several applications of stateful pricing, including but not limited to posted prices for online job scheduling and matching over a dynamic bipartite graph. As the existing online learning techniques do not yield vanishing-regret mechanisms for this problem, we develop a novel online learning framework defined over deterministic Markov decision processes with dynamic state transition and reward functions. We then prove that if the Markov decision process is guaranteed to admit an oracle that can simulate any given policy from any initial state with bounded loss -- a condition that is satisfied in the DRACC problem -- then the online learning problem can be solved with vanishing regret. Our proof technique is based on a reduction to online learning with switching cost, in which an online decision maker incurs an extra cost every time she switches from one arm to another. We formally demonstrate this connection and further show how DRACC can be used in our proposed applications of stateful pricing.
Studies on massive open online courses (MOOCs) users discuss the existence of typical profiles and their impact on the learning process of the students. However defining the typical behaviors as well as classifying the users accordingly is a difficult task. In this paper we suggest two methods to model MOOC users behaviour given their log data. We mold their behavior into a Markov Decision Process framework. We associate the users intentions with the MDP reward and argue that this allows us to classify them.