ترغب بنشر مسار تعليمي؟ اضغط هنا

Motivated by emerging applications such as live-streaming e-commerce, promotions and recommendations, we introduce a general class of multi-armed bandit problems that have the following two features: (i) the decision maker can pull and collect reward s from at most $K$ out of $N$ different arms in each time period; (ii) the expected reward of an arm immediately drops after it is pulled, and then non parametrically recovers as the idle time increases. With the objective of maximizing expected cumulative rewards over $T$ time periods, we propose, construct and prove performance guarantees for a class of Purely Periodic Policies. For the offline problem when all model parameters are known, our proposed policy obtains an approximation ratio that is at the order of $1-mathcal O(1/sqrt{K})$, which is asymptotically optimal when $K$ grows to infinity. For the online problem when the model parameters are unknown and need to be learned, we design an Upper Confidence Bound (UCB) based policy that approximately has $widetilde{mathcal O}(Nsqrt{T})$ regret against the offline benchmark. Our framework and policy design may have the potential to be adapted into other offline planning and online learning applications with non-stationary and recovering rewards.
How much credit (or blame) should an action taken in a state get for a future reward? This is the fundamental temporal credit assignment problem in Reinforcement Learning (RL). One of the earliest and still most widely used heuristics is to assign th is credit based on a scalar coefficient $lambda$ (treated as a hyperparameter) raised to the power of the time interval between the state-action and the reward. In this empirical paper, we explore heuristics based on more general pairwise weightings that are functions of the state in which the action was taken, the state at the time of the reward, as well as the time interval between the two. Of course it isnt clear what these pairwise weight functions should be, and because they are too complex to be treated as hyperparameters we develop a metagradient procedure for learning these weight functions during the usual RL training of a policy. Our empirical work shows that it is often possible to learn these pairwise weight functions during learning of the policy to achieve better performance than competing approaches.
In this work, we study auxiliary prediction tasks defined by temporal-difference networks (TD networks); these networks are a language for expressing a rich space of general value function (GVF) prediction targets that may be learned efficiently with TD. Through analysis in an illustrative domain we show the benefits to learning state representations of exploiting the full richness of TD networks, including both action-conditional predictions and temporally deep predictions. Our main (and perhaps surprising) result is that deep action-conditional TD networks with random structures that create random prediction-questions about random features yield state representations that are competitive with state-of-the-art hand-crafted value prediction and pixel control auxiliary tasks in both Atari games and DeepMind Lab tasks. We also show through stop-gradient experiments that learning the state representations solely via these unsupervised random TD network prediction tasks yield agents that outperform the end-to-end-trained actor-critic baseline.
67 - Yufeng Zheng , Zeyu Zheng 2020
We propose a new framework named DS-WGAN that integrates the doubly stochastic (DS) structure and the Wasserstein generative adversarial networks (WGAN) to model, estimate, and simulate a wide class of arrival processes with general non-stationary an d random arrival rates. Regarding statistical properties, we prove consistency and convergence rate for the estimator solved by the DS-WGAN framework under a non-parametric smoothness condition. Regarding computational efficiency and tractability, we address a challenge in gradient evaluation and model estimation, arised from the discontinuity in the simulator. We then show that the DS-WGAN framework can conveniently facilitate what-if simulation and predictive simulation for future scenarios that are different from the history. Numerical experiments with synthetic and real data sets are implemented to demonstrate the performance of DS-WGAN. The performance is measured from both a statistical perspective and an operational performance evaluation perspective. Numerical experiments suggest that, in terms of performance, the successful model estimation for DS-WGAN only requires a moderate size of representative data, which can be appealing in many contexts of operational management.
The objective of a reinforcement learning agent is to behave so as to maximise the sum of a suitable scalar function of state: the reward. These rewards are typically given and immutable. In this paper, we instead consider the proposition that the re ward function itself can be a good locus of learned knowledge. To investigate this, we propose a scalable meta-gradient framework for learning useful intrinsic reward functions across multiple lifetimes of experience. Through several proof-of-concept experiments, we show that it is feasible to learn and capture knowledge about long-term exploration and exploitation into a reward function. Furthermore, we show that unlike policy transfer methods that capture how the agent should behave, the learned reward functions can generalise to other kinds of agents and to changes in the dynamics of the environment by capturing what the agent should strive to do.
This paper introduces a new asymptotic regime for simplifying stochastic models having non-stationary effects, such as those that arise in the presence of time-of-day effects. This regime describes an operating environment within which the arrival pr ocess to a service system has an arrival intensity that is fluctuating rapidly. We show that such a service system is well approximated by the corresponding model in which the arrival process is Poisson with a constant arrival rate. In addition to the basic weak convergence theorem, we also establish a first order correction for the distribution of the cumulative number of arrivals over $[0,t]$, as well as the number-in-system process for an infinite-server queue fed by an arrival process having a rapidly changing arrival rate. This new asymptotic regime provides a second regime within which non-stationary stochastic models can be reasonably approximated by a process with stationary dynamics, thereby complementing the previously studied setting within which rates vary slowly in time.
61 - Zeyu Zheng , Harsha Honnappa , 2018
This paper is concerned with the development of rigorous approximations to various expectations associated with Markov chains and processes having non-stationary transition probabilities. Such non-stationary models arise naturally in contexts in whic h time-of-day effects or seasonality effects need to be incorporated. Our approximations are valid asymptotically in regimes in which the transition probabilities change slowly over time. Specifically, we develop approximations for the expected infinite horizon discounted reward, the expected reward to the hitting time of a set, the expected reward associated with the state occupied by the chain at time $n$, and the expected cumulative reward over an interval $[0,n]$. In each case, the approximation involves a linear system of equations identical in form to that which one would need to solve to compute the corresponding quantity for a Markov model having stationary transition probabilities. In that sense, the theory provides an approximation no harder to compute than in the traditional stationary context. While most of the theory is developed for finite state Markov chains, we also provide generalizations to continuous state Markov chains, and finite state Markov jump processes in continuous time. In the latter context, one of our approximations coincides with the uniform acceleration asymptotic due to Massey and Whitt (1998).
In many sequential decision making tasks, it is challenging to design reward functions that help an RL agent efficiently learn behavior that is considered good by the agent designer. A number of different formulations of the reward-design problem, or close variants thereof, have been proposed in the literature. In this paper we build on the Optimal Rewards Framework of Singh et.al. that defines the optimal intrinsic reward function as one that when used by an RL agent achieves behavior that optimizes the task-specifying or extrinsic reward function. Previous work in this framework has shown how good intrinsic reward functions can be learned for lookahead search based planning agents. Whether it is possible to learn intrinsic reward functions for learning agents remains an open problem. In this paper we derive a novel algorithm for learning intrinsic rewards for policy-gradient based learning agents. We compare the performance of an augmented agent that uses our algorithm to provide additive intrinsic rewards to an A2C-based policy learner (for Atari games) and a PPO-based policy learner (for Mujoco domains) with a baseline agent that uses the same policy learners but with only extrinsic rewards. Our results show improved performance on most but not all of the domains.
Econophysics and econometrics agree that there is a correlation between volume and volatility in a time series. Using empirical data and their distributions, we further investigate this correlation and discover new ways that volatility and volume int eract, particularly when the levels of both are high. We find that the distribution of the volume-conditional volatility is well fit by a power-law function with an exponential cutoff. We find that the volume-conditional volatility distribution scales with volume, and collapses these distributions to a single curve. We exploit the characteristics of the volume-volatility scatter plot to find a strong correlation between logarithmic volume and a quantity we define as local maximum volatility (LMV), which indicates the largest volatility observed in a given range of trading volumes. This finding supports our empirical analysis showing that volume is an excellent predictor of the maximum value of volatility for both same-day and near-future time periods. We also use a joint conditional probability that includes both volatility and volume to demonstrate that invoking both allows us to better predict the largest next-day volatility than invoking either one alone.
In a highly interdependent economic world, the nature of relationships between financial entities is becoming an increasingly important area of study. Recently, many studies have shown the usefulness of minimal spanning trees (MST) in extracting inte ractions between financial entities. Here, we propose a modified MST network whose metric distance is defined in terms of cross-correlation coefficient absolute values, enabling the connections between anticorrelated entities to manifest properly. We investigate 69 daily time series, comprising three types of financial assets: 28 stock market indicators, 21 currency futures, and 20 commodity futures. We show that though the resulting MST network evolves over time, the financial assets of similar type tend to have connections which are stable over time. In addition, we find a characteristic time lag between the volatility time series of the stock market indicators and those of the EU CO2 emission allowance (EUA) and crude oil futures (WTI). This time lag is given by the peak of the cross-correlation function of the volatility time series EUA (or WTI) with that of the stock market indicators, and is markedly different (>20 days) from 0, showing that the volatility of stock market indicators today can predict the volatility of EU emissions allowances and of crude oil in the near future.
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا