ترغب بنشر مسار تعليمي؟ اضغط هنا

We study the use of policy gradient algorithms to optimize over a class of generalized Thompson sampling policies. Our central insight is to view the posterior parameter sampled by Thompson sampling as a kind of pseudo-action. Policy gradient methods can then be tractably applied to search over a class of sampling policies, which determine a probability distribution over pseudo-actions (i.e., sampled parameters) as a function of observed data. We also propose and compare policy gradient estimators that are specialized to Bayesian bandit problems. Numerical experiments demonstrate that direct policy search on top of Thompson sampling automatically corrects for some of the algorithms known shortcomings and offers meaningful improvements even in long horizon problems where standard Thompson sampling is extremely effective.
We consider a finite-horizon multi-armed bandit (MAB) problem in a Bayesian setting, for which we propose an information relaxation sampling framework. With this framework, we define an intuitive family of control policies that include Thompson sampl ing (TS) and the Bayesian optimal policy as endpoints. Analogous to TS, which, at each decision epoch pulls an arm that is best with respect to the randomly sampled parameters, our algorithms sample entire future reward realizations and take the corresponding best action. However, this is done in the presence of penalties that seek to compensate for the availability of future information. We develop several novel policies and performance bounds for MAB problems that vary in terms of improving performance and increasing computational complexity between the two endpoints. Our policies can be viewed as natural generalizations of TS that simultaneously incorporate knowledge of the time horizon and explicitly consider the exploration-exploitation trade-off. We prove associated structural results on performance bounds and suboptimality gaps. Numerical experiments suggest that this new class of policies perform well, in particular in settings where the finite time horizon introduces significant exploration-exploitation tension into the problem. Finally, inspired by the finite-horizon Gittins index, we propose an index policy that builds on our framework that particularly outperforms the state-of-the-art algorithms in our numerical experiments.
The composition of natural liquidity has been changing over time. An analysis of intraday volumes for the S&P500 constituent stocks illustrates that (i) volume surprises, i.e., deviations from their respective forecasts, are correlated across stocks, and (ii) this correlation increases during the last few hours of the trading session. These observations could be attributed, in part, to the prevalence of portfolio trading activity that is implicit in the growth of ETF, passive and systematic investment strategies; and, to the increased trading intensity of such strategies towards the end of the trading session, e.g., due to execution of mutual fund inflows/outflows that are benchmarked to the closing price on each day. In this paper, we investigate the consequences of such portfolio liquidity on price impact and portfolio execution. We derive a linear cross-asset market impact from a stylized model that explicitly captures the fact that a certain fraction of natural liquidity providers only trade portfolios of stocks whenever they choose to execute. We find that due to cross-impact and its intraday variation, it is optimal for a risk-neutral, cost minimizing liquidator to execute a portfolio of orders in a coupled manner, as opposed to a separable VWAP-like execution that is often assumed. The optimal schedule couples the execution of the various orders so as to be able to take advantage of increased portfolio liquidity towards the end of the day. A worst case analysis shows that the potential cost reduction from this optimized execution schedule over the separable approach can be as high as 6% for plausible model parameters. Finally, we discuss how to estimate cross-sectional price impact if one had a dataset of realized portfolio transaction records that exploits the low-rank structure of its coefficient matrix suggested by our analysis.
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا