No Arabic abstract
As one of the solutions to the Dec-POMDP problem, the value decomposition method has achieved good results recently. However, most value decomposition methods require the global state during training, but this is not feasible in some scenarios where the global state cannot be obtained. Therefore, we propose a novel value decomposition framework, named State Inference for value DEcomposition (SIDE), which eliminates the need to know the true state by simultaneously seeking solutions to the two problems of optimal control and state inference. SIDE can be extended to any value decomposition method, as well as other types of multi-agent algorithms in the case of Dec-POMDP. Based on the performance results of different algorithms in Starcraft II micromanagement tasks, we verified that SIDE can construct the current state that contributes to the reinforcement learning process based on past local observations.
Neural dialogue models have been widely adopted in various chatbot applications because of their good performance in simulating and generalizing human conversations. However, there exists a dark side of these models -- due to the vulnerability of neural networks, a neural dialogue model can be manipulated by users to say what they want, which brings in concerns about the security of practical chatbot services. In this work, we investigate whether we can craft inputs that lead a well-trained black-box neural dialogue model to generate targeted outputs. We formulate this as a reinforcement learning (RL) problem and train a Reverse Dialogue Generator which efficiently finds such inputs for targeted outputs. Experiments conducted on a representative neural dialogue model show that our proposed model is able to discover such desired inputs in a considerable portion of cases. Overall, our work reveals this weakness of neural dialogue models and may prompt further researches of developing corresponding solutions to avoid it.
We consider the issue of strategic behaviour in various peer-assessment tasks, including peer grading of exams or homeworks and peer review in hiring or promotions. When a peer-assessment task is competitive (e.g., when students are graded on a curve), agents may be incentivized to misreport evaluations in order to improve their own final standing. Our focus is on designing methods for detection of such manipulations. Specifically, we consider a setting in which agents evaluate a subset of their peers and output rankings that are later aggregated to form a final ordering. In this paper, we investigate a statistical framework for this problem and design a principled test for detecting strategic behaviour. We prove that our test has strong false alarm guarantees and evaluate its detection ability in practical settings. For this, we design and execute an experiment that elicits strategic behaviour from subjects and release a dataset of patterns of strategic behaviour that may be of independent interest. We then use the collected data to conduct a series of real and semi-synthetic evaluations that demonstrate a strong detection power of our test.
Learning is an inherently continuous phenomenon. When humans learn a new task there is no explicit distinction between training and inference. As we learn a task, we keep learning about it while performing the task. What we learn and how we learn it varies during different stages of learning. Learning how to learn and adapt is a key property that enables us to generalize effortlessly to new settings. This is in contrast with conventional settings in machine learning where a trained model is frozen during inference. In this paper we study the problem of learning to learn at both training and test time in the context of visual navigation. A fundamental challenge in navigation is generalization to unseen scenes. In this paper we propose a self-adaptive visual navigation method (SAVN) which learns to adapt to new environments without any explicit supervision. Our solution is a meta-reinforcement learning approach where an agent learns a self-supervised interaction loss that encourages effective navigation. Our experiments, performed in the AI2-THOR framework, show major improvements in both success rate and SPL for visual navigation in novel scenes. Our code and data are available at: https://github.com/allenai/savn .
A common practice in many auctions is to offer bidders an opportunity to improve their bids, known as a Best and Final Offer (BAFO) stage. This final bid can depend on new information provided about either the asset or the competitors. This paper examines the effects of new information regarding competitors, seeking to determine what information the auctioneer should provide assuming the set of allowable bids is discrete. The rational strategy profile that maximizes the revenue of the auctioneer is the one where each bidder makes the highest possible bid that is lower than his valuation of the item. This strategy profile is an equilibrium for a large enough number of bidders, regardless of the information released. We compare the number of bidders needed for this profile to be an equilibrium under different information settings. We find that it becomes an equilibrium with fewer bidders when less additional information is made available to the bidders regarding the competition. It follows that when the number of bidders is a priori unknown, there are some advantages to the auctioneer to not reveal information.