CORAL: COde RepresentAtion Learning with Weakly-Supervised Transformers for Analyzing Data Analysis

65 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Mike Merrill

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Ge Zhang - Mike A. Merrill - Yang Liu

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Large scale analysis of source code, and in particular scientific source code, holds the promise of better understanding the data science process, identifying analytical best practices, and providing insights to the builders of scientific toolkits. However, large corpora have remained unanalyzed in depth, as descriptive labels are absent and require expert domain knowledge to generate. We propose a novel weakly supervised transformer-based architecture for computing joint representations of code from both abstract syntax trees and surrounding natural language comments. We then evaluate the model on a new classification task for labeling computational notebook cells as stages in the data analysis process from data import to wrangling, exploration, modeling, and evaluation. We show that our model, leveraging only easily-available weak supervision, achieves a 38% increase in accuracy over expert-supplied heuristics and outperforms a suite of baselines. Our model enables us to examine a set of 118,000 Jupyter Notebooks to uncover common data analysis patterns. Focusing on notebooks with relationships to academic articles, we conduct the largest ever study of scientific code and find that notebook composition correlates with the citation count of corresponding papers.

قيم البحث

156 - Stan Z. Li , Lirong Wu , Zelin Zang 2020

High dimensional data analysis for exploration and discovery includes three fundamental tasks: dimensionality reduction, clustering, and visualization. When the three associated tasks are done separately, as is often the case thus far, inconsistencie s can occur among the tasks in terms of data geometry and others. This can lead to confusing or misleading data interpretation. In this paper, we propose a novel neural network-based method, called Consistent Representation Learning (CRL), to accomplish the three associated tasks end-to-end and improve the consistencies. The CRL network consists of two nonlinear dimensionality reduction (NLDR) transformations: (1) one from the input data space to the latent feature space for clustering, and (2) the other from the clustering space to the final 2D or 3D space for visualization. Importantly, the two NLDR transformations are performed to best satisfy local geometry preserving (LGP) constraints across the spaces or network layers, to improve data consistencies along with the processing flow. Also, we propose a novel metric, clustering-visualization inconsistency (CVI), for evaluating the inconsistencies. Extensive comparative results show that the proposed CRL neural network method outperforms the popular t-SNE and UMAP-based and other contemporary clustering and visualization algorithms in terms of evaluation metrics and visualization.

التعلم الآلي الذكاء الاصطناعي التعلم الالي

Weakly-Supervised Reinforcement Learning for Controllable Behavior

294 - Lisa Lee , Benjamin Eysenbach , Ruslan Salakhutdinov 2020

Reinforcement learning (RL) is a powerful framework for learning to take actions to solve tasks. However, in many settings, an agent must winnow down the inconceivably large space of all possible tasks to the single task that it is currently being as ked to solve. Can we instead constrain the space of tasks to those that are semantically meaningful? In this work, we introduce a framework for using weak supervision to automatically disentangle this semantically meaningful subspace of tasks from the enormous space of nonsensical chaff tasks. We show that this learned subspace enables efficient exploration and provides a representation that captures distance between states. On a variety of challenging, vision-based continuous control problems, our approach leads to substantial performance gains, particularly as the complexity of the environment grows.

التعلم الآلي علم الروبوتات التعلم الالي

Efficient Self-supervised Vision Transformers for Representation Learning

192 - Chunyuan Li , Jianwei Yang , Pengchuan Zhang 2021

This paper investigates two techniques for developing efficient self-supervised vision transformers (EsViT) for visual representation learning. First, we show through a comprehensive empirical study that multi-stage architectures with sparse self-att entions can significantly reduce modeling complexity but with a cost of losing the ability to capture fine-grained correspondences between image regions. Second, we propose a new pre-training task of region matching which allows the model to capture fine-grained region dependencies and as a result significantly improves the quality of the learned vision representations. Our results show that combining the two techniques, EsViT achieves 81.3% top-1 on the ImageNet linear probe evaluation, outperforming prior arts with around an order magnitude of higher throughput. When transferring to downstream linear classification tasks, EsViT outperforms its supervised counterpart on 17 out of 18 datasets. The code and models will be publicly available.

الرؤية الحاسوبية وتمييز الأنماط الذكاء الاصطناعي التعلم الآلي

Contrastive Code Representation Learning

279 - Paras Jain , Ajay Jain , Tianjun Zhang 2020

Recent work learns contextual representations of source code by reconstructing tokens from their context. For downstream semantic understanding tasks like summarizing code in English, these representations should ideally capture program functionality . However, we show that the popular reconstruction-based BERT model is sensitive to source code edits, even when the edits preserve semantics. We propose ContraCode: a contrastive pre-training task that learns code functionality, not form. ContraCode pre-trains a neural network to identify functionally similar variants of a program among many non-equivalent distractors. We scalably generate these variants using an automated source-to-source compiler as a form of data augmentation. Contrastive pre-training improves JavaScript summarization and TypeScript type inference accuracy by 2% to 13%. We also propose a new zero-shot JavaScript code clone detection dataset, showing that ContraCode is both more robust and semantically meaningful. On it, we outperform RoBERTa by 39% AUROC in an adversarial setting and up to 5% on natural code.

التعلم الآلي الذكاء الاصطناعي لغات البرمجة

Representation learning for neural population activity with Neural Data Transformers

130 - Joel Ye , Chethan Pandarinath 2021

Neural population activity is theorized to reflect an underlying dynamical structure. This structure can be accurately captured using state space models with explicit dynamics, such as those based on recurrent neural networks (RNNs). However, using r ecurrence to explicitly model dynamics necessitates sequential processing of data, slowing real-time applications such as brain-computer interfaces. Here we introduce the Neural Data Transformer (NDT), a non-recurrent alternative. We test the NDTs ability to capture autonomous dynamical systems by applying it to synthetic datasets with known dynamics and data from monkey motor cortex during a reaching task well-modeled by RNNs. The NDT models these datasets as well as state-of-the-art recurrent models. Further, its non-recurrence enables 3.9ms inference, well within the loop time of real-time applications and more than 6 times faster than recurrent baselines on the monkey reaching dataset. These results suggest that an explicit dynamics model is not necessary to model autonomous neural population dynamics. Code: https://github.com/snel-repo/neural-data-transformers

الخلايا العصبية والإدراك التعلم الآلي