Beyond the Tip of the Iceberg: Assessing Coherence of Text Classifiers

65 0 0.0 ( 0 )

Download Cite

Added by Shane Storks

Publication date 2021

fields Informatics Engineering

and research's language is English

Authors Shane Storks - Joyce Chai

Computation and Language

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

As large-scale, pre-trained language models achieve human-level and superhuman accuracy on existing language understanding tasks, statistical bias in benchmark data and probing studies have recently called into question their true capabilities. For a more informative evaluation than accuracy on text classification tasks can offer, we propose evaluating systems through a novel measure of prediction coherence. We apply our framework to two existing language understanding benchmarks with different properties to demonstrate its versatility. Our experimental results show that this evaluation framework, although simple in ideas and implementation, is a quick, effective, and versatile measure to provide insight into the coherence of machines predictions.

rate research

Star formation in the early universe: beyond the tip of the iceberg

427 - N. R. Tanvir , A. J. Levan , A. S. Fruchter 2012

We present late-time Hubble Space Telescope imaging of the fields of six Swift GRBs lying at 5.0<z<9.5. Our data includes very deep observations of the field of the most distant spectroscopically confirmed burst, GRB 090423, at z=8.2. Using the precise positions afforded by their afterglows we can place stringent limits on the luminosities of their host galaxies. In one case, that of GRB 060522 at z=5.11, there is a marginal excess of flux close to the GRB position which may be a detection of a host at a magnitude J(AB)=28.5. None of the others are significantly detected meaning that all the hosts lie below Lstar at their respective redshifts, with star formation rates SFR<4Mo/yr in all cases. Indeed, stacking the five fields with WFC3-IR data we conclude a mean SFR<0.17Mo/yr per galaxy. These results support the proposition that the bulk of star formation, and hence integrated UV luminosity, at high redshifts arises in galaxies below the detection limits of deep-field observations. Making the reasonable assumption that GRB rate is proportional to UV luminosity at early times allows us to compare our limits with expectations based on galaxy luminosity functions derived from the Hubble Ultra-Deep Field (HUDF) and other deep fields. We infer that a luminosity function which is evolving rapidly towards steeper faint-end slope (alpha) and decreasing characteristic luminosity (Lstar), as suggested by some other studies, is consistent with our observations, whereas a non-evolving LF shape is ruled out at >90% confidence. Although it is not yet possible to make stronger statements, in the future, with larger samples and a fuller understanding of the conditions required for GRB production, studies like this hold great potential for probing the nature of star formation, the shape of the galaxy luminosity function, and the supply of ionizing photons in the early universe.

Cosmology and Nongalactic Astrophysics

Dust Reddened Quasars in FIRST and UKIDSS: Beyond the Tip of the Iceberg

505 - Eilat Glikman , Tanya Urrutia , Mark Lacy 2013

We present the results of a pilot survey to find dust-reddened quasars by matching the FIRST radio catalog to the UKIDSS near-infrared survey, and using optical data from SDSS to select objects with very red colors. The deep K-band limit provided by UKIDSS allows for finding more heavily-reddened quasars at higher redshifts as compared with previous work using FIRST and 2MASS. We selected 87 candidates with K<=17.0 from the UKIDSS Large Area Survey (LAS) First Data Release (DR1) which covers 190 deg2. These candidates reach up to ~1.5 magnitudes below the 2MASS limit and obey the color criteria developed to identify dust-reddened quasars. We have obtained 61 spectroscopic observations in the optical and/or near-infrared as well as classifications in the literature and have identified 14 reddened quasars with E(B-V)>0.1, including three at z>2. We study the infrared properties of the sample using photometry from the WISE Observatory and find that infrared colors improve the efficiency of red quasar selection, removing many contaminants in an infrared-to-optical color-selected sample alone. The highest-redshift quasars (z > 2) are only moderately reddened, with E(B-V) ~ 0.2-0.3. We find that the surface density of red quasars rises sharply with faintness, comprising up to 17% of blue quasars at the same apparent K-band flux limit. We estimate that to reach more heavily reddened quasars (i.e., E(B-V) > 0.5) at z>2 and a depth of K=17 we would need to survey at least ~2.5 times more area.

Cosmology and Nongalactic Astrophysics

On the Lack of Robust Interpretability of Neural Text Classifiers

117 - Muhammad Bilal Zafar , Michele Donini , Dylan Slack 2021

With the ever-increasing complexity of neural language models, practitioners have turned to methods for understanding the predictions of these models. One of the most well-adopted approaches for model interpretability is feature-based interpretability, i.e., ranking the features in terms of their impact on model predictions. Several prior studies have focused on assessing the fidelity of feature-based interpretability methods, i.e., measuring the impact of dropping the top-ranked features on the model output. However, relatively little work has been conducted on quantifying the robustness of interpretations. In this work, we assess the robustness of interpretations of neural text classifiers, specifically, those based on pretrained Transformer encoders, using two randomization tests. The first compares the interpretations of two models that are identical except for their initializations. The second measures whether the interpretations differ between a model with trained parameters and a model with random parameters. Both tests show surprising deviations from expected behavior, raising questions about the extent of insights that practitioners may draw from interpretations.

Computation and Language Machine Learning

Learning Variational Word Masks to Improve the Interpretability of Neural Text Classifiers

112 - Hanjie Chen , Yangfeng Ji 2020

To build an interpretable neural text classifier, most of the prior work has focused on designing inherently interpretable models or finding faithful explanations. A new line of work on improving model interpretability has just started, and many existing methods require either prior information or human annotations as additional inputs in training. To address this limitation, we propose the variational word mask (VMASK) method to automatically learn task-specific important words and reduce irrelevant information on classification, which ultimately improves the interpretability of model predictions. The proposed method is evaluated with three neural text classifiers (CNN, LSTM, and BERT) on seven benchmark text classification datasets. Experiments show the effectiveness of VMASK in improving both model prediction accuracy and interpretability.

Computation and Language Machine Learning

Attacking Text Classifiers via Sentence Rewriting Sampler

105 - Lei Xu , Kalyan Veeramachaneni 2021

Most adversarial attack methods on text classification can change the classifiers prediction by synonym substitution. We propose the adversarial sentence rewriting sampler (ASRS), which rewrites the whole sentence to generate more similar and higher-quality adversarial examples. Our method achieves a better attack success rate on 4 out of 7 datasets, as well as significantly better sentence quality on all 7 datasets. ASRS is an indispensable supplement to the existing attack methods, because classifiers cannot resist the attack from ASRS unless they are trained on adversarial examples found by ASRS.

Computation and Language