Do you want to publish a course? Click here

Beyond the Tip of the Iceberg: Assessing Coherence of Text Classifiers

65   0   0.0 ( 0 )
 Added by Shane Storks
 Publication date 2021
and research's language is English




Ask ChatGPT about the research

As large-scale, pre-trained language models achieve human-level and superhuman accuracy on existing language understanding tasks, statistical bias in benchmark data and probing studies have recently called into question their true capabilities. For a more informative evaluation than accuracy on text classification tasks can offer, we propose evaluating systems through a novel measure of prediction coherence. We apply our framework to two existing language understanding benchmarks with different properties to demonstrate its versatility. Our experimental results show that this evaluation framework, although simple in ideas and implementation, is a quick, effective, and versatile measure to provide insight into the coherence of machines predictions.



rate research

Read More

We present late-time Hubble Space Telescope imaging of the fields of six Swift GRBs lying at 5.0<z<9.5. Our data includes very deep observations of the field of the most distant spectroscopically confirmed burst, GRB 090423, at z=8.2. Using the precise positions afforded by their afterglows we can place stringent limits on the luminosities of their host galaxies. In one case, that of GRB 060522 at z=5.11, there is a marginal excess of flux close to the GRB position which may be a detection of a host at a magnitude J(AB)=28.5. None of the others are significantly detected meaning that all the hosts lie below Lstar at their respective redshifts, with star formation rates SFR<4Mo/yr in all cases. Indeed, stacking the five fields with WFC3-IR data we conclude a mean SFR<0.17Mo/yr per galaxy. These results support the proposition that the bulk of star formation, and hence integrated UV luminosity, at high redshifts arises in galaxies below the detection limits of deep-field observations. Making the reasonable assumption that GRB rate is proportional to UV luminosity at early times allows us to compare our limits with expectations based on galaxy luminosity functions derived from the Hubble Ultra-Deep Field (HUDF) and other deep fields. We infer that a luminosity function which is evolving rapidly towards steeper faint-end slope (alpha) and decreasing characteristic luminosity (Lstar), as suggested by some other studies, is consistent with our observations, whereas a non-evolving LF shape is ruled out at >90% confidence. Although it is not yet possible to make stronger statements, in the future, with larger samples and a fuller understanding of the conditions required for GRB production, studies like this hold great potential for probing the nature of star formation, the shape of the galaxy luminosity function, and the supply of ionizing photons in the early universe.
We present the results of a pilot survey to find dust-reddened quasars by matching the FIRST radio catalog to the UKIDSS near-infrared survey, and using optical data from SDSS to select objects with very red colors. The deep K-band limit provided by UKIDSS allows for finding more heavily-reddened quasars at higher redshifts as compared with previous work using FIRST and 2MASS. We selected 87 candidates with K<=17.0 from the UKIDSS Large Area Survey (LAS) First Data Release (DR1) which covers 190 deg2. These candidates reach up to ~1.5 magnitudes below the 2MASS limit and obey the color criteria developed to identify dust-reddened quasars. We have obtained 61 spectroscopic observations in the optical and/or near-infrared as well as classifications in the literature and have identified 14 reddened quasars with E(B-V)>0.1, including three at z>2. We study the infrared properties of the sample using photometry from the WISE Observatory and find that infrared colors improve the efficiency of red quasar selection, removing many contaminants in an infrared-to-optical color-selected sample alone. The highest-redshift quasars (z > 2) are only moderately reddened, with E(B-V) ~ 0.2-0.3. We find that the surface density of red quasars rises sharply with faintness, comprising up to 17% of blue quasars at the same apparent K-band flux limit. We estimate that to reach more heavily reddened quasars (i.e., E(B-V) > 0.5) at z>2 and a depth of K=17 we would need to survey at least ~2.5 times more area.
With the ever-increasing complexity of neural language models, practitioners have turned to methods for understanding the predictions of these models. One of the most well-adopted approaches for model interpretability is feature-based interpretability, i.e., ranking the features in terms of their impact on model predictions. Several prior studies have focused on assessing the fidelity of feature-based interpretability methods, i.e., measuring the impact of dropping the top-ranked features on the model output. However, relatively little work has been conducted on quantifying the robustness of interpretations. In this work, we assess the robustness of interpretations of neural text classifiers, specifically, those based on pretrained Transformer encoders, using two randomization tests. The first compares the interpretations of two models that are identical except for their initializations. The second measures whether the interpretations differ between a model with trained parameters and a model with random parameters. Both tests show surprising deviations from expected behavior, raising questions about the extent of insights that practitioners may draw from interpretations.
112 - Hanjie Chen , Yangfeng Ji 2020
To build an interpretable neural text classifier, most of the prior work has focused on designing inherently interpretable models or finding faithful explanations. A new line of work on improving model interpretability has just started, and many existing methods require either prior information or human annotations as additional inputs in training. To address this limitation, we propose the variational word mask (VMASK) method to automatically learn task-specific important words and reduce irrelevant information on classification, which ultimately improves the interpretability of model predictions. The proposed method is evaluated with three neural text classifiers (CNN, LSTM, and BERT) on seven benchmark text classification datasets. Experiments show the effectiveness of VMASK in improving both model prediction accuracy and interpretability.
Most adversarial attack methods on text classification can change the classifiers prediction by synonym substitution. We propose the adversarial sentence rewriting sampler (ASRS), which rewrites the whole sentence to generate more similar and higher-quality adversarial examples. Our method achieves a better attack success rate on 4 out of 7 datasets, as well as significantly better sentence quality on all 7 datasets. ASRS is an indispensable supplement to the existing attack methods, because classifiers cannot resist the attack from ASRS unless they are trained on adversarial examples found by ASRS.
comments
Fetching comments Fetching comments
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا