No Classification without Representation: Assessing Geodiversity Issues in Open Data Sets for the Developing World

53 0 0.0 ( 0 )

Download Cite

Added by Shreya Shankar

Publication date 2017

fields Mathematical Statistics

and research's language is English

Authors Shreya Shankar - Yoni Halpern - Eric Breck

Machine Learning

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Modern machine learning systems such as image classifiers rely heavily on large scale data sets for training. Such data sets are costly to create, thus in practice a small number of freely available, open source data sets are widely used. We suggest that examining the geo-diversity of open data sets is critical before adopting a data set for use cases in the developing world. We analyze two large, publicly available image data sets to assess geo-diversity and find that these data sets appear to exhibit an observable amerocentric and eurocentric representation bias. Further, we analyze classifiers trained on these data sets to assess the impact of these training distributions and find strong differences in the relative performance on images from different locales. These results emphasize the need to ensure geo-representation when constructing data sets for use in the developing world.

rate research

Assessing Image Quality Issues for Real-World Problems

220 - Tai-Yin Chiu , Yinan Zhao , Danna Gurari 2020

We introduce a new large-scale dataset that links the assessment of image quality issues to two practical vision tasks: image captioning and visual question answering. First, we identify for 39,181 images taken by people who are blind whether each is sufficient quality to recognize the content as well as what quality flaws are observed from six options. These labels serve as a critical foundation for us to make the following contributions: (1) a new problem and algorithms for deciding whether an image is insufficient quality to recognize the content and so not captionable, (2) a new problem and algorithms for deciding which of six quality flaws an image contains, (3) a new problem and algorithms for deciding whether a visual question is unanswerable due to unrecognizable content versus the content of interest being missing from the field of view, and (4) a novel application of more efficiently creating a large-scale image captioning dataset by automatically deciding whether an image is insufficient quality and so should not be captioned. We publicly-share our datasets and code to facilitate future extensions of this work: https://vizwiz.org.

Computer Vision and Pattern Recognition

Neurology-as-a-Service for the Developing World

51 - Tejas Dharamsi , Payel Das , Tejaswini Pedapati 2017

Electroencephalography (EEG) is an extensively-used and well-studied technique in the field of medical diagnostics and treatment for brain disorders, including epilepsy, migraines, and tumors. The analysis and interpretation of EEGs require physicians to have specialized training, which is not common even among most doctors in the developed world, let alone the developing world where physician shortages plague society. This problem can be addressed by teleEEG that uses remote EEG analysis by experts or by local computer processing of EEGs. However, both of these options are prohibitively expensive and the second option requires abundant computing resources and infrastructure, which is another concern in developing countries where there are resource constraints on capital and computing infrastructure. In this work, we present a cloud-based deep neural network approach to provide decision support for non-specialist physicians in EEG analysis and interpretation. Named `neurology-as-a-service, the approach requires almost no manual intervention in feature engineering and in the selection of an optimal architecture and hyperparameters of the neural network. In this study, we deploy a pipeline that includes moving EEG data to the cloud and getting optimal models for various classification tasks. Our initial prototype has been tested only in developed world environments to-date, but our intention is to test it in developing world environments in future work. We demonstrate the performance of our proposed approach using the BCI2000 EEG MMI dataset, on which our service attains 63.4% accuracy for the task of classifying real vs. imaginary activity performed by the subject, which is significantly higher than what is obtained with a shallow approach such as support vector machines.

Machine Learning Machine Learning

Proceedings of NIPS 2017 Workshop on Machine Learning for the Developing World

93 - Maria De-Arteaga , William Herlands 2017

This is the Proceedings of NIPS 2017 Workshop on Machine Learning for the Developing World, held in Long Beach, California, USA on December 8, 2017

Machine Learning

Analysis Issues for Large CMB Data Sets

259 - K.M. Gorski , E. Hivon , B.D. Wandelt 1998

Multi-frequency, high resolution, full sky measurements of the anisotropy in both temperature and polarisation of the cosmic microwave background radiation are the goals of the satellite missions MAP (NASA) and Planck (ESA). The ultimate data products of these missions - multiple microwave sky maps, each of which will have to comprise more than 10^6 pixels in order to render the angular resolution of the instruments - will present serious challenges to those involved in the analysis and scientific exploitation of the results of both surveys. Some considerations of the relevant aspects of the mathematical structure of future CMB data sets are presented in this contribution. >>> for better on-screen rendition of the figures see <<< http://www.tac.dk/~healpix or http://www.mpa-garching.mpg.de/~cosmo/contributions.html

$(1 + varepsilon)$-class Classification: an Anomaly Detection Method for Highly Imbalanced or Incomplete Data Sets

75 - Maxim Borisyak , Artem Ryzhikov , Andrey Ustyuzhanin 2019

Anomaly detection is not an easy problem since distribution of anomalous samples is unknown a priori. We explore a novel method that gives a trade-off possibility between one-class and two-class approaches, and leads to a better performance on anomaly detection problems with small or non-representative anomalous samples. The method is evaluated using several data sets and compared to a set of conventional one-class and two-class approaches.

Machine Learning Machine Learning