Annotation is the labeling of data by human effort. Annotation is critical to modern machine learning, and Bloomberg has developed years of experience of annotation at scale. This report captures a wealth of wisdom for applied annotation projects, collected from more than 30 experienced annotation project managers in Bloombergs Global Data department.
We present an overview of best practices for publishing data in astronomy and astrophysics journals. These recommendations are intended as a reference for authors to help prepare and publish data in a way that will better represent and support science results, enable better data sharing, improve reproducibility, and enhance the reusability of data. Observance of these guidelines will also help to streamline the extraction, preservation, integration and cross-linking of valuable data from astrophysics literature into major astronomical databases, and consequently facilitate new modes of science discovery that will better exploit the vast quantities of panchromatic and multi-dimensional data associated with the literature. We encourage authors, journal editors, referees, and publishers to implement the best practices reviewed here, as well as related recommendations from international astronomical organizations such as the International Astronomical Union (IAU) and International Virtual Observatory Alliance (IVOA) for publication of nomenclature, data, and metadata. A convenient Checklist of Recommendations for Publishing Data in Literature is included for authors to consult before the submission of the final version of their journal articles and associated data files. We recommend that publishers of journals in astronomy and astrophysics incorporate a link to this document in their Instructions to Authors.
We introduce Seshat, a new, simple and open-source software to efficiently manage annotations of speech corpora. The Seshat software allows users to easily customise and manage annotations of large audio corpora while ensuring compliance with the formatting and naming conventions of the annotated output files. In addition, it includes procedures for checking the content of annotations following specific rules that can be implemented in personalised parsers. Finally, we propose a double-annotation mode, for which Seshat computes automatically an associated inter-annotator agreement with the $gamma$ measure taking into account the categorisation and segmentation discrepancies.
Alchemical free energy calculations are a useful tool for predicting free energy differences associated with the transfer of molecules from one environment to another. The hallmark of these methods is the use of bridging potential energy functions representing emph{alchemical} intermediate states that cannot exist as real chemical species. The data collected from these bridging alchemical thermodynamic states allows the efficient computation of transfer free energies (or differences in transfer free energies) with orders of magnitude less simulation time than simulating the transfer process directly. While these methods are highly flexible, care must be taken in avoiding common pitfalls to ensure that computed free energy differences can be robust and reproducible for the chosen force field, and that appropriate corrections are included to permit direct comparison with experimental data. In this paper, we review current best practices for several popular application domains of alchemical free energy calculations, including relative and absolute small molecule binding free energy calculations to biomolecular targets.
The causes behind complications in laser-assisted tattoo removal are currently not well understood, and in the literature relating to tattoo removal the emphasis on removal treatment is on removal technologies and tools, not best parameters involved in the treatment process. Additionally, the very challenge of determining best practices is difficult given the complexity of interactions between factors that may correlate to these complications. In this paper we apply a battery of classical statistical methods and techniques to identify features that may be closely correlated to causes of complication during the tattoo removal process, and report quantitative evidence for potential best practices. We develop elementary statistical descriptions of tattoo data collected by the largest gang rehabilitation and reentry organization in the world, Homeboy Industries; perform parametric and nonparametric tests of significance; and finally, produce a statistical model explaining treatment parameter interactions, as well as develop a ranking system for treatment parameters utilizing bootstrapping and gradient boosting.
Much of biomedical and healthcare data is encoded in discrete, symbolic form such as text and medical codes. There is a wealth of expert-curated biomedical domain knowledge stored in knowledge bases and ontologies, but the lack of reliable methods for learning knowledge representation has limited their usefulness in machine learning applications. While text-based representation learning has significantly improved in recent years through advances in natural language processing, attempts to learn biomedical concept embeddings so far have been lacking. A recent family of models called knowledge graph embeddings have shown promising results on general domain knowledge graphs, and we explore their capabilities in the biomedical domain. We train several state-of-the-art knowledge graph embedding models on the SNOMED-CT knowledge graph, provide a benchmark with comparison to existing methods and in-depth discussion on best practices, and make a case for the importance of leveraging the multi-relational nature of knowledge graphs for learning biomedical knowledge representation. The embeddings, code, and materials will be made available to the communitY.