No Arabic abstract
This paper describes an open data set of 3,053 energy meters from 1,636 non-residential buildings with a range of two full years (2016 and 2017) at an hourly frequency (17,544 measurements per meter resulting in approximately 53.6 million measurements). These meters were collected from 19 sites across North America and Europe, with one or more meters per building measuring whole building electrical, heating and cooling water, steam, and solar energy as well as water and irrigation meters. Part of these data were used in the Great Energy Predictor III (GEPIII) competition hosted by the ASHRAE organization in October-December 2019. GEPIII was a machine learning competition for long-term prediction with an application to measurement and verification. This paper describes the process of data collection, cleaning, and convergence of time-series meter data, the meta-data about the buildings, and complementary weather data. This data set can be used for further prediction benchmarking and prototyping as well as anomaly detection, energy analysis, and building type classification.
Data competitions rely on real-time leaderboards to rank competitor entries and stimulate algorithm improvement. While such competitions have become quite popular and prevalent, particularly in supervised learning formats, their implementations by the host are highly variable. Without careful planning, a supervised learning competition is vulnerable to overfitting, where the winning solutions are so closely tuned to the particular set of provided data that they cannot generalize to the underlying problem of interest to the host. This paper outlines some important considerations for strategically designing relevant and informative data sets to maximize the learning outcome from hosting a competition based on our experience. It also describes a post-competition analysis that enables robust and efficient assessment of the strengths and weaknesses of solutions from different competitors, as well as greater understanding of the regions of the input space that are well-solved. The post-competition analysis, which complements the leaderboard, uses exploratory data analysis and generalized linear models (GLMs). The GLMs not only expand the range of results we can explore, they also provide more detailed analysis of individual sub-questions including similarities and differences between algorithms across different types of scenarios, universally easy or hard regions of the input space, and different learning objectives. When coupled with a strategically planned data generation approach, the methods provide richer and more informative summaries to enhance the interpretation of results beyond just the rankings on the leaderboard. The methods are illustrated with a recently completed competition to evaluate algorithms capable of detecting, identifying, and locating radioactive materials in an urban environment.
We present the second public data release of the Dark Energy Survey, DES DR2, based on optical/near-infrared imaging by the Dark Energy Camera mounted on the 4-m Blanco telescope at Cerro Tololo Inter-American Observatory in Chile. DES DR2 consists of reduced single-epoch and coadded images, a source catalog derived from coadded images, and associated data products assembled from 6 years of DES science operations. This release includes data from the DES wide-area survey covering ~5000 deg2 of the southern Galactic cap in five broad photometric bands, grizY. DES DR2 has a median delivered point-spread function full-width at half maximum of g= 1.11, r= 0.95, i= 0.88, z= 0.83, and Y= 0.90 arcsec photometric uniformity with a standard deviation of < 3 mmag with respect to Gaia DR2 G-band, a photometric accuracy of ~10 mmag, and a median internal astrometric precision of ~27 mas. The median coadded catalog depth for a 1.95 arcsec diameter aperture at S/N= 10 is g= 24.7, r= 24.4, i= 23.8, z= 23.1 and Y= 21.7 mag. DES DR2 includes ~691 million distinct astronomical objects detected in 10,169 coadded image tiles of size 0.534 deg2 produced from 76,217 single-epoch images. After a basic quality selection, benchmark galaxy and stellar samples contain 543 million and 145 million objects, respectively. These data are accessible through several interfaces, including interactive image visualization tools, web-based query clients, image cutout servers and Jupyter notebooks. DES DR2 constitutes the largest photometric data set to date at the achieved depth and photometric precision.
The recent advent of smart meters has led to large micro-level datasets. For the first time, the electricity consumption at individual sites is available on a near real-time basis. Efficient management of energy resources, electric utilities, and transmission grids, can be greatly facilitated by harnessing the potential of this data. The aim of this study is to generate probability density estimates for consumption recorded by individual smart meters. Such estimates can assist decision making by helping consumers identify and minimize their excess electricity usage, especially during peak times. For suppliers, these estimates can be used to devise innovative time-of-use pricing strategies aimed at their target consumers. We consider methods based on conditional kernel density (CKD) estimation with the incorporation of a decay parameter. The methods capture the seasonality in consumption, and enable a nonparametric estimation of its conditional density. Using eight months of half-hourly data for one thousand meters, we evaluate point and density forecasts, for lead times ranging from one half-hour up to a week ahead. We find that the kernel-based methods outperform a simple benchmark method that does not account for seasonality, and compare well with an exponential smoothing method that we use as a sophisticated benchmark. To gauge the financial impact, we use density estimates of consumption to derive prediction intervals of electricity cost for different time-of-use tariffs. We show that a simple strategy of switching between different tariffs, based on a comparison of cost densities, delivers significant cost savings for the great majority of consumers.
Electrochemical energy storage is central to modern society -- from consumer electronics to electrified transportation and the power grid. It is no longer just a convenience but a critical enabler of the transition to a resilient, low-carbon economy. The large pluralistic battery research and development community serving these needs has evolved into diverse specialties spanning materials discovery, battery chemistry, design innovation, scale-up, manufacturing and deployment. Despite the maturity and the impact of battery science and technology, the data and software practices among these disparate groups are far behind the state-of-the-art in other fields (e.g. drug discovery), which have enjoyed significant increases in the rate of innovation. Incremental performance gains and lost research productivity, which are the consequences, retard innovation and societal progress. Examples span every field of battery research , from the slow and iterative nature of materials discovery, to the repeated and time-consuming performance testing of cells and the mitigation of degradation and failures. The fundamental issue is that modern data science methods require large amounts of data and the battery community lacks the requisite scalable, standardized data hubs required for immediate use of these approaches. Lack of uniform data practices is a central barrier to the scale problem. In this perspective we identify the data- and software-sharing gaps and propose the unifying principles and tools needed to build a robust community of data hubs, which provide flexible sharing formats to address diverse needs. The Battery Data Genome is offered as a data-centric initiative that will enable the transformative acceleration of battery science and technology, and will ultimately serve as a catalyst to revolutionize our approach to innovation.
The Roma people, living throughout Europe, are a diverse population linked by the Romani language and culture. Previous linguistic and genetic studies have suggested that the Roma migrated into Europe from South Asia about 1000-1500 years ago. Genetic inferences about Roma history have mostly focused on the Y chromosome and mitochondrial DNA. To explore what additional information can be learned from genome-wide data, we analyzed data from six Roma groups that we genotyped at hundreds of thousands of single nucleotide polymorphisms (SNPs). We estimate that the Roma harbor about 80% West Eurasian ancestry-deriving from a combination of European and South Asian sources- and that the date of admixture of South Asian and European ancestry was about 850 years ago. We provide evidence for Eastern Europe being a major source of European ancestry, and North-west India being a major source of the South Asian ancestry in the Roma. By computing allele sharing as a measure of linkage disequilibrium, we estimate that the migration of Roma out of the Indian subcontinent was accompanied by a severe founder event, which we hypothesize was followed by a major demographic expansion once the population arrived in Europe.