ترغب بنشر مسار تعليمي؟ اضغط هنا

Real-Time Data Mining of Massive Data Streams from Synoptic Sky Surveys

62   0   0.0 ( 0 )
 نشر من قبل George Djorgovski
 تاريخ النشر 2016
والبحث باللغة English




اسأل ChatGPT حول البحث

The nature of scientific and technological data collection is evolving rapidly: data volumes and rates grow exponentially, with increasing complexity and information content, and there has been a transition from static data sets to data streams that must be analyzed in real time. Interesting or anomalous phenomena must be quickly characterized and followed up with additional measurements via optimal deployment of limited assets. Modern astronomy presents a variety of such phenomena in the form of transient events in digital synoptic sky surveys, including cosmic explosions (supernovae, gamma ray bursts), relativistic phenomena (black hole formation, jets), potentially hazardous asteroids, etc. We have been developing a set of machine learning tools to detect, classify and plan a response to transient events for astronomy applications, using the Catalina Real-time Transient Survey (CRTS) as a scientific and methodological testbed. The ability to respond rapidly to the potentially most interesting events is a key bottleneck that limits the scientific returns from the current and anticipated synoptic sky surveys. Similar challenge arise in other contexts, from environmental monitoring using sensor networks to autonomous spacecraft systems. Given the exponential growth of data rates, and the time-critical response, we need a fully automated and robust approach. We describe the results obtained to date, and the possible future developments.



قيم البحث

اقرأ أيضاً

The nature of scientific and technological data collection is evolving rapidly: data volumes and rates grow exponentially, with increasing complexity and information content, and there has been a transition from static data sets to data streams that must be analyzed in real time. Interesting or anomalous phenomena must be quickly characterized and followed up with additional measurements via optimal deployment of limited assets. Modern astronomy presents a variety of such phenomena in the form of transient events in digital synoptic sky surveys, including cosmic explosions (supernovae, gamma ray bursts), relativistic phenomena (black hole formation, jets), potentially hazardous asteroids, etc. We have been developing a set of machine learning tools to detect, classify and plan a response to transient events for astronomy applications, using the Catalina Real-time Transient Survey (CRTS) as a scientific and methodological testbed. The ability to respond rapidly to the potentially most interesting events is a key bottleneck that limits the scientific returns from the current and anticipated synoptic sky surveys. Similar challenge arise in other contexts, from environmental monitoring using sensor networks to autonomous spacecraft systems. Given the exponential growth of data rates, and the time-critical response, we need a fully automated and robust approach. We describe the results obtained to date, and the possible future developments.
180 - Yupeng Fu , Chinmay Soman 2021
Ubers business is highly real-time in nature. PBs of data is continuously being collected from the end users such as Uber drivers, riders, restaurants, eaters and so on everyday. There is a lot of valuable information to be processed and many decisio ns must be made in seconds for a variety of use cases such as customer incentives, fraud detection, machine learning model prediction. In addition, there is an increasing need to expose this ability to different user categories, including engineers, data scientists, executives and operations personnel which adds to the complexity. In this paper, we present the overall architecture of the real-time data infrastructure and identify three scaling challenges that we need to continuously address for each component in the architecture. At Uber, we heavily rely on open source technologies for the key areas of the infrastructure. On top of those open-source software, we add significant improvements and customizations to make the open-source solutions fit in Ubers environment and bridge the gaps to meet Ubers unique scale and requirements. We then highlight several important use cases and show their real-time solutions and tradeoffs. Finally, we reflect on the lessons we learned as we built, operated and scaled these systems.
Robust real-time monitoring of high-dimensional data streams has many important real-world applications such as industrial quality control, signal detection, biosurveillance, but unfortunately it is highly non-trivial to develop efficient schemes due to two challenges: (1) the unknown sparse number or subset of affected data streams and (2) the uncertainty of model specification for high-dimensional data. In this article, motivated by the detection of smaller persistent changes in the presence of larger transient outliers, we develop a family of efficient real-time robust detection schemes for high-dimensional data streams through monitoring feature spaces such as PCA or wavelet coefficients when the feature coefficients are from Tukey-Hubers gross error models with outliers. We propose to construct a new local detection statistic for each feature called $L_{alpha}$-CUSUM statistic that can reduce the effect of outliers by using the Box-Cox transformation of the likelihood function, and then raise a global alarm based upon the sum of the soft-thresholding transformation of these local $L_{alpha}$-CUSUM statistics so that to filter out unaffected features. In addition, we propose a new concept called false alarm breakdown point to measure the robustness of online monitoring schemes, and also characterize the breakdown point of our proposed schemes. Asymptotic analysis, extensive numerical simulations and case study of nonlinear profile monitoring are conducted to illustrate the robustness and usefulness of our proposed schemes.
A massive amount of data generated today on platforms such as social networks, telecommunication networks, and the internet in general can be represented as graph streams. Activity in a networks underlying graph generates a sequence of edges in the f orm of a stream; for example, a social network may generate a graph stream based on the interactions (edges) between different users (nodes) over time. While many graph mining algorithms have already been developed for analyzing relatively small graphs, graphs that begin to approach the size of real-world networks stress the limitations of such methods due to their dynamic nature and the substantial number of nodes and connections involved. In this paper we present GraphZip, a scalable method for mining interesting patterns in graph streams. GraphZip is inspired by the Lempel-Ziv (LZ) class of compression algorithms, and uses a novel dictionary-based compression approach in conjunction with the minimum description length principle to discover maximally-compressing patterns in a graph stream. We experimentally show that GraphZip is able to retrieve complex and insightful patterns from large real-world graphs and artificially-generated graphs with ground truth patterns. Additionally, our results demonstrate that GraphZip is both highly efficient and highly effective compared to existing state-of-the-art methods for mining graph streams.
157 - Joshua S. Bloom 2009
We are proposing to conduct a multicolor, synoptic infrared (IR) imaging survey of the Northern sky with a new, dedicated 6.5-meter telescope at San Pedro Martir (SPM) Observatory. This initiative is being developed in partnership with astronomy inst itutions in Mexico and the University of California. The 4-year, dedicated survey, planned to begin in 2017, will reach more than 100 times deeper than 2MASS. The Synoptic All-Sky Infrared (SASIR) Survey will reveal the missing sample of faint red dwarf stars in the local solar neighborhood, and the unprecedented sensitivity over such a wide field will result in the discovery of thousands of z ~ 7 quasars (and reaching to z > 10), allowing detailed study (in concert with JWST and Giant Segmented Mirror Telescopes) of the timing and the origin(s) of reionization. As a time-domain survey, SASIR will reveal the dynamic infrared universe, opening new phase space for discovery. Synoptic observations of over 10^6 supernovae and variable stars will provide better distance measures than optical studies alone. SASIR also provides significant synergy with other major Astro2010 facilities, improving the overall scientific return of community investments. Compared to optical-only measurements, IR colors vastly improve photometric redshifts to z ~ 4, enhancing dark energy and dark matter surveys based on weak lensing and baryon oscillations. The wide field and ToO capabilities will enable a connection of the gravitational wave and neutrino universe - with events otherwise poorly localized on the sky - to transient electromagnetic phenomena.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا