ترغب بنشر مسار تعليمي؟ اضغط هنا

The MADlib Analytics Library or MAD Skills, the SQL

67   0   0.0 ( 0 )
 نشر من قبل Joseph Hellerstein
 تاريخ النشر 2012
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

MADlib is a free, open source library of in-database analytic methods. It provides an evolving suite of SQL-based algorithms for machine learning, data mining and statistics that run at scale within a database engine, with no need for data import/export to other tools. The goal is for MADlib to eventually serve a role for scalable database systems that is similar to the CRAN library for R: a community repository of statistical methods, this time written with scale and parallelism in mind. In this paper we introduce the MADlib project, including the background that led to its beginnings, and the motivation for its open source nature. We provide an overview of the librarys architecture and design patterns, and provide a description of various statistical methods in that context. We include performance and speedup results of a core design pattern from one of those methods over the Greenplum parallel DBMS on a modest-sized test cluster. We then report on two initial efforts at incorporating academic research into MADlib, which is one of the projects goals. MADlib is freely available at http://madlib.net, and the project is open for contributions of both new methods, and ports to additional database platforms.

قيم البحث

اقرأ أيضاً

We propose the client-side AES256 encryption for a cloud SQL DB. A column ciphertext is deterministic or probabilistic. We trust the cloud DBMS for security of its run-time values, e.g., through a moving target defense. The client may send AES key(s) with the query. These serve the on-the-fly decryption of selected ciphertext into plaintext for query evaluation. The DBMS clears the key(s) and the plaintext at the query end at latest. It may deliver ciphertext to decryption enabled clients or plaintext otherwise, e.g., to browsers/navigators. The scheme functionally offers to a cloud DBMS capabilities of a plaintext SQL DBMS. AES processing overhead appears negligible for a modern CPU, e.g., a popular Intel I5. The determin-istic encryption may have no storage overhead. The probabilistic one doubles the DB storage. The scheme seems the first generally practical for an outsourced encrypted SQL DB. An implementation sufficient to practice with appears easy. An existing cloud SQL DBMS with UDF support should do.
In this project we are presenting a grammar which unify the design and development of spatial databases. In order to make it, we combine nominal and spatial information, the former is represented by the relational model and latter by a modification o f the same model. The modification lets to represent spatial data structures (as Quadtrees, Octrees, etc.) in a integrated way. This grammar is important because with it we can create tools to build systems that combine spatial-nominal characteristics such as Geographical Information Systems (GIS), Hypermedia Systems, Computed Aided Design Systems (CAD), and so on
The objective of this work was to utilize BigBench [1] as a Big Data benchmark and evaluate and compare two processing engines: MapReduce [2] and Spark [3]. MapReduce is the established engine for processing data on Hadoop. Spark is a popular alterna tive engine that promises faster processing times than the established MapReduce engine. BigBench was chosen for this comparison because it is the first end-to-end analytics Big Data benchmark and it is currently under public review as TPCx-BB [4]. One of our goals was to evaluate the benchmark by performing various scalability tests and validate that it is able to stress test the processing engines. First, we analyzed the steps necessary to execute the available MapReduce implementation of BigBench [1] on Spark. Then, all the 30 BigBench queries were executed on MapReduce/Hive with different scale factors in order to see how the performance changes with the increase of the data size. Next, the group of HiveQL queries were executed on Spark SQL and compared with their respective Hive runtimes. This report gives a detailed overview on how to setup an experimental Hadoop cluster and execute BigBench on both Hive and Spark SQL. It provides the absolute times for all experiments preformed for different scale factors as well as query results which can be used to validate correct benchmark execution. Additionally, multiple issues and workarounds were encountered and solved during our work. An evaluation of the resource utilization (CPU, memory, disk and network usage) of a subset of representative BigBench queries is presented to illustrate the behavior of the different query groups on both processing engines. Last but not least it is important to mention that larger parts of this report are taken from the master thesis of Max-Georg Beer, entitled Evaluation of BigBench on Apache Spark Compared to MapReduce [5].
In recent years, there has been a substantial amount of work on large-scale data analytics using Hadoop-based platforms running on large clusters of commodity machines. A less-explored topic is how those data, dominated by application logs, are colle cted and structured to begin with. In this paper, we present Twitters production logging infrastructure and its evolution from application-specific logging to a unified client events log format, where messages are captured in common, well-formatted, flexible Thrift messages. Since most analytics tasks consider the user session as the basic unit of analysis, we pre-materialize session sequences, which are compact summaries that can answer a large class of common queries quickly. The development of this infrastructure has streamlined log collection and data analysis, thereby improving our ability to rapidly experiment and iterate on various aspects of the service.
Geo-replication poses an inherent trade-off between low latency, high availability and strong consistency. While NoSQL databases favor low latency and high availability, relaxing consistency, more recent cloud databases favor strong consistency and e ase of programming, while still providing high scalability. In this paper, we present Antidote SQL, a database system that allows application developers to relax SQL consistency when possible. Unlike NoSQL databases, our approach enforces primary key, foreign key and check SQL constraints even under relaxed consistency, which is sufficient for guaranteeing the correctness of many applications. To this end, we defined concurrency semantics for SQL constraints under relaxed consistency and show how to implement such semantics efficiently. For applications that require strict SQL consistency, Antidote SQL provides support for such semantics at the cost of requiring coordination among replicas.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا