ترغب بنشر مسار تعليمي؟ اضغط هنا

VHT: Vertical Hoeffding Tree

45   0   0.0 ( 0 )
 نشر من قبل Gianmarco De Francisci Morales
 تاريخ النشر 2016
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

IoT Big Data requires new machine learning methods able to scale to large size of data arriving at high speed. Decision trees are popular machine learning models since they are very effective, yet easy to interpret and visualize. In the literature, we can find distributed algorithms for learning decision trees, and also streaming algorithms, but not algorithms that combine both features. In this paper we present the Vertical Hoeffding Tree (VHT), the first distributed streaming algorithm for learning decision trees. It features a novel way of distributing decision trees via vertical parallelism. The algorithm is implemented on top of Apache SAMOA, a platform for mining distributed data streams, and thus able to run on real-world clusters. We run several experiments to study the accuracy and throughput performance of our new VHT algorithm, as well as its ability to scale while keeping its superior performance with respect to non-distributed decision trees.



قيم البحث

اقرأ أيضاً

Tables are widely used with various structures to organize and present data. Recent attempts on table understanding mainly focus on relational tables, yet overlook to other common table structures. In this paper, we propose TUTA, a unified pre-traini ng architecture for understanding generally structured tables. Noticing that understanding a table requires spatial, hierarchical, and semantic information, we enhance transformers with three novel structure-aware mechanisms. First, we devise a unified tree-based structure, called a bi-dimensional coordinate tree, to describe both the spatial and hierarchical information of generally structured tables. Upon this, we propose tree-based attention and position embedding to better capture the spatial and hierarchical information. Moreover, we devise three progressive pre-training objectives to enable representations at the token, cell, and table levels. We pre-train TUTA on a wide range of unlabeled web and spreadsheet tables and fine-tune it on two critical tasks in the field of table structure understanding: cell type classification and table type classification. Experiments show that TUTA is highly effective, achieving state-of-the-art on five widely-studied datasets.
Machine learning (ML) is an important part of modern data science applications. Data scientists today have to manage the end-to-end ML life cycle that includes both model training and model serving, the latter of which is essential, as it makes their works available to end-users. Systems for model serving require high performance, low cost, and ease of management. Cloud providers are already offering model serving options, including managed services and self-rented servers. Recently, serverless computing, whose advantages include high elasticity and fine-grained cost model, brings another possibility for model serving. In this paper, we study the viability of serverless as a mainstream model serving platform for data science applications. We conduct a comprehensive evaluation of the performance and cost of serverless against other model serving systems on two clouds: Amazon Web Service (AWS) and Google Cloud Platform (GCP). We find that serverless outperforms many cloud-based alternatives with respect to cost and performance. More interestingly, under some circumstances, it can even outperform GPU-based systems for both average latency and cost. These results are different from previous works claim that serverless is not suitable for model serving, and are contrary to the conventional wisdom that GPU-based systems are better for ML workloads than CPU-based systems. Other findings include a large gap in cold start time between AWS and GCP serverless functions, and serverless low sensitivity to changes in workloads or models. Our evaluation results indicate that serverless is a viable option for model serving. Finally, we present several practical recommendations for data scientists on how to use serverless for scalable and cost-effective model serving.
Recently, deep learning has been an area of intense researching. However, as a kind of computing intensive task, deep learning highly relies on the scale of GPU memory, which is usually prohibitive and scarce. Although there are some extensive works have been proposed for dynamic GPU memory management, they are hard to be applied to systems with multiple dynamic workloads, such as in-database machine learning system. In this paper, we demonstrated TENSILE, a method of managing GPU memory in tensor granularity to reduce the GPU memory peak, with taking the multiple dynamic workloads into consideration. As far as we know, TENSILE is the first method which is designed to manage multiple workloads GPU memory using. We implement TENSILE on a deep learning framework built by ourselves, and evaluated its performance. The experiment results show that TENSILE can save more GPU memory with less extra time overhead than prior works in both single and multiple dynamic workloads scenarios.
Federated learning (FL) is an emerging paradigm that enables multiple organizations to jointly train a model without revealing their private data to each other. This paper studies {it vertical} federated learning, which tackles the scenarios where (i ) collaborating organizations own data of the same set of users but with disjoint features, and (ii) only one organization holds the labels. We propose Pivot, a novel solution for privacy preserving vertical decision tree training and prediction, ensuring that no intermediate information is disclosed other than those the clients have agreed to release (i.e., the final tree model and the prediction output). Pivot does not rely on any trusted third party and provides protection against a semi-honest adversary that may compromise $m-1$ out of $m$ clients. We further identify two privacy leakages when the trained decision tree model is released in plaintext and propose an enhanced protocol to mitigate them. The proposed solution can also be extended to tree ensemble models, e.g., random forest (RF) and gradient boosting decision tree (GBDT) by treating single decision trees as building blocks. Theoretical and experimental analysis suggest that Pivot is efficient for the privacy achieved.
319 - Giovanni Peccati 2009
It is proved that each Hoeffding space associated with a random permutation (or, equivalently, with extractions without replacement from a finite population) carries an irreducible representation of the symmetric group, equivalent to a two-block Specht module.

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا