ﻻ يوجد ملخص باللغة العربية
The ability to measure similarity between documents enables intelligent summarization and analysis of large corpora. Past distances between documents suffer from either an inability to incorporate semantic similarities between words or from scalability issues. As an alternative, we introduce hierarchical optimal transport as a meta-distance between documents, where documents are modeled as distributions over topics, which themselves are modeled as distributions over words. We then solve an optimal transport problem on the smaller topic space to compute a similarity score. We give conditions on the topics under which this construction defines a distance, and we relate it to the word movers distance. We evaluate our technique for k-NN classification and show better interpretability and scalability with comparable performance to current methods at a fraction of the cost.
Learning generic representations with deep networks requires massive training samples and significant computer resources. To learn a new specific task, an important issue is to transfer the generic teachers representation to a student network. In thi
Traditional multi-view learning methods often rely on two assumptions: ($i$) the samples in different views are well-aligned, and ($ii$) their representations in latent space obey the same distribution. Unfortunately, these two assumptions may be que
News articles usually contain knowledge entities such as celebrities or organizations. Important entities in articles carry key messages and help to understand the content in a more direct way. An industrial news recommender system contains various k
We propose SelfDoc, a task-agnostic pre-training framework for document image understanding. Because documents are multimodal and are intended for sequential reading, our framework exploits the positional, textual, and visual information of every sem
Many digital libraries recommend literature to their users considering the similarity between a query document and their repository. However, they often fail to distinguish what is the relationship that makes two documents alike. In this paper, we mo