ترغب بنشر مسار تعليمي؟ اضغط هنا

Automatic Text Summarization Approaches to Speed up Topic Model Learning Process

99   0   0.0 ( 0 )
 نشر من قبل Juan-Manuel Torres-Moreno
 تاريخ النشر 2017
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

The number of documents available into Internet moves each day up. For this reason, processing this amount of information effectively and expressibly becomes a major concern for companies and scientists. Methods that represent a textual document by a topic representation are widely used in Information Retrieval (IR) to process big data such as Wikipedia articles. One of the main difficulty in using topic model on huge data collection is related to the material resources (CPU time and memory) required for model estimate. To deal with this issue, we propose to build topic spaces from summarized documents. In this paper, we present a study of topic space representation in the context of big data. The topic space representation behavior is analyzed on different languages. Experiments show that topic spaces estimated from text summaries are as relevant as those estimated from the complete documents. The real advantage of such an approach is the processing time gain: we showed that the processing time can be drastically reduced using summarized documents (more than 60% in general). This study finally points out the differences between thematic representations of documents depending on the targeted languages such as English or latin languages.

قيم البحث

اقرأ أيضاً

143 - Damir Korenv{c}ic 2020
Topic models are widely used unsupervised models capable of learning topics - weighted lists of words and documents - from large collections of text documents. When topic models are used for discovery of topics in text collections, a question that ar ises naturally is how well the model-induced topics correspond to topics of interest to the analyst. In this paper we revisit and extend a so far neglected approach to topic model evaluation based on measuring topic coverage - computationally matching model topics with a set of reference topics that models are expected to uncover. The approach is well suited for analyzing models performance in topic discovery and for large-scale analysis of both topic models and measures of model quality. We propose new measures of coverage and evaluate, in a series of experiments, different types of topic models on two distinct text domains for which interest for topic discovery exists. The experiments include evaluation of model quality, analysis of coverage of distinct topic categories, and the analysis of the relationship between coverage and other methods of topic model evaluation. The paper contributes a new supervised measure of coverage, and the first unsupervised measure of coverage. The supervised measure achieves topic matching accuracy close to human agreement. The unsupervised measure correlates highly with the supervised one (Spearmans $rho geq 0.95$). Other contributions include insights into both topic models and different methods of model evaluation, and the datasets and code for facilitating future research on topic coverage.
We introduce a new approach for abstractive text summarization, Topic-Guided Abstractive Summarization, which calibrates long-range dependencies from topic-level features with globally salient content. The idea is to incorporate neural topic modeling with a Transformer-based sequence-to-sequence (seq2seq) model in a joint learning framework. This design can learn and preserve the global semantics of the document, which can provide additional contextual guidance for capturing important ideas of the document, thereby enhancing the generation of summary. We conduct extensive experiments on two datasets and the results show that our proposed model outperforms many extractive and abstractive systems in terms of both ROUGE measurements and human evaluation. Our code is available at: https://github.com/chz816/tas.
117 - Ziqiang Cao , Furu Wei , Wenjie Li 2017
Unlike extractive summarization, abstractive summarization has to fuse different parts of the source text, which inclines to create fake facts. Our preliminary study reveals nearly 30% of the outputs from a state-of-the-art neural summarization syste m suffer from this problem. While previous abstractive summarization approaches usually focus on the improvement of informativeness, we argue that faithfulness is also a vital prerequisite for a practical abstractive summarization system. To avoid generating fake facts in a summary, we leverage open information extraction and dependency parse technologies to extract actual fact descriptions from the source text. The dual-attention sequence-to-sequence framework is then proposed to force the generation conditioned on both the source text and the extracted fact descriptions. Experiments on the Gigaword benchmark dataset demonstrate that our model can greatly reduce fake summaries by 80%. Notably, the fact descriptions also bring significant improvement on informativeness since they often condense the meaning of the source text.
When video collections become huge, how to explore both within and across videos efficiently is challenging. Video summarization is one of the ways to tackle this issue. Traditional summarization approaches limit the effectiveness of video exploratio n because they only generate one fixed video summary for a given input video independent of the information need of the user. In this work, we introduce a method which takes a text-based query as input and generates a video summary corresponding to it. We do so by modeling video summarization as a supervised learning problem and propose an end-to-end deep learning based method for query-controllable video summarization to generate a query-dependent video summary. Our proposed method consists of a video summary controller, video summary generator, and video summary output module. To foster the research of query-controllable video summarization and conduct our experiments, we introduce a dataset that contains frame-based relevance score labels. Based on our experimental result, it shows that the text-based query helps control the video summary. It also shows the text-based query improves our model performance. Our code and dataset: https://github.com/Jhhuangkay/Query-controllable-Video-Summarization.
Text classification tends to be difficult when data are deficient or when it is required to adapt to unseen classes. In such challenging scenarios, recent studies have often used meta-learning to simulate the few-shot task, thus negating implicit com mon linguistic features across tasks. This paper addresses such problems using meta-learning and unsupervised language models. Our approach is based on the insight that having a good generalization from a few examples relies on both a generic model initialization and an effective strategy for adapting this model to newly arising tasks. We show that our approach is not only simple but also produces a state-of-the-art performance on a well-studied sentiment classification dataset. It can thus be further suggested that pretraining could be a promising solution for few-shot learning of many other NLP tasks. The code and the dataset to replicate the experiments are made available at https://github.com/zxlzr/FewShotNLP.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا