Character-Based Text Classification using Top Down Semantic Model for Sentence Representation

95 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Zhenzhou Wu

تاريخ النشر 2017

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Zhenzhou Wu - Xin Zheng - Daniel Dahlmeier

الحساب واللغة التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Despite the success of deep learning on many fronts especially image and speech, its application in text classification often is still not as good as a simple linear SVM on n-gram TF-IDF representation especially for smaller datasets. Deep learning tends to emphasize on sentence level semantics when learning a representation with models like recurrent neural network or recursive neural network, however from the success of TF-IDF representation, it seems a bag-of-words type of representation has its strength. Taking advantage of both representions, we present a model known as TDSM (Top Down Semantic Model) for extracting a sentence representation that considers both the word-level semantics by linearly combining the words with attention weights and the sentence-level semantics with BiLSTM and use it on text classification. We apply the model on characters and our results show that our model is better than all the other character-based and word-based convolutional neural network models by cite{zhang15} across seven different datasets with only 1% of their parameters. We also demonstrate that this model beats traditional linear models on TF-IDF vectors on small and polished datasets like news article in which typically deep learning models surrender.

قيم البحث

93 - Chi Zhang , Shagan Sah , Thang Nguyen 2018

This paper introduces a sentence to vector encoding framework suitable for advanced natural language processing. Our latent representation is shown to encode sentences with common semantic information with similar vector representations. The vector r epresentation is extracted from an encoder-decoder model which is trained on sentence paraphrase pairs. We demonstrate the application of the sentence representations for two different tasks -- sentence paraphrasing and paragraph summarization, making it attractive for commonly used recurrent frameworks that process text. Experimental results help gain insight how vector representations are suitable for advanced language embedding.

الحساب واللغة الذكاء الاصطناعي

Universal Language Model Fine-tuning for Text Classification

103 - Jeremy Howard , Sebastian Ruder 2018

Inductive transfer learning has greatly impacted computer vision, but existing approaches in NLP still require task-specific modifications and training from scratch. We propose Universal Language Model Fine-tuning (ULMFiT), an effective transfer lear ning method that can be applied to any task in NLP, and introduce techniques that are key for fine-tuning a language model. Our method significantly outperforms the state-of-the-art on six text classification tasks, reducing the error by 18-24% on the majority of datasets. Furthermore, with only 100 labeled examples, it matches the performance of training from scratch on 100x more data. We open-source our pretrained models and code.

الحساب واللغة التعلم الآلي التعلم الالي

Neural Attentive Bag-of-Entities Model for Text Classification

157 - Ikuya Yamada , Hiroyuki Shindo 2019

This study proposes a Neural Attentive Bag-of-Entities model, which is a neural network model that performs text classification using entities in a knowledge base. Entities provide unambiguous and relevant semantic signals that are beneficial for cap turing semantics in texts. We combine simple high-recall entity detection based on a dictionary, to detect entities in a document, with a novel neural attention mechanism that enables the model to focus on a small number of unambiguous and relevant entities. We tested the effectiveness of our model using two standard text classification datasets (i.e., the 20 Newsgroups and R8 datasets) and a popular factoid question answering dataset based on a trivia quiz game. As a result, our model achieved state-of-the-art results on all datasets. The source code of the proposed model is available online at https://github.com/wikipedia2vec/wikipedia2vec.

الحساب واللغة التعلم الآلي

Text Classification Using Label Names Only: A Language Model Self-Training Approach

113 - Yu Meng , Yunyi Zhang , Jiaxin Huang 2020

Current text classification methods typically require a good number of human-labeled documents as training data, which can be costly and difficult to obtain in real applications. Humans can perform classification without seeing any labeled examples b ut only based on a small set of words describing the categories to be classified. In this paper, we explore the potential of only using the label name of each class to train classification models on unlabeled data, without using any labeled documents. We use pre-trained neural language models both as general linguistic knowledge sources for category understanding and as representation learning models for document classification. Our method (1) associates semantically related words with the label names, (2) finds category-indicative words and trains the model to predict their implied categories, and (3) generalizes the model via self-training. We show that our model achieves around 90% accuracy on four benchmark datasets including topic and sentiment classification without using any labeled documents but learning from unlabeled data supervised by at most 3 words (1 in most cases) per class as the label name.

الحساب واللغة التعلم الآلي

Text Classification for Azerbaijani Language Using Machine Learning and Embedding

114 - Umid Suleymanov , Behnam Kiani Kalejahi , Elkhan Amrahov 2019

Text classification systems will help to solve the text clustering problem in the Azerbaijani language. There are some text-classification applications for foreign languages, but we tried to build a newly developed system to solve this problem for th e Azerbaijani language. Firstly, we tried to find out potential practice areas. The system will be useful in a lot of areas. It will be mostly used in news feed categorization. News websites can automatically categorize news into classes such as sports, business, education, science, etc. The system is also used in sentiment analysis for product reviews. For example, the company shares a photo of a new product on Facebook and the company receives a thousand comments for new products. The systems classify the comments into categories like positive or negative. The system can also be applied in recommended systems, spam filtering, etc. Various machine learning techniques such as Naive Bayes, SVM, Decision Trees have been devised to solve the text classification problem in Azerbaijani language.

الحساب واللغة التعلم الآلي التعلم الالي