No Arabic abstract
Text classification is one of the most critical areas in machine learning and artificial intelligence research. It has been actively adopted in many business applications such as conversational intelligence systems, news articles categorizations, sentiment analysis, emotion detection systems, and many other recommendation systems in our daily life. One of the problems in supervised text classification models is that the models performance depends heavily on the quality of data labeling that is typically done by humans. In this study, we propose a new network community detection-based approach to automatically label and classify text data into multiclass value spaces. Specifically, we build networks with sentences as the network nodes and pairwise cosine similarities between the Term Frequency-Inversed Document Frequency (TFIDF) vector representations of the sentences as the network link weights. We use the Louvain method to detect the communities in the sentence networks. We train and test the Support Vector Machine and the Random Forest models on both the human-labeled data and network community detection labeled data. Results showed that models with the data labeled by the network community detection outperformed the models with the human-labeled data by 2.68-3.75% of classification accuracy. Our method may help developments of more accurate conversational intelligence and other text classification systems.
We propose a new approach to address the text classification problems when learning with partial labels is beneficial. Instead of offering each training sample a set of candidate labels, we assign negative-oriented labels to the ambiguous training examples if they are unlikely fall into certain classes. We construct our new maximum likelihood estimators with self-correction property, and prove that under some conditions, our estimators converge faster. Also we discuss the advantages of applying one of our estimator to a fully supervised learning problem. The proposed method has potential applicability in many areas, such as crowdsourcing, natural language processing and medical image analysis.
Service manual documents are crucial to the engineering company as they provide guidelines and knowledge to service engineers. However, it has become inconvenient and inefficient for service engineers to retrieve specific knowledge from documents due to the complexity of resources. In this research, we propose an automated knowledge mining and document classification system with novel multi-model transfer learning approaches. Particularly, the classification performance of the system has been improved with three effective techniques: fine-tuning, pruning, and multi-model method. The fine-tuning technique optimizes a pre-trained BERT model by adding a feed-forward neural network layer and the pruning technique is used to retrain the BERT model with new data. The multi-model method initializes and trains multiple BERT models to overcome the randomness of data ordering during the fine-tuning process. In the first iteration of the training process, multiple BERT models are being trained simultaneously. The best model is then selected for the next phase of the training process with another two iterations and the training processes for other BERT models will be terminated. The performance of the proposed system has been evaluated by comparing with two robust baseline methods, BERT and BERT-CNN. Experimental results on a widely used Corpus of Linguistic Acceptability (CoLA) dataset have shown that the proposed techniques perform better than these baseline methods in terms of accuracy and MCC score.
The study of community structure has been a hot topic of research over the last years. But, while successfully applied in several areas, the concept lacks of a general and precise notion. Facts like the hierarchical structure and heterogeneity of complex networks make it difficult to unify the idea of community and its evaluation. The global functional known as modularity is probably the most used technique in this area. Nevertheless, its limits have been deeply studied. Local techniques as the ones by Lancichinetti et al. and Palla et al. arose as an answer to the resolution limit and degeneracies that modularity has. Here we start from the algorithm by Lancichinetti et al. and propose a unique growth process for a fitness function that, while being local, finds a community partition that covers the whole network, updating the scale parameter dynamically. We test the quality of our results by using a set of benchmarks of heterogeneous graphs. We discuss alternative measures for evaluating the community structure and, in the light of them, infer possible explanations for the better performance of local methods compared to global ones in these cases.
Process model extraction (PME) is a recently emerged interdiscipline between natural language processing (NLP) and business process management (BPM), which aims to extract process models from textual descriptions. Previous process extractors heavily depend on manual features and ignore the potential relations between clues of different text granularities. In this paper, we formalize the PME task into the multi-grained text classification problem, and propose a hierarchical neural network to effectively model and extract multi-grained information without manually-defined procedural features. Under this structure, we accordingly propose the coarse-to-fine (grained) learning mechanism, training multi-grained tasks in coarse-to-fine grained order to share the high-level knowledge for the low-level tasks. To evaluate our approach, we construct two multi-grained datasets from two different domains and conduct extensive experiments from different dimensions. The experimental results demonstrate that our approach outperforms the state-of-the-art methods with statistical significance and further investigations demonstrate its effectiveness.
Developed so far, multi-document summarization has reached its bottleneck due to the lack of sufficient training data and diverse categories of documents. Text classification just makes up for these deficiencies. In this paper, we propose a novel summarization system called TCSum, which leverages plentiful text classification data to improve the performance of multi-document summarization. TCSum projects documents onto distributed representations which act as a bridge between text classification and summarization. It also utilizes the classification results to produce summaries of different styles. Extensive experiments on DUC generic multi-document summarization datasets show that, TCSum can achieve the state-of-the-art performance without using any hand-crafted features and has the capability to catch the variations of summary styles with respect to different text categories.