No Arabic abstract
Social media communications are becoming increasingly prevalent; some useful, some false, whether unwittingly or maliciously. An increasing number of rumours daily flood the social networks. Determining their veracity in an autonomous way is a very active and challenging field of research, with a variety of methods proposed. However, most of the models rely on determining the constituent messages stance towards the rumour, a feature known as the wisdom of the crowd. Although several supervised machine-learning approaches have been proposed to tackle the message stance classification problem, these have numerous shortcomings. In this paper we argue that semi-supervised learning is more effective than supervised models and use two graph-based methods to demonstrate it. This is not only in terms of classification accuracy, but equally important, in terms of speed and scalability. We use the Label Propagation and Label Spreading algorithms and run experiments on a dataset of 72 rumours and hundreds of thousands messages collected from Twitter. We compare our results on two available datasets to the state-of-the-art to demonstrate our algorithms performance regarding accuracy, speed and scalability for real-time applications.
We consider the task of learning a classifier from the feature space $mathcal{X}$ to the set of classes $mathcal{Y} = {0, 1}$, when the features can be partitioned into class-conditionally independent feature sets $mathcal{X}_1$ and $mathcal{X}_2$. We show the surprising fact that the class-conditional independence can be used to represent the original learning task in terms of 1) learning a classifier from $mathcal{X}_2$ to $mathcal{X}_1$ and 2) learning the class-conditional distribution of the feature set $mathcal{X}_1$. This fact can be exploited for semi-supervised learning because the former task can be accomplished purely from unlabeled samples. We present experimental evaluation of the idea in two real world applications.
Network traffic classification, a task to classify network traffic and identify its type, is the most fundamental step to improve network services and manage modern networks. Classical machine learning and deep learning method have developed well in the field of network traffic classification. However, there are still two major challenges. One is how to protect the privacy of users traffic data, and the other is that it is difficult to obtain labeled data in reality. In this paper, we propose a novel approach using federated semi-supervised learning for network traffic classification. In our approach, the federated servers and several clients work together to train a global classification model. Among them, unlabeled data is used on the client, and labeled data is used on the server. Moreover, we use two traffic subflow sampling methods: simple sampling and incremental sampling for data preprocessing. The experimental results in the QUIC dataset show that the accuracy of our federated semi-supervised approach can reach 91.08% and 97.81% when using the simple sampling method and incremental sampling method respectively. The experimental results also show that the accuracy gap between our method and the centralized training method is minimal, and it can effectively protect users privacy and does not require a large amount of labeled data.
Most segmentation methods in child brain MRI are supervised and are based on global intensity distributions of major brain structures. The successful implementation of a supervised approach depends on availability of an age-appropriate probabilistic brain atlas. For the study of early normal brain development, the construction of such a brain atlas remains a significant challenge. Moreover, using global intensity statistics leads to inaccurate detection of major brain tissue classes due to substantial intensity variations of MR signal within the constituent parts of early developing brain. In order to overcome these methodological limitations we develop a local, semi-supervised framework. It is based on Kernel Fisher Discriminant Analysis (KFDA) for pattern recognition, combined with an objective structural similarity index (SSIM) for perceptual image quality assessment. The proposed method performs optimal brain partitioning into subdomains having different average intensity values followed by SSIM-guided computation of separating surfaces between the constituent brain parts. The classified image subdomains are then stitched slice by slice via simulated annealing to form a global image of the classified brain. In this paper, we consider classification into major tissue classes (white matter and grey matter) and the cerebrospinal fluid and illustrate the proposed framework on examples of brain templates for ages 8 to 11 months and ages 44 to 60 months. We show that our method improves detection of the tissue classes by its comparison to state-of-the-art classification techniques known as Partial Volume Estimation.
Seeding then expanding is a commonly used scheme to discover overlapping communities in a network. Most seeding methods are either too complex to scale to large networks or too simple to select high-quality seeds, and the non-principled functions used by most expanding methods lead to poor performance when applied to diverse networks. This paper proposes a new method that transforms a network into a corpus where each edge is treated as a document, and all nodes of the network are treated as terms of the corpus. An effective seeding method is also proposed that selects seeds as a training set, then a principled expanding method based on semi-supervised learning is applied to classify edges. We compare our new algorithm with four other community detection algorithms on a wide range of synthetic and empirical networks. Experimental results show that the new algorithm can significantly improve clustering performance in most cases. Furthermore, the time complexity of the new algorithm is linear to the number of edges, and this low complexity makes the new algorithm scalable to large networks.
Recently proposed consistency-based Semi-Supervised Learning (SSL) methods such as the $Pi$-model, temporal ensembling, the mean teacher, or the virtual adversarial training, have advanced the state of the art in several SSL tasks. These methods can typically reach performances that are comparable to their fully supervised counterparts while using only a fraction of labelled examples. Despite these methodological advances, the understanding of these methods is still relatively limited. In this text, we analyse (variations of) the $Pi$-model in settings where analytically tractable results can be obtained. We establish links with Manifold Tangent Classifiers and demonstrate that the quality of the perturbations is key to obtaining reasonable SSL performances. Importantly, we propose a simple extension of the Hidden Manifold Model that naturally incorporates data-augmentation schemes and offers a framework for understanding and experimenting with SSL methods.