No Arabic abstract
Monitoring network traffic data to detect any hidden patterns of anomalies is a challenging and time-consuming task that requires high computing resources. To this end, an appropriate summarization technique is of great importance, where it can be a substitute for the original data. However, the summarized data is under the threat of removing anomalies. Therefore, it is vital to create a summary that can reflect the same pattern as the original data. Therefore, in this paper, we propose an INtelligent Summarization approach for IDENTifying hidden anomalies, called INSIDENT. The proposed approach guarantees to keep the original data distribution in summarized data. Our approach is a clustering-based algorithm that dynamically maps original feature space to a new feature space by locally weighting features in each cluster. Therefore, in new feature space, similar samples are closer, and consequently, outliers are more detectable. Besides, selecting representatives based on cluster size keeps the same distribution as the original data in summarized data. INSIDENT can be used both as the preprocess approach before performing anomaly detection algorithms and anomaly detection algorithm. The experimental results on benchmark datasets prove a summary of the data can be a substitute for original data in the anomaly detection task.
Inertial Measurement Unit (IMU) sensors are present in everyday devices such as smartphones and fitness watches. As a result, the array of health-related research and applications that tap onto this data has been growing, but little attention has been devoted to the prediction of an individuals heart rate (HR) from IMU data, when undergoing a physical activity. Would that be even possible? If so, this could be used to design personalized sets of aerobic exercises, for instance. In this work, we show that it is viable to obtain accurate HR predictions from IMU data using Recurrent Neural Networks, provided only access to HR and IMU data from a short-lived, previously executed activity. We propose a novel method for initializing an RNNs hidden state vectors, using a specialized network that attempts to extract an embedding of the physical conditioning (PCE) of a subject. We show that using a discriminator in the training phase to help the model learn whether two PCEs belong to the same individual further reduces the prediction error. We evaluate the proposed model when predicting the HR of 23 subjects performing a variety of physical activities from IMU data available in public datasets (PAMAP2, PPG-DaLiA). For comparison, we use as baselines the only model specifically proposed for this task and an adapted state-of-the-art model for Human Activity Recognition (HAR), a closely related task. Our method, PCE-LSTM, yields over 10% lower mean absolute error. We demonstrate empirically that this error reduction is in part due to the use of the PCE. Last, we use the two datasets (PPG-DaLiA, WESAD) to show that PCE-LSTM can also be successfully applied when photoplethysmography (PPG) sensors are available, outperforming the state-of-the-art deep learning baselines by more than 30%.
We consider the following general hidden hubs model: an $n times n$ random matrix $A$ with a subset $S$ of $k$ special rows (hubs): entries in rows outside $S$ are generated from the probability distribution $p_0 sim N(0,sigma_0^2)$; for each row in $S$, some $k$ of its entries are generated from $p_1 sim N(0,sigma_1^2)$, $sigma_1>sigma_0$, and the rest of the entries from $p_0$. The problem is to identify the high-degree hubs efficiently. This model includes and significantly generalizes the planted Gaussian Submatrix Model, where the special entries are all in a $k times k$ submatrix. There are two well-known barriers: if $kgeq csqrt{nln n}$, just the row sums are sufficient to find $S$ in the general model. For the submatrix problem, this can be improved by a $sqrt{ln n}$ factor to $k ge csqrt{n}$ by spectral methods or combinatorial methods. In the variant with $p_0=pm 1$ (with probability $1/2$ each) and $p_1equiv 1$, neither barrier has been broken. We give a polynomial-time algorithm to identify all the hidden hubs with high probability for $k ge n^{0.5-delta}$ for some $delta >0$, when $sigma_1^2>2sigma_0^2$. The algorithm extends to the setting where planted entries might have different variances each at least as large as $sigma_1^2$. We also show a nearly matching lower bound: for $sigma_1^2 le 2sigma_0^2$, there is no polynomial-time Statistical Query algorithm for distinguishing between a matrix whose entries are all from $N(0,sigma_0^2)$ and a matrix with $k=n^{0.5-delta}$ hidden hubs for any $delta >0$. The lower bound as well as the algorithm are related to whether the chi-squared distance of the two distributions diverges. At the critical value $sigma_1^2=2sigma_0^2$, we show that the general hidden hubs problem can be solved for $kgeq csqrt n(ln n)^{1/4}$, improving on the naive row sum-based method.
The identification and quantification of markers in medical images is critical for diagnosis, prognosis and management of patients in clinical practice. Supervised- or weakly supervised training enables the detection of findings that are known a priori. It does not scale well, and a priori definition limits the vocabulary of markers to known entities reducing the accuracy of diagnosis and prognosis. Here, we propose the identification of anomalies in large-scale medical imaging data using healthy examples as a reference. We detect and categorize candidates for anomaly findings untypical for the observed data. A deep convolutional autoencoder is trained on healthy retinal images. The learned model generates a new feature representation, and the distribution of healthy retinal patches is estimated by a one-class support vector machine. Results demonstrate that we can identify pathologic regions in images without using expert annotations. A subsequent clustering categorizes findings into clinically meaningful classes. In addition the learned features outperform standard embedding approaches in a classification task.
Generating useful network summaries is a challenging and important problem with several applications like sensemaking, visualization, and compression. However, most of the current work in this space do not take human feedback into account while generating summaries. Consider an intelligence analysis scenario, where the analyst is exploring a similarity network between documents. The analyst can express her agreement/disagreement with the visualization of the network summary via iterative feedback, e.g. closing or moving documents (nodes) together. How can we use this feedback to improve the network summary quality? In this paper, we present NetReAct, a novel interactive network summarization algorithm which supports the visualization of networks induced by text corpora to perform sensemaking. NetReAct incorporates human feedback with reinforcement learning to summarize and visualize document networks. Using scenarios from two datasets, we show how NetReAct is successful in generating high-quality summaries and visualizations that reveal hidden patterns better than other non-trivial baselines.
The darknet markets are notorious black markets in cyberspace, which involve selling or brokering drugs, weapons, stolen credit cards, and other illicit goods. To combat illicit transactions in the cyberspace, it is important to analyze the behaviors of participants in darknet markets. Currently, many studies focus on studying the behavior of vendors. However, there is no much work on analyzing buyers. The key challenge is that the buyers are anonymized in darknet markets. For most of the darknet markets, We only observe the first and last digits of a buyers ID, such as ``a**b. To tackle this challenge, we propose a hidden buyer identification model, called UNMIX, which can group the transactions from one hidden buyer into one cluster given a transaction sequence from an anonymized ID. UNMIX is able to model the temporal dynamics information as well as the product, comment, and vendor information associated with each transaction. As a result, the transactions with similar patterns in terms of time and content group together as the subsequence from one hidden buyer. Experiments on the data collected from three real-world darknet markets demonstrate the effectiveness of our approach measured by various clustering metrics. Case studies on real transaction sequences explicitly show that our approach can group transactions with similar patterns into the same clusters.