No Arabic abstract
Spreadsheet table detection is the task of detecting all tables on a given sheet and locating their respective ranges. Automatic table detection is a key enabling technique and an initial step in spreadsheet data intelligence. However, the detection task is challenged by the diversity of table structures and table layouts on the spreadsheet. Considering the analogy between a cell matrix as spreadsheet and a pixel matrix as image, and encouraged by the successful application of Convolutional Neural Networks (CNN) in computer vision, we have developed TableSense, a novel end-to-end framework for spreadsheet table detection. First, we devise an effective cell featurization scheme to better leverage the rich information in each cell; second, we develop an enhanced convolutional neural network model for table detection to meet the domain-specific requirement on precise table boundary detection; third, we propose an effective uncertainty metric to guide an active learning based smart sampling algorithm, which enables the efficient build-up of a training dataset with 22,176 tables on 10,220 sheets with broad coverage of diverse table structures and layouts. Our evaluation shows that TableSense is highly effective with 91.3% recall and 86.5% precision in EoB-2 metric, a significant improvement over both the current detection algorithm that are used in commodity spreadsheet tools and state-of-the-art convolutional neural networks in computer vision.
Customers make a lot of reviews on online shopping websites every day, e.g., Amazon and Taobao. Reviews affect the buying decisions of customers, meanwhile, attract lots of spammers aiming at misleading buyers. Xianyu, the largest second-hand goods app in China, suffering from spam reviews. The anti-spam system of Xianyu faces two major challenges: scalability of the data and adversarial actions taken by spammers. In this paper, we present our technical solutions to address these challenges. We propose a large-scale anti-spam method based on graph convolutional networks (GCN) for detecting spam advertisements at Xianyu, named GCN-based Anti-Spam (GAS) model. In this model, a heterogeneous graph and a homogeneous graph are integrated to capture the local context and global context of a comment. Offline experiments show that the proposed method is superior to our baseline model in which the information of reviews, features of users and items being reviewed are utilized. Furthermore, we deploy our system to process million-scale data daily at Xianyu. The online performance also demonstrates the effectiveness of the proposed method.
Information extraction from semi-structured webpages provides valuable long-tailed facts for augmenting knowledge graph. Relational Web tables are a critical component containing additional entities and attributes of rich and diverse knowledge. However, extracting knowledge from relational tables is challenging because of sparse contextual information. Existing work linearize table cells and heavily rely on modifying deep language models such as BERT which only captures related cells information in the same table. In this work, we propose a novel relational table representation learning approach considering both the intra- and inter-table contextual information. On one hand, the proposed Table Convolutional Network model employs the attention mechanism to adaptively focus on the most informative intra-table cells of the same row or column; and, on the other hand, it aggregates inter-table contextual information from various types of implicit connections between cells across different tables. Specifically, we propose three novel aggregation modules for (i) cells of the same value, (ii) cells of the same schema position, and (iii) cells linked to the same page topic. We further devise a supervised multi-task training objective for jointly predicting column type and pairwise column relation, as well as a table cell recovery objective for pre-training. Experiments on real Web table datasets demonstrate our method can outperform competitive baselines by +4.8% of F1 for column type prediction and by +4.1% of F1 for pairwise column relation prediction.
We devise an autoencoder based strategy to facilitate anomaly detection for boosted jets, employing Graph Neural Networks (GNNs) to do so. To overcome known limitations of GNN autoencoders, we design a symmetric decoder capable of simultaneously reconstructing edge features and node features. Focusing on latent space based discriminators, we find that such setups provide a promising avenue to isolate new physics and competing SM signatures from sensitivity-limiting QCD jet contributions. We demonstrate the flexibility and broad applicability of this approach using examples of $W$ bosons, top quarks, and exotic hadronically-decaying exotic scalar bosons.
We develop an algorithm which exceeds the performance of board certified cardiologists in detecting a wide range of heart arrhythmias from electrocardiograms recorded with a single-lead wearable monitor. We build a dataset with more than 500 times the number of unique patients than previously studied corpora. On this dataset, we train a 34-layer convolutional neural network which maps a sequence of ECG samples to a sequence of rhythm classes. Committees of board-certified cardiologists annotate a gold standard test set on which we compare the performance of our model to that of 6 other individual cardiologists. We exceed the average cardiologist performance in both recall (sensitivity) and precision (positive predictive value).
We develop a reliable, fully automatic method for the detection of coronal holes, that provides consistent full-disk segmentation maps over the full solar cycle and can perform in real-time. We use a convolutional neural network to identify the boundaries of coronal holes from the seven EUV channels of the Atmospheric Imaging Assembly (AIA) as well as from line-of-sight magnetograms from the Helioseismic and Magnetic Imager (HMI) onboard the Solar Dynamics Observatory (SDO). For our primary model (Coronal Hole RecOgnition Neural Network Over multi-Spectral-data; CHRONNOS) we use a progressively growing network approach that allows for efficient training, provides detailed segmentation maps and takes relations across the full solar-disk into account. We provide a thorough evaluation for performance, reliability and consistency by comparing the model results to an independent manually curated test set. Our model shows good agreement to the manual labels with an intersection-over-union (IoU) of 0.63. From the total of 261 coronal holes with an area $>1.5cdot10^{10}$ km$^2$ identified during the time range 11/2010 - 12/2016, 98.1% were correctly detected by our model. The evaluation over almost the full solar cycle no. 24 shows that our model provides reliable coronal hole detections, independent of the level of solar activity. From the direct comparison over short time scales of days to weeks, we find that our model exceeds human performance in terms of consistency and reliability. In addition, we train our model to identify coronal holes from each channel separately and show that the neural network provides the best performance with the combined channel information, but that coronal hole segmentation maps can be also obtained solely from line-of-sight magnetograms.