No Arabic abstract
Identifying players in sports videos by recognizing their jersey numbers is a challenging task in computer vision. We have designed and implemented a multi-task learning network for jersey number recognition. In order to train a network to recognize jersey numbers, two output label representations are used (1) Holistic - considers the entire jersey number as one class, and (2) Digit-wise - considers the two digits in a jersey number as two separate classes. The proposed network learns both holistic and digit-wise representations through a multi-task loss function. We determine the optimal weights to be assigned to holistic and digit-wise losses through an ablation study. Experimental results demonstrate that the proposed multi-task learning network performs better than the constituent holistic and digit-wise single-task learning networks.
Puck localization is an important problem in ice hockey video analytics useful for analyzing the game, determining play location, and assessing puck possession. The problem is challenging due to the small size of the puck, excessive motion blur due to high puck velocity and occlusions due to players and boards. In this paper, we introduce and implement a network for puck localization in broadcast hockey video. The network leverages expert NHL play-by-play annotations and uses temporal context to locate the puck. Player locations are incorporated into the network through an attention mechanism by encoding player positions with a Gaussian-based spatial heatmap drawn at player positions. Since event occurrence on the rink and puck location are related, we also perform event recognition by augmenting the puck localization network with an event recognition head and training the network through multi-task learning. Experimental results demonstrate that the network is able to localize the puck with an AUC of $73.1 %$ on the test set. The puck location can be inferred in 720p broadcast videos at $5$ frames per second. It is also demonstrated that multi-task learning with puck location improves event recognition accuracy.
In this work, an automatic and simple framework for hockey ice-rink localization from broadcast videos is introduced. First, video is broken into video-shots by a hierarchical partitioning of the video frames, and thresholding based on their histograms. To localize the frames on the ice-rink model, a ResNet18-based regressor is implemented and trained, which regresses to four control points on the model in a frame-by-frame fashion. This leads to the projection jittering problem in the video. To overcome this, in the inference phase, the trajectory of the control points on the ice-rink model are smoothed, for all the consecutive frames of a given video-shot, by convolving a Hann window with the achieved coordinates. Finally, the smoothed homography matrix is computed by using the direct linear transform on the four pairs of corresponding points. A hockey dataset for training and testing the regressor is gathered. The results show success of this simple and comprehensive procedure for localizing the hockey ice-rink and addressing the problem of jittering without affecting the accuracy of homography estimation.
Learning to predict multiple attributes of a pedestrian is a multi-task learning problem. To share feature representation between two individual task networks, conventional methods like Cross-Stitch and Sluice network learn a linear combination of features or feature subspaces. However, linear combination rules out the complex interdependency between channels. Moreover, spatial information exchanging is less-considered. In this paper, we propose a novel Co-Attentive Sharing (CAS) module which extracts discriminative channels and spatial regions for more effective feature sharing in multi-task learning. The module consists of three branches, which leverage different channels for between-task feature fusing, attention generation and task-specific feature enhancing, respectively. Experiments on two pedestrian attribute recognition datasets show that our module outperforms the conventional sharing units and achieves superior results compared to the state-of-the-art approaches using many metrics.
Analyzing human affect is vital for human-computer interaction systems. Most methods are developed in restricted scenarios which are not practical for in-the-wild settings. The Affective Behavior Analysis in-the-wild (ABAW) 2021 Contest provides a benchmark for this in-the-wild problem. In this paper, we introduce a multi-modal and multi-task learning method by using both visual and audio information. We use both AU and expression annotations to train the model and apply a sequence model to further extract associations between video frames. We achieve an AU score of 0.712 and an expression score of 0.477 on the validation set. These results demonstrate the effectiveness of our approach in improving model performance.
This paper presents a neural network based method Multi-Task Affect Net(MTANet) submitted to the Affective Behavior Analysis in-the-Wild Challenge in FG2020. This method is a multi-task network and based on SE-ResNet modules. By utilizing multi-task learning, this network can estimate and recognize three quantified affective models: valence and arousal, action units, and seven basic emotions simultaneously. MTANet achieve Concordance Correlation Coefficient(CCC) rates of 0.28 and 0.34 for valence and arousal, F1-score of 0.427 and 0.32 for AUs detection and categorical emotion classification.