Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Novel tile segmentation scheme for omnidirectional video

321 0 0.0 ( 0 )

Download Cite

Added by Jisheng Li

Publication date 2021

fields Informatics Engineering

and research's language is English

Authors Jisheng Li - Ziyu Wen - Sihan Li

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Regular omnidirectional video encoding technics use map projection to flatten a scene from a spherical shape into one or several 2D shapes. Common projection methods including equirectangular and cubic projection have varying levels of interpolation that create a large number of non-information-carrying pixels that lead to wasted bitrate. In this paper, we propose a tile based omnidirectional video segmentation scheme which can save up to 28% of pixel area and 20% of BD-rate averagely compared to the traditional equirectangular projection based approach.

rate research

Transferring Cross-domain Knowledge for Video Sign Language Recognition

95 - Dongxu Li , Xin Yu , Chenchen Xu 2020

Word-level sign language recognition (WSLR) is a fundamental task in sign language interpretation. It requires models to recognize isolated sign words from videos. However, annotating WSLR data needs expert knowledge, thus limiting WSLR dataset acquisition. On the contrary, there are abundant subtitled sign news videos on the internet. Since these videos have no word-level annotation and exhibit a large domain gap from isolated signs, they cannot be directly used for training WSLR models. We observe that despite the existence of a large domain gap, isolated and news signs share the same visual concepts, such as hand gestures and body movements. Motivated by this observation, we propose a novel method that learns domain-invariant visual concepts and fertilizes WSLR models by transferring knowledge of subtitled news sign to them. To this end, we extract news signs using a base WSLR model, and then design a classifier jointly trained on news and isolated signs to coarsely align these two domain features. In order to learn domain-invariant features within each class and suppress domain-specific features, our method further resorts to an external memory to store the class centroids of the aligned news signs. We then design a temporal attention based on the learnt descriptor to improve recognition performance. Experimental results on standard WSLR datasets show that our method outperforms previous state-of-the-art methods significantly. We also demonstrate the effectiveness of our method on automatically localizing signs from sign news, achieving 28.1 for [email protected].

Computer Vision and Pattern Recognition Human-Computer Interaction Multimedia

Temporally Distributed Networks for Fast Video Semantic Segmentation

148 - Ping Hu , Fabian Caba Heilbron , Oliver Wang 2020

We present TDNet, a temporally distributed network designed for fast and accurate video semantic segmentation. We observe that features extracted from a certain high-level layer of a deep CNN can be approximated by composing features extracted from several shallower sub-networks. Leveraging the inherent temporal continuity in videos, we distribute these sub-networks over sequential frames. Therefore, at each time step, we only need to perform a lightweight computation to extract a sub-features group from a single sub-network. The full features used for segmentation are then recomposed by application of a novel attention propagation module that compensates for geometry deformation between frames. A grouped knowledge distillation loss is also introduced to further improve the representation power at both full and sub-feature levels. Experiments on Cityscapes, CamVid, and NYUD-v2 demonstrate that our method achieves state-of-the-art accuracy with significantly faster speed and lower latency.

Computer Vision and Pattern Recognition Machine Learning Multimedia

Edge-assisted Viewport Adaptive Scheme for real-time Omnidirectional Video transmission

64 - Tao Guo , Xikang Jiang , Bin Xiang 2020

Omnidirectional applications are immersive and highly interactive, which can improve the efficiency of remote collaborative work among factory workers. The transmission of omnidirectional video (OV) is the most important step in implementing virtual remote collaboration. Compared with the ordinary video transmission, OV transmission requires more bandwidth, which is still a huge burden even under 5G networks. The tile-based scheme can reduce bandwidth consumption. However, it neither accurately obtain the field of view(FOV) area, nor difficult to support real-time OV streaming. In this paper, we propose an edge-assisted viewport adaptive scheme (EVAS-OV) to reduce bandwidth consumption during real-time OV transmission. First, EVAS-OV uses a Gated Recurrent Unit(GRU) model to predict users viewport. Then, users were divided into multicast clusters thereby further reducing the consumption of computing resources. EVAS-OV reprojects OV frames to accurately obtain users FOV area from pixel level and adopt a redundant strategy to reduce the impact of viewport prediction errors. All computing tasks were offloaded to edge servers to reduce the transmission delay and improve bandwidth utilization. Experimental results show that EVAS-OV can save more than 60% of bandwidth compared with the non-viewport adaptive scheme. Compared to a two-layer scheme with viewport adaptive, EVAS-OV still saves 30% of bandwidth.

Multimedia

WeClick: Weakly-Supervised Video Semantic Segmentation with Click Annotations

340 - Peidong Liu , Zibin He , Xiyu Yan 2021

Compared with tedious per-pixel mask annotating, it is much easier to annotate data by clicks, which costs only several seconds for an image. However, applying clicks to learn video semantic segmentation model has not been explored before. In this work, we propose an effective weakly-supervised video semantic segmentation pipeline with click annotations, called WeClick, for saving laborious annotating effort by segmenting an instance of the semantic class with only a single click. Since detailed semantic information is not captured by clicks, directly training with click labels leads to poor segmentation predictions. To mitigate this problem, we design a novel memory flow knowledge distillation strategy to exploit temporal information (named memory flow) in abundant unlabeled video frames, by distilling the neighboring predictions to the target frame via estimated motion. Moreover, we adopt vanilla knowledge distillation for model compression. In this case, WeClick learns compact video semantic segmentation models with the low-cost click annotations during the training phase yet achieves real-time and accurate models during the inference period. Experimental results on Cityscapes and Camvid show that WeClick outperforms the state-of-the-art methods, increases performance by 10.24% mIoU than baseline, and achieves real-time execution.

Computer Vision and Pattern Recognition Artificial Intelligence Multimedia

Word-level Deep Sign Language Recognition from Video: A New Large-scale Dataset and Methods Comparison

353 - Dongxu Li , Cristian Rodriguez Opazo , Xin Yu 2019

Vision-based sign language recognition aims at helping deaf people to communicate with others. However, most existing sign language datasets are limited to a small number of words. Due to the limited vocabulary size, models learned from those datasets cannot be applied in practice. In this paper, we introduce a new large-scale Word-Level American Sign Language (WLASL) video dataset, containing more than 2000 words performed by over 100 signers. This dataset will be made publicly available to the research community. To our knowledge, it is by far the largest public ASL dataset to facilitate word-level sign recognition research. Based on this new large-scale dataset, we are able to experiment with several deep learning methods for word-level sign recognition and evaluate their performances in large scale scenarios. Specifically we implement and compare two different models,i.e., (i) holistic visual appearance-based approach, and (ii) 2D human pose based approach. Both models are valuable baselines that will benefit the community for method benchmarking. Moreover, we also propose a novel pose-based temporal graph convolution networks (Pose-TGCN) that models spatial and temporal dependencies in human pose trajectories simultaneously, which has further boosted the performance of the pose-based method. Our results show that pose-based and appearance-based models achieve comparable performances up to 66% at top-10 accuracy on 2,000 words/glosses, demonstrating the validity and challenges of our dataset. Our dataset and baseline deep models are available at url{https://dxli94.github.io/WLASL/}.

Computer Vision and Pattern Recognition Human-Computer Interaction Multimedia

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Novel tile segmentation scheme for omnidirectional video

Ask ChatGPT about the research

No Arabic abstract

Read More

suggested questions