Hacking VMAF and VMAF NEG: vulnerability to different preprocessing methods

114 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Anastasia Antsiferova

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Maksim Siniukov - Anastasia Antsiferova - Dmitriy Kulikov

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Video-quality measurement plays a critical role in the development of video-processing applications. In this paper, we show how video preprocessing can artificially increase the popular quality metric VMAF and its tuning-resistant version, VMAF NEG. We propose a pipeline that tunes processing-algorithm parameters to increase VMAF by up to 218.8%. A subjective comparison revealed that for most preprocessing methods, a videos visual quality drops or stays unchanged. We also show that some preprocessing methods can increase VMAF NEG scores by up to 23.6%.

قيم البحث

46 - Fan Zhang , Angeliki Katsenou , Christos Bampis 2021

VMAF is a machine learning based video quality assessment method, originally designed for streaming applications, which combines multiple quality metrics and video features through SVM regression. It offers higher correlation with subjective opinions compared to many conventional quality assessment methods. In this paper we propose enhancements to VMAF through the integration of new video features and alternative quality metrics (selected from a diverse pool) alongside multiple model combination. The proposed combination approach enables training on multiple databases with varying content and distortion characteristics. Our enhanced VMAF method has been evaluated on eight HD video databases, and consistently outperforms the original VMAF model (0.6.1) and other benchmark quality metrics, exhibiting higher correlation with subjective ground truth data.

معالجة الصور والفيديو الرؤية الحاسوبية وتمييز الأنماط

Simulated Annealing for JPEG Quantization

110 - Max Hopkins , Michael Mitzenmacher , 2017

JPEG is one of the most widely used image formats, but in some ways remains surprisingly unoptimized, perhaps because some natural optimizations would go outside the standard that defines JPEG. We show how to improve JPEG compression in a standard-co mpliant, backward-compatible manner, by finding improved default quantization tables. We describe a simulated annealing technique that has allowed us to find several quantization tables that perform better than the industry standard, in terms of both compressed size and image fidelity. Specifically, we derive tables that reduce the FSIM error by over 10% while improving compression by over 20% at quality level 95 in our tests; we also provide similar results for other quality levels. While we acknowledge our approach can in some images lead to visible artifacts under large magnification, we believe use of these quantization tables, or additional tables that could be found using our methodology, would significantly reduce JPEG file sizes with improved overall image quality.

الوسائط المتعددة الرؤية الحاسوبية وتمييز الأنماط الرسم الحاسوبي

High-Speed and High-Quality Text-to-Lip Generation

98 - Jinglin Liu , Zhiying Zhu , Yi Ren 2021

As a key component of talking face generation, lip movements generation determines the naturalness and coherence of the generated talking face video. Prior literature mainly focuses on speech-to-lip generation while there is a paucity in text-to-lip (T2L) generation. T2L is a challenging task and existing end-to-end works depend on the attention mechanism and autoregressive (AR) decoding manner. However, the AR decoding manner generates current lip frame conditioned on frames generated previously, which inherently hinders the inference speed, and also has a detrimental effect on the quality of generated lip frames due to error propagation. This encourages the research of parallel T2L generation. In this work, we propose a novel parallel decoding model for high-speed and high-quality text-to-lip generation (HH-T2L). Specifically, we predict the duration of the encoded linguistic features and model the target lip frames conditioned on the encoded linguistic features with their duration in a non-autoregressive manner. Furthermore, we incorporate the structural similarity index loss and adversarial learning to improve perceptual quality of generated lip frames and alleviate the blurry prediction problem. Extensive experiments conducted on GRID and TCD-TIMIT datasets show that 1) HH-T2L generates lip movements with competitive quality compared with the state-of-the-art AR T2L model DualLip and exceeds the baseline AR model TransformerT2L by a notable margin benefiting from the mitigation of the error propagation problem; and 2) exhibits distinct superiority in inference speed (an average speedup of 19$times$ than DualLip on TCD-TIMIT).

الوسائط المتعددة الرؤية الحاسوبية وتمييز الأنماط

Spike Camera and Its Coding Methods

77 - Siwei Dong , Tiejun Huang , Yonghong Tian 2021

This paper introduces a spike camera with a distinct video capture scheme and proposes two methods of decoding the spike stream for texture reconstruction. The spike camera captures light and accumulates the converted luminance intensity at each pixe l. A spike is fired when the accumulated intensity exceeds the dispatch threshold. The spike stream generated by the camera indicates the luminance variation. Analyzing the patterns of the spike stream makes it possible to reconstruct the picture of any moment which enables the playback of high speed movement.

الوسائط المتعددة

Learning to score the figure skating sports videos

85 - Chengming Xu , Yanwei Fu , Bing Zhang 2018

This paper targets at learning to score the figure skating sports videos. To address this task, we propose a deep architecture that includes two complementary components, i.e., Self-Attentive LSTM and Multi-scale Convolutional Skip LSTM. These two co mponents can efficiently learn the local and global sequential information in each video. Furthermore, we present a large-scale figure skating sports video dataset -- FisV dataset. This dataset includes 500 figure skating videos with the average length of 2 minutes and 50 seconds. Each video is annotated by two scores of nine different referees, i.e., Total Element Score(TES) and Total Program Component Score (PCS). Our proposed model is validated on FisV and MIT-skate datasets. The experimental results show the effectiveness of our models in learning to score the figure skating videos.

الوسائط المتعددة الرؤية الحاسوبية وتمييز الأنماط