No Arabic abstract
Steganography comprises the mechanics of hiding data in a host media that may be publicly available. While previous works focused on unimodal setups (e.g., hiding images in images, or hiding audio in audio), PixInWav targets the multimodal case of hiding images in audio. To this end, we propose a novel residual architecture operating on top of short-time discrete cosine transform (STDCT) audio spectrograms. Among our results, we find that the residual audio steganography setup we propose allows independent encoding of the hidden image from the host audio without compromising quality. Accordingly, while previous works require both host and hidden signals to hide a signal, PixInWav can encode images offline -- which can be later hidden, in a residual fashion, into any audio signal. Finally, we test our scheme in a lab setting to transmit images over airwaves from a loudspeaker to a microphone verifying our theoretical insights and obtaining promising results.
The widespread application of audio communication technologies has speeded up audio data flowing across the Internet, which made it a popular carrier for covert communication. In this paper, we present a cross-modal steganography method for hiding image content into audio carriers while preserving the perceptual fidelity of the cover audio. In our framework, two multi-stage networks are designed: the first network encodes the decreasing multilevel residual errors inside different audio subsequences with the corresponding stage sub-networks, while the second network decodes the residual errors from the modified carrier with the corresponding stage sub-networks to produce the final revealed results. The multi-stage design of proposed framework not only make the controlling of payload capacity more flexible, but also make hiding easier because of the gradual sparse characteristic of residual errors. Qualitative experiments suggest that modifications to the carrier are unnoticeable by human listeners and that the decoded images are highly intelligible.
As an important component of multimedia analysis tasks, audio classification aims to discriminate between different audio signal types and has received intensive attention due to its wide applications. Generally speaking, the raw signal can be transformed into various representations (such as Short Time Fourier Transform and Mel Frequency Cepstral Coefficients), and information implied in different representations can be complementary. Ensembling the models trained on different representations can greatly boost the classification performance, however, making inference using a large number of models is cumbersome and computationally expensive. In this paper, we propose a novel end-to-end collaborative learning framework for the audio classification task. The framework takes multiple representations as the input to train the models in parallel. The complementary information provided by different representations is shared by knowledge distillation. Consequently, the performance of each model can be significantly promoted without increasing the computational overhead in the inference stage. Extensive experimental results demonstrate that the proposed approach can improve the classification performance and achieve state-of-the-art results on both acoustic scene classification tasks and general audio tagging tasks.
The frequent exchange of multimedia information in the present era projects an increasing demand for copyright protection. In this work, we propose a novel audio zero-watermarking technology based on graph Fourier transform for enhancing the robustness with respect to copyright protection. In this approach, the combined shift operator is used to construct the graph signal, upon which the graph Fourier analysis is performed. The selected maximum absolute graph Fourier coefficients representing the characteristics of the audio segment are then encoded into a feature binary sequence using K-means algorithm. Finally, the resultant feature binary sequence is XOR-ed with the watermark binary sequence to realize the embedding of the zero-watermarking. The experimental studies show that the proposed approach performs more effectively in resisting common or synchronization attacks than the existing state-of-the-art methods.
Steganography represents the art of unobtrusively concealing a secrete message within some cover data. The key scope of this work is about visual steganography techniques that hide a full-sized color image / video within another. A majority of existing works are devoted to the image case, where both secret and cover data are images. We empirically validate that image steganography model does not naturally extend to the video case (i.e., hiding a video into another video), mainly because it completely ignores the temporal redundancy within consecutive video frames. Our work proposes a novel solution to the problem of video steganography. The technical contributions are two-fold: first, the residual between two consecutive frames tends to zero at most pixels. Hiding such highly-sparse data is significantly easier than hiding the original frames. Motivated by this fact, we propose to explicitly consider inter-frame residuals rather than blindly applying image steganography model on every video frame. Specifically, our model contains two branches, one of which is specially designed for hiding inter-frame difference into a cover video frame and the other instead hides the original secret frame. A simple thresholding method determines which branch a secret video frame shall choose. When revealing the concealed secret video, two decoders are devised, revealing difference or frame respectively. Second, we develop the model based on deep convolutional neural networks, which is the first of its kind in the literature of video steganography. In experiments, comprehensive evaluations are conducted to compare our model with both classic least significant bit (LSB) method and pure image steganography models. All results strongly suggest that the proposed model enjoys advantages over previous methods. We also carefully investigate key factors in the success of our deep video steganography model.
This paper proposes a new steganographic scheme relying on the principle of cover-source switching, the key idea being that the embedding should switch from one cover-source to another. The proposed implementation, called Natural Steganography, considers the sensor noise naturally present in the raw images and uses the principle that, by the addition of a specific noise the steganographic embedding tries to mimic a change of ISO sensitivity. The embedding methodology consists in 1) perturbing the image in the raw domain, 2) modeling the perturbation in the processed domain, 3) embedding the payload in the processed domain. We show that this methodology is easily tractable whenever the processes are known and enables to embed large and undetectable payloads. We also show that already used heuristics such as synchronization of embedding changes or detectability after rescaling can be respectively explained by operations such as color demosaicing and down-scaling kernels.