ترغب بنشر مسار تعليمي؟ اضغط هنا

Harnessing Perceptual Adversarial Patches for Crowd Counting

113   0   0.0 ( 0 )
 نشر من قبل Liu Shunchang
 تاريخ النشر 2021
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

Crowd counting, which is significantly important for estimating the number of people in safety-critical scenes, has been shown to be vulnerable to adversarial examples in the physical world (e.g., adversarial patches). Though harmful, adversarial examples are also valuable for assessing and better understanding model robustness. However, existing adversarial example generation methods in crowd counting scenarios lack strong transferability among different black-box models. Motivated by the fact that transferability is positively correlated to the model-invariant characteristics, this paper proposes the Perceptual Adversarial Patch (PAP) generation framework to learn the shared perceptual features between models by exploiting both the model scale perception and position perception. Specifically, PAP exploits differentiable interpolation and density attention to help learn the invariance between models during training, leading to better transferability. In addition, we surprisingly found that our adversarial patches could also be utilized to benefit the performance of vanilla models for alleviating several challenges including cross datasets and complex backgrounds. Extensive experiments under both digital and physical world scenarios demonstrate the effectiveness of our PAP.


قيم البحث

اقرأ أيضاً

State-of-the-art methods for counting people in crowded scenes rely on deep networks to estimate crowd density. While effective, deep learning approaches are vulnerable to adversarial attacks, which, in a crowd-counting context, can lead to serious s ecurity issues. However, attack and defense mechanisms have been virtually unexplored in regression tasks, let alone for crowd density estimation. In this paper, we investigate the effectiveness of existing attack strategies on crowd-counting networks, and introduce a simple yet effective pixel-wise detection mechanism. It builds on the intuition that, when attacking a multitask network, in our case estimating crowd density and scene depth, both outputs will be perturbed, and thus the second one can be used for detection purposes. We will demonstrate that this significantly outperforms heuristic and uncertainty-based strategies.
In crowd counting, each training image contains multiple people, where each person is annotated by a dot. Existing crowd counting methods need to use a Gaussian to smooth each annotated dot or to estimate the likelihood of every pixel given the annot ated point. In this paper, we show that imposing Gaussians to annotations hurts generalization performance. Instead, we propose to use Distribution Matching for crowd COUNTing (DM-Count). In DM-Count, we use Optimal Transport (OT) to measure the similarity between the normalized predicted density map and the normalized ground truth density map. To stabilize OT computation, we include a Total Variation loss in our model. We show that the generalization error bound of DM-Count is tighter than that of the Gaussian smoothed methods. In terms of Mean Absolute Error, DM-Count outperforms the previous state-of-the-art methods by a large margin on two large-scale counting datasets, UCF-QNRF and NWPU, and achieves the state-of-the-art results on the ShanghaiTech and UCF-CC50 datasets. DM-Count reduced the error of the state-of-the-art published result by approximately 16%. Code is available at https://github.com/cvlab-stonybrook/DM-Count.
Traditional crowd counting approaches usually use Gaussian assumption to generate pseudo density ground truth, which suffers from problems like inaccurate estimation of the Gaussian kernel sizes. In this paper, we propose a new measure-based counting approach to regress the predicted density maps to the scattered point-annotated ground truth directly. First, crowd counting is formulated as a measure matching problem. Second, we derive a semi-balanced form of Sinkhorn divergence, based on which a Sinkhorn counting loss is designed for measure matching. Third, we propose a self-supervised mechanism by devising a Sinkhorn scale consistency loss to resist scale changes. Finally, an efficient optimization method is provided to minimize the overall loss function. Extensive experiments on four challenging crowd counting datasets namely ShanghaiTech, UCF-QNRF, JHU++, and NWPU have validated the proposed method.
State-of-the-art methods for counting people in crowded scenes rely on deep networks to estimate crowd density. They typically use the same filters over the whole image or over large image patches. Only then do they estimate local scale to compensate for perspective distortion. This is typically achieved by training an auxiliary classifier to select, for predefined image patches, the best kernel size among a limited set of choices. As such, these methods are not end-to-end trainable and restricted in the scope of context they can leverage. In this paper, we introduce an end-to-end trainable deep architecture that combines features obtained using multiple receptive field sizes and learns the importance of each such feature at each image location. In other words, our approach adaptively encodes the scale of the contextual information required to accurately predict crowd density. This yields an algorithm that outperforms state-of-the-art crowd counting methods, especially when perspective effects are strong.
In this paper, we propose a novel perspective-guided convolution (PGC) for convolutional neural network (CNN) based crowd counting (i.e. PGCNet), which aims to overcome the dramatic intra-scene scale variations of people due to the perspective effect . While most state-of-the-arts adopt multi-scale or multi-column architectures to address such issue, they generally fail in modeling continuous scale variations since only discrete representative scales are considered. PGCNet, on the other hand, utilizes perspective information to guide the spatially variant smoothing of feature maps before feeding them to the successive convolutions. An effective perspective estimation branch is also introduced to PGCNet, which can be trained in either supervised setting or weakly-supervised setting when the branch has been pre-trained. Our PGCNet is single-column with moderate increase in computation, and extensive experimental results on four benchmark datasets show the improvements of our method against the state-of-the-arts. Additionally, we also introduce Crowd Surveillance, a large scale dataset for crowd counting that contains 13,000+ high-resolution images with challenging scenarios.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا