No Arabic abstract
Deep Neural Network (DNN) trained object detectors are widely deployed in many mission-critical systems for real time video analytics at the edge, such as autonomous driving and video surveillance. A common performance requirement in these mission-critical edge services is the near real-time latency of online object detection on edge devices. However, even with well-trained DNN object detectors, the online detection quality at edge may deteriorate for a number of reasons, such as limited capacity to run DNN object detection models on heterogeneous edge devices, and detection quality degradation due to random frame dropping when the detection processing rate is significantly slower than the incoming video frame rate. This paper addresses these problems by exploiting multi-model multi-device detection parallelism for fast object detection in edge systems with heterogeneous edge devices. First, we analyze the performance bottleneck of running a well-trained DNN model at edge for real time online object detection. We use the offline detection as a reference model, and examine the root cause by analyzing the mismatch among the incoming video streaming rate, video processing rate for object detection, and output rate for real time detection visualization of video streaming. Second, we study performance optimizations by exploiting multi-model detection parallelism. We show that the model-parallel detection approach can effectively speed up the FPS detection processing rate, minimizing the FPS disparity with the incoming video frame rate on heterogeneous edge devices. We evaluate the proposed approach using SSD300 and YOLOv3 on benchmark videos of different video stream rates. The results show that exploiting multi-model detection parallelism can speed up the online object detection processing rate and deliver near real-time object detection performance for efficient video analytics at edge.
Video cameras are pervasively deployed in city scale for public good or community safety (i.e. traffic monitoring or suspected person tracking). However, analyzing large scale video feeds in real time is data intensive and poses severe challenges to network and computation systems today. We present CrossRoI, a resource-efficient system that enables real time video analytics at scale via harnessing the videos content associations and redundancy across a fleet of cameras. CrossRoI exploits the intrinsic physical correlations of cross-camera viewing fields to drastically reduce the communication and computation costs. CrossRoI removes the repentant appearances of same objects in multiple cameras without harming comprehensive coverage of the scene. CrossRoI operates in two phases - an offline phase to establish cross-camera correlations, and an efficient online phase for real time video inference. Experiments on real-world video feeds show that CrossRoI achieves 42% - 65% reduction for network overhead and 25% - 34% reduction for response delay in real time video analytics applications with more than 99% query accuracy, when compared to baseline methods. If integrated with SotA frame filtering systems, the performance gains of CrossRoI reach 50% - 80% (network overhead) and 33% - 61% (end-to-end delay).
Recent advances in computer vision and neural networks have made it possible for more surveillance videos to be automatically searched and analyzed by algorithms rather than humans. This happened in parallel with advances in edge computing where videos are analyzed over hierarchical clusters that contain edge devices, close to the video source. However, the current video analysis pipeline has several disadvantages when dealing with such advances. For example, video encoders have been designed for a long time to please human viewers and be agnostic of the downstream analysis task (e.g., object detection). Moreover, most of the video analytics systems leverage 2-tier architecture where the encoded video is sent to either a remote cloud or a private edge server but does not efficiently leverage both of them. In response to these advances, we present SIEVE, a 3-tier video analytics system to reduce the latency and increase the throughput of analytics over video streams. In SIEVE, we present a novel technique to detect objects in compressed video streams. We refer to this technique as semantic video encoding because it allows video encoders to be aware of the semantics of the downstream task (e.g., object detection). Our results show that by leveraging semantic video encoding, we achieve close to 100% object detection accuracy with decompressing only 3.5% of the video frames which results in more than 100x speedup compared to classical approaches that decompress every video frame.
Edge video analytics is becoming the solution to many safety and management tasks. Its wide deployment, however, must first address the tension between inference accuracy and resource (compute/network) cost. This has led to the development of video analytics pipelines (VAPs), which reduce resource cost by combining DNN compression/speedup techniques with video processing heuristics. Our measurement study on existing VAPs, however, shows that todays methods for evaluating VAPs are incomplete, often producing premature conclusions or ambiguous results. This is because each VAPs performance varies substantially across videos and time (even under the same scenario) and is sensitive to different subsets of video content characteristics. We argue that accurate VAP evaluation must first characterize the complex interaction between VAPs and video characteristics, which we refer to as VAP performance clarity. We design and implement Yoda, the first VAP benchmark to achieve performance clarity. Using primitive-based profiling and a carefully curated benchmark video set, Yoda builds a performance clarity profile for each VAP to precisely define its accuracy/cost tradeoff and its relationship with video characteristics. We show that Yoda substantially improves VAP evaluations by (1) providing a comprehensive, transparent assessment of VAP performance and its dependencies on video characteristics; (2) explicitly identifying fine-grained VAP behaviors that were previously hidden by large performance variance; and (3) revealing strengths/weaknesses among different VAPs and new design opportunities.
The proliferation of camera-enabled devices and large video repositories has led to a diverse set of video analytics applications. These applications rely on video pipelines, represented as DAGs of operations, to transform videos, process extracted metadata, and answer questions like, Is this intersection congested? The latency and resource efficiency of pipelines can be optimized using configurable knobs for each operation (e.g., sampling rate, batch size, or type of hardware used). However, determining efficient configurations is challenging because (a) the configuration search space is exponentially large, and (b) the optimal configuration depends on users desired latency and cost targets, (c) input video contents may exercise different paths in the DAG and produce a variable amount intermediate results. Existing video analytics and processing systems leave it to the users to manually configure operations and select hardware resources. We present Llama: a heterogeneous and serverless framework for auto-tuning video pipelines. Given an end-to-end latency target, Llama optimizes for cost efficiency by (a) calculating a latency target for each operation invocation, and (b) dynamically running a cost-based optimizer to assign configurations across heterogeneous hardware that best meet the calculated per-invocation latency target. This makes the problem of auto-tuning large video pipelines tractable and allows us to handle input-dependent behavior, conditional branches in the DAG, and execution variability. We describe the algorithms in Llama and evaluate it on a cloud platform using serverless CPU and GPU resources. We show that compared to state-of-the-art cluster and serverless video analytics and processing systems, Llama achieves 7.8x lower latency and 16x cost reduction on average.
Low-cost cameras enable powerful analytics. An unexploited opportunity is that most captured videos remain cold without being queried. For efficiency, we advocate for these cameras to be zero streaming: capturing videos to local storage and communicating with the cloud only when analytics is requested. How to query zero-streaming cameras efficiently? Our response is a camera/cloud runtime system called DIVA. It addresses two key challenges: to best use limited camera resource during video capture; to rapidly explore massive videos during query execution. DIVA contributes two unconventional techniques. (1) When capturing videos, a camera builds sparse yet accurate landmark frames, from which it learns reliable knowledge for accelerating future queries. (2) When executing a query, a camera processes frames in multiple passes with increasingly more expensive operators. As such, DIVA presents and keeps refining inexact query results throughout the querys execution. On diverse queries over 15 videos lasting 720 hours in total, DIVA runs at more than 100x video realtime and outperforms competitive alternative designs. To our knowledge, DIVA is the first system for querying large videos stored on low-cost remote cameras.