No Arabic abstract
The geometry of binocular projection is analyzed, with reference to the primate visual system. In particular, the effects of coordinated eye movements on the retinal images are investigated. An appropriate oculomotor parameterization is defined, and is shown to complement the classical version and vergence angles. The midline horopter is identified, and subsequently used to construct the epipolar geometry of the system. It is shown that the Essential matrix can be obtained by combining the epipoles with the projection of the midline horopter. A local model of the scene is adopted, in which depth is measured relative to a plane containing the fixation point. The binocular disparity field is given a symmetric parameterization, in which the unknown scene-depths determine the location of corresponding image-features. The resulting Cyclopean depth-map can be combined with the estimated oculomotor parameters, to produce a local representation of the scene. The recovery of visual direction and depth from retinal images is discussed, with reference to the relevant psychophysical and neurophysiological literature.
Unmanned vehicles often need to locate targets with high precision during work. In the unmanned material handling workshop, the unmanned vehicle needs to perform high-precision pose estimation of the workpiece to accurately grasp the workpiece. In this context, this paper proposes a high-precision unmanned vehicle target positioning system based on binocular vision. The system uses a region-based stereo matching algorithm to obtain a disparity map, and uses the RANSAC algorithm to extract position and posture features, which achives the estimation of the position and attitude of a six-degree-of-freedom cylindrical workpiece. In order to verify the effect of the system, this paper collects the accuracy and calculation time of the output results of the cylinder in different poses. The experimental data shows that the position accuracy of the system is 0.61~1.17mm and the angular accuracy is 1.95~5.13{deg}, which can achieve better high-precision positioning effect.
Localizing stereo boundaries and predicting nearby disparities are difficult because stereo boundaries induce occluded regions where matching cues are absent. Most modern computer vision algorithms treat occlusions secondarily (e.g., via left-right consistency checks after matching) or rely on high-level cues to improve nearby disparities (e.g., via deep networks and large training sets). They ignore the geometry of stereo occlusions, which dictates that the spatial extent of occlusion must equal the amplitude of the disparity jump that causes it. This paper introduces an energy and level-set optimizer that improves boundaries by encoding occlusion geometry. Our model applies to two-layer, figure-ground scenes, and it can be implemented cooperatively using messages that pass predominantly between parents and children in an undecimated hierarchy of multi-scale image patches. In a small collection of figure-ground scenes curated from Middlebury and Falling Things stereo datasets, our model provides more accurate boundaries than previous occlusion-handling stereo techniques. This suggests new directions for creating cooperative stereo systems that incorporate occlusion cues in a human-like manner.
Most of the few-shot learning methods learn to transfer knowledge from datasets with abundant labeled data (i.e., the base set). From the perspective of class space on base set, existing methods either focus on utilizing all classes under a global view by normal pretraining, or pay more attention to adopt an episodic manner to train meta-tasks within few classes in a local view. However, the interaction of the two views is rarely explored. As the two views capture complementary information, we naturally think of the compatibility of them for achieving further performance gains. Inspired by the mutual learning paradigm and binocular parallax, we propose a unified framework, namely Binocular Mutual Learning (BML), which achieves the compatibility of the global view and the local view through both intra-view and cross-view modeling. Concretely, the global view learns in the whole class space to capture rich inter-class relationships. Meanwhile, the local view learns in the local class space within each episode, focusing on matching positive pairs correctly. In addition, cross-view mutual interaction further promotes the collaborative learning and the implicit exploration of useful knowledge from each other. During meta-test, binocular embeddings are aggregated together to support decision-making, which greatly improve the accuracy of classification. Extensive experiments conducted on multiple benchmarks including cross-domain validation confirm the effectiveness of our method.
Image feature extraction and matching is a fundamental but computation intensive task in machine vision. This paper proposes a novel FPGA-based embedded system to accelerate feature extraction and matching. It implements SURF feature point detection and BRIEF feature descriptor construction and matching. For binocular stereo vision, feature matching includes both tracking matching and stereo matching, which simultaneously provide feature point correspondences and parallax information. Our system is evaluated on a ZYNQ XC7Z045 FPGA. The result demonstrates that it can process binocular video data at a high frame rate (640$times$480 @ 162fps). Moreover, an extensive test proves our system has robustness for image compression, blurring and illumination.
This work presents dense stereo reconstruction using high-resolution images for infrastructure inspections. The state-of-the-art stereo reconstruction methods, both learning and non-learning ones, consume too much computational resource on high-resolution data. Recent learning-based methods achieve top ranks on most benchmarks. However, they suffer from the generalization issue due to lack of task-specific training data. We propose to use a less resource demanding non-learning method, guided by a learning-based model, to handle high-resolution images and achieve accurate stereo reconstruction. The deep-learning model produces an initial disparity prediction with uncertainty for each pixel of the down-sampled stereo image pair. The uncertainty serves as a self-measurement of its generalization ability and the per-pixel searching range around the initially predicted disparity. The downstream process performs a modified version of the Semi-Global Block Matching method with the up-sampled per-pixel searching range. The proposed deep-learning assisted method is evaluated on the Middlebury dataset and high-resolution stereo images collected by our customized binocular stereo camera. The combination of learning and non-learning methods achieves better performance on 12 out of 15 cases of the Middlebury dataset. In our infrastructure inspection experiments, the average 3D reconstruction error is less than 0.004m.