Crowd-sourcing has become a promising way to build} a feature-based indoor positioning system that has lower labour and time costs. It can make full use of the widely deployed infrastructure as well as built-in sensors on mobile devices. One of the key challenges is to generate the reference feature map (RFM), a database used for localization, by {aligning crowd-sourced {trajectories according to associations embodied in the data. In order to facilitate the data fusion using crowd-sourced inertial sensors and radio signals, this paper proposes an approach to adaptively mining geometric information. This is the essential for generating spatial associations between trajectories when employing graph-based optimization methods. The core idea is to estimate the functional relationship to map the similarity/dissimilarity between radio signals to the physical space based on the relative positions obtained from inertial sensors and their associated radio signals. Namely, it is adaptable to different modalities of data and can be implemented in a self-supervised way. We verify the generality of the proposed approach through comprehensive experimental analysis: i) qualitatively comparing the estimation of geometric mapping models and the alignment of crowd-sourced trajectories; ii) quantitatively evaluating the positioning performance. The 68% of the positioning error is less than 4.7 $mathrm{m}$ using crowd-sourced RFM, which is on a par with manually collected RFM, in a multi-storey shopping mall, which covers more than 10, 000 $ mathrm{m}^2 $.