The present paper proposes a physics-informed super-resolution (SR) model based on a convolutional neural network and applies it to the near-surface temperature in urban areas with the scaling factor of 4. The SR model incorporates a skip connection, a channel attention mechanism, and separated feature extractors for the inputs of temperature, building height, downward shortwave radiation, and horizontal velocity. We train the SR model with sets of low-resolution (LR) and high-resolution (HR) images from building-resolving large-eddy simulations (LESs) in an urban city. The generalization capability of the SR model is confirmed with LESs in another city. The estimated HR temperature fields are more accurate than those of the bicubic interpolation and image SR model that takes only the temperature as input. Except for the temperature input, the building height is the most important to reconstruct the HR temperature and enables the SR model to reduce errors in temperature near building boundaries. The analysis of attention weights indicates that the importance of building height increases as the downward shortwave radiation becomes larger. The contrast between sun and shade is strengthened with the increase in solar radiation, which may affect the temperature distribution. The short inference time suggests the potential of the proposed physics-informed SR model to facilitate a real-time HR forecast in metropolitan areas by combining it with an LR building-resolving LES model.