In the next decade, we will see a considerable need for NLP models for situated settings where diversity of situations and also different modalities including eye-movements should be taken into account in order to grasp the intention of the user. However, language comprehension in situated settings can not be handled in isolation, where different multimodal cues are inherently present and essential parts of the situations. In this research proposal, we aim to quantify the influence of each modality in interaction with various referential complexities. We propose to encode the referential complexity of the situated settings in the embeddings during pre-training to implicitly guide the model to the most plausible situation-specific deviations. We summarize the challenges of intention extraction and propose a methodological approach to investigate a situation-specific feature adaptation to improve crossmodal mapping and meaning recovery from noisy communication settings.