Complex applications running on multicore processors show a rich performance phenomenology. The growing number of cores per ccNUMA domain complicates performance analysis of memory-bound code since system noise, load imbalance, or task-based programming models can lead to thread desynchronization. Hence, the simplifying assumption that all cores execute the same loop can not be upheld. Motivated by observations on plain and modifi
تحميل البحث