ترغب بنشر مسار تعليمي؟ اضغط هنا

Performance evaluation of explicit finite difference algorithms with varying amounts of computational and memory intensity

138   0   0.0 ( 0 )
 نشر من قبل Satya Pramod Jammy
 تاريخ النشر 2016
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

Future architectures designed to deliver exascale performance motivate the need for novel algorithmic changes in order to fully exploit their capabilities. In this paper, the performance of several numerical algorithms, characterised by varying degrees of memory and computational intensity, are evaluated in the context of finite difference methods for fluid dynamics problems. It is shown that, by storing some of the evaluated derivatives as single thread- or process-local variables in memory, or recomputing the derivatives on-the-fly, a speed-up of ~2 can be obtained compared to traditional algorithms that store all derivatives in global arrays.



قيم البحث

اقرأ أيضاً

We introduce a new model of computation: the online LOCAL model (OLOCAL). In this model, the adversary reveals the nodes of the input graph one by one, in the same way as in classical online algorithms, but for each new node the algorithm can also in spect its radius-$T$ neighborhood before choosing the output; instead of looking ahead in time, we have the power of looking around in space. It is natural to compare OLOCAL with the LOCAL model of distributed computing, in which all nodes make decisions simultaneously in parallel based on their radius-$T$ neighborhoods.
In the unit-cost comparison model, a black box takes an input two items and outputs the result of the comparison. Problems like sorting and searching have been studied in this model, and it has been generalized to include the concept of priced inform ation, where different pairs of items (say database records) have different comparison costs. These comparison costs can be arbitrary (in which case no algorithm can be close to optimal (Charikar et al. STOC 2000)), structured (for example, the comparison cost may depend on the length of the databases (Gupta et al. FOCS 2001)), or stochastic (Angelov et al. LATIN 2008). Motivated by the database setting where the cost depends on the sizes of the items, we consider the problems of sorting and batched predecessor where two non-uniform sets of items $A$ and $B$ are given as input. (1) In the RAM setting, we consider the scenario where both sets have $n$ keys each. The cost to compare two items in $A$ is $a$, to compare an item of $A$ to an item of $B$ is $b$, and to compare two items in $B$ is $c$. We give upper and lower bounds for the case $a le b le c$. Notice that the case $b=1, a=c=infty$ is the famous ``nuts and bolts problem. (2) In the Disk-Access Model (DAM), where transferring elements between disk and internal memory is the main bottleneck, we consider the scenario where elements in $B$ are larger than elements in $A$. The larger items take more I/Os to be brought into memory, consume more space in internal memory, and are required in their entirety for comparisons. We first give output-sensitive lower and upper bounds on the batched predecessor problem, and use these to derive bounds on the complexity of sorting in the two models. Our bounds are tight in most cases, and require novel generalizations of the classical lower bound techniques in external memory to accommodate the non-uniformity of keys.
The spatial join is a popular operation in spatial database systems and its evaluation is a well-studied problem. As main memories become bigger and faster and commodity hardware supports parallel processing, there is a need to revamp classic join al gorithms which have been designed for I/O-bound processing. In view of this, we study the in-memory and parallel evaluation of spatial joins, by re-designing a classic partitioning-based algorithm to consider alternative approaches for space partitioning. Our study shows that, compared to a straightforward implementation of the algorithm, our tuning can improve performance significantly. We also show how to select appropriate partitioning parameters based on data statistics, in order to tune the algorithm for the given join inputs. Our parallel implementation scales gracefully with the number of threads reducing the cost of the join to at most one second even for join inputs with tens of millions of rectangles.
Simulations of physical phenomena are essential to the expedient design of precision components in aerospace and other high-tech industries. These phenomena are often described by mathematical models involving partial differential equations (PDEs) wi thout exact solutions. Modern design problems require simulations with a level of resolution that is difficult to achieve in a reasonable amount of time even in effectively parallelized solvers. Though the scale of the problem relative to available computing power is the greatest impediment to accelerating these applications, significant performance gains can be achieved through careful attention to the details of memory accesses. Parallelized PDE solvers are subject to a trade-off in memory management: store the solution for each timestep in abundant, global memory with high access costs or in a limited, private memory with low access costs that must be passed between nodes. The GPU implementation of swept time-space decomposition presented here mitigates this dilemma by using private (shared) memory, avoiding internode communication, and overwriting unnecessary values. It shows significant improvement in the execution time of the PDE solvers in one dimension achieving speedups of 6-2x for large and small problem sizes respectively compared to naive G
Suppose we sequentially put $n$ balls into $n$ bins. If we put each ball into a random bin then the heaviest bin will contain ${sim}log n/loglog n$ balls with high probability. However, Azar, Broder, Karlin and Upfal [SIAM J. Comput. 29 (1999) 180--2 00] showed that if each time we choose two bins at random and put the ball in the least loaded bin among the two, then the heaviest bin will contain only ${sim}loglog n$ balls with high probability. How much memory do we need to implement this scheme? We need roughly $logloglog n$ bits per bin, and $nlogloglog n$ bits in total. Let us assume now that we have limited amount of memory. For each ball, we are given two random bins and we have to put the ball into one of them. Our goal is to minimize the load of the heaviest bin. We prove that if we have $n^{1-delta}$ bits then the heaviest bin will contain at least $Omega(deltalog n/loglog n)$ balls with high probability. The bound is tight in the communication complexity model.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا