PaSh: Light-touch Data-Parallel Shell Processing


Abstract in English

This paper presents {scshape PaSh}, a system for parallelizing POSIX shell scripts. Given a script, {scshape PaSh} converts it to a dataflow graph, performs a series of semantics-preserving program transformations that expose parallelism, and then converts the dataflow graph back into a script -- one that adds POSIX constructs to explicitly guide parallelism coupled with {scshape PaSh}-provided {scshape Unix}-aware runtime primitives for addressing performance- and correctness-related issues. A lightweight annotation language allows command developers to express key parallelizability properties about their commands. An accompanying parallelizability study of POSIX and GNU commands -- two large and commonly used groups -- guides the annotation language and optimized aggregator library that {scshape PaSh} uses. Finally, {scshape PaSh}s {scshape PaSh}s extensive evaluation over 44 unmodified {scshape Unix} scripts shows significant speedups ($0.89$--$61.1times$, avg: $6.7times$) stemming from the combination of its program transformations and runtime primitives.

Download