skip to main content

An order-aware dataflow model for parallel Unix pipelines

Published:19 August 2021Publication History
Skip Abstract Section

Abstract

We present a dataflow model for modelling parallel Unix shell pipelines. To accurately capture the semantics of complex Unix pipelines, the dataflow model is order-aware, i.e., the order in which a node in the dataflow graph consumes inputs from different edges plays a central role in the semantics of the computation and therefore in the resulting parallelization. We use this model to capture the semantics of transformations that exploit data parallelism available in Unix shell computations and prove their correctness. We additionally formalize the translations from the Unix shell to the dataflow model and from the dataflow model back to a parallel shell script. We implement our model and transformations as the compiler and optimization passes of a system parallelizing shell pipelines, and use it to evaluate the speedup achieved on 47 pipelines.

Skip Supplemental Material Section

Supplemental Material

icfp21main-p20-p-video.mp4

mp4

21 MB

3473570.mp4

mp4

26.3 MB

References

  1. 2014. A Catalog of Stream Processing Optimizations. ACM Comput. Surv., 46, 4 (2014), Article 46, March, 34 pages. issn:0360-0300 https://doi.org/10.1145/2528412 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Arvind Arasu, Shivnath Babu, and Jennifer Widom. 2006. The CQL Continuous Query Language: Semantic Foundations and Query Execution. The VLDB Journal, 15, 2 (2006), June, 121–142. issn:1066-8888 https://doi.org/10.1007/s00778-004-0147-z Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. K. Arvind, E. David Culler, Robert Iannucci, Vinod Kathail, Keshav Pingali, and Robert Thomas. 1984. The tagged token dataflow architecture. Technical report, MIT Laboratory for Computer Science.Google ScholarGoogle Scholar
  4. K. Arvind and Rishiyur S. Nikhil. 1990. Executing a Program on the MIT Tagged-Token Dataflow Architecture. IEEE Trans. Comput., 39, 3 (1990), March, 300–318. issn:0018-9340 https://doi.org/10.1109/12.48862 Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Amnon Barak and Oren La’adan. 1998. The MOSIX Multicomputer Operating System for High Performance Cluster Computing. Future Gener. Comput. Syst., 13, 4–5 (1998), March, 361–372. issn:0167-739X https://doi.org/10.1016/S0167-739X(97)00037-X Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Jon Bentley. 1985. Programming Pearls: A Spelling Checker. Commun. ACM, 28, 5 (1985), May, 456–462. issn:0001-0782 https://doi.org/10.1145/3532.315102 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Jon Bentley, Don Knuth, and Doug McIlroy. 1986. Programming Pearls: A Literate Program. Commun. ACM, 29, 6 (1986), June, 471–483. issn:0001-0782 https://doi.org/10.1145/5948.315654 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Gérard Berry and Georges Gonthier. 1992. The ESTEREL Synchronous Programming Language: Design, Semantics, Implementation. Sci. Comput. Program., 19, 2 (1992), Nov., 87–152. issn:0167-6423 https://doi.org/10.1016/0167-6423(92)90005-V Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Pawan Bhandari. 2020. Solutions to unixgame.io. https://git.io/Jf2dn Accessed: 2020-04-14.Google ScholarGoogle Scholar
  10. Timothy Bourke and Marc Pouzet. 2013. ZéLus: A Synchronous Language with ODEs. In Proceedings of the 16th International Conference on Hybrid Systems: Computation and Control (HSCC ’13). Association for Computing Machinery, New York, NY, USA. 113–118. isbn:9781450315678 https://doi.org/10.1145/2461328.2461348 Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM, 51, 1 (2008), Jan., 107–113. issn:0001-0782 https://doi.org/10.1145/1327452.1327492 Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Jack B. Dennis. 1974. First Version of a Data Flow Procedure Language. In Programming Symposium, B. Robinet (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg. 362–376. isbn:978-3-540-37819-8 https://doi.org/10.1007/3-540-06859-7_145 Google ScholarGoogle ScholarCross RefCross Ref
  13. Tom Duff. 1990. Rc—a Shell for Plan 9 and UNIX. 283–296. isbn:0030475295Google ScholarGoogle Scholar
  14. Jeff Epstein, Andrew P. Black, and Simon Peyton-Jones. 2011. Towards Haskell in the Cloud. In Proceedings of the 4th ACM Symposium on Haskell (Haskell ’11). ACM, New York, NY, USA. 118–129. isbn:978-1-4503-0860-1 https://doi.org/10.1145/2034675.2034690 Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Azadeh Farzan and Victor Nicolet. 2017. Synthesis of Divide and Conquer Parallelism for Loops. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2017). Association for Computing Machinery, New York, NY, USA. 540–555. isbn:9781450349888 https://doi.org/10.1145/3062341.3062355 Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Azadeh Farzan and Victor Nicolet. 2019. Modular Divide-and-Conquer Parallelization of Nested Loops. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2019). Association for Computing Machinery, New York, NY, USA. 610–624. isbn:9781450367127 https://doi.org/10.1145/3314221.3314612 Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Michael I Gordon, William Thies, and Saman Amarasinghe. 2006. Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. ACM SIGPLAN Notices, 41, 11 (2006), 151–162.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Michael Greenberg. 2018. The POSIX shell is an interactive DSL for concurrency.Google ScholarGoogle Scholar
  19. Michael Greenberg and Austin J. Blatt. 2020. Executable Formal Semantics for the POSIX Shell: Smoosh: the Symbolic, Mechanized, Observable, Operational Shell. Proc. ACM Program. Lang., 4, POPL (2020), Article 43, Jan., 31 pages. https://doi.org/10.1145/3371111 Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Michael Greenberg, Konstantinos Kallas, and Nikos Vasilakis. 2021. The Future of the Shell: Unix and Beyond. In Proceedings of the Workshop on Hot Topics in Operating Systems (HotOS ’21). Association for Computing Machinery, New York, NY, USA. 240–241. isbn:9781450384384 https://doi.org/10.1145/3458336.3465296 Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Michael Greenberg, Konstantinos Kallas, and Nikos Vasilakis. 2021. Unix Shell Programming: The Next 50 Years. In Proceedings of the Workshop on Hot Topics in Operating Systems (HotOS ’21). Association for Computing Machinery, New York, NY, USA. 104–111. isbn:9781450384384 https://doi.org/10.1145/3458336.3465294 Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Shivam Handa, Konstantinos Kallas, Nikos Vasilakis, and Martin Rinard. 2021. An Order-aware Dataflow Model for Parallel Unix Pipelines (Artifact). https://doi.org/10.5281/zenodo.4776838 Google ScholarGoogle ScholarCross RefCross Ref
  23. Michael Hart. 1971. Project Gutenberg. https://www.gutenberg.org/Google ScholarGoogle Scholar
  24. Dan Jurafsky. 2017. Unix for Poets. https://web.stanford.edu/class/cs124/lec/124-2018-UnixForPoets.pdfGoogle ScholarGoogle Scholar
  25. Gilles Kahn. 1974. The Semantics of a Simple Language for Parallel Programming. Information Processing, 74 (1974), 471–475.Google ScholarGoogle Scholar
  26. Gilles Kahn and David B. MacQueen. 1977. Coroutines and Networks of Parallel Processes. Information Processing, 77 (1977), 993–998.Google ScholarGoogle Scholar
  27. Konstantinos Kallas, Filip Niksic, Caleb Stanford, and Rajeev Alur. 2020. DiffStream: Differential Output Testing for Stream Processing Programs. Proc. ACM Program. Lang., 4, OOPSLA (2020), Article 153, Nov., 29 pages. https://doi.org/10.1145/3428221 Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Richard M. Karp and Raymond E. Miller. 1966. Properties of a Model for Parallel Computations: Determinacy, Termination, Queueing. SIAM J. Appl. Math., 14, 6 (1966), 1390–1411. https://doi.org/10.1137/0114108 Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Charles Edwin Killian, James W. Anderson, Ryan Braud, Ranjit Jhala, and Amin M. Vahdat. 2007. Mace: Language Support for Building Distributed Systems. In Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’07). ACM, New York, NY, USA. 179–188. isbn:978-1-59593-633-2 https://doi.org/10.1145/1250734.1250755 Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Nokia Bell Labs. 2019. The Unix Game—Solve puzzles using Unix pipes. https://unixgame.io/unix50 Accessed: 2020-03-05.Google ScholarGoogle Scholar
  31. Paul Le Guernic, Albert Benveniste, Patricia Bournai, and Thierry Gautier. 1986. Signal–A data flow-oriented language for signal processing. IEEE transactions on acoustics, speech, and signal processing, 34, 2 (1986), 362–374.Google ScholarGoogle ScholarCross RefCross Ref
  32. Edward Ashford Lee and David G. Messerschmitt. 1987. Static Scheduling of Synchronous Data Flow Programs for Digital Signal Processing. IEEE Trans. Comput., 36, 1 (1987), Jan., 24–35. issn:0018-9340 https://doi.org/10.1109/TC.1987.5009446 Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Edward A. Lee and David G. Messerschmitt. 1987. Synchronous data flow. Proc. IEEE, 75, 9 (1987), 1235–1245.Google ScholarGoogle ScholarCross RefCross Ref
  34. Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, and Peter A. Tucker. 2005. Semantics and Evaluation Techniques for Window Aggregates in Data Streams. In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data (SIGMOD ’05). Association for Computing Machinery, New York, NY, USA. 311–322. isbn:1595930604 https://doi.org/10.1145/1066157.1066193 Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Konstantinos Mamouras. 2020. Semantic Foundations for Deterministic Dataflow and Stream Processing. In European Symposium on Programming. 394–427.Google ScholarGoogle Scholar
  36. Konstantinos Mamouras, Mukund Raghothaman, Rajeev Alur, Zachary G. Ives, and Sanjeev Khanna. 2017. StreamQRE: Modular Specification and Efficient Evaluation of Quantitative Queries over Streaming Data. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2017). Association for Computing Machinery, New York, NY, USA. 693–708. isbn:9781450349888 https://doi.org/10.1145/3062341.3062369 Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Konstantinos Mamouras, Caleb Stanford, Rajeev Alur, Zachary G. Ives, and Val Tannen. 2019. Data-Trace Types for Distributed Stream Processing Systems. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2019). ACM, New York, NY, USA. 670–685. isbn:978-1-4503-6712-7 https://doi.org/10.1145/3314221.3314580 Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Florence Maraninchi and Yann RéMond. 2001. Argos: An Automaton-Based Synchronous Language. Comput. Lang., 27, 1–3 (2001), April, 61–92. issn:0096-0551 https://doi.org/10.1016/S0096-0551(01)00016-9 Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Chris McDonald and Trevor I. Dix. 1988. Support for Graphs of Processes in a Command Interpreter. Softw. Pract. Exper., 18, 10 (1988), Oct., 1011–1016. issn:0038-0644 https://doi.org/10.1002/spe.4380181007 Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Malcolm D McIlroy, Elliot N Pinson, and Berkley A Tague. 1978. UNIX Time-Sharing System: Foreword. Bell System Technical Journal, 57, 6 (1978), 1899–1904.Google ScholarGoogle ScholarCross RefCross Ref
  41. Sape J. Mullender, Guido van Rossum, Andrew S. Tanenbaum, Robbert van Renesse, and Hans van Staveren. 1990. Amoeba: A Distributed Operating System for the 1990s. Computer, 23, 5 (1990), May, 44–53. issn:0018-9162 https://doi.org/10.1109/2.53354 Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martín Abadi. 2013. Naiad: A Timely Dataflow System. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP ’13). ACM, New York, NY, USA. 439–455. isbn:978-1-4503-2388-8 https://doi.org/10.1145/2517349.2522738 Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. John K Ousterhout, Andrew R. Cherenson, Fred Douglis, Michael N. Nelson, and Brent B. Welch. 1988. The Sprite network operating system. Computer, 21, 2 (1988), 23–36. http://www.research.ibm.com/people/f/fdouglis/papers/sprite.pdfGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  44. Rob Pike, Dave Presotto, Ken Thompson, and Howard Trickey. 1990. Plan 9 from Bell Labs. In Proceedings of the summer 1990 UKUUG Conference. 1–9. http://css.csail.mit.edu/6.824/2014/papers/plan9.pdfGoogle ScholarGoogle Scholar
  45. Deepti Raghavan, Sadjad Fouladi, Philip Levis, and Matei Zaharia. 2020. POSH: A Data-Aware Shell. In 2020 USENIX Annual Technical Conference (USENIX ATC 20). USENIX Association, 617–631. isbn:978-1-939133-14-4 https://www.usenix.org/conference/atc20/presentation/raghavanGoogle ScholarGoogle Scholar
  46. Christophe Ratel, Nicolas Halbwachs, and Pascal Raymond. 1991. Programming and Verifying Critical Systems by Means of the Synchronous Data-Flow Language LUSTRE. 112–119. isbn:0897914554 https://doi.org/10.1145/125083.123062 Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Radu Rugina and Martin Rinard. 1999. Automatic Parallelization of Divide and Conquer Algorithms. In Proceedings of the Seventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’99). Association for Computing Machinery, New York, NY, USA. 72–83. isbn:1581131003 https://doi.org/10.1145/301104.301111 Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Scott Schneider, Martin Hirzel, Bugra Gedik, and Kun-Lung Wu. 2015. Safe Data Parallelism for General Streaming. IEEE Trans. Comput., 64, 2 (2015), Feb., 504–517. issn:0018-9340 https://doi.org/10.1109/TC.2013.221 Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Peter Sewell, James J. Leifer, Keith Wansbrough, Francesco Zappa Nardelli, Mair Allen-Williams, Pierre Habouzit, and Viktor Vafeiadis. 2005. Acute: High-level Programming Language Design for Distributed Computation. In Proceedings of the Tenth ACM SIGPLAN International Conference on Functional Programming (ICFP ’05). ACM, New York, NY, USA. 15–26. isbn:1-59593-064-7 https://doi.org/10.1145/1086365.1086370 Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Calvin Smith and Aws Albarghouthi. 2016. MapReduce Program Synthesis. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’16). Association for Computing Machinery, New York, NY, USA. 326–340. isbn:9781450342612 https://doi.org/10.1145/2908080.2908102 Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Diomidis Spinellis and Marios Fragkoulis. 2017. Extending Unix Pipelines to DAGs. IEEE Trans. Comput., 66, 9 (2017), 1547–1561.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Ole Tange. 2011. GNU Parallel—The Command-Line Power Tool. ;login: The USENIX Magazine, 36, 1 (2011), Feb, 42–47. https://doi.org/10.5281/zenodo.16303 Google ScholarGoogle ScholarCross RefCross Ref
  53. Ole Tange. 2020. Differences Between GNU parallel and Alternatives. https://www.gnu.org/software/parallel/parallel_alternatives.htmlGoogle ScholarGoogle Scholar
  54. Dave Taylor and Brandon Perry. 2016. Wicked Cool Shell Scripts: 101 Scripts for Linux, OS X, and UNIX Systems (2nd ed.). No Starch Press, USA. isbn:1593276028Google ScholarGoogle Scholar
  55. William Thies, Michal Karczmarek, and Saman P. Amarasinghe. 2002. StreamIt: A Language for Streaming Applications. In Proceedings of the 11th International Conference on Compiler Construction (CC ’02). Springer-Verlag, Berlin, Heidelberg. 179–196. isbn:3540433694Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Eleftheria Tsaliki and Diomidis Spinellis. 2021. The real statistics of buses in Athens. https://insidestory.gr/article/noymera-leoforeia-athinas?token=0MFVISB8N6 Published by the Inside Story news organization.Google ScholarGoogle Scholar
  57. Nikos Vasilakis, Konstantinos Kallas, Konstantinos Mamouras, Achilles Benetopoulos, and Lazar Cvetković. 2021. PaSh: Light-Touch Data-Parallel Shell Processing. Association for Computing Machinery, New York, NY, USA. 49–66. isbn:9781450383349 https://doi.org/10.1145/3447786.3456228 Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Nikos Vasilakis, Ben Karel, and Jonathan M. Smith. 2015. From Lone Dwarfs to Giant Superclusters: Rethinking Operating System Abstractions for the Cloud. In Proceedings of the 15th USENIX Conference on Hot Topics in Operating Systems (HOTOS’15). USENIX Association, Berkeley, CA, USA. 15–15. http://dl.acm.org/citation.cfm?id=2831090.2831105Google ScholarGoogle Scholar
  59. Robert Virding, Claes Wikström, Mike Williams, and Joe Armstrong. 1996. Concurrent Programming in ERLANG (2nd Ed.). Prentice Hall International (UK) Ltd., GBR. isbn:013508301XGoogle ScholarGoogle Scholar
  60. W. Gentzsch (Sun Microsystems). 2001. Sun Grid Engine: Towards Creating a Compute Power Grid. In Proceedings of the 1st International Symposium on Cluster Computing and the Grid (CCGRID ’01). IEEE Computer Society, USA. 35. isbn:0769510108Google ScholarGoogle Scholar
  61. Edward Walker, Weijia Xu, and Vinoth Chandar. 2009. Composing and Executing Parallel Data-Flow Graphs with Shell Pipes. In Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science (WORKS ’09). Association for Computing Machinery, New York, NY, USA. Article 11, 10 pages. isbn:9781605587172 https://doi.org/10.1145/1645164.1645175 Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Ian Watson and John Gurd. 1979. A prototype data flow computer with token labelling. In 1979 International Workshop on Managing Requirements Knowledge (MARK). 623–628.Google ScholarGoogle ScholarCross RefCross Ref
  63. Andy B Yoo, Morris A Jette, and Mark Grondona. 2003. Slurm: Simple linux utility for resource management. In Workshop on Job Scheduling Strategies for Parallel Processing. 44–60.Google ScholarGoogle ScholarCross RefCross Ref
  64. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI’12). USENIX Association, Berkeley, CA, USA. 2–2. http://dl.acm.org/citation.cfm?id=2228298.2228301Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Zhao Zhang, Daniel S. Katz, Timothy G. Armstrong, Justin M. Wozniak, and Ian Foster. 2013. Parallelizing the Execution of Sequential Scripts. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC ’13). Association for Computing Machinery, New York, NY, USA. Article 31, 12 pages. isbn:9781450323789 https://doi.org/10.1145/2503210.2503222 Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. An order-aware dataflow model for parallel Unix pipelines

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image Proceedings of the ACM on Programming Languages
          Proceedings of the ACM on Programming Languages  Volume 5, Issue ICFP
          August 2021
          1011 pages
          EISSN:2475-1421
          DOI:10.1145/3482883
          Issue’s Table of Contents

          Copyright © 2021 Owner/Author

          Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 19 August 2021
          Published in pacmpl Volume 5, Issue ICFP

          Check for updates

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader