4

Asynchronous Work and Pipelines

Not all work that we want to do with computers involves serving a request in near-real-time and responding to a user. Sometimes we need to do asynchronous tasks like:

  • Periodic work, such as a nightly data export, or computing monthly reports
  • Work scheduled for later, such as scheduling reminders to users
  • Long-running work, such as scheduling a build or a set of tests
  • Running a continuous statistics computation based on incoming data

Batch processes may not be that large and may just run as a scheduled cron job.

MapReduce

However, not all computations fit on one machine. One way of running large batch computations across a fleet of computers is the MapReduce paradigm. Read this article which describes how MapReduce works.

  • How does MapReduce help us to scale big computations?

Read this book chapter about Data Processing Pipelines.

  • Give two reasons why data processing pipelines can be fragile

Optionally, you can follow this short tutorial to implement a distributed word count application, and run it locally on a Glow cluster. You will get hands-on experience with MapReduce.

Queues

Queues are a frequently-seen component of large software systems that involve potentially heavyweight or long-running requests. A queue can act as a form of buffer, smoothing out spikes of load so that the system can deal with work when it has the resources to do so. Read about the Queue-Based Load-Leveling Pattern.

  • How can results of tasks be communicated back to users in a queue-based system?

Kafka is a commonly-used open-source distributed queue. Read Apache Kafka in a Nutshell.

  • What are the components of the Kafka architecture?
  • How are topics different from partitions?

Project work for this section