Elastic Map Reduce (EMR)

Amazon EMR (previously called Amazon Elastic MapReduce) is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data.

Using these frameworks and related open-source projects, you can process data for analytics purposes and business intelligence workloads. EMR, in other words, is the place where Platform pipelines run.

A word you will often come across when working as a data engineer is cluster. At a high level, a computer cluster is a group of two or more computers, or nodes, that run in parallel to achieve a common goal. This allows workloads consisting of a high number of individual, parallelizable tasks to be distributed among the nodes in the cluster.

A list of all active clusters

From all of these clusters, we are only going to use some of them. The Filter contains four options:

  • All Clusters: the whole list of clusters within a time range.

  • Active Clusters: the list of active, currently running clusters.

  • Terminated Clusters: the list of terminated clusters, be that from us or from AWS.

  • Failed Clusters: the list of clusters that have run into issues.

We selected Active Clusters, then we have this view:

A list of active clusters

The one cluster we care the most is Spark Production Cluster. That is the cluster that is connected to the Platform. All clusters have the same user interface, and if we click one of them, we get the following view:

Most frequented tabs

It is also good to know that Presto Production Cluster is responsible for the dashboard query performance. The more computing power this cluster has, the faster the query results will show in dashboards.

The last one that we need to mention is Spark Development Cluster. This cluster is connected to the development Platform site. Every job that we run from dev-prime can be tracked within this cluster.

An important part of EMR that we have to elaborate are Steps.

Steps

When we run a pipeline in Platform, a step is created containing information conform the pipeline. A step can have three states:

  • Pending: the step is in queue.

  • Running: the step is actually being processed.

  • Completed / Failed: the step has finished running and has completed or failed.

It is worth noting that only one step can be completed at a time. EMR steps work in a priority queue, meaning that the first job that goes in is the first job that will be completed.

Describing a failed, successful and pending task

To cancel a step, you can use the Cancel step button for a step that is in a Pending state. But if you want to cancel a pipeline that is currently Running, you have to follow this guide.

Last updated

Was this helpful?