Glue and Athena

There is a way to directly query data from our data lake that resides on S3. To do this, three AWS services are needed:

  1. AWS S3: is an object storage service that offers industry-leading scalability, data availability, security, and performance.

  2. AWS Glue: AWS Glue is a server-less data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development.

  3. AWS Athena: Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL.

Glue

We want to start off by finding the S3 folder on our data-lake. Let us assume we want to use this path: s3://prime-data-lake/production/prime/vdh/a_data_primer_platform/results/your_name/maths_students_that_passed/ .

Careful, we need to use the S3 URi path (the one above), not the one from AWS, meaning this is not correct: https://s3.console.aws.amazon.com/s3/buckets/prime-data-lake?prefix=production/client/data/ - WRONG!

Having copied the correct URi, open the AWS Glue service on your AWS console and navigate to Crawlers, as in the image:

We need to add a crawler which crawls the S3 data and put results as a table on Athena, let's create a crawler by clicking Create Crawler button.

  • Add information about your crawler - the name of the crawler, we usually skip description.

  • Click on add a data source

  • Add a data store - we specify our path at the Include Path section.

  • We can still use the Add a data source button if we want to add multiple sources, in this case we will just click next in order to keep it simple.

  • Choose an IAM Role - we chose AWSGlueServiceRole-Prime.

  • Configure the crawler's output - specify database (use default for now) and add a prefix, the table name is appended to your prefix.

  • Configure output wrap up.

And click Create crawler, then you can see your crawler in the list. You can start it by clicking Run Crawler.

Athena

Having done all of this, you can finally navigate to AWS Athena service and use the database you have set earlier.

You can click the 3 dots on the right of the table, and then "preview table" to see the first 10 rows of your table

The result will be something like this:

You may get an error saying that you need to have a location for your query results, in that case you can use this location: s3://aws-athena-query-results-346515021782-eu-central-1/

It is worth mentioning that there is a cost applied to every service we request. Make sure to remove the tables you have created after using them.

Please feel free to explore more about Athena and SQL in general since it is of vital importance to know some basic queries and the general SQL logic is also very useful in building pipelines.

Last updated

Was this helpful?