Pipeline Debugging
Execution Status
We have executed our two versions of the pipeline - the un-versioned and versioned one. We are using as many words containing version because it is a very important part of our routine, and we need to stick the idea in your head. So, version version version. That'll do it!
First, let's check both versions of our pipeline, in the Activity Status (right-bar), if the execution finished successfully. We can do so from the platform, or we can do it using AWS. We will now be navigating on AWS to check the execution status as well.
When you're signed in to AWS (we suppose you already have an account there as a part of the onboarding process), use the search bar to navigate to the EMR service. From the window that opens there, you can see a table with different rows of data. Since we will get to that later (on the AWS section - add link here), we are only showing you what you need to know right now. At the top of the table, change the filter from All Clusters to Active Clusters. Now the table will contain at least 5 rows of data, and you need to click on the Spark Production Cluster.
On this new window, you can see a lot of tabs, and within them, we navigate to the Steps tab. We also have a table here, and somewhere in there, we will find your executions. On the Filter part at the top of the table, type Onboarding, and now the table should show the rows that contain your executions. You can identify them using your version's name. It should look something like this:

From here, we can track the status of the steps. When we did this documentation, we weren't patient enough to wait for both pipelines to finish, so we might have taken a screenshot before finishing. As you can see, the first version completed successfully (success status) for a time period of 46 seconds, meanwhile the second version did not finish yet, it is in queue to be executed. If that's the case for you too, unfortunately you will have to refresh the table 3 times per second until your pipeline finishes.
Debugging a pipeline execution
In the unfortunate case that a pipeline fails, we need to know what happened. That is why we have the log files, on the right of the screen. If the status of a step (row) is 'Failed', then you can click on the stderr button on the right of the table, for the step that failed. That will open a new browser tab, containing insights on where the pipeline failed. We will get to error handling later on, but as we thought there might be issues with your first pipeline, it would be good for you to know how to check what happened. Also, in case of any issues, feel free to contact your mentor, a failure might not be your fault, so never hesitate to ask.
Find your files
If the executions are finished successfully, this means our data has been written (saved) on our storage. Again, using the search bar of AWS, we navigate to the S3 service. From there, you can see a list of all PRIME Buckets. To find the files that we've written, enter the prime-data-lake bucket. It might take some time to fetch everything there. If it takes too much time, type prod on the search bar and press Enter, and this will get us there faster. After that you choose prime, then vdh, and in the end: a_data_primer_platform.
You can find there the input file that we used on the pipeline (student_grades.csv). On the results folder, using the path you wrote on the exporter, find the outputs of your pipelines.
There are many ways to read your output, like:
using an importer and its Data Overview feature on VDH
using S3 Query
using Python if you have your Jupyter notebook set up
You can do whichever you want for this example (or maybe all of them), it's up to you.
Last updated
Was this helpful?