Your first pipeline

After understanding the VDH userface, you are finally ready to create your very own first pipeline!

Example: Student Grading

Let's start using these modules. For our first pipeline we are going to:

  1. Extract the data from a CSV file containing student grading information:

    • Click on the Tools Icon, and navigate to the Connectors tab. There, from the importers subcategory, drag and drop a CSV module.

    • Click on the dropped block and paste the following path there: s3a://prime-data-lake/production/prime/vdh/a_data_primer_platform/input/student_grades.csv. This is the path of the file that is saved on our storage system (we will get to that later). If you click on the Data Overview tab, you see that you have everything in one column, so go back to 'Configure CSV Schema' and change the delimiter from ; to , This will fix the Data Representation.

  2. Rename the columns of the file:

    • On the Processors tab of VDH Modules, pick the Rename module, and drop it on the main screen

    • Connect the output point of the CSV module to the input point of the rename module

    • Then rename the following columns:

      • name -> student_name

      • surname -> student_surname

      • grade -> student_grade

    • Check the Data Overview again on the Rename module to see the changes you've made. Do this after every change you make to track your progress.

  3. Include the students of Mathematics only

    • From the VDH Modules drag the Filter module, and connect it to rename

    • On the column field, choose subject from the dropdown list

    • Leave Equals as condition

    • Write Mathematics as a value

  4. Remove the students that did not pass

    • Drag another filter module

    • Choose student_grade as a Column

    • Choose Greater Than as a Condition

    • Write 5 as a value

  5. Save the data of the students that successfully passed the exam

    • Drag a Parquet Exporter from the VDH Modules

    • Set path to: s3a://prime-data-lake/production/prime/vdh/a_data_primer_platform/results/your_name/maths_students_that_passed

At this point, your pipeline should look something like this.

A simple transformer of a pipeline

Importing and Exporting Restrictions

There are some restrictions that will make your pipeline fail with a Forbidden Error if reading or writing from non allowable environments. Every environment is isolated, meaning that you can read / write from / to a specific organization only. Then within the environment, there are still some restrictions. You can read from the client's raw S3 data (for example, if you are using Xenos Environment, then you can read data from prime-xenos S3 bucket), but you can not write to it. You can only write within:

s3://.../production/client/vdh/

In this specific case, since the client is Xenos, the exact path would be:

s3://prime-data-lake/production/xenos/vdh/

Now that we know our way around, press the Run Once button to execute the pipeline that we have created, and while it executes, move on to the next section.

Versioning

Other than the environment variables, that represent a means of passing parameters to the pipeline, which we briefly explained above, there is another powerful feature of VDH, that enables us to automate and schedule pipelines, to process daily or weekly batches without any manual changes to the pipelines. This is done via versioning, which is how we read files based on it's version. We use versions of date format on importer and exporter paths. Say that we have weekly deliveries of transactional data of a client. They deliver the transactional data with these names on two weeks:

s3a://prime-data-lake/production/prime/vdh/a_data_primer_platform/2022_05_16/transactions.csv

s3a://prime-data-lake/production/prime/vdh/a_data_primer_platform/2022_05_23/transactions.csv

In order not to go to every pipeline, every week, and change all the importer paths to the current date, we use the following naming convention for this importer.

s3a://prime-data-lake/production/prime/vdh/a_data_primer_platform/version/transactions.csv

After doing that we need to specify the format of the version, as instead of 2022_05_16, we might have another file that is delivered on the 20220516 directory. To create the version format, check the following image and steps below to apply versioning to your pipeline.

Enabling versioning for a pipeline

To apply versioning to our pipeline, go back to VDH (make sure you're still within your project). Find your pipeline, and clone the first version that you created. Name it the same, just add the suffix - versioned. What we want to do now, is to version the output data, so open the new pipeline version, and on the exporter, follow these steps.

  1. Click on the exporter module

  2. Change the path to: s3a://prime-data-lake/production/prime/vdh/a_data_primer_platform/results/your_name/version/maths_students_that_passed

  3. Click on the edit button (no. 1 from the image above)

  4. On the Version Timeline Format (no. 3) change the underscores to dashes. It should be yyyy-MM-dd.

  5. For the purpose of this example, we will write to a day before, so set the value to -1 Days (no. 4).

  6. Without changing the Replacement Tag (no. 2) value, click on the Save Changes button.

You can now see that the versioned path differs from the regular path, as version has been replaced with yesterday's date. Running this version tomorrow, will produce a different output path. It is worth noting that the pipelines can be scheduled in Platform through the Scheduler service. We talk in depth about this service in Scheduler section in A Data Primer: Engineer.

Run this version as well, and we'll proceed with the next sections to check your results.

Last updated

Was this helpful?