Pipeline Construction

Click on the version of the pipeline we created in the previous section, and let's start building your first pipeline together, explaining all features step by step.

Pipe

A pipeline itself consists of multiple sub-processes, which we will be referring to as pipes (or in some cases - VDH Modules), therefore for now, we will be defining a pipeline as an ordered collection of pipes. Each pipe has at least one input dataset, a specific transformation applied to it, and an output dataset.

Right Bar

For a pipeline to execute it needs at least one complete end-to-end dataflow. So, let's start building our first pipes. To do that, we will explain our right bar seen in the image below:

Below you can find some brief explanation of the options that each of these icons leads to.

1. Tools Icon

Tools Icon shows us the full list of all VDH Modules. As seen in the image below, after clicking on the Tools icon, we can find all VDH Modules there.

We divide our VDH Modules into three main categories and several subcategories. As can be seen in the image, each of them contains extra information regarding their usage, and if we want to use them in the pipeline, we can either drag and drop them to the main screen, or click on them and they will automatically appear at a location of your screen.

Connectors: define the entry and end point of the dataflow. It has two subcategories:
- Importers: represent the inputs of an end to end dataflow. VDH supports various data sources, but the most commonly used ones are: Parquet, CSV, Parquet - Data Lake and JSON.
- Exporters: represent the output of an end-to-end dataflow. The most commonly used are Parquet and Parquet - Data Lake, due to performance enhancements. It is always recommended to use Parquet as an exporting module, because of its capabilities. Parquet is designed as a columnar storage format to support complex data processing, and when compared to a format like CSV, offers compelling benefits in terms of cost, efficiency, and flexibility.
Processors: contain the transformation pipes. It has ten subcategories:
- Filters: contain one filtering module.
- Transformers: contain general transformers of the dataset.
- Strings: contain transformers applicable to columns of string data type.
- Joins: contain transformers that combine two datasets on different techniques.
- Numbers: contain transformers applicable to columns of numeric data types.
- Geo Spatial: contain one module for distance calculation between two geo points.
- Web: contain crawlers and scrapers of web-pages (used mostly on Scraping).
- Aggregators: contain Group By and Window Functions.
- Date and Time: contain transformers applicable to columns of date/timestamp datatype
- Others: contain Note - which is not a transformer, it rather serves as one of pipeline documentation features.
Analytics: also contain transformation pipes, but mostly used on our ML Solutions, to extract statistics on the input dataset. It has three subcategories:
- Data Quality Assurance: contains transformers that pass the data through data quality checks and return statistical results of these checks.
- Content Validators: these are mostly used before starting the data transformation process, checking if we have valid content in the input data.
- Analytics: contain transformers that calculate statistical functions on the dataset columns.

2. Activity Log

Activity Log offers us live monitoring of your currently Run Once or Scheduled pipeline. You can read more about Run Once or Scheduled Jobs in the Scheduler section. The Activity tab allows you to revert the pipeline to an older version.

If you click on Activity Log > Activity you will find a window that looks something like this:

This window shows us every saving event, meaning that it shows us every time someone hit the save button on this pipeline and the user that made the save.

On the right of every saved event, you can see 2 small icons:

The view icon - shows you the exact version(state) of the pipeline that was saved, meaning that you can see the state of the pipeline after the save. This is very useful as it enables us to easily track changes and get back to older versions with ease.

The compare icon - allows you to visually see the exact differences between 2 versions, so if you click it, you will see the differences between the versions highlighted in VDH. We will get to test this cool feature in the next page 😉

3. Settings Icon

Settings Icon makes possible the overwriting of Spark configurations or allows us to set environment variables for our pipeline. An environment variable will replace any value in the pipeline with the one that we set in this section. For example, if we write: {name: : "MyPipeline"} then anywhere in the pipeline, the value name will be replaced with MyPipeline . This feature is handy when we want to run one pipeline multiple times for different parameters, for example, different years.

The top bar

A brief explanation of what every button does:

Add Stage - is used to create new stages within the same pipeline. This helps us to have multiple pipelines in one place, whenever they are dependent to achieve a single purpose.
Search- can be used to search anything inside the pipelines. You can search the name of a column and it will show you every processor that uses that column, you can also search by the names of the tools etc.
Run Once- you can run the pipeline by clicking the Run Once button. The status can be tracked through activity log which we will explain later. A pipeline runs from left to right, meaning the first stage in your user interface is the first one to be executed, then the second stage is executed, and so on. The arrow at the run once button shows you the possible AWS Clusters where you can choose to run the said pipeline, this will be explained better in the AWS Section. For now all you need to know is that the Run Once button uses the default "Spark Production Cluster" Once you edit a pipeline, you join the respective pipeline subscriber group, meaning that you will be notified whenever the pipeline run fails or succeeds. You can choose to disable this option in account settings.
Stage Enabler / Disabler - enables or disables a stage. Disabled stage will be skipped while running the pipeline.
Cursor Enabler / Disabler - if enabled, you can select with your cursor, when off, you can move through the pipeline with your cursor.
Commit, Pull, Push buttons - similar to the Git Versioning philosophy.
1. Commit - saves the changes all the changes that you made on you pipeline
2. Pull - enables you to import changes from another pipeline by just adding the Pipeline URL
3. Push - saves the changes while also incrementing the version and gives you the option to leave a note to explain the changes. For Scheduled pipelines, only the latest pushed version is taken into consideration, meaning that if you only commit your changes they will be taken into account when you hit the "Run Once" button but not on scheduled pipelines.
Undo / Redo Buttons - undo or redo your action.
Three Dots → Import / Export Pipeline - makes pipeline exportation and importation in another project or environment possible using JSON configurations. While exporting, you can save the file in your local computer, then you can import that file in another project.
Three Dots → Import / Export Stage - makes stage exportation and importation in another pipeline possible using JSON configurations. The same procedure as with pipelines can be repeated.
Three Dots → Pipeline Recipe - the representation of the pipeline in JSON.

PreviousPipeline Creation NextYour first pipeline

Last updated 2 years ago

Was this helpful?