Platform

Search inside a pipeline

There are many connectors and processors available in VDH. We usually get familiar with the most used ones and know where to find them. However, there is also a very helpful feature, to right click anywhere within the pipeline, and search for the connector/processor quickly.

Apart from that, we can also search for importers/exporters, columns, processes in the search box in the pipeline and directly find what we are looking for (instead of checking every stage of the pipeline).

Compare Schema through the Combine Node

Whenever we need to quickly check if two files contain the same columns, we can drag a Combine processor and check in the Column overview if the number of columns of the first file equals that of the combine node. In case we have the same column but in two different data type formats, it will be listed twice in the Combine’s column overview. So, if “File 1” and “File 2” have 27 columns each, and there are 28 columns on the “Combine” node, there is a column that has a different name or data type.

Compare columns through the Select Node

Another way of quickly comparing columns of two same importer paths (whenever we want to check if there are missing columns), is by selecting all columns from the first importer, and then copying the same select node and linking it to the second importer. For example, here we have two same paths, and it seems like there has been an update and the file does not have that column anymore.

Sampled path

When processing files that contain huge amounts of data, it gets heavier for the importer to load it. Thus, it will take time until all columns are loaded in the parquet importer. As a solution, we can add “part-00000*” in the sampled path, this will only load the schema from the first partition of our file.

Sanitize Option

In case the client sends files with columns containing spaces, we have a cool feature in the Rename Node that automatically detects columns that need to be renamed. In that way, we will save time and prevent errors to occur (for example, if there was a space in the end of the column name, the file could fail loading).

Copying the S3 path of Data Lake exporter

If in a pipeline you have only a data lake as an exporter it is a little bit complicated to find out where in the S3 that file is being saved. Follow the steps below for an easy way:

After you have ran the pipeline click these three dots and the Pipeline Recipe:

In the Search tab (CTRL+F) you need to search for the name of the versioned path of the data lake and there you have it. Select the S3 path like in the screenshot below:

And the last step is to search for it in the S3 bucket:

Replace Commas before you export to CSV

When saving data into csv files where the comma is used as the delimiter, it is important to initially replace the commas of the input columns with some other signs in order for the output columns not to shift and mess up.

Last updated

Was this helpful?