Common Use Cases

Processing JSON Data

We have the possibility to process JSON data through VDH as well. Assume we have the following use case;

Use Case

Usually, when we want to hit a REST API endpoint, there are credentials to be provided. We do not worry about this, we assume we have the data ready and it looks like this (almost always, REST API endpoints give a response that starts with a result key):

"result":[{"availability":"UNAVAILABLE","bundles":[1,2,3]},{"availability":"AVAILABLE","bundles":[1]},{"availability":"PROBABLY","bundles":[100]}]}

Solution

Let's load this into VDH, since JSON importer does one round of auto-parsing, then the Data Overview will look like this:

Right now, this row is a string. For us to process this, we have to parse it as JSON. Data parsing is a method where one string of data gets converted into a different type of data. In our case, we parse this row to an array of objects, meaning this row corresponds this data type:

array<struct<availability:string,bundles:array>>

Knowing this, we can create a new column expression in VDH, we can name it whatever we want and put the following code:

from_json(result, 'array<struct<availability:string,bundles:array>>')

From_json does the parsing, the result is the string column that we want to parse, and the second parameter represents the type conversion. Having done this, then Data Overview looks the same, but these two columns are not structurally the same, since one represents a string and the other a JSON.

Now that we have a JSON, we can explode the array to get multiple objects into rows.

explode(json_result)

We can now start working individually with each row; we could not do this if the string was not parsed to JSON, since square brackets are recognized as string characters not as arrays.

Now let's say we want to create a column that holds the availability value and one column that holds the multiplied values inside bundles. For the first one, we can do: exploded_row.availability

And for the second one:

aggregate(exploded_row.bundles, 1, (acc, x) -> acc * x)

This is a simple reduce function useful for this case. The first parameter requires the array we want to operate on, and the second parameter holds the start value. Data Overview then looks like this:

We can now say that we have successfully parsed the JSON data and converted it into a format that can handle more easily.

Source Sampling

We have integrated a feature within our VDH, in order to make our life better. We call this feature sampling, and as the name suggests, we use it to sample data. It is exactly this part of the VDH that enables us to have insights on Column Overview, Data Overview as well as Column Statistics, while we are developing a pipeline. It lets you debug each step, therefore ensuring that you will not have any unexpected basic error.

But, why is sampling needed here? When you created your first pipeline, the source only had 5 rows, split into 4 columns. So, on Data Overview, you could see full data along the way. But that's because the amount of data was really small, and the processing that we did was a simple one.

Suppose we have to process the transactions file we mentioned in the previous section, with a huge amount of data. Without sampling the data before the processing, it would get the whole source, and it would take A LOT of time. So, for each VDH module that you would add, while configuring and testing it, you would be wasting a lot of time just waiting.

That is why we first sample the data, in order to only process a part of the whole dataset, making it easier to develop and test at a faster pace. You can look at it as a means of simulating the full run on a faster way to test the module configurations. The way sampling works is that, on whichever module you click on Data Overview, a sample of the importer is extracted, and is processed through all processor modules all the way to the module in which you have clicked Data Overview in.

As much as a good and helpful feature sampling is, there are some things that we need to consider while constructing pipelines, because a simple change can do things that are unexpected in the background. Suppose that you have removed columns from the importer schema, but now you want to change the path of the file. If you go on the Configure tab and change it there (you probably would), then you'd think that you only changed the path, but instead, that change on the Configure tab UI, triggered an API call that executed some processes and did some stuff on the background, and you suddenly have all the columns of the source included again, so you would have to re-exclude them.

However, there is a solution to that. If you don't want to drop columns multiple times, then you can change the path of the source without triggering an API call. And how do we do that? The answer is magic. And by magic, we mean our JSON Config tab, which we have come to love more and more day after day. It allows us to do really cool stuff, and one of them is changing the path without triggering an API call. So, we can go to that tab, and on the value of the path key, we make the change we want, save the pipeline, and continue living "happily".

PreviousImporters and Exporters NextDashboards

Last updated 2 years ago

Was this helpful?