Common Use Cases

Beside what we have already mentioned, there is still a lot to cover. However, we will try to list down a few use cases that we commonly face.

Data Source Location

Sometimes, we might want to find the Dashboard data and query that ourselves, for example, using Athena.

In order to do that, we first locate the Data Lake exporter pipeline. In our case, the URi of the pipeline looks like below:

platform.goprime.io/5e64f1ab8132210001462924/vetl/62695777e73d8f1d879284fa/627d0f7eccee4a0115de81c0/62947d69fbac8163fe541a85

The environment that we are working on is the Prime environment (as you can see from the homepage). The final S3 path that our data resides on is:

s3a://prime-data-lake/production/prime/vdh/processed/627d0f7eccee4a0115de81c0/62947d69fbac8163fe541a85/your_name_dashboard_2022_05_30

If we do a match, we can see that:

Hand to hand comparison of links

We can navigate to the Data Lake exporter data through the pipeline and version ID.

Data Splits and Partitions

The dashboards may load fast or slow, regarding the structure of the data. There are two main factors that have direct impact on platform performance:

  1. The output file having too many partitions.

  2. On-the-fly calculations.

We will talk about the second point later; for now, let us elaborate the first one.

In order for the pipeline to complete as fast as possible, our processing engine, Spark, does partitioning, which means it distributes the load to its processing cores. The main problem with this is that unless specified by the user, Spark will most likely generate hundreds of partitions. What do partitions look like?

Partition overview

In this exact case, Spark has generated 5 partitions because we requested it so (we determined 5 splits). The first one is not a partition, that one indicates if the pipeline has finished or not. This means that every time you require a dashboard, Presto, which is our query engine, will try to combine all of these files and give you a visual display in dashboards. In this case, 5 partitions are easy to handle (depending on the size), but if there were more, then overhead is created in proportion with the number of partitions. Let us assume we have the following dataset which Presto runs queries against to provide Widgets the required content.

S3 partition overview with sizes

We have 501 partitions with a size of 2.5 MB each.

Calculating the total size, we get 502 * 2.5 MB = 1255 MB. As we can tell, the data size is quite big, but another problem with this data set is that it is split into 500 parts. For Presto to collect all of the partitions, it needs time and processing power.

The ideal size of a data source is N-partitions with the size of 128 MB. Since our export has a lot of partitions with small sizes, a refactoring is needed. We can either:

  1. Get it to 11 partitions of around 120MB.

  2. Partition the data from a column that is frequently used while filtering.

The final conclusion is that knowing the output dataset size, you must set splits to:

datasetSize (in MB) / 128

Repartitioning is the process where we try to enforce Spark (default 500 partitions) to use less partitions than auto generated. We can do this in two ways:

  1. Specify a number in the splits UI while creating the data lake exporter - good when we already know the output size.

  2. Set optimize to true in JSON Configuration of the data lake exporter - good when we do not know the output size, the application decides for the optimum number of splits. "optimize": { "enabled": true, "size": 64, "tolerance": 5, "set": false }

Partition by is the database process where very large tables are divided into multiple smaller parts. This process is a two-edged sword, since:

  1. If used correctly, it immensely increases the performance because it enables Spark not to do a full scan of all the data.

  2. If used incorrectly, it generates a dozen partitions inside a dozen objects.

It is ideal to partition by columns that are frequently used in filtering. For example, if a dashboard visually displays data per year, then using that column as "partition by" fastens the process a lot.

An example of a partitioned by data source looks like this:

Partitioning by data set

This partitioning was generated through an export connector by adding the following setting:

partitionBy: ['columnToPartitionBy']

If we want to partition by through a Data Lake exporter, then we can specify it in the user interface. An example of that looks like:

Partitioning in user interface

Both settings mentioned above work for partition by as well. The overhead affects partitioned data as well. For example, a partitioned by object that holds hundreds of partitions will still run slow. For example, optimize set to true for this pipeline generates the following results:

Results generated by setting optimize to true flag

And if we visit one of the objects, we see:

One partition within a partitioned by dataset

This is a perfect example, since this category is small and still has only one partition, which means no overhead while Presto scans the full data set.

Pipeline vs. On The Fly Calculations

A big effect on the performance of dashboards is caused by the calculations we do. Let us assume the following use case, we have a product_group_code and a product_group_name, the containing of which is not to our interest. We need to display in the dashboard a concatenation of those two. We have two options to make this possible; Your Own Field is an option provided by Dashboards when you use filters. That option allows you to simulate a new column on your table that is calculated while the Widgets are being loaded, also known as On The Fly calculation. That option looks like this:

Creating your own field

Since we want to concatenate two columns, we can use the following expression:

concat(cast(product_group_code as string), '-',product_group_name)

With this action, we are forcing the dashboard to calculate our own written expressions after the initial data from S3 has been read from Presto. If the platform does not hit cache, then this calculation will be repeated over and over again. Another option is to create this column while creating the Visual Data Hub pipeline. Since this can be done through many modules, we will use the one that works the same as Your Own Field, and that is Column Expression. Creation of that module before the data export:

A concatenating module

With this action, we spare the Dashboard from calculating additional stuff once the data has been scanned from S3. The concatenation as a calculation is cheap, thus it is not time consuming. The case might not be the same if we have filters and aggregations. Having gone through the explanation, you might have come to understand that doing calculations on the fly is a very heavy process. We want to avoid by any chance these calculations and stick to pipeline ones if possible. A dashboard load is like running a small pipeline right when the button is clicked, while if we have the calculations already done in the pipeline, results are easily accessible by Presto through S3. An analogy of that: think of how fast would Data Overview show data if everything was in the imported data and no calculations were done after importing.

Last updated

Was this helpful?