Importers and Exporters

Glob Patterns

Globs are the patterns you use when you run commands such as ls src/*.js, or you might see them used in config files such as a .gitignore where you might see .cache/*, for example. The glob module finds all the path names matching a specified pattern according to the rules used by the Unix shell, although results are returned in arbitrary order.In Visual Data Hub, we use globs in almost every pipeline (the wildcard symbol). Let us assume we have four CSV files to read data from. The CSV files are named:

  1. data_first

  2. data_second

  3. data_third

  4. data_fourth

Each of them has only one column named name and one row named filenameUser, meaning CSV file data_first has one row which contains firstUser. We will go through some examples from VDH below.

Reading everything inside the directory (double wildcards filter within directory in a recursive manner)
Reading data_first and data_fourth, but not data_second and data_third
Reading everything but data_fourth
Reading everything that starts with data_f and has four characters behind
Reading everything that starts with data_f or data_s
Reading data_second, data_first and data_fourth.

Data Source Schema

The database schema is its structure described in a formal language supported by the database management system (DBMS). In Platform, you can have a look at a data source schema by clicking the Importer → Column Overview.

Column overview of a data source

When importing a file or folder, be that Parquet, CSV, JSON, etc., there are two options:

  1. Infer schema - let the application decide about the data types of all the columns; this means that the application will get a sample of rows and try to set a type to them. Note, this happens on a sample level and can sometimes cause trouble.

  2. Custom schema - you can set the types as what you expect the values to be by clicking the Setting icon in a row. This can also cause some trouble, since non compatible values will become null.

Use Case 1 - Inferring Schema Disadvantage

Suppose we want to read a comma separated values file with this content:

Code

"1-Mars"

25

5

3

If we do not specify the schema and let the application (in this case Platform) infer it, we get the following result:

#
Column
Type
Allow Nulls?

1

Code

Text

No

This means that every record is represented as string, thus doing numeric calculations is not possible.

Since most of the columns are numbers, we would like this to be a number and not a string, but the application (Platform in this case) will keep setting the type to string. In order to have this column as integer type, it is better to set the schema yourself.

In the Platform, we can do this by changing the schema type (as in the first image → Setting Button), while in code, you can create your own schema structure.

Adding a schema column

The results with a Custom Schema (set by us):

#
Code

1

-

2

23

3

3

4

5

5

-

The results with a Inferred Schema (set by Platform):

#
Code

1

1 - Mars

2

23

3

3

4

5

5

-

We can perform numerical calculations on the left side since it has integers, the right one has strings, therefore no numerical operations are available.

In the platform, if we want the application to infer schema, we have to remove it from the JSON Configuration block in the module, as explained in the Module JSON Config guide.

Use Case 2 - Generalizing Data Types

Suppose we want to read a file that contains a column that is supposed to have a decimal number in it. Suppose also that sometimes, the file is uploaded with this column as Double type (25.15) and sometimes with column as String type (25,15).

In order to process this file, even if the schema changes, we might want to read the file as string (because we can convert a Double to String, but not vice versa in our case), check if the file has a comma or a dot, then cast it to the right format and keep processing it further.

To do this, we want to set the type to String, then use the following expression:

cast(replace(input,',','.') as double)

If the String has a comma, it will be replaced with a dot (which is compatible if we want to convert it to a Double), and if the String does not, it moves forward without any change. This way, regardless of the input, the processing will be completed properly.

Use Case 3 - Preselecting Columns on Importers

Suppose that we have a transactions file, and within it we have 162 columns, containing data for 28'436'519 transactions (this is not real data, we tried to choose the digits to make it look like it is). These columns are there for different reasons, and they will be used in different pipelines. On one of the pipelines, we want to retrieve from this file, the total revenue sold per receipt. So, having multiple transactions per receipt, we want to sum up the revenue. This information can be found on 2 out of the 162 columns => receipt_code and revenue_per_transaction.

What do we do with the other 160 columns? One way is to read the file, and then using the Select module of the VDH, we can drop all the other columns and continue processing just the two columns we need. This enhances the performance, but it is not the optimal solution. We still have processing redundancy on our pipeline, because we are reading 160 unnecessary columns, each with ~28 million rows. We solve this problem, by setting a static schema to the importers, which will be the same on each execution. This is achieved via the Column Overview tab of the importer. Like in the image below, on VDH there are two buttons that are used for this function.

Column overview edit and delete buttons

Using the delete button (trash can - or can it?) we can remove the columns upfront, and tell our processing engine exactly what we want to read. This way, since Parquet is a columnar storage format, we ignore all the unnecessary columns, thus greatly optimizing the performance.

Having this in mind, will make your life easier, our life easier, and we will get to spend less money and time, and money.

Ensuring Data Append

Often, when creating pipelines we have to combine a number of data sources to come to a desirable result. For example, say we are processing weekly data and somewhere in the process, we want to merge specific files.

Since the input is not dependent on us (usually on the client), schema might eventually change. Meaning that, a column that we expect to be string in the parquet importer, is integer, thus it cannot be parsed.

The best approach to solve this, is to first handle the schema change. Because we can not hard refresh the importer every week, what we can do is remove the schema from the JSON Config, so it is dynamically parsed every time a week comes in.

schema: []

Then, we want to cast EVERY selected column (considering we have a select right after the importer, which is always recommended) to a specific type, even if we know that it might probably be parsed to that. For example, if we know we get a column called product_name, we still want to cast it to a string, even though we know that it will always be a string (consistency). By doing this, we are transitioning from raw to cleaned files, and we can assure that cleaned files will have the same schema. That's where we start using the wildcard operator then.

The wildcard operator itself combines data sources, that's why we can not use it in raw data sources, because we can not combine data whose schema is not the same. For example, say we have some data sources that we know that schema is the same so far, but might eventually change. We need these data sources as one for later use.

Weekly data uploads

In order to process this data as one, we can:

  1. Use wildcard operator all these files.

  2. Do a select module.

  3. Do casting modules.

  4. Export to parquet.

By doing this, we skip a cleaned file which results in less data. But the problem with this approach is that if the input schema changes once, then the wildcard will mess up. By schema change, we mean if another column is added which we do not expect or if a type changes. The first case, will result in a bad combination, while the second one will result in schema merging errors.

CSV schema does merging row-based, while Parquet has smart merge. For example, in CSV, if we combine two tables which do not have columns in the same order, the combination will mess up, unlike in Parquet.

On the other hand, we could clean those data sources individually every week, thus having data with the same schema. This allows to use the wildcard operator freely without worrying about any error. That means, for each data source:

  1. Do a select module.

  2. Do casting modules for all the columns.

  3. Export to parquet.

  4. Combine.

As an example, this way of working is implemented in Xenos' Standardized Output pipelines.

Plain Values Dictionary

What is Plain Values Dictionary?

Plain Values Dictionary is a rare Spark error that arises when the column schema read (cached) by Visual Data Hub is not the same as the original file.

Let us assume we have one Parquet file that contains two columns, columnA and columnB. If we load this through VDH and everything goes fine, then the Importer → JSON Config will have an attribute that looks like this:

"schema": [{"column": "columnA","type": "INTEGER"},{"column": "columnB","type": "STRING"}]

From this point, two potential use cases might arise:

  1. If the file is updated with one more column, let’s say columnC (meaning our parquet file now has three columns, columnA, columnB and columnC), and the importer in the platform is not hard reloaded, then the exported data will have two columns again. That means that the schema is loaded from our importer module (cache) and not directly from the file.

  2. If the parquet’s column type is updated, for example columnA, INTEGER → STRING and the importer in platform is not hard reloaded, then the Plain Values Dictionary error will pop up when running the pipeline.

Overcoming the issue

The solution to this is to update the importer module schema to be the same as the original files, that is, either by:

  • Hard-refreshing the importer - this creates another request to clear cache and get the schema content again:

  • Removing the schema from the JSON Configuration module - this automatically loads everything the original dataset schema contains:

    "schema": []

This use case is very useful when you are working with versioned data, and you know that the schema might eventually change.

Last updated

Was this helpful?