Onboarding

When a new client is about to work with us, the first thing PRIME does is create their very own S3 bucket where they upload all their data in a non-structured manner, named prime-clientName, as in the image below:

By non structured, we mean that the format, the content and the data structure is not defined by any means and we call this data raw client data. This data usually comes in CSV (comma separated values) format, that is why the second thing we do is move this data as is (sometimes with small modifications, depending on the client) to our production data lake. To do this, we create a new client environment on the platform and a project, called Standardized Output and inside this project we create a subproject called Raw. In Raw we want to create any pipeline that does the data migration from the client environment to our data lake. The way we do this might differ from client to client, but a thick line always stands between transaction data and mapper files (a line which Data Analysts usually easily notice). When creating the pipelines, it is a good habit to use as many wildcards as possible; using wildcards avoids hard-coded values and minimizes manual work, thus leading to better and more compact data. When this migration happens, the destiny file format must be parquet. Parquet is a data format which is much faster compared to its subordinates because it is optimized to work with complex data in bulk and features different ways for efficient data compression (like snappy). If we were to take this into technical terms, raw client data would be:

s3://prime-client/data.csv

and raw data would be:

s3://prime-data-lake/production/client/vdh/standardized_output/raw/data

The .parquet postfix is not specified by the user because it is something that Spark handles for us. This convent is applied at the moment of writing this article and might soon change!

PreviousData Lifecycle NextData Processing

Last updated 2 years ago

Was this helpful?