What to look for

NULLs

Always check if a file has null values in columns that are necessary for future processing/insights. Sometimes it happens that the data uploaded from our clients has null values in fields like: id, price, revenue, this should not be allowed as we are not able to derive full insights if we don’t have complete data. Null values can also produce errors during future calculations.

Sometimes it happens that we also create null values while running pipelines, in those cases we should always check how they were created and if we can avoid them. It is a common occurrence that null values are created during left joins, but sometimes that is the expected result. If it is not possible to (or we are sure that we should) continue without null values, we should be aware of the troubles that we may encounter in the future.

Duplicates

Duplicate rows are another error that we may encounter during our day to day work. This is highly impactful as it may duplicate important information like revenue, and then we get double the amount of revenue on our dashboards (no, this is not how we are supposed to help our clients “grow” 😛).

Duplicate Data also often happens during joins, so we should always be careful about the files that we are using and the results that we produce. We have a “drop duplicates” node in our VDH, however we should always try to not cause duplicates in the first place and the node should function as a preventive measure.

Number of Rows

Depending on the transformations that we use, the number of rows may stay the same or it may change. A Data Analyst should be able to explain the difference between the number of rows. After some experience, you will gain some intuition whether that is the expected result or there is something wrong in the transformation.

Value checks (outliers, negative values)

A lot of issues can be found by checking the values on some columns.

Expected Values

The simplest form of this check is by checking the totals and comparing them with past periods, this would alert us on big changes of the data and would be a good indicator to continue with extra checks.

There are also some columns that should have specific values and we can perform checks on those. Another simple check could be to see if there are any negative values in the prices of product (unless our client pays their customers to take stuff away from the store 😛).

We can also check the values in columns that have categorical data, if we see something like “January” in a column called “Day of the Week” we know that there is something wrong with our data.

Outliers

Outlier are data points in our dataset that have values that are much different than most other data points. Since we work with a lot of products from each retailer, sometimes it makes sense to process data on product level, we encounter outliers quite often. However, we still have to make sure that those specific data points are real outliers and not the result of a mistake in our data processing.

There are many different ways to find outliers, a very easy way is to use a scatter plot chart and to check the values in the corner. There are also specific formulas that help us find the most extreme values in our data set.

Last updated

Was this helpful?