With Tableau Prep Builder you can easily clean your data. In many organisations usually it’s the data engineers and data scientists to take care of data preparation. Analyst interact with the data only after a process of data cleaning and preparation. Thanks to Tableau Prep, data cleaning. With Tableau Prep Builder you can easily clean your data. In many organisations usually it’s the data engineers and data scientists to take care of data preparation. Analyst interact with the data only after a process of data cleaning and preparation. Thanks to Tableau Prep, data cleaning has become quicker and easier. In our Clean step from the previous data investigation, we can change the data type of each field by selecting the data type icon in the corner and reassigning the correct role. We can also assign geographic roles as they will be used in Tableau Desktop. Now it's time to discuss data cleaning in Tableau. Much as we'd like our data to arrive in pristine condition, the reality is that, data files are not always well-formatted. In this video, we are going to discuss data cleaning and Tableau using a file that needs some work before it's ready for analysis.
The following is a guest post by Tableau Public author Nicolás Iguchi.
'I have nothing to offer but blood, toil, tears, and sweat.'
How many times have you looked for that silver-bullet tool to solve all the data-quality issues we encounter in almost every data set? I’ve already lost count, and despite searching for years, I have not found it anywhere.
I stopped looking for it when I finally understood that data preparation is like the work that a gardener has to do before the flowers bloom. It’s necessary to prepare the ground, plant the seeds, add nutrients, remove weeds, ensure that the plant receives the correct amount of water and sun, and so on. All these tasks require a lot of time and effort, and no one regales about them once spring arrives, but they are critical to getting the most beautiful flower.
You can ask any gardener about the demands of preparing one's garden.
The same situation applies for data vizzes. We need to give our best effort in the data-preparation process to create a meaningful story, even if it consumes our time.
I have a few helpful data-cleaning tips that I believe can help you as well:
1. Get Involved With Your Data
Before thinking of the way to communicate your story, take some time to understand the nature of your data and the business rules that support it. It’s not only about taking a look at the header line and the first rows. It’s about getting deeply involved in the content of the data, trying to understand what they represent and how they do it. Some things that should be reviewed in this stage are:
- Data types: Do data types correspond to the values that are actually stored? If they don’t, how can you fix this gap?
- Values each column can take: Are they within the expected range (eg, are temperatures between -15°C and 60 °C)? If they aren’t, what may these out-of-range values express?
- Empty and NULL values: If these exist, what may they represent for each case? An empty value could have a very important meaning in a data set.
- Initial patterns in the data: Is there any superficial pattern you can spot?
This process will help identify the types of data-cleaning tasks your data set needs.
2. Never, Ever Trust Your Data at First Sight
What Is Data Cleaning
In general, the first rows of any data set are always pretty and well-formatted, which makes us unconsciously think that we won’t have to deal with data quality issues. Cheated by the 50-rows preview that every tool provides, we move forward and start thinking about the best way to start the story, only to realize that after those 50 rows, there were so many strange cases that make it impossible to do any kind of analysis.
This may happen even if the data set is generated by an important organization. In the end, every data source is created by humans or by machine-developed by humans, and can include errors or non-considered situations.
Besides, there may be some differences in the way we understand the data, making wrong assumptions (unit measures are a common case: we assume that it’s expressed in our usual unit measure, but that’s not always so).
Always suspect of the quality of the data you are analyzing, and double-check the assumptions you may have. This attitude will help you detect more issues in advance, saving a lot of time in the data preparation process. You will also develop detective skills you never imagined you had.
3. Avoid Manually Cleaning Your Data
How many times did you review a data set and think: “I can fix this manually in a couple of minutes”? It’s very tempting to just open your file in a spreadsheet and clean your data manually. It may seem like be the easiest way to perform this tedious task.
But deep inside, we all know that as humans, we tend to make mistakes when performing repetitive tasks. It may also happen that, after getting the first insights, we find some inconsistencies that require us to start over the cleaning process. If we did our cleaning process manually for every iteration, these “couple of minutes” could easily turn into several hours of work.
Whenever it’s possible, use a software tool to clean your data. Tableau Data Interpreter is an excellent help in the cleaning process, and you can also use groupings and calculated fields to help you.
Tableau Data Interpreter is your friend!
4. Standardize Your Data
Standards are a constraint for creativity, but in data preparation they accelerate our understanding of the data we are reviewing, especially when dealing with multiple data sets.
Change the column headers to some that make more sense to you, and follow the same naming convention across all the data sets you use. If the same column appears in various data sets, use the same name for this column, so that you remember they are related.
Without standards, it’s not so obvious that these data sets are equivalent.
Create standards over your data values as well. That way, you can get the data you need by taking only a few fields and applying the simple filters. Here are some things to consider when you standardize your data set:
- In Dimensions, group those elements that we know are equal but machines can’t detect (for example, we know that NYC and New York refer to the same city, but many software tools are unable to do so). Base your analysis on these new standard dimensions.
- Create new Dimension fields to group those values that need to be analyzed together, letting you filter and focus on them without applying complex rules every time.
- Keep only one field per measure, if for any case you only need to take a subset of it you can always filter the dataset in the canvas.
- Use the same unit for all your measures, so that you can aggregate them without worrying about conversions.
5. Iterate Your Data-Cleaning Process
Iteration is the best (and maybe the only) way to reach perfection.
Instead of trying to fix all your data sources in a row, focus on those issues that block you from performing any kind of analysis. Once the initial cleaning is done, start filling Tableau’s canvas with your first ideas. Based on these insights, you’ll be able to determine which data quality issues need to be corrected first. With every iteration, you’ll find new data-quality issues that you can then tackle in the next iteration.
When is it best to stop iterating? When you feel comfortable with the story.
These tips are not hard-and-fast rules that must be strictly followed. I can’t either guarantee you that you’ll be able to dominate every data set with them either, but I’m pretty sure that at least you’ll find data-cleaning process more manageable, and you may even start enjoying it as well!
Nicolas is a Business Intelligence Designer at Globant’s Big Data Studio and Business Intelligence Teacher Assistant at UTN University in Buenos Aires, Argentina. You can contact him on Twitter @ntigc.
Tableau recently promoted its Maestro beta program to a licensed data cleansing and modification tool called Prep. Tableau Prep comes bundled with Tableau Desktop in their new Creator license. The purpose of bundling the two products together is to speed up the time it takes to import clean data (and thereby speeding up the time to data analysis) by simplifying the data cleansing process.
A quick tip: With Tableau, sometimes the shininess of a new feature (or in this case, a whole new tool) can obscure the actual need for the functionality. If your dataset is already clean, or you have simple data cleanup needs (like renaming fields or changing data types) the legacy Tableau Data Source screen is still the fastest way for you to proceed. With that said, Tableau Prep does simplify much of the manual cleansing process that the legacy Data Source screen has available.
Tableau Prep’s Top Features:
This is why Tableau Prep exists. If you’ve used Tableau Desktop in the past and you wanted to clean your data, you either cleaned it in your source system (in the original database or Excel spreadsheet for example) or you used a combination of calculated fields and groups directly in Tableau. Prep allows you to handle your data cleanup directly and makes it very simple to see which fields are affected.
Providing an Overview of Your Data Prior to Analyzing It
Before you even begin analyzing your data, Tableau Prep provides you with an overview of all of your fields as well as data histograms so you can view the frequency of the values in each field. By simply sorting any of these fields you can see which values are most frequent or infrequent within your dataset. A count of the total distinct records in each field is provided as well.
Documenting Your Cleanup Process
Data Cleaning In Tableau Examples
If you have previously used Tableau’s legacy data preparation screen, it was up to you to document your process in a tool outside of Tableau. Prep removes this grunt work and provides a clean and intuitive flow diagram of each step in your process. Not only does this allow you to keep track of what you have done, it allows others to pick up your Prep flow and quickly understand it too.
By saving a flow in Prep, you will be able to use the same process for data cleanup in the future. While this still requires manual execution, running the flow only requires the click of a button.
Tableau Prep’s Limitations:
Tableau Prep is not an ETL tool that you can automate to run without your interaction. It will not replace your current ETL process or stored procedures. For now, the only way to execute a Tableau Prep flow is directly in the tool itself.
You cannot set a schedule for executing your Prep flows. Unfortunately, this manual execution process means that it is not viable for reports that require regular data updates.
Data Cleaning In Tableau Public
Prep allows exporting the cleansed dataset as an extract or publishing directly to Tableau Server. However, if you want to use the data outside of Tableau, the only current option for export is as a csv.
While there are no published data volume limitations, large datasets are not handled well by Prep. When connecting to a dataset of 10M records, the data field histograms did not populate. Simply converting two fields from integers to dates and attempting to run the flow caused Prep to fail and return a message that stated, “An error has occurred while running the flow”. Until Tableau provides additional guidance, Prep should be primarily used for smaller datasets.
As it exists now, there are limited use cases for using Tableau Prep without using Tableau Desktop. There are more mature and feature-rich data preparation tools out there that allow you to automate the data clean-up process, export the data however you would like, and schedule that export as well.
However, Tableau Prep is a great improvement over the data preparation capabilities that previously existed in Tableau Desktop. By reducing the amount of manual work required in cleaning up data prior to importing it into Tableau, as well as having Prep bundled directly with Desktop, users will be able to focus more time on data analysis and less time on data clean-up.