I keep seeing a common phrase that 80% of the work of a Data Scientist is data cleansing. I’m not sure whether this number is a true reflection or not, for me it reinforces the need for good quality data if you want good quality results.
If you’re not familiar with the term Data Science, it’s a set of techniques used to extract meaningful information from data sets using methods from Mathematics and Computer Science, particularly, statistics, machine learning, classification and data analytics.
You may say that this has been around for over 20 years, and it has, but with the advent of Big Data and open source software for data analysis, data science is taking shape with a staggering number of new job opportunities for those wanting to learn.
Machine learning has driven a lot of the recent excitement in technology circles with a computer no longer being programmed to perform a specific task better than a human, but being programmed to learn (now that is exciting.)
Software that learns is trained in specific domains and as soon as the training is complete and satisfactory, the software can be used to process tasks within that domain.
For example, character recognition is an example that’s been around for some time. Handwritten numbers and letters are digitised into numbers and passed to machine learning algorithms which learn how to recognise them. The software is then ready to recognise any character. A classic application is the recognition of postcodes on envelopes in mail sortation used by postal authorities.
There are so many applications for data science, from image recognition, gaming, speech recognition, fraud detection to self-driving cars and robotics to name a few.
A discipline closer to home is marketing. With digital marketing, the amount of data we have across traditional methods like direct marketing and the new methods of social media has flourished to large volumes.
To make sense of the data, data science techniques are ideally suitable to answering great marketing questions, like how is each prospect interacting with our multi-channel campaigns, how well are our products performing, which customers should be targeted with our new products, etc.
The same questions as always, but now with a much wider dataset to review and new techniques to analyse them with.
Data cleansing (aka. data cleaning, data scrubbing, data preparation, data munging) has been around for decades, but takes on more significance in the new world of data science.
If we are using data science to create visualisation, then it’s important the data is clean (i.e. fit-for-purpose) otherwise the visuals will be miss-leading.
For example, if we want to see sales across different countries, and looking to create a chart based country names then you need to ensure country names are correct and it’s the right country.
In this example, you can see we need to clean the data before it can be used for data visualisation:
Country names must be standardised.
This simple scenario is very common, so you imagine, with the full range of data errors one can have, applying the techniques of data science requires a lot of data cleansing.
If we are using the above data in a machine learning environment to predict possible sales figures from past performance, the same problem occurs, we need to clean it to have good predictions.
Machine learning uses statistical techniques, hence data like country labels are converted to numbers so the machine can learn. In the above example, there are only two distinct countries, not the six different values we see. We want to assign two numbers not six.
So, for machine learning, data cleansing is very important otherwise the software is being trained using incorrect data.
From my experience, the statement that 80% of a data scientist time is spent data cleaning (with all its other manifestations) is not too far from the truth.
Data has always required cleaning and reformatting, and that won’t go away. Whether the application is image recognition or marketing analytics, data will always need to be cleaned.