Dirty, dirty data: Does Your Data Have a Credibility Challenge?

Dirty data is a massive problem. This contaminated data is an issue for most companies. If you are in the US, you will have access to loads of public/open data as well as commercial databases available to use in analytics.


It looks amazing when you examine all of the types of data available. For example, there are 85,000 databases alone at data.gov and these contain things such as healthcare, consumer, hospital discharges, home health patient outcomes, hospice data, DSH data, AHRQ data, etc. Also available are 1,330 databases on healthcare from dozens of sources at healthdata.gov and these include things such as hospital discharges by county, VA hospital compare, Medicare spending by patient, patient hospital experience, home health patient outcomes, DSH data, AHRQ data and more. You can also access data on 50 million Medicare patients and a lot more. Sounds like a data paradise compared to those in Japan, Europe and the rest of the world.
 
Big is NOT Always Best
 
However, big is not always best. These databases are by no means comprehensive and usually the main use achievable from combining all this big data is forecasting; it is robust and good for that purpose. Nevertheless, for other analyses, the gaps and lack of data points in key areas can create very misleading conclusions, depending on what you are analyzing.
 
We know a lot of this data is full of gaps, but we also know there is a lot of dirty data amongst it all…data that has been keyed in inaccurately can get lost in the sea of data. What about the rest of the world? What data do we have there? Well, we have sales and market share data, sometimes promotional spend data and, of course, market research data as well as a few other databases. No data paradise at all. In addition, even this small data can be dirty. The sales and market share data is typically taken from sample data and extrapolated. The promotional spend databases bought are almost always erroneous. How I know this is that every time we are given one of these by a client to use they say, “It is inaccurate for my brand so we need to change those figures.” If it is inaccurate for one brand, chances are it is inaccurate for all.
 
Data quality is an industry-wide, global issue in Pharma. I remember speaking with someone at Bayer in the US. They recognized this problem was so severe, they divided their analytics team into a data team, whose job it was to source and clean data, and a reporting team. A big area of data errors is data about data. A famous example of such an error was the NASA Mars Climate Orbiter disaster where the $125 million piece of equipment was lost because one group of engineers used metric units and another used imperial units for a key operation.
 
Everyone will be familiar with the situation where you are preparing a presentation for senior management and something looks off with the numbers. You go and verify the numbers and find there is indeed an error. Most people routinely come across this issue and either correct it, or do not pick it up at all. A few studies have found that analytics teams spend around half of their time finding and fixing data errors.
 
Let’s consider the issues with errors that creep in. In Pharma sales and marketing it can mean wasting resources on the wrong targets, inaccurate strategy or tactics leading to reduced revenue and profit. It is a huge problem in Pharma. What about in healthcare in general? It can be an even bigger problem. An incorrect result from a pathology test could kill a patient. The costs are huge in every possible way. Nonetheless, the solution is simple. Every team needs to ensure that the people working with the data are aware of potential issues, that they understand where errors could occur and know enough about the data they are using to be able to find and correct any issues uncovered.
 
If your data is unreliable, you will make inaccurate decisions and senior management will stop believing what you say. Everyone will fall back into using gut feel and intuition, and will be more likely to reject counter-intuitive implications that arise from strong data and analyses. We all know the term ‘garbage in, garbage out’ which arose from bad quality data. However, correcting the issue is not as insurmountable a task as you may imagine, and the solution is not a technological one. It is a communication solution. Often errors occur because the user of the data does not fully understand the data and they assume things about it that are incorrect. Alternatively, they might not understand it enough, so glaring errors are not picked up.
 
Eularis are scrupulous about data quality as our reputation is on the line if we give bad results and, therefore, the data we use for our machine learning and other analytics is as clean as possible. Eularis have a large team who do this with every piece of data we collect. It adds huge costs to us but it must be done for us to be able to get good results for our clients.
 

Data has two critical moments in its life – when it is created and when it is used. The errors need to be fixed at the moment of creation, rather than the moment it is used. In the majority of studies about data errors it has been found that most are not picked up until the data is used and makes no sense. Fixing the data should not only be about fixing others errors but equally about ensuring that those creating the data are error checking as they go along so that the root cause of errors is found immediately and eradicated. Interestingly, when working on big data projects in the US, when we find data issues and contact the originators of the data, we hear something similar each time: “We didn’t think anyone used the data so we didn’t spend much time on it”. It is very important that the companies responsible for compiling the data are aware that this is being used and errors are an issue that they need to address. This is the best way to correct the problem – at the source. Obviously, this is for the big, open datasets. With the data bought or collected, it must be checked as it comes in.

Often companies launch huge data cleaning efforts to clean their existing data. This is good; however, a more efficient way would be to get the data cleaned at the source to identify and eliminate the data errors in the first place, saving a lot of time, money and effort. Of course, it still needs to be checked when it reaches you, but doing this will reduce the clean-up effort dramatically. So, basically, by having a data quality team in place and disciplined methodologies to deal with checking data, as well as speaking with the originators and fully understanding where every piece of data came from, much of these problems can be dealt with early before they creep into multiple analyses and before it becomes more difficult to trace the source of the errors.
 
What We Do
 

Eularis have a data quality control manual that must be adhered to for checking all data we receive. Although when collecting market research data, agencies are not thrilled with all the monitoring and checking, it has to be done to ensure quality results. Eularis also ensure that the data creation team has a copy of our data checklists so they can make certain everything is error-free before we get the data. Naturally, we review each data set one more time. Most often we still find errors to be corrected despite the initial pass by the data creators, but a lot less time is used now compared to the 9 months of man hours checking data per project in the old days before we got the data creators involved. It is still a time consuming task but not in the same league as before.
For more information on anything in this article, please contact the author – Dr Andree Bates – at Eularis: www.eularis.com.

Contact Us

Write you name and email and enquiry and we will get right back to you as soon as we can.