The Problem of Data Integrity and How You Can Get Above Water
If you have ever dealt with data that's collected or organized by humans, you're probably familiar with the biggest pain of being a data analyst: that a majority of your time is spent wrangling data, rather than analyzing. Getting your data into the right format is the issue of import.
The Problem of Data Integrity
"If you have an internal dataset into which humans enter information, it's going to be dirty. And continue to get dirtier," says Matt Schuchardt, director of business development and innovation at HIMSS Analytics.
Dirty data has a multitude of contributing factors. Sometimes data is dirty because it's being entered into a system by several different people. Sometimes it gets to that level because of the limits of your CRM or whichever tracking system you're using. Let's imagine, for instance, that you have excellent data on two separate companies, and your CRM doesn't automatically update their records when those two companies suddenly merge. Likewise, your CRM might not know about other details — such as the number of hospitals in a health system, for example — leading to your data becoming less and less connected to reality.
One big underlying current carrying data into dirty territory is that of naming conventions. Anytime multiple individuals participate in data entry, room for error is introduced. Schuchardt gives a simple example:
"Consider the name: University of Chicago Medical Center. Think of all the ways you could possibly abbreviate that. And then think of 25 ways you haven't thought of," he says.
"A rational, reasonable way of connecting the data to the name is a challenge. Do you use structured fields, therefore making it harder for people to enter unstructured information? And if so, as you increase user experience friction, will you gain the volume of data you need?"
You can use common entry tips and tricks, of course, such as ensuring that all zip or postal code information can only be input as a five-character string. Again, that might impact your data volume. You can try list matching by row, perhaps, but there's one looming truth that data analysts can't escape:
"Computers are dumb," Schuchardt says. "What you think of as the letter 'n', the computer just sees as bytes. The symbol 'n' that you see might not match up with what the computer is seeing, making sorting exceptionally difficult."
That prevents a lot of good data from being analyzed. There will be a ton of volume, but little ability to distinguish between signal and noise. So what can you do in order to spend less time as a data janitor, and more as an analyst?
Get Out of the Data Dumpster
Every analyst faces this challenge, because the fun part of data analysis is well, analysis. The "work" part is getting the data together correctly.
Some analysts have greater challenges than others. Using off the shelf business intelligence software, for example, means the analyst might spend a lot of time reordering by hand or using spreadsheets. That's not so fun. Other analysts, who are more familiar with SQL or Python, might be able to manipulate the data more easily. But as you know if you're anything like us, you'd much rather see what cool things you can learn, instead of trying to organize data.
If you're really fluent with coding, you can complete actions that were traditionally manual in an automated way, essentially forcing it into the shape you need. That's a fast way to get out of the data dumpster, but it requires a specific skillset that takes a long time to gain.
Connecting aspects of your data together via various means of manipulation needs to be done regularly, consistently. What if, though, you could get away from all of that and pull insights immediately? That can only happen if someone else has already done the cleaning and connecting for you.
This rings especially true in the healthcare vertical. There's a lot of institutional knowledge required to get in behind the data and make it work for analysis, says Schuchardt. For example, knowing that a system once called Upper Valley Healthcare Group is now called Medical Alliance of the Upper Valley, or that one health system had purchased its neighboring health system, is important to the janitorial role. Similarly, knowing that a university's medical center is really quite different from a typical university center, and in fact has many of the "same" qualities as larger health systems, would allow you to appropriately categorize that hospital by factors other than name or size.
The best course of action? Consider finding a partner to drive your analysis efforts from the data backend.
Says Schuchardt, "At a very high level, having someone deliver data that is in a usable format for any use case, is a great leap forward to where you actually want to be working.
"Data is dirty. Always. Comparing data to data is even dirtier than that. There will always be some effort involved, but having some help, having someone who understands, is going to help you ask the right questions, get the data in the right format to accomplish analysis and lighten the burden of you knowing all the industry-specific things you'd otherwise have to know."