Healthcare data is just messy. That’s the simplest way to put it. It doesn’t live in one place, it doesn’t follow one format, and most of the systems it comes from weren’t built to work together.
Many teams still approach it like any other data problem. They set up pipelines, connect sources like EHRs or claims systems, and get things running. And to be fair, that part usually works. Data moves, dashboards get built, things look like they’re on track.
The problems show up later. Usually, when someone starts asking basic questions, like why two reports don’t match or how a number was calculated. That’s when you realize no one has a clean answer. It’s not because people aren’t capable. It’s because once data starts moving across multiple systems, it becomes hard to track unless you’ve been very intentional about it from the start. Most teams aren’t, at least not early on.
Getting visibility into how healthcare data moves
Healthcare makes this harder because data is always moving. Between providers, labs, insurers, and internal systems, there are too many touchpoints. Add distributed pipelines on top of that, and you lose visibility pretty quickly.
People talk about lineage, audit trails, all of that. It’s important, but in reality, it often gets pushed to later stages. By then, everything is already complicated and harder to untangle.
Access control is another one that sounds easy, but rarely is. Not everyone should see everything, but figuring out who should see what takes effort. When that’s not clear, people either get blocked or they work around the system. Both happen more than you’d expect.
Definitions are where things quietly go wrong. The same field can mean slightly different things depending on the source system. If that isn’t clearly documented, you end up comparing things that look the same but aren’t.
I saw this on a project where multiple pipelines were feeding into reporting. From the outside, everything looked fine. Data was flowing, dashboards were live, no obvious errors.
But once we started digging into it, it became clear no one fully understood how the data was structured. We had to map it out ourselves just to get a handle on it. And once we did, we found inconsistencies everywhere. Same fields, different meanings, which led to incorrect metrics showing up in dashboards people were actually using.
Nothing broke. That’s what made it tricky. It was just wrong in ways that weren’t obvious.
We fixed it by aligning definitions and adding validation checks before data reached reporting. It wasn’t a complex fix, but it should have been done much earlier.
Working within healthcare regulations
A lot of people think regulations are the hardest part of healthcare data systems. They are not. The harder part, in my opinion, is discipline. You can have all the compliance rules in place, but if your pipelines aren’t well structured, those rules don’t really help.
There’s also more AI being layered into these systems now. That adds another layer of risk. If the data underneath isn’t solid, the models just carry those issues forward. Sometimes they make them harder to spot.
I’ve seen similar issues in other industries, too. In financial systems, we built monitoring that could catch problems in real time across different parts of the pipeline. It helped a lot, mostly because we finally had visibility into what was actually happening. That’s really the core problem in most of these setups. Lack of visibility. Small issues go unnoticed until they start affecting outputs.
Where healthcare data systems tend to struggle
Healthcare systems struggle most when governance is treated as something to deal with later. Teams focus on getting pipelines running, and structure becomes an afterthought. By the time they come back to it, things are already complicated.
Data from different systems gets stitched together without fully aligning with how it’s defined. Over time, those small mismatches turn into bigger issues.
The teams that do this well don’t necessarily use different tools. They just spend more time upfront making sure things are clear. They document, they standardize, and they monitor early instead of reacting later.
Supporting analytics without losing control
There’s a lot of pressure right now for healthcare teams to do more with their data. Improve outcomes, reduce costs, support broader public health work, all of that depends on being able to actually use the data in a meaningful way. Distributed systems make that possible. You can connect different datasets, run large-scale analysis, and start getting insights that weren’t accessible before. That part is exciting, and it’s where most teams focus.
The harder part is making sure the system underneath all of that is reliable. If the data isn’t consistent, or people don’t trust where it came from, the analysis doesn’t hold up. It might look right, but that doesn’t mean it is. What I’ve seen work is when teams stop treating governance and analytics as two separate things. Instead of building pipelines first and worrying about control later, they design both together. It doesn’t have to be overly complex, just clear enough that people understand the data and how it’s being used.
It does take more effort upfront. There’s no way around that. But over time, those systems tend to be more stable and easier to work with. You spend less time fixing issues and more time actually using the data.
This isn’t unique to healthcare, but it shows up more clearly here because the stakes are higher. When the data is tied to real decisions, you notice pretty quickly if something isn’t right. It’s not exciting work, but it makes everything else easier. And in healthcare, where the data actually matters, that tradeoff is worth it.
Photo: Liana Nagieva, Getty Images
AvaneerAniket Abhishek Soni is a senior data engineer and researcher with more than six years of experience designing and leading large-scale data pipelines, cloud platforms, and AI-enabled solutions across highly regulated industries, including healthcare, financial services, and climate research. His work focuses on making enterprise data systems more reliable, governable, and performant to support advanced analytics and applied artificial intelligence in real-world environments.
This post appears through the MedCity Influencers program. Anyone can publish their perspective on business and innovation in healthcare on MedCity News through MedCity Influencers. Click here to find out how.
