15 49.0138 8.38624 1 1 4000 1 https://apcdjournal.com 300 true

Why are 5% of Services Occurring on January 1, 1960?

,

New Blog Series: Inconsistent Data Sets

How do you deal with inconsistent data?

Every experienced analyst knows that about 90% of the job is getting your data in shape for analysis. This 90% is not only time-consuming, but also more drudgery than enlightenment. This is especially true if you need to merge multiple data sets and whose file formats are not identical. A classic example are the rich data from CMS’ Hospital Compare. Every quarter, CMS releases updated results. And too often, those updates are formatted differently than their predecessors—changed field names, new data formats, altered data structure, etc. How can a health care analyst spin Hospital Compare gold from the mess of data file straw?

For those who don’t know, Hospital Compare offers extensive quality and other data for dozens of measures, thousands of hospitals, and multiple years. The challenge is how to merge it all when file format may vary from year to year.

In the next 6 posts, Hannah Sieber, Software Engineer at FHC, will discuss the challenges and a very real solution that speeds production time, reduces analyst angst, and prevents human errors.

Although Hannah’s example is Hospital Compare, FHC’s solution will work for all manner of merged data sets: hospital discharge data merged across several states, all-payer claims data (APCD) across multiple years or states; population health data for one provider organization from multiple insurance plans, and many more.

I hope you enjoy the series. If you need help with data tasks like these, please reach out. We’d be glad to help.

Claims data passed as an extract to either a vendor, state agency, or at-risk provider has usually been transferred over half a dozen times.

It began its journey as billing data sent from the provider to the payer. Then it was transferred to the adjudication system (frequently this data is transferred back and forth a few times before it is finalized but that’s another blog post). Adjudicated data becomes paid data via the payment system. Within the payer organization this data passes through a kind of financial or general ledger accounting system. Usually this data is then transferred to a ‘warehouse’ for analysis. Lastly, the data is extracted and transferred for a specific purpose. During these transactions data is passed from one data processing system to another and in healthcare some of these systems rely on legacy systems built on older platforms. Date fields (I’m using ‘date’ here as shorthand for both date and date-time data.) are one area where this process can lead to problems.

Think about it this way. Dates are usually stored as a numeric indicating the number of units past a particular day. Importantly, the numeric date of 0 indicates not the absence of an amount (as it would with a standard numeric field) but an actual day. That day depends on the data storage language being used.

MS SQL SERVER — 1900-01-01 00:00:00.000

MYSQL – 0000-00-00 00:00:00.000

EXCEL – 1900-01-00

SAS – 1960-01-01

SAS is an important date because many legacy warehouses and especially data extraction code runs on SAS. A null date can often get converted to 0 as data gets passed back and forth and what was missing becomes an actual date. So, quality check processes that only look for “null” in the date field can miss these important omissions. We’ll explore this more in future posts.

This happens with many fields, but happens most frequently with admit date, discharge date, and adjudication date as shown in this dataset. This is a problem for trending data and particularly for identifying average length of stay.

Dr. Olmsted has a Ph.D. in Economics and has been working with healthcare data for over twenty years for companies including IHCIS (now part of Optum), RTI International, and Health Dialog.

Previous
Q&A Spotlight: What are APCDs Learning about Data Quality?
Next
Finding Non-NULL NULL Dates