15 49.0138 8.38624 1 1 4000 1 https://apcdjournal.com 300 true

Lead with Nothing

,

New Blog Series: Inconsistent Data Sets

How do you deal with inconsistent data?

Every experienced analyst knows that about 90% of the job is getting your data in shape for analysis. This 90% is not only time-consuming, but also more drudgery than enlightenment. This is especially true if you need to merge multiple data sets and whose file formats are not identical. A classic example are the rich data from CMS’ Hospital Compare. Every quarter, CMS releases updated results. And too often, those updates are formatted differently than their predecessors—changed field names, new data formats, altered data structure, etc. How can a health care analyst spin Hospital Compare gold from the mess of data file straw?

For those who don’t know, Hospital Compare offers extensive quality and other data for dozens of measures, thousands of hospitals, and multiple years. The challenge is how to merge it all when file format may vary from year to year.

In the next 6 posts, Hannah Sieber, Software Engineer at FHC, will discuss the challenges and a very real solution that speeds production time, reduces analyst angst, and prevents human errors.

Although Hannah’s example is Hospital Compare, FHC’s solution will work for all manner of merged data sets: hospital discharge data merged across several states, all-payer claims data (APCD) across multiple years or states; population health data for one provider organization from multiple insurance plans, and many more.

I hope you enjoy the series. If you need help with data tasks like these, please reach out. We’d be glad to help.

Medical claims extracts will frequently drop the leading zeros from codes as shown in this dataset. It is important to add leading zeros back to your codes before using crosswalks and code sets.

Revenue center codes should always be zero-padded to 4 digits (‘0000’). HCPCS codes can be zero-filled to 5 digits (‘00000’). DRG and bill type codes should be zero-filled to 3 digits (‘000’). Working with ICD codes is trickier (see “International Classification of Disease, Period!”).

When working with an extract that has dropped leading zeros it is critical to retain the ‘.’ (and vice versa). Consider our dataset above; a copy of the original dataset was converted to Excel without maintaining the leading zero (as frequently happens when importing from .csv files). The first row shows an ICD-CM (version 9) diagnosis code value of 320. Since leading zeros and‘.’ have been dropped it is impossible to know whether this value code indicates ‘003.20’ (Localized salmonella infection, unspecified) or ’03.20’ (Faucial diphtheria).

All extracts you work with should include either the ‘.’ or all leading zeros for ICD codes (with the exception of ICD-PCS revision 10). If not, go back to your extract provider and request a new copy.

Dr. Olmsted has a Ph.D. in Economics and has been working with healthcare data for over twenty years for companies including IHCIS (now part of Optum), RTI International, and Health Dialog.

Previous
Enhanced Excel data access with our new Data Publishing System
Next
User Interface Features of the Data Publishing System