De-identified Data is Dangerous

John Ulett, VP/CIO CentraState Medical Center


Most software contracts ask for the right to use de-identified data, and I won’t ever agree to that again! Every contract I’ve reviewed in the last 10 years has requested the right to use the data provided for other purposes, as long as it is de-identified. Privacy is maintained by removing fields that directly identify the patient. Turns out, that isn’t as easy as I thought.

My eyes were opened when I attended In: Confidence, a Privacy Conference in NYC hosted by Privitar. Privitar is a privacy software vendor out of the UK. The conference tagline was “Explore the future of safe and powerful data.” I’ve always known data was powerful; it was the “safe” aspect that changed for me

My opinion started to change last year when I participated on a NJHIMSS conference panel with David Reis, CIO at Hackensack Meridian Health. He said, “Today’s de-identified data is tomorrow’s identified data.” I learned at the privacy conference just how true that could be.

Carnegie Mellon University did a study, Simple Demographics Often Identify People Uniquely, using data from the 1990 census. They demonstrated that just three fields, gender, date-of-birth, and 5 digit zip code uniquely identify 87% of the individuals in the United States. Think about it! How often is there someone in your zip that shares your birthdate and gender? Turns out, just 13% of the time.

You can easily purchase a database with name, address (including zip code), gender and birthdate for any or all portions of the US. Combine that with poorly “de-identified” health data, and Mr. Reis was correct. It becomes re-identified. 

Do we know the algorithms our vendors use to de-identify their data? The answer is no. Is there a simple solution to make its de-identification safer? Yes! I learned that introducing a little “noise” into the data makes it much harder to re-identify. 

By noise, Privitar discussed techniques like changing the date-of-birth forward or back by days, months or even a year or more. Would the outcome of most data use be affected if the patient were a month or year younger or older? Probably not. Is it less identifiable? Definitely yes!

The patient data entrusted to us during the care process is powerful. With it, we diagnose and provide treatment. It is up to us to make it safe and make sure it can’t re-identify a patient. One approach is to contractually require vendors to make the data “noisy” when they de-identify it. Or safer yet, don’t let them use it at all!