Why real-world data is messy

I was describing a real-world data experience from my previous role to someone, and they thought it was a pretty cool story.

This is that story.

—

It's March 2020, and the entire world is in upheaval due to a strange new virus.

We get a request at work. Some data is coming in from a hospital in Wuhan! Do we want to take a crack at it?

We say, “please send it our way asap,” obviously.

Imagine conducting research in an active hospital during the chaotic beginnings of a new pandemic.

Now imagine the who, where, what, and how of recording that data.

Who? Whoever can.
Where? Excel.
What? Numbers, IDs, dates, etc.
How? Manual data entry.

You’re getting the idea.

Politely put, it’s a bit of a 💩.
❌ Different people manually enter data into one of a dozen worksheets
❌ Typos are rampant, including in fields needed to join worksheets together
❌ Dates are whatever format each person prefers — sometimes, this includes Chinese characters
❌ Other unexpected characters prevent standard tools from even ingesting the data
❌ Information is color-coded

Some data cleaning just has to be done manually, with care to repeat the same steps the next time an updated Excel is emailed over. (This happens multiple times.)

Any questions take a while to resolve, if at all. Time zones, language barriers, and the whole pandemic thing. There’s no redoing anything on their end.

Doctors at the hospital are trying out two drugs, so the most urgent question is “which is better?”

No placebo arm, and handful of patients are somehow taking both. How do we define “better”, anyway?

Time to recovery is an obvious one.

We can start counting at different times, and all make sense:
• First symptoms noticed?
• Felt sick enough to go to the hospital?
• First positive test result?

There's ambiguity. At least one patient just leaves the hospital after a long stay without ever testing negative.

With fewer than 100 patients in this ad-hoc study, though, we can't simply exclude everyone with non-ideal data. We define things multiple ways and look at them all.

There are other ways to measure “better” as well. But this is a busy hospital and not carefully planned research. One patient is measured on days 2, 4, 6, and 8, another on 3, 4, 5, and 10. It's hard to make a 🍎 to 🍎 comparison.

Moral of the story: When reality is messy, data tends to be, too!

This is why only a fraction of real data science work resembles activities taught in a course or school. 📈

—

If you’ve made it this far, I hope you enjoyed the read.

To my fellow data scientists — may all of your data be squeaky clean. 🥰

P.S. In the end, we found that one drug led to improved time to recovery and lower inflammation. We published this as a peer-reviewed article in May 2020, and it has over 68,000 views.