What is Health Big Data?

Image - What is Health Big Data?

Scientists are figuring out how to turn massive amount of data on health and disease into brand new insights into human biology.

By Belinda Smith

Our digital universe is massive, and ballooning exponentially. The total amount of data produced in human history to the year 2013 was 4.4 trillion gigabytes, and by the time 2020 rolls around, that figure will have reached 44 trillion gigabytes.

If that much information was contained in a string of USB flash drives each storing 128 gigabytes, it would reach the Moon and back almost 25 times.

Can we put all that information to good use? In the case of health, the answer is most definitely yes.

Data that’s smart – not just big

Health big data isn’t just about the huge amounts of information collected by hospitals, such as admission records and test results. The steps tracked by your Fitbit and iPhone also come into play. And while such data sets are rooted in healthcare, extracting useable insights out of them can be an incredibly complicated feat.

“It’s science, but there’s a lot of art in this,” says Professor Sallie Pearson, head of the Medicines Policy Research Unit at the UNSW Centre for Big Data Research in Health. “Records are generally collected for payment or administration, not research.”

That means to glean meaningful information from health-related big data, a data scientist must first define a sensible research question – something that can inform clinical advice or policy. For example, does a patient with a body mass index (BMI) that classes them as ‘overweight’ respond better to cancer drug A or B?

The next step, says Andrew Blance, one of the Centre’s biostatisticians, is to design a suitable study. How that’s conducted, he explains, “depends on the policy or research question you’re asking”.

Common analyses include: classification (such as deciding if a tumour is cancerous); clustering (does a combination of traits result in a disease?); and association rules (if X happens, then Y follows).

So, we’ve established our research question. Now, we need to get our hands on the data to answer it.

Mining the best bits

Australian researchers can apply to access a wide array of state and Federal health data, says Professor Louisa Jorm, foundation director of the Centre. This includes hospital inpatient records, Medicare claims, cancer registry data, assisted reproduction (such as IVF) data, and the 45 and Up Study, which tracks more than 250,000 ageing Australians.

For privacy reasons, any identifying information, such as patient names, is scrubbed before the Centre receives it. But how do researchers know if that information is any good? After all, when data is manually entered into a database, human error can always creep in.

To sniff out and eliminate statistical anomalies called outliers, which can indicate errors, these data sets are ‘cleaned’. This involves machine learning algorithms scanning through a data set and picking out anything that looks a bit odd, such as a patient’s blood pressure that’s recorded as 1020/80. There’s almost certainly an extra zero in there before the 2, so that data point is likely to be removed.

Once the data is cleaned of outliers, analysis can begin.

If researchers find that yes, overweight patients really do respond better to cancer drug A than B, they can publish that result and call it quits, right? Not quite - one data set does not make a convincing result.

The same analysis must then be run on a host of different data sets from different countries, or larger groups of people to be sure the association stands. Only then can the result be published and policies changed to reflect it. If a doctor has an overweight patient, they know that cancer drug A is likely to work better than the other. If the patient loses weight, then they might switch to drug B.

New insights into our own biology

In 2015, Pearson and her team analysed medical records from Australia’s Pharmaceutical Benefits Scheme and found that statin use had either declined or stopped altogether after ABC’s Catalyst program cast doubt on the effectiveness of these heart medications.

And earlier this year, researchers from the Centre analysed data from Australian and New Zealand women undergoing assisted reproductive technology ovarian stimulation treatment (an early step of IVF), and found that those who started the treatment before the age of 30 had a 43.7% chance of giving birth after one treatment cycle, which are better odds than might be expected.

But that’s still not the end of the story, Blance says.

After answering one question, data can be mined again to look for even more patterns. And if promising patterns do show up, these can form the basis of another health big data study.

To squeeze the most useful information from such de-identified (or scrubbed) data sets, Jorm says, is to link them – the more, the merrier.

“One thing that’s pretty clear is it’s all well to have these silos of data, but to get the most out of them, you have to connect them,” she says. “Look at the whole life of a person.”

Historically, the cross-jurisdictional divide has proved a roadblock for health data researchers in Australia, because it can make accessing the information in the first place the hardest part. To give you an idea, state governments look after hospital data, but the Commonwealth holds medicine-dispensing data. This means that to obtain the latter, you first need to access the state government files.

Things are looking up, though. In recent years, de-identified records and information have been allocated unique numbers to bridge data sets and create larger, richer ones.

Not only will linked data sets now give health data scientists a look at what works (and what doesn’t) in healthcare, so too will electronic medical records, which are currently being rolled out across the country.

The quantified self

Electronic medical records are far more detailed than hospital admission data. They can contain medical imaging records, pathology results, and notes written by clinicians. They also have the benefit of being able to be updated in real time.

At this stage, exactly how electronic medical records can most effectively be harnessed by health data scientists isn’t completely clear – but it’s something the Centre will be working on.

For example, “How do you extract useful information from large volumes of free text or radiology images, which are massive files?” says Jorm. But while there are many technical challenges, there are also massive gains to be achieved.

Another rich source of information will be from the ‘quantified self’, such as self-generated fitness tracker data that individuals upload themselves or genomic data from DNA-sequencing companies.

“Integration of this data with traditional data is a big challenge, but an extraordinary opportunity,” says Jorm. “We’re also increasingly exploring ‘found’ data. Are there ways we could potentially use internet-derived data such as tweets or Facebook posts?”

While millennials have a more liberal attitude to data sharing than earlier generations, training for health data scientists remains stuck in the 20th century. There’s a greater need for data and digital literacy in medicine, Jorm says: “Medical schools have lagged behind. There’s minimal training in statistics and no training in computer science.”

With data production doubling every year, we’ll need more health data scientists to make sense of it all.

The boom in health big data is only just beginning.

Date Published
Tuesday, 19 September 2017

Subscribe to News & Events

Subscribe RSS Feed

Event - Videos

Seminar by Professor John Quackenbush, Dana-Faber Cancer Institute & Harvard School of Public Health - “Big Data in Health Care and Biomedical Research”

View all our videos here

Back to Top