DeepThink News

Big Data and Clinical Genomics

by | Mar 27, 2019

Progress Is Being Made Toward Using Big Data for Genomics-Guided Precision Medicine

Precision medicine springs from a paradox. On the one hand, researchers in the field seek to characterize ever smaller populations of patients—to the level of a single person (the so-called n-of-one). In that sense, “precision medicine is almost the antithesis of big data,” said Shawn Sweeney, director of the American Association for Cancer Research (AACR) Project GENIE (Genomics Evidence Neoplasia Information Exchange) Coordinating Center.

On the other hand, achieving the desired level of precision will require a medical system that can learn from the experiences and genomic backgrounds of many, many people. This will require robust, reliable, and standardized analyses of deep, broad datasets that include genomic and other molecular data as well as clinical data about patients’ journeys and outcomes.

Companies aiming to provide pharmacogenomic precision medicine (the right drug to the right person at the right time) are already making significant progress toward gathering and integrating genomic and real-world clinical data to help inform prescribing decisions by pharmacies and clinicians. (see sidebar) But it’s in the area of precision oncology that most of the action is happening—though significant challenges still loom. “Although we need it, we don’t really have big data in healthcare yet,” Sweeney said. “A lot of us are in this space to make the data big enough so we can enable automation to happen later on.”

In addition to building deeper datasets, companies striving to provide precision oncology decision support need to integrate genomic and real-world datasets, standardize and democratize the analyses of such datasets, and create a system that can handle ever-increasing types of genomic testing as the field expands.

Despite the data challenges, some companies are already demonstrating the value of integrating genomic and real-world data. These projects include a “Patients Like This” app (CancerLinQ); the use of real-world data to generate a natural history study of breast cancer patients (Project GENIE); the recapitulation of known findings from real-world data (Foundation Medicine); and the use of machine learning (ML) on an aggregated dataset of colon cancer patients (Jintel and Intermountain Health).

Ultimately, all of the companies in this space share the dream of a learning healthcare system built on big data that is constantly growing and incorporating new information. In cancer, each patient’s journey would be recorded into the health record in such a way that it would help inform precision treatments for future patients. “That’s the full learning system and we’re in the early, early stages of that,” said Chris Cournoyer, strategic advisor and former CEO of N-of-One.

Building Deep Datasets

In the field of precision-guided genomics medicine, useful data is found in two big pots: (1) genomic and other molecular datasets, and (2) clinical “real world” datasets derived from electronic health records (EHRs). The companies and organizations working in this space face a host of challenges.

For example, CancerLinQ, a not-for-profit subsidiary of the American Society of Clinical Oncology (ASCO), is good at extracting clinical data from a variety of EHRs but struggles to obtain computable genomics data. “Genomic testing labs don’t have to share their data in electronic format,” said Wendy Rubinstein, M.D., Ph.D., division director, clinical data management and curation at CancerLinQ. “Largely, they view the electronic format as proprietary.” This means CancerLinQ must go through several more steps to make the genomic data computable, which can sometimes compromise quality.

AACR’s Project GENIE faces the opposite challenge. It recently released a publicly available genomic dataset consisting of next generation sequencing data for about 60,000 tumors from patients at 19 different institutions. Many of the institutions generate their own genomic data (in computable form) and for institutions that send their tumor samples to outside labs (such as Foundation Medicine), Project GENIE has requested and received the genomic data electronically.

When it comes to clinical data, however, Project GENIE has a much smaller amount, and what they have has been primarily retrieved through manual methods. They dig into EHRs of individual patients to gather the clinical data needed to answer specific questions, link the data to the appropriate genomic profile, and de-identify the data prior to use. For specific cohorts of interest, for example, they collected a very detailed dataset—meaning it included any fields that could be remotely relevant to the questions at hand. Going forward, they’ve found a way to define a slightly less detailed and more pragmatic dataset to collect. “You have to wrestle down what data you need to answer the questions you want to ask,” Sweeney said. The pragmatic dataset will include each patient’s status at the time of genetic testing, treatment history and outcomes, pan-cancer data, and certain unique values for specific cancers. “This will provide a fairly complete picture of each patient’s journey with cancer going forward,” Sweeney said.

When Foundation Medicine and Flatiron Health (both owned by Roche) set out to connect their genomic and clinical data, they faced yet a different conundrum—how to integrate privacy-protected genomic data held by Foundation Medicine with privacy-protected clinical information held by Flatiron Health. First, they needed to make the connection between Jean Doe with a particular date of birth in one dataset and Jean Doe in the other dataset. For this, they used a third-party entity, which “tokenized” the identity information into a scrambled 32-character key that cannot be reverse engineered. “If both Flatiron and Foundation Medicine have seen the same case, both will have the same key,” said Guarav Singal, chief data officer.

While companies are making progress toward building deep datasets, there remains the problem of structuring, standardizing, and democratizing the data—as well as its analysis—so that healthcare providers can benefit from the wisdom they contain. “There’s a big data normalization problem,” Cournoyer said. “Big data sits in silos across any hospital system.” Even when healthcare providers own and have access to both genomic data and patient treatment and outcomes data, they aren’t integrating basic information such as stage of cancer or prior treatments into the decision support workflow along with the genetic report. “These are things that can fundamentally change the interpretation and the targeted therapy choice for that patient,” Cournoyer said.

Data silos are not the only problem. Too much data is unstructured, residing in pathology reports, lab reports, and physician’s text notes. And there is a lack of standardization across EHRs. To help address that problem, ASCO has launched a project called mCODE (minimal Common Oncology Data Elements) to identify the minimal standard data that oncology EHRs should include and that every EHR vendor should support. The goal is to create a common vocabulary so at least that part of every EHR will be interoperable with other EHRs.

An additional challenge going forward will be the growth of different kinds of genetic and molecular testing of tumors and cancer patients. For example, today a patient might have one genetic test to identify late-stage treatment options. In the future, more patients will undergo early screening and early detection testing, and cancer patients will have not just one genetic test, but potentially multiple tests over time to look for tumor changes, residual disease, and response rates. “So the datasets are going to grow exponentially,” said Sean Scott, head of business development for Qiagen (which now owns N-of-One). Moreover, the types of assays and samples will likely expand as well, including test types with little standardization, Scott added.

Learning from Real-World Data: Demonstrated Value

Remarkably, despite the data challenges, several companies are already demonstrating that analyzing integrated genomic and clinical datasets can yield valuable insights. CancerLinQ has created an app (currently under revision) called “Patients Like This.” Clinicians often face a patient who has a complicated treatment history, or for whom the guidelines don’t quite apply. Patients Like This allows clinicians to filter the EHR by age, cancer stage, comorbidities, or prior treatment to find patients similar to the one in front of them. They can then look at how those similar patients have been treated and what their outcomes have been. Currently, it difficult to do this well, but “clinicians express a desire to look at the data in this way,” Rubinstein said.

Project GENIE conducted two breast cancer natural history studies using GENIE’s unified genomic database. The projects identified cohorts of metastatic breast cancer patients who had particularly rare variants, then collected treatment history and outcomes data from their medical records. The goal: To determine whether patients with these variants do better or worse on certain treatments than someone who doesn’t have them. “We can now answer that question,” said Sweeney. Studies such as this can provide a synthetic control arm for single-arm clinical trials, which are becoming commonplace for the treatment of rare cancers. “If you have 50 patients on the single arm trial, it is necessary to know if the treatment actually helped by comparison to a natural history cohort,” Sweeney said. “That’s one way of using real-world clinical data.”

Foundation Medicine’s success story goes further. They set out to see if the clinical-genomic database they had built could recapitulate the results of the last dozen clinical research studies of non-small cell lung cancer. Because the Foundation Medicine/Flatiron cohort consists of a broader, more representative population than the clinical research cohort, being able to recapitulate those results would be validating. “And if not, then that’s interesting as well,” Singal said. In fact, the study using 2000 patients from the Foundation Medicine non-small cell lung cancer dataset recapitulated every one of the known findings. “This was a landmark moment for us,” Singal said. “It established that indeed a real-world data set created from rigorous genomics and rigorous data extraction can be of the same scientific merit as the approaches we are used to.” Since the end of the study, the NSCLC dataset has now grown to 6000 patients. “Now we can ask the next level of questions,” Singal noted.

Jintel Health recently announced a collaboration with Intermountain Healthcare’s Precision Medicine Program. Jintel used a large cancer dataset (real-world data) and a literature-based knowledge base to train a variety of ML models for prognosis prediction and treatment ranking. Intermountain will apply these models to its clinical and genomic data with the aim of better understanding which targeted therapies work for specific groups of patients.

The collaboration springs from earlier work in which Intermountain showed that genomics-based precision oncology improved survival and quality of life for 44 late-stage metastatic cancer patients at a lower cost than traditional practices. In order to expand and speed up that work, Intermountain and Jintel collaborated on a pilot project to aggregate a cohort of colon cancer patients’ clinical and genomic patient data from different sources within the Intermountain Healthcare system. They used a subset of that data to further train Jintel’s models. When applied to a test set of patients (to validate the model), the trained model identified the most effective treatment for each patient.

In the future, said Ping Zhang, CEO of Jintel, “We want to look at a large population and provide the tools for physicians to not only query but also do some simulation studies with real world datasets …. to enable novel discoveries and help physicians make more informed decisions in real time.”

ADVANCING MEDICINE WITH PRECISION INTELLIGENCE