Accelerating Treatment Decisions and Drug Discovery Using a Clinical Genomics Data Platform

`01` The Challenge

Streamlining Genomic Data Management and Analysis to Unlock Insights and Drive Innovation in Personalized Genetic Testing and Diagnostics

Natera, Inc., a global leader in cell-free DNA testing, is on a mission to make personalized genetic testing and diagnostics an integral part of standard care, helping to inform earlier, more targeted interventions that lead to longer and healthier lives for millions of people.

Natera Clinical Genomics Data Platform

Natera’s long-term business strategy is centered on improving its existing genetic tests and developing novel ones, and leveraging its vast genomic and clinical data to enable more accurate, personalized diagnoses and care decisions. By simplifying access to paired DNA variants and clinical data, Natera seeks to collaborate with partners in biopharma, biotech, and life sciences, to streamline clinical trials, accelerate drug discovery, and improve time-to-market for new tests and therapies. This approach positions Natera as a leader in personalized genetic testing and diagnostics, while also helping to democratize and commercialize anonymized data, to support ongoing R&D efforts in the industry.

Natera’s flagship product, a personalized molecular residual disease (MRD) test, Signatera, has already enabled Natera to start transforming how various cancers are treated and managed, through its tumor-informed approach:

Personalized Design. Custom-built for each patient based on their tumor’s unique mutation signature
Ultrasensitive Detection. Utilizes circulating tumor DNA (ctDNA) for highly accurate disease monitoring
Longitudinal Monitoring. Enables ongoing assessment with just a blood sample after initial design

With over 200,000 patients tested and extensive clinical validation, this test established Natera as a trusted leader in MRD testing.

Despite the success of its MRD test, however, Natera was looking for more efficient ways to prepare, manage, and leverage its vast, continuously growing genomic and clinical datasets on its existing comprehensive genomics database (CGDB) solution.

Some of Natera’s Challenges Included:

Data Silos. Because data sources were disconnected, it limited the ability of Natera’s technical and non-technical staff to query, analyze, and collaborate on data (e.g. highly sensitive circulating tumor DNA) across different testing modalities.
Data Handling Inefficiencies. Reliance on manual and semi-manual processes extended data processing time (i.e. for personalized test designs based on whole exome sequencing), increased operational costs, and raised the risk of errors.
Performance Issues. Constantly increasing data volumes, including those from ctDNA detection, affected the speed of data analysis and reporting, which is crucial for timely MRD surveillance.
Lack of Standardized Workflows. Inconsistent workflows across different testing modalities created pipeline inefficiencies (e.g. for longitudinal monitoring). Natera’s data scientists, researchers, clinicians, and oncologists had to spend more time “moving” and “connecting” data instead of focusing on the data itself and its insights.
Infrastructure Overhead. The platform’s infrastructure was large and expensive to maintain, lacked scalability to accommodate more continuously growing datasets, and struggled to process whole exome sequencing (WES) data for personalized assay designs.
Data Accessibility. Difficulty in fast, at-scale ability to leverage real-world clinico-genomic data from 200,000 patients caused bottlenecks in test development and clinical decision-making on test results, as well as made it difficult for Natera’s biopharma partners to quickly access the datasets they need for their drug discovery and drug development initiatives.
Commercialization Barriers. The platform’s limited data collaboration capabilities created obstacles for Natera to quickly improve and democratize its unique datasets that could attract revenue streams from biopharma, biotech, and life sciences companies.
Time-to-Market Delays. Slower data processing and analysis delayed insights for internal (data scientists, researchers) and external (biopharma partners, scientists) users, while affecting test development and improvement over the long run.
Scalability Limitations. Natera’s existing CGDB could not keep up with growing, petabyte-scale volumes of complex and increasingly diverse, real-world genomic and clinical data, especially across different testing modalities, such as primary oncology tumor tissue samples, matched normal tissue, and longitudinal blood samples.

To address these challenges and capitalize on emerging opportunities, Natera’s leaders recognized the need for a modern, cloud-based CGDB solution. They expected their new Clinical Genomics Data Platform to:

Integrate disconnected datasets, automate workflows, and standardize processes to streamline variant annotation and data management.
Enable faster querying and visualization of real-world genomic and clinical data, empowering researchers and clinicians with near real-time insights.
Support large-scale genomic data processing and analysis, while reducing infrastructure costs, optimizing usage of staff resources, and improving data ownership across the organization. Unlock new opportunities to monetize Natera’s unique anonymized datasets, creating new revenue streams from partnerships with biopharma, biotech, and life sciences companies.
Reduce the time-to-insight for personalized treatment decisions (including on their MRD test) and drug discovery and development, driving faster time-to-market for innovations in the industry.

Recognizing the complexity of developing such a platform, Natera joined forces with Provectus, an AWS Premier Tier Services Partner with AWS competencies in Data & Analytics and Migration.

This Partnership Aimed to Tackle Several Key Objectives:

Develop a comprehensive, forward-looking data and analytics strategy for Natera’s new Clinical Genomics Data Platform
Migrate vast genomic and clinical datasets to AWS, with the ability to store, manage, and run analyses on data more cost-efficiently
Develop and implement a unified data warehouse, enabling both internal and external users to efficiently query, visualize, and collaborate on data
Develop user-friendly, customized self-service data products and dashboards to support cohort identification, cohort analysis, and unified data analytics tasks
Leverage AWS services, including AWS HealthOmics and Amazon QuickSight, for scalable and more operations-driven genomics workflows

Through collaboration with Provectus, Natera sought to overcome its current challenges and unlock new business opportunities. Their new Clinical Genomics Data Platform would address immediate data management and analysis needs, while positioning Natera as a partner of choice for clinico-genomic data analysis in early-stage cancer within the Healthcare & Life Sciences industry.

To bring this vision to life, Natera set the stage for a collaborative effort that would transform Natera’s data-driven drug discovery and research operations, and revolutionize the landscape of personalized genetic testing and diagnostics, and precision oncology.

`02` The Approach

Bridging Genetics, Data, and Cloud on a Clinical Genomics Data Platform Built with AWS HealthOmics

Provectus started by developing a comprehensive data and analytics strategy for Natera, to deliver a highly scalable and cost-effective CGDB solution — Clinical Genomics Data Platform — powered by AWS services.

The solution was implemented in three major phases:

Phase 1: Migration Readiness Assessment and Prototype

The partnership began with a Migration Readiness Assessment (MRA), during which Provectus developed a Prototype of the platform using AWS HealthOmics and Amazon QuickSight.

The Prototype, delivered in a matter of weeks, showcased that it is possible to:

Seamlessly execute data engineering workflows to streamline and accelerate Natera’s data- and analytics-driven operations
Efficiently store, retrieve, organize, and share transformed genomic and clinical data in HealthOmics storage
Quickly build business intelligence dashboards using Amazon QuickSight, to enable non-technical users to leverage data in near real-time.

This initial phase was crucial in demonstrating the potential of the envisioned platform, and set the stage for full-scale development.

Phase 2: Platform Development and Implementation

The Clinical Genomics Data Platform was designed and built as Natera’s foundational solution for storing, processing, analyzing, visualizing, and collaborating on large-scale genomic and clinical data. The platform was developed and implemented in a matter of months.

A strategic partnership with AWS significantly enhanced the development of the Clinical Genomics Data Platform, ensuring alignment with Natera’s long-term goals. This collaboration empowered the project team to fully harness AWS’s cutting-edge services, including privileged access to beta features.

Key Capabilities of the Delivered Clinical Genomics Data Platform Include:

Large-Scale Genomic Data Handling. By leveraging AWS HealthOmics, the platform enables Natera to efficiently store, retrieve, organize, and share various genomic data formats, including metadata. AWS services also help implement advanced cost and performance optimization strategies for genomic data management.
Efficient Data Querying and Processing. The platform uses Amazon Athena to run queries on complex real-world data without any engineering heavy lifting. It utilizes Apache Iceberg as an easily accessible data lake storage, enabling interactive analysis with standard SQL commands. These services help implement data partitioning and indexing for more efficient querying.
Complex Analytical Workloads Management. Amazon Redshift (Spectrum) is used to query and integrate genomic and clinical datasets. Custom queries and stored procedures can be used to uncover clinical-genomic correlations. For example, the platform can identify genetic variants associated with specific diseases or conditions, which can be targeted to improve treatment outcomes.
Data Visualization and Insight Generation. The platform is also powered by Amazon QuickSight to enable non-technical users, such as clinical researchers and drug developers, to create interactive self-service dashboards and visualizations for collaboration.

In addition to these core functionalities, the Clinical Genomics Data Platform incorporated several specialized components, which were realized as follows:

The Variant Call Format (VCF) Pipeline was partially migrated from Natera’s CGDB to AWS to reduce infrastructure costs and improve data ownership. This ensured a reliable VCF annotation workflow while minimizing data latency. Compliance with regulatory requirements were maintained throughout the migration process.
The Variant Data Store (VDS) was designed, built, and delivered to contain somatic WES samples linked to tests available in Natera’s existing solution. It supported efficient querying and linkage across clinical and result data stores, facilitating cohort selection and ensuring robust data governance to meet regulatory standards.
The Integrated Data Warehouse was developed to harmonize real-world variant, clinical, and report data in a single repository. It ensured consistent data models, schemas, and dictionaries across all data types and implemented data governance practices to maintain high data quality and consistency.
The Cohort Builder provides customizable functionality for identifying, building, and analyzing cohorts. It serves as a streamlined, self-service UI that unlocks insights in seconds, allowing users to analyze SNP/INDEL prevalence, track patient journeys, and monitor changes in test results for any cohort or the entire population with a few clicks. From a technology standpoint, the tool allows Natera to significantly minimize data duplication and reduce storage costs.

Together, these components enhance Natera’s genomic data management and analysis capabilities, enabling seamless, end-to-end data flow from raw genomic processing to cohort analysis and reporting.

Implementation for MRD Testing

Natera’s molecular residual disease testing solution was the first product to benefit from the Clinical Genomics Data Platform.

This implementation demonstrated the platform’s ability to:

Handle complex and personalized, petabyte-scale genomic data
Process and analyze tumor-specific mutations
Generate personalized, tumor-informed testing panels
Track longitudinal changes in ctDNA levels

The Clinical Genomics Data Platform has proven its potential to revolutionize cancer treatment and management long-term by enabling Natera to handle personalized genetic testing and diagnostics, achieving results faster and on a larger scale.

Phase 3: Cost and Performance Optimizations, and Data Pipeline Migration

In genomic data processing, performance and cost efficiency are key. Provectus implemented strategic optimizations for Natera’s Clinical Genomics Data Platform to improve processing speeds, enhance scalability and reduce operational expenses.

To optimize performance and achieve a 300x faster annotation speed, parallel processing of genomic data was implemented, allowing for simultaneous data tasks that cut down processing time. Database queries and indexing were optimized to ensure rapid data retrieval.

For cost optimization, a 2x reduction in data processing and management costs was achieved by implementing auto-scaling, which dynamically adjusted compute resources based on demand. A range of AWS and open-source services was used for continuous cost analysis and optimization, while data lifecycle management policies were established to move infrequently accessed data into cold storage, reducing ongoing storage expenses.

Following the successful deployment of the Clinical Genomics Data Platform, Natera was ready to expand its capabilities by migrating additional data pipelines. These efforts include integrating pipelines for variant calling and annotation on Whole Exome Sequenced Germline and ctDNA samples, clinical metadata approved for research, and genetically derived ethnicity information.

These optimizations enabled Natera to process larger volumes of genomic data with increased speed and cost-efficiency. By replacing legacy tools and eliminating third-party software licensing fees, the platform saved Natera tens of millions of dollars, while increasing performance efficiency by 300x, empowering teams with self-service analytics, and improving their capacity to provide timely and affordable genetic testing and diagnostic services. Additionally, it empowered Natera’s biopharma partners to harness rich genomic insights, accelerating their innovation efforts in drug discovery and development.

`03` The Results

Transforming Genomic Data Processing, Management, and Analysis to Accelerate Treatment Decisions and Drug Discovery

The Clinical Genomics Data Platform has significantly improved Natera’s capabilities for genomic data processing and analysis.

Note: The initial successful migration of the VCF annotation workflow from Natera’s CGDB to AWS HealthOmics was a significant achievement that increased batch processing capacity from 45 to 30,000 samples (processed in six hours), all while maintaining 100% integrity of the migrated data.

Driving Efficiencies for Natera

The Clinical Genomics Data Platform enables Natera to improve its in-house operations, accelerate research, enhance clinical decision-making, optimize resource use, and improve data accessibility for faster, more informed analysis.

The Clinical Genomics Data Platform enables Natera to:

Accelerate Research and Development. Faster data processing allows Natera’s data science, engineering, and research teams to quickly iterate to streamline and accelerate test development, gradually improving their ability to identify and validate new biomarkers.
Enhance Clinical Decision Support. Streamlined analysis of unified, properly structured datasets allows for quicker turnaround of test results. The improved accessibility, accuracy, and depth of genomic insights, derived from the paired clinical data, supports more informed and personalized decisions about care.
Optimize Utilization of Resources. The modernized platform automates complex data processing and management tasks, reducing the burden on Natera’s data science teams and allowing them to focus on dataset enrichment, MRD test improvement, and cancer research support.
Democratize Data Across the Organization. The platform’s new features, including the Cohort Builder, allow researchers to quickly access and utilize clean, structured variant, clinical, and report data with just a few clicks, streamlining their workflows.

Driving Efficiencies for Natera’s Clients and Partners

The Clinical Genomics Data Platform is a powerful resource for Natera’s clients and partners, from hospitals and clinics using MRD tests to pharmaceutical companies leveraging genomic insights to fuel their innovation efforts.

Biopharmaceutical companies can leverage Natera’s modernized platform to identify patient segments with high unmet medical needs, optimize clinical trial designs, and accelerate drug discovery and development. It is a source of high-quality, easily accessible paired genomic and clinical data that can lower R&D costs and speed up the delivery of new treatments to the market.

The paired clinical and genomic data, available on the Clinical Genomics Data Platform, is also useful to healthcare providers such as oncologists. This data helps physicians better understand how patients respond in the real world and can be used for research, investigator-initiated clinical trial design, and to inform treatment decisions for their patients.

Natera can also rely on this unique, clean and structured data as real-world evidence to support regulatory approvals and demonstrate the clinical utility of its own tests. This strengthens their case for insurance reimbursement, boosting Natera’s revenue from test sales.

“With the Clinical Genomics Data Platform, Natera has the tools it needs to democratize and commercialize its data, packaging and selling anonymized datasets to stakeholders in the biomedical field, including biotechnology firms, pharmaceutical companies, life sciences organizations, and research institutions. Thus, the platform serves as a tool that directly supports Natera’s strategic vision: to unlock the potential of its data and AI assets – an opportunity that is worth tens of billions of dollars across the life sciences sector.”

The data commoditization trend reflects a shift in the biomedical field, where high-quality, large-scale genomic and clinical datasets are becoming essential to power R&D initiatives driving innovation industry-wide. The platform’s ability to efficiently process, annotate, and organize complex real-world data makes it particularly attractive to drug development companies and scientific institutions looking to uncover new insights into disease mechanisms, identify novel therapeutic targets, and design more effective clinical trials.

World’s Largest Early Cancer Patient Dataset

One example, a unique dataset prepared on the Clinical Genomics Data Platform, combines extensive molecular data, including full exome sequencing, with detailed clinical information and longitudinal follow-up for early-stage cancer patients.

What sets this dataset apart is its focus on early-stage cancers and the inclusion of Natera’s proprietary MRD test results, providing a holistic view of cancer progression and treatment response. The dataset includes different tumor types (lung cancer, renal cancer, bladder cancer, etc.), genetic variants, and treatment outcomes, offering researchers insights into the early stages of cancer development and the effectiveness of interventions.

By leveraging the platform’s querying capabilities and the Cohort Builder tool, researchers can perform complex analyses and generate hypotheses in seconds instead of hours or even days. This rapid, self-service access to such a dataset accelerates the pace of cancer research, enabling faster identification of novel biomarkers, more precise patient stratification for clinical trials, and the development of highly targeted therapies.

The Clinical Genomics Data Platform has become Natera’s foundational solution for large-scale processing, management, analysis, and commercialization of petabytes of real-world genomic and clinical data. By delivering 300x faster performance, enabling self-service analytics and insights, and eliminating costly third-party dependencies, the platform has already generated tens of millions of dollars in operational savings. Just as importantly, it helped Natera to start realizing its long-term vision to drive innovation in cancer detection, improve patient outcomes, and unlock new revenue streams through data and AI monetization in a market worth tens of billions.

This transformation positions Natera as a leader driving innovation in personalized genetic testing and diagnostics. It enables Natera to accelerate scientific discoveries in genomics and, ultimately, contribute to enhanced cancer detection efficiency, affordability, and precision, ensuring longer, healthier lives for millions.

Moving Forward

Learn more about the Provectus Genomics Data Platform
Explore customer success stories covering data platforms: Second Genome, Appen, Wellcove, IMVU
Apply for NextGen Data Platform Acceleration Program, to get started