From its earliest days, collaboration has been at the heart of genomics. Between 1990 and 2003, the Human Genome Project brought together hundreds of researchers from 20 institutions in six countries and prioritised rapid pre-publication data release. Since then, the field has only grown more interconnected. Between 1996 and 2015, the number of authors on any given scientific research paper, whether biological or physical, increased by ~40%(1). For bioinformatics, that number is closer to 400%(2). Successive genomics sequencing projects, such as the international HapMap project and the 1000 Genomes Project, carried on the values and traditions of the Human Genome Project, growing continuously more collaborative and expansive. In February 2020, the Pan Cancer Project published the most comprehensive study of whole cancer genomes to date – the product of more than 1,300 international researchers working together to analyse more than 2,600 tumour sequences.
Despite these exemplary efforts in the research space, each project has had to chart its own course for data sharing and collaboration. Broadly speaking, data is still siloed by type, disease, country, institution, and sector. Analysis methods are non-standardised and rarely scalable. Achieving interoperability, which would help us overcome these challenges, is hindered by differing approaches to regulation, consent, and data standardisation. All of this leads to a striking gap between research and healthcare, the latter of which is becoming an increasingly prominent player in genomics. If we don’t act efficiently and effectively as a global community now, we will face an overwhelming mass of fragmented data from which few researchers can learn and even fewer patients can benefit.
The Global Alliance for Genomics and Health
The Global Alliance for Genomics and Health (GA4GH) works to facilitate responsible data sharing, enable research, and improve health outcomes by developing interoperable standards for research and clinical genomics. All GA4GH standards build on the Framework for Responsible Sharing of Genomic and Health-Related Data.(3) Founded on the human rights to benefit from scientific advancement and to be recognised for one’s contributions to science(4), this international guidance document aims to support responsible genomic data sharing in order to equitably promote the health and wellbeing of individuals, families, and communities worldwide. It provides a model for international cooperation, collaboration, and good governance, as well as benchmarks for accountability. The Framework has been translated into 14 languages and used to inform local data sharing approaches around the globe, including the World Economic Forum(5), the Academy of Science of South Africa(6), DNA.Land(7), Health Data Research UK(8), and the Horizon-2020 CORBEL project(9).
While GA4GH aims to enable international data sharing across the translational continuum, from research to healthcare to industry, the means to this end is often complicated by differing approaches to data sharing in disparate jurisdictions. Some of our members take a data commons approach, creating trusted, controlled repositories of multiple datasets. Others follow a hub and spoke model, where common data elements, structures, and access/use rules facilitate interoperability. Others aim to link together distributed networks of secure datasets to enable query and analysis on hidden data and still others are focused on sharing only the high-level genomic knowledge rather than the raw data themselves. GA4GH must support each of these approaches with standards and policies that can be adapted for each unique, individual use case.
Whenever centralisation is not technically or legally possible – which is often – we promote technology-enabled federation. Groups of autonomous organisations and datasets connected by centralised control mechanisms enable researchers to move their analyses to data rather than downloading a copy of a dataset on their local machines. This approach requires broad, reciprocal data access methods that respect disparate national processes and patient consents.
Federated approaches to data access and analysis are becoming vital to unlocking the full potential of genomic data as it scales and shifts from a primarily research activity to one dominated by healthcare. In 2012, the clinical community-generated only 1% of sequencing data. In 2017, that number had grown to 20%, with 120,000 genomes sequenced for rare disease and cancer diagnosis(10). By 2025, we expect the pendulum to swing even further, with more than 80% of total sequencing coming from healthcare. Standards that support interoperability and federation will enable secondary use of these data for research and will enable the emergence of a virtual cohort of more than 60 million genomes. Such a resource will support the discovery of more reliable patterns in health and disease, increase cohort diversity, deliver more statistical significance in analyses, match similar patients at disparate ends of the globe leading to increased rare disease diagnoses, provide stronger variant interpretations, and provide more informed clinical decision support.
GA4GH technical standards, including APIs, data models, and standard schemas, have been implemented by over 40 leading genomics institutions, including ELIXIR, Genomics England, the NIH All of Us Research Program, TOPmed, Australian Genomics, Illumina, Google, and Amazon Web Services. Nearly every genomics institution in the world uses the standard GA4GH file formats to store their genomic read and variant data, including more than 4 petabytes of compressed genomics data which are being stored in 1.5 million GA4GH CRAM files around the globe. Standard licensing agreements ensure anyone can contribute to, access, and implement GA4GH standards.
This work will support a learning health system in which healthcare can access the methods and skills of the research community, and, reciprocally, researchers can leverage healthcare data to make discoveries. The genomics community is shifting from a paradigm of data sharing by downloading as popularised in the era of the Human Genome Project, to one of data visiting, in which researchers bring their algorithms to massively large scale research and clinical cohorts. The result will be an internationally federated ecosystem of responsible data sharing that supports both the human right to benefit from scientific advances as well as the right of researchers to be recognised for their work. As a global community of researchers and clinicians, we have a responsibility to bridge this gap in order to enable collaboration at scale, so we can learn faster and better, together.
1 Economist T. Why research papers have so many authors. TheEconomist. 2016. Accessed 24 February 2020.
2 Song, M., Yang, C. C., & Tang, X. (2013). “Detecting evolution of bioin-formatics with a content and co-authorship analysis.” SpringerPlus,2(1), 186. doi: 10.1186/2193-1801-2-186.
3 Knoppers, Bartha Maria (2014). Framework for Responsible Sharing of Genomic and Health-Related Data. The HUGO Journal doi:10.1186/s11568-014-0003-1
4 UN General Assembly. “Universal Declaration of Human Rights.” United Nations, 217 (III) A, 1948, Paris, art. 1, http://www.un.org/en/universal-declaration-human-rights/. Accessed 24 Feb. 2020.
5 Federated Data Systems: Balancing Innovation and Trust in the Use of Sensitive Data. World Economic Forum, 2019.
6 “Human Genetics and Genomics in South Africa: Ethical, Legal and Social Implications.” Academy of Science of South Africa, Nov. 2018, doi:10.17159/assaf.2018/0033.
7 “Terms of Consent.” DNA .LAND, DNA.LAND, dna.land/consent.
8 Digital Innovation Hub Programme Prospectus Appendix: Principles for Participation. Health Data Research UK, 2019, http://www.hdruk.ac.uk/wp-content/uploads/2019/07/Digital-Innovation-Hub-Programme-Prospectus-Appendix-Principles-for-Participation.pdf.
9 Ohmann, Christian, et al. “Sharing and Reuse of Individual Participant Data from Clinical Trials: Principles and Recommendations.” BMJ Open, vol. 7, no. 12, 2017, doi:10.1136/bmjopen-2017-018647.
10 Birney, Ewan, et al. “Genomics in Healthcare: GA4GH Looks to 2022.” BioRxiv, 2017, doi:10.1101/203554.
Editor's Recommended Articles
Must Read >> Whole genome sequencing: It’s getting personal