data science

Michael F. Huerta from the National Library of Medicine at the U.S.’s National Institutes of Health explores how discovery and health benefit from the intersection of data science and open science

The promise of biomedical science to save and improve lives has never been realised so quickly and spectacularly as it is today. And yet, the nearly universal digitisation of research and healthcare is starting to unlock the power of data and more open paradigms, making possible faster progress and changes in the very nature of discovery. The National Library of Medicine (NLM) lives at the intersection of these forces and is poised to catalyse this transformation.

NLM is an institute of the National Institutes of Health (NIH) with a research and training focus on biomedical information science, informatics and data science. NLM is at the forefront of innovation in computational biology, computing in context, extracting insight from electronic health records and using artificial intelligence (AI) and data science approaches to answer key biomedical questions. This scientific leadership is reflected in NLM’s support of extramural research across the country, as well as a robust intramural research programme and internal information engineering efforts aimed at innovating and improving its products and processes.

NLM is also the world’s largest biomedical library, creating and hosting major resources, tools and services for biomedical literature, data, standards and more. Every day, NLM sends over 1,000 terabytes of data to nearly five million users and receives over 100 terabytes from more than 3000 users. As a library, NLM has fostered and advanced open science and scholarship by making digital research objects – whether a digital literature citation, dataset, or data standard definition – findable, accessible, interoperable and reusable (i.e., FAIR), as well as attributable and sustainable. Resources like GenBank, PubMed, Medline Plus, PubChem, PubMed Central and make data and information findable and accessible and their implementation makes data and information reusable, attributable and sustainable.

NLM facilitates interoperability of digital data by promoting, developing and hosting a range of standards products, such as terminologies like UMLS and LOINC, as well as standards platforms such as the NIH Clinical Data Elements (CDE) Repository and the Value Set Authority Center. NLM also shares its standards expertise, acting as the coordinating body of the Department of Health and Human Services for clinical terminology and of NIH through its leadership of the NIH Clinical CDE Task Force.

The recently released NLM Strategic Plan envisions NLM building on its experience to become a platform for biomedical discovery and data-powered health by achieving three goals.

The first is to provide tools for data-driven research, which includes enhancing innovation by expanding NLM’s biomedical informatics and data science research activities. NLM will also work to connect resources, tools and services as the basis of a sustainable, open and trusted digital ecosystem for biomedical and health information, scholarship and science. Emphasis will be placed on ensuring digital research objects such as scientific papers, datasets, models, analytic pipelines and others, are FAIR and appropriately associated with each other – minding appropriate considerations for privacy and confidentiality.

In pursuit of the second goal, to reach more people in more ways, NLM will optimise users’ experience with – and use of – its resources, tools and services, to better serve its many different users. The final goal is to expand NLM’s training support and activities to: (1) produce experts who will develop next-generation innovations in biomedical informatics and data science; (2) make sure that biomedical scientists are adept in the use of these advanced approaches; (3) instill an understanding of the opportunities, limits and requirements of data science across the entire biomedical workforce and; (4) assure the public is data-ready to make the best use of health information in the 21st Century.

Maximising the scientific opportunities that sit at the intersection of data science and open science will require addressing some basic issues, but solutions do not seem far off. Non-traditional practices needed for open science, such as sharing data, must be encouraged; this might be addressed by strategically aligning incentives across the biomedical enterprise. And, as digital research objects and their links to each other multiply, at-scale curation solutions will be needed; this might be addressed by using AI to infer the nature of an object based on its location in the network of interconnected objects, with provenance tracked using blockchain.

Finally, the sustainability of an open digital ecosystem is crucial; this will be helped by making sure that decisions about investing in the ecosystem are based on empirical evidence about the value of those investments to the science.

Achieving the goals of the NLM Strategic Plan, especially in the context of the NIH Strategic Plan for Data Science, will usher in a new era of data-driven research and data-powered health – one that is certain to offer more hope to more people more rapidly.


Michael F. Huerta

Coordinator of Data & Open Science Initiatives

Associate Director for Program Development

National Library of Medicine

National Institutes of Health

Tel: +1 301 827 6451



Please enter your comment!
Please enter your name here