Goran Nenadic, from The University of Manchester and The Alan Turing Institute, argues for using patient information stored in routinely collected healthcare free-text data
A lot of health and social care research relies on data specifically collected for a particular study, with relatively small sample sizes and a relatively high cost of data collection. However, in many countries, national health and social care services routinely collect longitudinal data to support patient care. Records about patients’ encounters with such services are increasingly available digitally on a large scale, providing key information about the concerns, treatments and outcomes of care. Such data has been used not only to support clinical practice but also increasingly as a source of research data to improve our understanding of disease patterns and determinants of health and well-being.
The healthcare data space is also increasingly heterogeneous, including well-structured variables (e.g. from diagnostic and laboratory tests, measurements from monitoring equipment, or coded data) and semi-structured and unstructured data (e.g. images or clinical letters). Structured data, while undoubtedly important, cannot capture all the necessary clinical details through numerical measurements, predefined pull-down menus, clinical codes and checklists.
Health and social care systems have, therefore, relied on natural language, which remains the main means of communication with patients and carers, and between professionals. Such communication includes clinical notes, letters, reports and observations, produced by professionals and increasingly, patient-generated comments on their experience, adverse reactions or outcomes that are available on social media or in patient diaries.
Estimates are that 85% of actionable health information is stored in a free-text narrative, which typically records key contextual information, including detailed symptom profiles and personalised treatment plans, as well as patients’ risk factors, experience and impact on quality of life. This is particularly extensive in some areas (e.g. mental health and social care), where a free-text narrative is a principal means to record key patient information. However, the text in medical records is often stripped out before records are made available for research purposes due to the privacy issues and, thus, this rich source of data is often untapped.
Several case studies have shown that there is great value in pulling information from a clinical free-text narrative (e.g. in mental health, rheumatology, cancer studies, GP records). While free-text narratives are routinely collected, their automated processing is still not common, mainly due to challenges that analysis of unstructured free text brings with its often complex context-dependent and idiosyncratic terminological constructs.
However, the ability to process such data both for an individual patient and/or aggregate them across a specific patient population is a key enabler for a range of artificial intelligence (AI) applications in healthcare, including personalised medicine and clinical decision support: if we could unearth detailed information from very large real-world datasets, we could find out which treatments work best for which patients.
Over the last 30 years, there has been significant work on developing automated computerised methods for large-scale processing of free text, known as text analytics. Such methods process free-text documents and automatically identify, link and extract mentions of key clinical variables and their values, including, for example, diagnoses, problems, symptoms, affected anatomical locations, diagnostic or therapeutic procedures performed, drug/medication prescriptions, adverse drug events, family/social history, and behaviour and quality of life indicators.
For example, text-analytics methods can extract specific mentions and details of a radiotherapy treatment from a clinical note, or mentions of social factors (e.g. smoking, alcohol consumption) or emotional concerns in a patient’s discharge summary. Similarly, methods have been developed to extract clinical information that may only be available in a free-text format, such as symptom severity in mental health, detailed prescription extractions or results of specific instruments (e.g. Mini-Mental State Examination). This machine reading through text analytics allows transforming unstructured free-text into a structured representation (clinical variables and values), which can be then used to help clinicians navigate through individual records as well as for further actionable analytics on integrated health data. However, there have been many barriers in text-analytics methods being translated and widely used in practice.
Recognising the needs for and challenges in processing free-text data in health and social care, the UK’s Engineering and Physical Sciences Research Council (EPSRC) has established the Healthcare Text Analytics Network (Healtex), bringing together experts from academia, the National Health Service (NHS), regulators, industry and patient communities to identify key barriers in processing free-text data, scope future research directions and disseminate best practice and successful outcomes where large-scale text processing has been instrumental in healthcare research. The network started its activities in 2016 and has facilitated engagements with the wider stakeholder community via workshops, conferences and feasibility funding, aiming to coordinate efforts and provide open, reusable and privacy-aware analytics solutions.
While much is known about public attitudes to the use of health data in general, the network specifically commissioned a Citizens’ Jury on whether, and in what circumstances, healthcare free-text data should be used for research. A jury comprised of members of the public in the UK explored the opportunities and potential trade-offs between privacy and the public good and offered broad support for the use of free-text health data for health-related research by academic and NHS organisations. One of the key issues that has been highlighted is that free-text data often contains very detailed and sensitive personal information, including possibly a range of third-party identifiers. This requires thorough de-identification before narrative can be used by researchers.
While finding and masking personal identifiable information is a complex task, there have been several efforts to provide automated methods to remove such information from clinical letters and notes on a large scale, with the accuracy comparable to human efforts. We note that de-identified free-text data is mainly needed for the training and validation of text analytics methods, whereas – when used in practice – free-text data would be transformed by text analytics software into structured data before leaving the trusted environments in hospitals, removing the need for de-identification. The network has initiated making de-identified free-text data available through data donations and by generating synthetic data that can be used for training text analytics software.
In addition to free-text data privacy and governance, the network operates a number of other working groups that aim to provide translational benefits to the clinical community. These range from groups focusing on processing mental health clinical letters or mining drug prescriptions, to automated analysis of radiology reports, automated clinical coding of diagnoses, processing of health social media analytics and patient feedback data, to the analysis of veterinary reports.
The Healtex network has become a hub for a multi-disciplinary community of clinicians and computer scientists, but also regulators and information custodians, patients and carers, industry, charities and many more, who collaborate on finding the ways to use routinely collected free text to support clinical practice (e.g. summarisation of clinical notes and letters to support efficient browsing and aggregation of a patient’s history), clinical, social care and epidemiological research, as well as enabling patients and carers to efficiently search posts on social media.
Given that both research and practice are likely to have a growing need for effective text analytics, Healtex is determined to collaborate with other organisations (both internationally and nationally, e.g. with The Alan Turing Institute for data science and AI, and Health Data Research UK) to provide a forum that can revolutionise healthcare and improve patient outcomes by unlocking the evidence contained in free-text data that is currently widely untapped.
More information about Healtex and how to join the network is available at http://healtex.org
Please note: This is a commercial profile
Professor of Computer Science
The University of Manchester
Tel: +44 (0)161 275 6289
Editor's Recommended Articles
Must Read >> Turning health data into knowledge to improve lives