Dhouha Grissa , Ditlev Nytoft Rasmussen, Aleksander Krag, Søren Brunak, Lars Juhl Jensen
Alcoholic-related liver disease (ALD) is the cause of more than half of all liver-related deaths. Sustained excess drinking causes fatty liver and alcohol-related steatohepatitis, which may progress to alcoholic liver fibrosis (ALF) and eventually to alcohol-related liver cirrhosis (ALC). Unfortunately, it is difficult to identify patients with early-stage ALD, as these are largely asymptomatic. Consequently, the majority of ALD patients are only diagnosed by the time ALD has reached decompensated cirrhosis, a symptomatic phase marked by the development of complications as bleeding and ascites. The main goal of this study is to discover relevant upstream diagnoses helping to understand the development of ALD, and to highlight meaningful downstream diagnoses that represent its progression to liver failure. Here, we use data from the Danish health registries covering the entire population of Denmark during nineteen years (1996–2014), to examine if it is possible to identify patients likely to develop ALF or ALC based on their past medical history. To this end, we explore a knowledge discovery approach by using high-dimensional statistical and machine learning techniques to extract and analyze data from the Danish National Patient Registry.
Alcohol-related liver disease (ALD) is caused by alcohol-overuse [1–3]. It is the third leading preventable cause of death in the world and one of the major chronic liver diseases worldwide [4, 5]. The mortality and morbidity rates from patients with ALD is very high. It accounts for ∼ 500,000 deaths from ALD annually worldwide, representing a huge financial and healthcare burden on society . According to World Health Organization (WHO) Global Information System on Alcoholic and Health (GISAH) , the harms related to Alcoholic result in 2016 in some 3 million deaths (5.3% of all deaths) worldwide with more than 2 million deaths in USA. Unfortunately, our knowledge on ALD is still limited by a lack of large population studies .
Materials and methods
In this section, we present the knowledge discovery approach of Fig 1, built integrating data extraction, statistical and classification techniques. It is divided into six main steps: (i) preprocessing and extraction of the NPR data; (ii) matched case–control study based on the stratification of patients with ALD and on cohorts extracted from random sampling of NPR data in a time-dependent manner; (iii) extraction of upstream and downstream data from cases and controls; (iv) a discrimination analysis based on statistical methods, (v) and a prediction study based on machine-learning techniques; and (vi) evaluation and interpretation of the results.
Results & discussion
Groups of patients with ALD
From the preprocessed registry data (NPRpp), we derive two main subcohorts: (i) a first subcohort of 33, 391 ALD patients; and (ii) a second subcohort of the remaining 6, 010, 942 non-ALD patients with 63, 416, 907 clinical encounters. By looking at the individual diagnoses of the patients in the subcohort of patients with ALD, we get six main ICD-10 codes. Each code is part of the the block “Diseases of liver” (K70-K77) and represents a type/stage of ALD. ALC is the most common form of ALD in the NPR data, with 23, 271 patients who got diagnosed with ALC out of the total of 33, 391 patients with ALD. By contrast, we find only 499 patients with ALF (also called alcoholic fibrosis and sclerosis of liver) among the ALD patients. For the remaining ICD-10 codes of ALD, we get 5, 959 patients with Alcoholic fatty liver, 4, 275 patients with Alcoholic hepatitis, 5, 546 patients with Alcoholic hepatic failure and 5, 823 patients with Alcoholic liver disease-unspecified (the same patient can be counted under multiple codes).
The authors thank Maja Thiele (Odense University Hospital) for constructive comments on the manuscript.
Citation: Grissa D, Nytoft Rasmussen D, Krag A, Brunak S, Juhl Jensen L (2020) Alcoholic liver disease: A registry view on comorbidities and disease prediction. PLoS Comput Biol 16(9): e1008244. https://doi.org/10.1371/journal.pcbi.1008244
Editor: Andrey Rzhetsky, University of Chicago, UNITED STATES
Received: April 24, 2020; Accepted: August 13, 2020; Published: September 22, 2020
Copyright: © 2020 Grissa et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The study was approved by the Danish Data Protection Agency [SUND-2017-57] and the Danish Health Authority [FSEID-00003092]. For privacy reasons, we cannot share the registry data about individual patients, which is the input for the analysis. To nonetheless make the analysis as transparent as possible, we provide summary statistics of the entire cohort in the subsection ‘Study design and population’ and present the full aggregate results coming out of the analysis (i.e. not only those passing statistical significance). We furthermore have created artificial data for five fictive patients, to give the reader as clear a view of the nature of the data as we can, given the legal constraints patient-sensitive data.
Funding: This work was supported by the European Union’s Horizon 2020 research and innovation programme ; and Novo Nordisk Foundation [NNF14CC0001]. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.