A practical approach for colorectal cancer diagnosis based on machine learning

Nguyen Hai Minh, Tran Quang Quy, Ngo Duc Tam, Tran Manh Tuan, Le Hoang Son

Abstract

In this paper, we present the results of applying machine learning models to build a Colorectal Cancer Diagnosis system. The methodology encompasses six key steps: collecting raw data from Electronic Medical Records (EMRs), revising feature attributes with expert input, data preprocessing, model adaptation, training machine learning models (CART, Random Forest, and XGBOOST), and evaluating the results.

Introduction

Colorectal colon, with the third highest diagnosis rate, is the second dangerous cancer in Viet Nam. Colorectal cancer constitutes a substantial public health challenge, particularly among men. Originating from malignant cells within the rectum, a segment of the large intestine, colorectal cancer progresses through distinct stages, often asymptomatically during its early phases.

Materials

Data of an Electronic Medical Record (EMR) is a crucial component in managing patient information and providing efficient healthcare in a hospital. The issue of using EMR data for machine learning models to support physicians in diagnosis is an important and meaningful matter. This study does not involve any human or animal participation.

Methods

Based on the data collected and through the preprocessing steps in Section 2 and also results of the analysis on previous studies of the machine learning models, including CART, Random Forest, XGBOOTS in Section 1. 

Results and Discussion

For data processing, analysis, and visualization, the following libraries were applied: PANDAS, SKLEARN, XLSXWRITER, MATH, MATPLOTLIB, and PYVI using an Asus laptop with an Intel Core i5-10300H processor, 8GB RAM, and the Ubuntu 20.04 operating system.

Conclusions

In this paper, a real dataset from a hospital is collected and preprocessed. Apart from that, a novel method is proposed. This model combines unified academic algorithms to support Colorectal cancer prediction.

Citation: Hai Minh N, Quy TQ, Tam ND, Tuan TM, Son LH (2025) A practical approach for colorectal cancer diagnosis based on machine learning. PLoS One 20(4): e0321009. https://doi.org/10.1371/journal.pone.0321009

Editor: Jie Zhang, Newcastle University, UNITED KINGDOM OF GREAT BRITAIN AND NORTHERN IRELAND

Received: October 17, 2024; Accepted: February 27, 2025; Published: April 29, 2025

Copyright: © 2025 Hai Minh et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the paper.

Funding: This research was sponsored by the Ministry of Education and Training project, “Research on Application of Machine Learning Model in Analysis Electronic Medical Records Gastrointestinal Disease”, B2022-TNA-24.

Competing interests: The authors have declared that no competing interests exist.