BD - Earth day 2024

PLUS: Predicting cancer metastasis potential based on positive and unlabeled learning

Junyi Zhou, Xiaoyu Lu, Wennan Chang, Changlin Wan, Xiongbin Lu, Chi Zhang, Sha Cao

Abstract
Metastatic cancer accounts for over 90% of all cancer deaths, and evaluations of metastasis potential are vital for minimizing the metastasis-associated mortality and achieving optimal clinical decision-making. Computational assessment of metastasis potential based on large-scale transcriptomic cancer data is challenging because metastasis events are not always clinically detectable. The under-diagnosis of metastasis events results in biased classification labels, and classification tools using biased labels may lead to inaccurate estimations of metastasis potential. This issue is further complicated by the unknown metastasis prevalence at the population level, the small number of confirmed metastasis cases, and the high dimensionality of the candidate molecular features. Our proposed algorithm, called Positive and unlabeled Learning from Unbalanced cases and Sparse structures (PLUS), is the first to use a positive and unlabeled learning framework to account for the under-detection of metastasis events in building a classifier.

Introduction
Metastatic cancer is responsible for over 90% of all cancer deaths [1,2]. Compared with well-confined primary tumors, metastatic cancer remains incurable because of its systemic nature and the resistance of disseminated tumor cells to existing therapeutic agents [3,4]. Hence, for a substantial number of cancer patients, effective treatment is largely dependent on an understanding of and capacity to interdict metastasis. Cancer metastasis is a multistep process by which cancer cells disperse from a primary site and progressively colonize distant organs.

Results
Problem formulation and methods overview

Diagnoses of metastatic cancer are often confirmed by detection of tumor masses at a distant site or effusions on clinical examination or by imaging [23]. Unfortunately, there is currently no panel of basic tests that can aid in revealing metastatic tumor events. Hence, many patients that are not diagnosed with metastatic tumors may have developed metastasis, but could not be diagnosed at an early phase due to weak symptoms (Fig 1A). Take the cancer patients enrolled in the TCGA project as an example. Among patients initially diagnosed as non-metastatic (M0), a large portion have a good prognosis and do not develop metastasis (M0: NP-Alive in Fig 1B). However, a significant portion of these patients do develop metastasis (M0: P-Alive in Fig 1B) or die (M0: Deceased in Fig 1B) based on their follow-up data.

Discussion and conclusions
Metastasis is the major cause of cancer-related deaths, and evaluations of metastasis risk are essential for tailored treatment of cancer patients. Existing computational tools for predicting the cancer metastasis potential fall under two categories: 1) methods that build a classifier using the clinical metastasis diagnoses as responses and 2) methods that evaluate the behavior of gene features found to be significantly associated with metastasis-related survival outcomes. Such predictors exist in many even for the same cancer type; however, selected gene features rarely overlap, not to mention the little consistency of metastasis predictor genes among different cancer types.

Citation: Zhou J, Lu X, Chang W, Wan C, Lu X, Zhang C, et al. (2022) PLUS: Predicting cancer metastasis potential based on positive and unlabeled learning. PLoS Comput Biol 18(3): e1009956. https://doi.org/10.1371/journal.pcbi.1009956

Academic Editor:  Jie Liu, University of Michigan, UNITED STATES

Received: June 17, 2021; Accepted: February 23, 2022; Published: March 29, 2022

Copyright: © 2022 Zhou et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited..

Data Availability: The datasets analysed in this study are available from TCGA GDC repository, and the Gene Expression Omnibus (GEO) repository under the following accession numbers: GSE103322, GSE75688. PLUS is an R program available at https://github.com/xiaoyulu95/PLUS.

Competing interests: The authors have declared that no competing interests exist.