Putative biomarkers for predicting tumor sample purity based on gene expression data


We identified a ten-gene set (CSF2RB, RHOH, C1S, CCDC69, CCL22, CYTIP, POU2AF1, FGR, CCL21, and IL7R) that may serve as a biomarker to tumor purity prediction (regardless of tumor type) using RNA-seq gene expression data. We applied a supervised machine learning method, XGBoost, to data from 33 TCGA tumor types to predict tumor purity using RNA-seq gene expression data. We used ensemble of 1,000 XGBoost models to make final predictions.

Input data should be a gene expression value matrix with rows for samples and columns for genes. Sample identifiers should be included with variable name "id". Here is a sample dataset. The output file include two columns for sample identifier and tumor purity prediction.


Load testing gene expression file:

Email:


Authors: Ethan Xu and Yuanyuan Li