基于机器学习的胰腺癌特征基因筛选初步研究
作者:
通讯作者:
作者单位:

1.中南大学湘雅医院,普通外科,湖南 长沙 410008;2.中南大学湘雅医院,药学部,湖南 长沙 410008;3.中南大学湘雅医院,国家老年疾病临床医学研究中心,湖南 长沙 410008

作者简介:

魏伟,中南大学湘雅医院主治医师,主要从事胰腺疾病基础与临床方面的研究。

基金项目:

湖南省自然科学基金资助项目(2019JJ40489)。


Machine learning-based feature gene screening of pancreatic cancer
Author:
Affiliation:

1.Department of General Surgery, Xiangya Hospital, Central South University, Changsha 410008, China;2.Department of Pharmacy, Xiangya Hospital, Central South University, Changsha 410008, China;3.National Clinical Research Center for Geriatric Disorders, Xiangya Hospital, Central South University, Changsha 410008, China

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 音频文件
  • |
  • 视频文件
    摘要:

    背景与目的 胰腺癌是一种难治的癌症,90%以上的患者在诊断后1年内死亡。胰腺癌病变组织和正常组织之间存在差异表达基因(DEGs)可能与胰腺癌的发生和发展密切相关。本研究运用机器学习方法对胰腺癌DEGs进行筛选,以期为研究该病的发生机制提供依据。方法 从公共基因GEO数据库中筛选胰腺癌基因表达谱,使用线性回归模型软件包Limma对不同组的芯片进行差异性计算,归一化;使用R语言获得DEGs,对筛选出来的DEGs特征选择方法进一步进行筛选;基于获得的核心DEGs,采用AdaBoost和Bagging算法分别构建胰腺癌预测模型。用DAVID 网站对核心DEGs进行GO功能分析和KEGG通路富集分析,再用STRING网站及Cytscape软件对核心DEGs进行蛋白-蛋白相互作用(PPI)网络分析,最后用GEPIA网站对预后相关的核心DEGs行生存分析。结果 通过特征筛选,得到了18个关键的DEGs;以该18个DEGs建立特征子集,结合AdaBoost算法建立了预测模型,预报准确率可以达到92.3%。通过对DEGs的GO和KEGG分析,发现CDK1、CCNA2和CCNB1有间接作用,对胰腺癌的形成和发展有一定的作用。生存分析显示,CDK1(P=0.000 8)、CCNB1(P=0.012)、CSK2(P=0.023)、CKS1B(P=0.001 3)的表达量与患者总生存期(OS)有相关性,它们的表达量越高,患者OS越短。结论 机器学习方法可较好地对胰腺癌特征基因进行筛选,对胰腺癌的诊治及相关的药物开发具有一定意义。

    Abstract:

    Background and Aims Pancreatic cancer is a difficult-to-treat disease and over 90% of the patients will die within one year of diagnosis. The presence of differentially expressed genes (DEGs) between diseased and normal pancreatic cancer tissues may closely associated with the development and progression of pancreatic cancer. This study was conducted to screen the DEGs in pancreatic cancer using a machine learning approach, so as to provide a basis for studying the pathogenetic mechanism of this disease.Methods Pancreatic cancer gene expression profiles were screened from the public gene GEO database, differential calculations and normalizations were performed using the linear regression model package Limma for different groups of microarrays. The DEGs were obtained using the R language, and the selected DEGs were further screened by correlation-based feature selection method. Based on the hub DEGs obtained, AdaBoost and Bagging algorithms were used to construct pancreatic cancer prediction models respectively. The GO function analysis and KEGG enrichment analysis of the hub DEGs were performed through the DAVID website, and protein-protein interaction (PPI) network of the hub DEGs was analyzed using STRING database and Cytscape software. Finally, survival analysis was performed on the relevant hub DEGs through the GEPIA website.Results Through feature screening, 18 key DEGs were obtained. A prediction model was built by using AdaBoost algorithm based on the feature subset containing the 18 DEGs, and the prediction accuracy reached 92.3%. The GO and KEGG analysis of the DEGs revealed an indirect role for CDK1, CCNA2 and CCNB1 in the formation and development of pancreatic cancer. Survival analysis showed that the expressions of CDK1 (P=0.000 8), CCNB1 (P=0.012), CSK2 (P=0.023) and CKS1B (P=0.001 3) were correlated with the overall survival (OS) of patients, and higher expressions of them were associated with shorter OS of patients.Conclusion Machine learning methods can be efficiently applied for hub genes screening in pancreatic cancer, and have certain significance for the diagnosis and treatment of pancreatic cancer and related drug development.

    表 1 不同弱分类器对胰腺癌的预测结果Table 1 Prediction result of different weak classifiers for pancreatic cancer
    图1 DEGs火山图(蓝色点表示满足阈值的下调DEGs,红色点表示满足阈值的上调DEGs,灰色点表示不满足阈值的DEGs)Fig.1 Volcano plots of DEGs (Blue dots indicating the down-regulated DEGs that meet the threshold, red dots indicating the up-regulated DEGs that meet the threshold, and gray dots indicating DEGs that do not meet the threshold)
    图2 差异表达基因的GO功能富集分析Fig.2 Functional enrichment analysis of GO for differentially expressed genes
    图3 DEGs的KEGG功能富集分析Fig.3 KEGG functional enrichment analysis of DEGs
    图4 DEGs蛋白交互作用网络图Fig.4 Protein interaction network of DEGs
    图5 关键基因表达与胰腺癌患者生存的关系Fig.5 Relations of the expressions of the hub genes with the survival of pancreatic cancer patients
    图1 DEGs火山图(蓝色点表示满足阈值的下调DEGs,红色点表示满足阈值的上调DEGs,灰色点表示不满足阈值的DEGs)Fig.1 Volcano plots of DEGs (Blue dots indicating the down-regulated DEGs that meet the threshold, red dots indicating the up-regulated DEGs that meet the threshold, and gray dots indicating DEGs that do not meet the threshold)
    图2 差异表达基因的GO功能富集分析Fig.2 Functional enrichment analysis of GO for differentially expressed genes
    图3 DEGs的KEGG功能富集分析Fig.3 KEGG functional enrichment analysis of DEGs
    图4 DEGs蛋白交互作用网络图Fig.4 Protein interaction network of DEGs
    图5 关键基因表达与胰腺癌患者生存的关系Fig.5 Relations of the expressions of the hub genes with the survival of pancreatic cancer patients
    参考文献
    相似文献
    引证文献
引用本文

魏伟,欧政林,窦晓淋,张帅,唐翎.基于机器学习的胰腺癌特征基因筛选初步研究[J].中国普通外科杂志,2022,31(9):1203-1209.
DOI:10.7659/j. issn.1005-6947.2022.09.009

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
历史
  • 收稿日期:2022-06-26
  • 最后修改日期:2022-08-25
  • 录用日期:
  • 在线发布日期: 2022-09-30