基于生物信息学胰腺腺癌关键基因的筛选及支持向量机诊断模型的构建
作者:
通讯作者:
作者单位:

作者简介:

基金项目:

甘肃省重点研发计划基金资助项目(17YF1FA128);甘肃省兰州市人才创新创业基金资助项目(2017-RC-37)。


Identification of hub genes in pancreatic adenocarcinoma and construction of a support vector machine diagnostic classifier based on bioinformatics approaches
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 音频文件
  • |
  • 视频文件
    摘要:

    背景与目的:胰腺癌是一种常见的消化道恶性肿瘤,其主要病理类型为胰腺腺癌(PAAD),因早期诊断困难且缺乏有效的治疗措施,故预后极差。因此,寻找PAAD的诊治新靶标具有重要意义。本研究通过生物信息学方法筛选与PAAD诊断和预后相关的关键基因,构建分类PAAD样本和正常样本的支持向量机(SVM)模型,以期为PAAD的诊治及机制研究提供依据。
    方法:从基因表达数据库(GEO)中下载3个芯片数据(GSE28735、GSE62165、GSE62452),应用R语言的Limma包筛选出PAAD组织和正常组织间的差异表达基因(DEGs)。利用STRING数据库对DEGs进行GO和KEGG通路富集分析。再以STRING数据库构建DEGs的蛋白互作网络(PPI),利用Cytoscape软件进行可视化编辑,并通过MCODE插件进行关键子网络分析。使用R语言的survival包筛选PPI和关键子网络中与预后相关的关键节点,将其上传至Metascape进行功能富集分析。利用R语言caret包中递归式特征消除(RFE)算法筛选关键节点中的最优特征基因,在GEPIA数据库中验证最优特征基因的表达差异,随后通过R语言的e1071包构建最优特征基因的SVM模型,并在3个芯片数据中借助R语言的pROC包对该模型进行验证。在TCGA数据库中,用R语言的survminer包筛选出最优特征基因中与PAAD预后相关的基因作为关键基因。
    结果:共筛选出257个DEGs,包括168个上调基因和89个下调基因。GO分析结果表明DEGs主要参与细胞外基质的组成、细胞黏附、丝氨酸肽酶活性等生物学过程。KEGG分析显示,DEGs主要富集于蛋白质的消化和吸收、胰腺的分泌、黏着斑、PI3K-Akt信号通路。生存分析筛选出14个关键节点同时在GSE28735和GSE62452中与预后相关(均P<0.05),这些基因在肿瘤侵犯和肿瘤发生中发挥一定作用。RFE筛选出8个最优特征基因:LAMA3、FN1、ITGA3、MET、PLAU、CENPF、MMP14、OAS2;GEPIA数据库验证发现这8个最优特征基因在PAAD组织中明显上调(均P<0.01);这些基因构建的SVM模型在3个芯片数据中ROC曲线的AUC依次为0.898、1.000、0.905。TCGA数据库验证发现LAMA3、ITGA3、MET、PLAU、CENPF及OAS2的上调与PAAD预后不良有关(均P<0.05)。
    结论:关键基因LAMA3、ITGA3、MET、PLAU、CENPF及OAS2可能成为PAAD诊治的新靶点;基于8个最优特征基因构建的SVM模型可有效诊断PAAD。

    Abstract:

    Background and Aims: Pancreatic cancer is a common malignant tumor of the digestive tract. Its main pathological type is pancreatic adenocarcinoma (PAAD). Due to the difficulty of early diagnosis and lack of effective treatment measures, the prognosis of PAAD is extremely poor. Therefore, defining new targets for the diagnosis and treatment of PAAD is of great significance. This study was conducted to screen the hub genes related to the diagnosis and prognosis of PAAD by bioinformatics analysis, and then construct a support vector machine (SVM) model to classify PAAD and normal pancreatic samples, so as to provide a useful resource for researches in terms of diagnosis, treatment and mechanism of PAAD. 
    Methods: Three microarray datasets (GSE28735, GSE62165, GSE62452) were downloaded from the Gene Expression Omnibus (GEO) database. The differentially expressed genes (DEGs) between PAAD tissue and normal pancreatic tissue were screened using Limma package of R language. GO and KEGG pathway enrichment analysis of the DEGs were performed using STRING database. Then, protein-protein interaction networks (PPI) of the DEGs were generated using the STRING server and visualized by Cytoscape software. Key subnetwork module analyses were performed through MCODE plug-in. R language survival package was used to screen the key nodes related to prognosis in PPI and key subnetworks, and then, the key nodes were uploaded to Metascape for function enrichment analysis. The recursive feature elimination (RFE) algorithm in caret package of R language was used to select the optimal feature genes in key nodes, and the expression differences of the optimal feature genes were verified in GEPIA database. A SVM classifier based on the optimal feature genes was constructed using the R language e1071 package, and the R language pROC package was used to verify the model in the 3 microarray datasets. In the TCGA database, the R package survminer was used to select the genes related to the prognosis of PAAD among the optimal feature genes as the hub genes. 
    Results: A total of 257 DEGs were screened, including 168 up-regulated genes and 89 down-regulated genes. GO analysis showed that DEGs were mainly involved in biological processes such as the extracellular matrix organization, cell adhesion, serine-type peptidase activity. KEGG analysis showed that DEGs were mainly enriched in protein digestion and absorption, pancreatic secretion, focal adhesion and PI3K-Akt signaling pathway. Survival analysis showed that 14 key nodes were associated with the prognosis in both GSE28735 and GSE62452 (all P<0.05), and these genes played a certain role in neoplasm invasiveness and oncogenesis. RFE screened out 8 optimal feature genes: LAMA3, FN1, ITGA3, MET, PLAU, CENPF, MMP14, and OAS2; GEPIA database validation found that the 8 optimal feature genes were significantly up-regulated in PAAD tissues (all P<0.01). The AUC of ROC curve of the SVM model constructed by these genes in the 3 microarray datasets were 0.898, 1.000 and 0.905, respectively. TCGA database verification found that the up-regulations of LAMA3, ITGA3, MET, PLAU, CENPF and OAS2 were associated with poor prognosis of PAAD (all P<0.05).
    Conclusion: The hub genes LAMA3, ITGA3, MET, PLAU, CENPF and OAS2 may be new targets for diagnosis or treatment of PAAD. The SVM model based on 8 optimal feature genes offers an effective tool for diagnosing PAAD.

    参考文献
    相似文献
    引证文献
引用本文

张波,徐涛,徐浩,夏雨,周文策.基于生物信息学胰腺腺癌关键基因的筛选及支持向量机诊断模型的构建[J].中国普通外科杂志,2021,30(3):276-285.
DOI:10.7659/j. issn.1005-6947.2021.03.005

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
历史
  • 收稿日期:2020-10-29
  • 最后修改日期:2021-03-25
  • 录用日期:
  • 在线发布日期: 2021-03-25