基于机器学习算法的胰腺导管腺癌预后模型构建及其验证
作者:
通讯作者:
作者单位:

1.四川省绵阳市中医医院 超声医学科,四川 绵阳 621000;2.四川省绵阳市中医医院 普通外科,四川 绵阳 621000;3.成都中医药大学医学技术学院,四川 成都 611137

作者简介:

张业光,四川省绵阳市中医医院主任医师,主要从事超声医学(普外方向)方面的研究。

基金项目:

四川省绵阳市卫健委基金资助项目(202309);四川省绵阳市中医医院基金资助项目(MYSZYYYKT2023117)。


Construction and validation of a prognostic model for pancreatic ductal adenocarcinoma based on machine learning algorithm
Author:
Affiliation:

1.Department of Ultrasound Medicine, Mianyang Traditional Chinese Medicine Hospital, Mianyang, Sichuan 621000, China;2.Department of General Surgery, Mianyang Traditional Chinese Medicine Hospital, Mianyang, Sichuan 621000, China;3.College of Medical Technology, Chengdu University of Traditional Chinese Medicine, Chengdu 611137, China

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 音频文件
  • |
  • 视频文件
    摘要:

    背景与目的 胰腺导管腺癌(PDAC)是胰腺癌中最常见的病理类型,其远期预后差,缺乏个体化的预后评价工具。本研究通过SEER数据库中的大样本真实世界数据,基于机器学习算法,构建PDAC患者预后列线图,旨在精准化、个体化评价PDAC患者的预后,为临床决策制定提供参考。方法 根据纳入和排除标准,提取SEER数据库2000—2018年期间经病理学确诊为PDAC患者的临床病理及预后资料。按7∶3随机分为训练集和验证集。在训练集中,分别采用单(多)因素Cox比例风险模型、LASSO回归模型和随机生存森林模型筛选影响PDAC预后的独立因素,构建预测6、12、36个月肿瘤特异性生存期(CSS)和总生存期(OS)的列线图模型。随后,分别在训练集和验证集中利用一致性指数(C指数)、受试者工作特征(ROC)曲线、校准曲线、生存曲线、决策曲线分析对模型进行验证和评估。结果 本研究共纳入4 237例患者,其中训练集2 965例,验证集1 272例,两组基线资料均衡可比。训练集和验证集中患者的中位随访时间分别为18(9~36)个月和18(9~37)个月。多因素Cox比例风险模型显示,年龄、T分期、N分期、M分期、分化程度、手术、系统治疗和化疗是OS的独立影响因素(均P<0.05);年龄、T分期、N分期、M分期、分化程度、手术和化疗是CSS的独立影响因素(均P<0.05)。LASSO回归模型显示,年龄、分化程度、T分期、N分期、M分期、化疗、手术、淋巴清扫范围、放疗和系统治疗与OS相关;分化程度、T分期、N分期、M分期、化疗、手术、淋巴清扫范围、放疗和系统治疗与CSS相关。随机生存森林模型显示,影响OS的重要性评分前五位变量分别为:系统治疗、分化程度、N分期、化疗和T分期;而影响CSS的重要性评分前五位变量分别为:系统治疗、分化程度、N分期、化疗和AJCC分期。基于多因素Cox回归模型、LASSO回归模型和随机生存森林模型的分析结果并结合临床重要性,最终选择年龄、T分期、N分期、M分期、分化程度、手术和化疗,共七个临床特征成功构建预测6、12、36个月的OS和CSS的预测模型。模型验证结果表明,对于OS,在训练集和验证集中的C指数分别为0.692(95% CI=0.681~0.704)和0.680(95% CI=0.664~0.698);对于CSS,在训练集和验证集中的C指数分别为0.696(95% CI=0.684~0.707)和0.680(95% CI=0.662~0.698)。ROC曲线表明模型具有良好的预测价值;校准曲线均靠近理想的45°参考线。结论 年龄、TNM分期、分化程度、手术和化疗是PDAC患者预后的独立影响因素;基于这些变量构建的预测模型,有较高的区分度和准确度。有助于临床医师为PDAC患者制定精准的、个体化的治疗和随访方案。

    Abstract:

    Background and Aims Pancreatic ductal adenocarcinoma (PDAC) is the most common pathological type of pancreatic cancer, with a poor long-term prognosis and a lack of individualized prognostic assessment tools. This study was conducted to construct a prognostic nomogram for PDAC patients based on large-sample real-world data from the SEER database using machine learning algorithms to provide precise and individualized prognostic evaluations to inform clinical decision-making.Methods The clinical and prognostic data of PDAC patients pathologically diagnosed from 2000 to 2018 were extracted from the SEER database based on inclusion and exclusion criteria. The data were randomly divided into training (70%) and validation (30%) sets. In the training set, independent prognostic factors were identified using univariate and multivariate Cox proportional hazards models, LASSO regression, and random survival forests. A nomogram was developed to predict 6, 12, and 36-month cancer-specific survival (CSS) and overall survival (OS). The model was then validated and assessed in both training and validation sets using the concordance index (C-index), receiver operating characteristic (ROC) curve, calibration curve, survival curves, and decision curve analysis.Results A total of 4 237 patients were included, with 2 965 in the training set and 1 272 in the validation set, showing comparable baseline characteristics. The median follow-up time was 18 (9-36) months for the training set and 18 (9-37) months for the validation set. The multivariate Cox model indicated that age, T stage, N stage, M stage, differentiation, surgery, systemic therapy, and chemotherapy were independent factors for OS (all P<0.05). For CSS, age, T stage, N stage, M stage, differentiation, surgery, and chemotherapy were independent factors (all P<0.05). The LASSO regression model found that age, differentiation, T stage, N stage, M stage, chemotherapy, surgery, lymph node dissection, radiotherapy, and systemic therapy were associated with OS, while T stage, N stage, M stage, chemotherapy, surgery, lymph node dissection, radiotherapy, and systemic therapy were linked to CSS. The random survival forest model identified the top five variables affecting OS as systemic therapy, differentiation, N stage, chemotherapy, and T stage; and for CSS, they were systemic therapy, differentiation, N stage, chemotherapy, and AJCC stage. Based on the analyses from the multivariate Cox, LASSO, and random survival forest model, along with clinical significance, a prediction model was successfully constructed using seven clinical features: age, T stage, N stage, M stage, differentiation, surgery, and chemotherapy to predict OS and CSS at 6, 12, and 36 months. The validation results showed C-indexes of 0.692 (95% CI=0.681-0.704) and 0.680 (95% CI=0.664-0.698) for OS in the training and validation sets, respectively, and 0.696 (95% CI=0.684-0.707) and 0.680 (95% CI=0.662-0.698) for CSS. ROC curves indicated good predictive value, and calibration curves closely matched the ideal 45° reference line.Conclusion Age, TNM stage, differentiation, surgery, and chemotherapy are independent prognostic factors for PDAC patients. The prognostic model based on these variables has high discrimination and accuracy, assisting clinicians in developing precise and personalized treatment and follow-up plans for PDAC patients.

    表 4 PDAC患者的单因素Cox分析(续)Table 4 Univariate Cox regression analysis for CSS and OS in PDAC patients (continued)
    表 3 PDAC患者的单因素Cox分析Table 3 Univariate Cox regression analysis for CSS and OS in PDAC patients
    表 1 Table 1
    表 2 训练集和验证集PDAC患者的基线特征表[n(%)](续)Table 2 Baseline characteristics of patients with PDAC in training set and validation set [n (%)] (continued)
    图1 病例筛选流程图Fig.1 Case screening process表1 训练集和验证集PDAC患者的基线特征表[n(%)] Table 1 Baseline characteristics of patients with PDAC in training set and validation set [n (%)]
    图2 多因素Cox回归分析结果森林图 A:OS;B:CSSFig.2 Forest plot of multivariate Cox regression analysis results A: OS; B: CSS
    图3 基于LASSO回归的特征选择 A:LASSO回归系数随Log(λ)的变化曲线(OS);B:基于10折交叉验证C指数随Log(λ)的变化曲线(OS);C:LASSO回归系数随Log(λ)的变化曲线(CSS);D:基于10折交叉验证C指数随Log(λ)的变化曲线(CSS)Fig.3 Feature selection based on LASSO regression A: LASSO regression coefficients vs. Log(λ) curve (OS); B: C-index from 10-fold cross-validation vs. Log(λ) curve (OS); C: LASSO regression coefficients vs. Log(λ) curve (CSS); D: C-index from 10-fold cross-validation vs. Log(λ) curve (CSS)
    图4 变量重要性随机森林模型 A:OS;B:CSSFig.4 Variable importance from the random forest model A: OS; B: CS
    图5 预测PDAC患者6、12、36个月预后的列线图 A:OS;B:CSSFig.5 Nomogram for predicting the prognosis of PDAC patients at 6, 12, and 36 months A: OS; B: CSS
    图6 模型在训练集和验证集中6、12、36个月预测能力验证的ROC曲线 A:OS;B:CSSFig.6 ROC curves validating the predictive ability of the model at 6, 12, and 36 months in the training and validation sets A: OS; B: CSS
    图7 PDAC患者6、12、36个月OS与CSS的校准曲线 A:训练集;B:验证集Fig.7 Calibration curves for OS and CSS of PDAC patients at 6, 12, and 36 months A: Training set; B: Validation set
    图8 训练集和验证集中列线图和TNM分期预测6、12、36个月OS和CSS的DCA比较 A:训练集OS;B:验证集OS;C:训练集CSS;D:验证集CSSFig.8 Comparison of DCA for the nomogram and TNM staging predicting OS and CSS at 6, 12, and 36 months in the training and validation sets A: OS for the training set; B: OS for the validation set; C: CSS for the training set; D: CSS for the validation set
    图9 不同风险患者的生存曲线 A:训练集OS;B:验证集OS;C:训练集CSS;D:验证集CSSFig.9 Survival curves for patients with different risk levels A: OS for the training set; B: OS for the validation set; C: CSS for the training set; D: CSS for the validation set
    参考文献
    相似文献
    引证文献
引用本文

张业光,赵攀,章慧,黄正红,黄坤.基于机器学习算法的胰腺导管腺癌预后模型构建及其验证[J].中国普通外科杂志,2024,33(9):1459-1472.
DOI:10.7659/j. issn.1005-6947.2024.09.013

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
历史
  • 收稿日期:2023-12-28
  • 最后修改日期:2024-03-14
  • 录用日期:
  • 在线发布日期: 2024-10-12