Petroleum Science >2026, Issue6: 3017-3036 DOI: https://doi.org/10.1016/j.petsci.2026.01.042
A data augmentation method for lacustrine shale lithofacies classification based on a conditional diffusion probabilistic model: A case study from the Dongying Depression, Bohai Bay Basin, China Open Access
文章信息
作者:Gui-Ang Li, Cheng-Yan Lin, Chun-Mei Dong, Li-Hua Ren, Peng-Jie Ma, Yu-Qi Wu, Guo-Yin Zhang, Xin-Yu Du, Zi-Ru Zhao
作者单位:
投稿时间:
引用方式:Li, G.A., Lin, C.Y., Dong, C.M., et al., 2026. A data augmentation method for lacustrine shale lithofacies classification based on a conditional diffusion probabilistic model: A case study from the Dongying Depression, Bohai Bay Basin, China. Petrol. Sci. 23 (6), 3017–3036. https://doi.org/10.1016/j.petsci.2026.01.042.
文章摘要
Accurate identification of lithofacies is critical for shale hydrocarbon exploration and development. Although machine learning (ML) is one of the most effective approaches for predicting shale lithofacies, the inherent geological heterogeneity results in scarce training samples and severely imbalanced class distributions, leading conventional ML methods to experience overfitting and reduced accuracy. To address this issue, we introduced a conditional diffusion probabilistic model (CDPM) to address these challenges and developed a comprehensive data augmentation framework for shale lithofacies prediction. Applying this framework to the Upper Fourth Member of the Shahejie Formation (Es4s) in the Dongying Depression, we successfully generated 3,600 class-balanced augmented samples from 926 core-calibrated samples and eight conventional well log curves from well NY1, achieving an 878.3% increase in rare organic-rich fissile calcareous mudstone (L1) lithofacies. To ensure the reliability of the augmented data, a comprehensive quality assessment was conducted, demonstrating that augmented data effectively retained logging characteristics and petrophysical relationships, with a Fréchet Inception Distance (FID) of 22.9 and a maximum mean discrepancy (MMD) of 0.078. Building upon this high-quality augmented dataset, we evaluated the impact of data augmentation on lithofacies classification performance across random forest (RF), support vector machine (SVM), and gradient boosted decision tree (GBDT) algorithms. The results showed substantial improvements, with average accuracy and F1 score increases of 13.6% and 16.5%, respectively, and a 33.1% improvement in L1 recall. To further validate the practical applicability of our approach, blind-well validation on independent wells from different structural positions demonstrated robust generalization capability, achieving significant improvement over traditional ML methods. This study pioneers a conditional diffusion model for predicting shale lithofacies, providing a novel framework for characterising lacustrine shale oil reservoirs and predicting sweet spots.
关键词
-
Lacustrine shale; Lithofacies prediction; Conditional diffusion probabilistic model; Data augmentation; Class imbalance