Abstract
This paper presents the potential of contrastive language-image pretraining (CLIP) in subsurface geoscience, both as a standalone tool for multimodal semantic search and within the broader context of vision-language model training. This paper focuses on the role of CLIP in developing subsurface-specific vision-language models and presents a novel workflow for the stochastic generation of meaningful subsurface image-text data sets, adaptable for any vision-language training. Using this workflow, a synthetic geologic and seismic image-caption data set are constructed to conduct training experiments with CLIP. A comparison of training strategies for subsurface classification and semantic search reveals that CLIP models using domain-specific pretrained encoders deliver superior accuracy compared with traditional CLIP pretraining with off-the-shelf encoders. Additionally, we introduce GeoGen, a machine learning tool designed to generate realistic geologic and seismic models from textual descriptions in which domain-trained CLIP models play a pivotal role. Evaluation of GeoGen's performance emphasizes the importance of selecting an appropriate CLIP training strategy to enhance vision-language fusion and improve model quality.