Velocity model inversion is one of the most challenging tasks in seismic exploration, and an accurate velocity model is essential for high-resolution seismic imaging. Recently, velocity inversion methods based on deep learning (DL), particularly convolutional neural networks (CNNs), have attracted considerable attention from the seismic exploration community. These researchers aim to directly estimate the velocity model from raw seismograms using a well-trained model. Although CNN-based velocity inversion methods have demonstrated remarkable performance in terms of intelligence and automation, their inversion performance is often constrained by a limited long-range dependence. Specifically, when conducting a convolutional operation on raw seismic data using small kernels (i.e., 1 × 1, 3 × 3, 5 × 5, and 7 × 7), CNN-based methods extract only the local features and neglect the weak spatial correlation between different local features that reflect the information of the same interface. This correlation could assist CNN in providing an overview of the seismic data and promote inversion performance when using DL. Furthermore, the time-varying properties of seismic data pose a challenge to the weight sharing of CNNs. Here, we have developed a new DL framework based on a transformer, called the seismic velocity inversion transformer (SVIT), to address the problem of velocity inversion. SVIT uses a self-attention mechanism to capture the long-range dependence of seismic data, rather than stacking multiple convolutional layers as in CNNs. Thus, SVIT can provide more informative remote features for building velocity models. The validity and reliability of the proposed method are demonstrated through numerical experiments using synthetic models. Compared with the conventional full-waveform inversion method and an existing CNN-based velocity inversion method, our SVIT indicates greater consistency with the target in terms of the velocity value, subsurface structures, and geologic interfaces and is expected to provide a new DL-based solution to resolve inversion problems.