Computational cost is a major factor that inhibits the practical application of 3D depth migration. We have developed a fast parallel scheme to speed up 3D wave-equation depth migration on a parallel computing device, i.e., on graphics processing units (GPUs). The third-order optimized generalized-screen propagator is used to take advantage of the built-in software implementation of the fast Fourier transform. The propagator is coded as a sequence of kernels that can be called from the computer host for each frequency component. Moving the wavefield extrapolation for each depth level to the GPUs allows handling a large 3D velocity model, but this scheme can be speeded up to a limited degree over the CPU implementation because of the low-bandwidth data transfer between host and device. We have created further speedup in this extrapolation scheme by minimizing the low-bandwidth data transfer, which is done by storing the 3D velocity model and imaged data in the device memory, and reducing half the memory demand by compressing the 3D velocity model and imaged data using integer arrays instead of float arrays. By incorporating a 2D tapered function, time-shift propagator, and scaling of the inverse Fourier transform into a compact kernel, the computation time is reduced greatly. Three-dimensional impulse responses and synthetic data examples have demonstrated that the GPU-based Fourier migration typically is 25 to 40 times faster than the CPU-based implementation. It enables us to image complex media using 3D depth migration with little concern for computational cost. The macrovelocity model can be built in a much shorter turnaround time.