In this paper, we explore to determine 3D human pose directly from monocular image data. While current state-of-the-art approaches employ the volumetric representation to predict per voxel likelihood for each human joint, the network output is memory-intensive, making it hard to function on mobile devices. To reduce the output dimension, we intend to decompose the volumetric representation into 2D depth-aware heatmaps and joint depth estimation. We propose to learn depth-aware 2D heatmaps via associative embeddings to reconstruct the connection between the 2D joint location and its corresponding depth. Our approach achieves a good trade-off between complexity and high performance. We conduct extensive experiments on the popular benchmark Human3.6M and advance the state-of-the-art accuracy for 3D human pose estimation in the wild.