3D object detection from monocular images has proven to be an enormously challenging task, with the performance of leading systems not yet achieving even 10% of that of LiDAR-based counterparts. One explanation for this performance gap is that existing systems are entirely at the mercy of the perspective image-based representation, in which the appearance and scale of objects varies drastically with depth and meaningful distances are difficult to infer. In this work we argue that the ability to reason about the world in 3D is an essential element of the 3D object detection task. To this end, we introduce the orthographic feature transform, which maps image-based features into an orthographic 3D space, enabling us to reason holistically about the spatial configuration of the scene. We apply this transformation as part of an end-to-end deep learning architecture and demonstrate our approach on the KITTI 3D object benchmark.