BALab seminars — Analytical database execution engines, vectorization, and Amazon Redshift applications

Abstract

Query execution engines for analytics are continuously adapting to the underlying hardware in order to maximize performance. Wider SIMD registers and more complex SIMD instruction sets are emerging in mainstream CPUs and new processor designs such as the many-core Intel Xeon Phi CPUs that rely on SIMD vectorization to achieve high performance per core while packing a greater number of smaller cores per chip. In the database literature, using SIMD to optimize stand-alone operators with key–rid pairs is common, yet the state-of-the-art query engines rely on compilation of tightly coupled operators where hand-optimized individual operators become impractical. In this article, we extend a state-of-the-art analytical query engine design by combining code generation and operator pipelining with SIMD vectorization, and show that the SIMD speedup is diminished when execution is dominated by random memory accesses. To better utilize the hardware features, we introduce VIP, an analytical query engine designed and built bottom-up from pre- compiled column-oriented data-parallel sub-operators and implemented entirely in SIMD. In our evaluation using synthetic and TPC-H queries on a many-core CPU we show that VIP outperforms hand-optimized query-specific code without incurring the runtime compilation overhead, and highlight the efficiency of VIP at utilizing the hardware features of many-core CPUs.

Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse solution that makes it simple and cost-effective to efficiently analyze large volumes of data using existing business intelligence tools. Since launching in February 2013, it has been Amazon Web Service's (AWS) fastest growing service, with many thousands of customers and many petabytes of data under management. Amazon Redshift's pace of adoption has been a surprise to many participants in the data warehousing community. While Amazon Redshift was priced disruptively at launch, available for as little as $1000/TB/year, there are many open-source data warehousing technologies and many commercial data warehousing engines that provide free editions for development or under some usage limit. While Amazon Redshift provides a modern MPP, columnar, scale-out architecture, so too do many other data warehousing engines. And, while Amazon Redshift is available in the AWS cloud, one can build data warehouses using EC2 instances and the database engine of one's choice with either local or network-attached storage.

Orestis Polychroniou is a senior software engineer at Amazon Web Services working on the core query execution performance of Amazon Redshift. Before joining Amazon he did his PhD with Kenneth A. Ross at Columbia University on modern databases with his PhD thesis focusing on analytical query execution optimizations optimized for all layers of modern hardware. His research has ~500 citations.