Presto! It’s not only an incantation to excite your audience after a magic trick, but also a name being used more and more when discussing how to churn through big data. While there are many deployments of Presto in the wild, the technology — a distributed SQL query engine that supports all kinds of data sources — remains unfamiliar to many developers and data analysts who could benefit from using it.
In this article, I’ll be discussing Presto: what it is, where it came from, how it is different from other data warehousing solutions, and why you should consider it for your big data solutions.
Presto vs. Hive
Presto originated at Facebook back in 2012. Open-sourced in 2013 and managed by the Presto Foundation (part of the Linux Foundation), Presto has experienced a steady rise in popularity over the years. Today, several companies have built a business model around Presto, such as Ahana, with PrestoDB-based ad hoc analytics offerings.
Presto was built as a means to provide end-users access to enormous data sets to perform ad hoc analysis. Before Presto, Facebook would use Hive (also built by Facebook and then donated to the Apache Software Foundation) in order to perform this kind of analysis. As Facebook’s data sets grew, Hive was found to be insufficiently interactive (read: too slow). This was largely because the foundation of Hive is MapReduce, which, at the time, required intermediate data sets to be persisted to HDFS. That meant a lot of I/O to disk for data that was ultimately thrown away.
Presto takes a different approach to executing those queries to save time. Instead of keeping intermediate data on HDFS, Presto allows you to pull the data into memory and perform operations on the data there instead of persisting all of the intermediate data sets to disk. If that sounds familiar, you may have heard of Apache Spark (or any number of other technologies out there) that have the same basic concept to effectively replace MapReduce-based technologies. Using Presto, I’ll keep the data where it lives (in Hadoop or, as we’ll see, anywhere) and perform the executions in-memory across our distributed system, shuffling data between servers as needed. I avoid touching any disk, ultimately speeding up query execution time.
How Presto works
Different from a traditional data warehouse, Presto is referred to as a SQL query execution engine. Data warehouses control how data is written, where that data resides, and how it is read. Once you get data into your warehouse, it can prove difficult to get it back out. Presto takes another approach by decoupling data storage from processing, while providing support for the same ANSI SQL query language you are used to.
At its core, Presto executes queries over data sets that are provided by plug-ins, specifically Connectors. A Connector provides a means for Presto to read (and even write) data to an external data system. The Hive Connector is one of the standard connectors, using the same metadata you would use to interact with HDFS or Amazon S3. Because of this connectivity, Presto is a drop-in replacement for organizations using Hive today. It is able to read data from the same schemas and tables using the same data formats — ORC, Avro, Parquet, JSON, and more. In addition to the Hive connector, you’ll find connectors for Cassandra, Elasticsearch, Kafka, MySQL, MongoDB, PostgreSQL, and many others. Connectors are being contributed to Presto all the time, giving Presto the potential to be able to access data anywhere it lives.
The advantage of this decoupled storage model is that Presto is able to provide a single federated view of all of your data — no matter where it resides. This ramps up the capabilities of ad hoc querying to levels it has never reached before, while also providing interactive query times over your large data sets (as long as you have the infrastructure to back it up, on-premises or cloud).
Let’s take a look at how Presto is deployed and how it goes about executing your queries. Presto is written in Java, and therefore requires a JDK or JRE to be able to start. Presto is deployed as two main services, a single Coordinator and many Workers. The Coordinator service is effectively the brain of the operation, receiving query requests from clients, parsing the query, building an execution plan, and then scheduling work to be done across many Worker services. Each Worker processes a part of the overall query in parallel, and you can add Worker services to your Presto deployment to fit your demand. Each data source is configured as a catalog, and you can query as many catalogs as you want in each query.
Presto is accessed through a JDBC driver and integrates with practically any tool that can connect to databases using JDBC. The Presto command line interface, or CLI, is often the starting point when beginning to explore Presto. Either way, the client connects to the Coordinator to issue a SQL query. That query is parsed and validated by the Coordinator, and built into a query execution plan. This plan details how a query is going to be executed by the Presto workers. The query plan (typically) begins with one or more table scans in order to pull data out of your external data stores. There are then a series of operators to perform projections, filters, joins, group bys, orders, and all kinds of other operations. The plan ends with the final result set being delivered to the client via the Coordinator. These query plans are vital to understanding how Presto executes your queries, as well as being able to dissect query performance and find any potential bottlenecks.
Presto query example
Let’s take a look at a query and corresponding query plan. I’ll use a TPC-H query, a common benchmarking tool used for SQL databases. In short, TPC-H defines a standard set of tables and queries in order to test SQL language completeness as well as a means to benchmark various databases. The data is designed for business use cases, containing sales orders of items that can be provided by a large number of supplies. Presto provides a TPC-H Connector that generates data on the fly — a very useful tool when checking out Presto.
SELECT
SUM(l.extendedprice*l.discount) AS revenue
FROM lineitem l
WHERE
l.shipdate >= DATE '1994-01-01'
AND l.shipdate < DATE '1994-01-01' + INTERVAL '1' YEAR
AND l.discount BETWEEN .06 - 0.01 AND .06 + 0.01
AND l.quantity < 24;
This is query number six, known as the Forecasting Revenue Change Query. Quoting the TPC-H documentation, “this query quantifies the amount of revenue increase that would have resulted from eliminating certain company-wide discounts in a given percentage range in a given year.”
Presto breaks a query into one or more stages, also called fragments, and each stage contains multiple operators. An operator is a particular function of the plan that is executed, either a scan, a filter, a join, or an exchange. Exchanges often break up the stages. An exchange is the part of the plan where data is sent across the network to other workers in the Presto cluster. This is how Presto manages to provide its scalability and performance — by splitting a query into multiple smaller operations that can be performed in parallel and allow data to be redistributed across the cluster to perform joins, group-bys, and ordering of data sets. Let’s look at the distributed query plan for this query. Note that query plans are read from the bottom up.
Fragment 0 [SINGLE]
- Output[revenue] => [sum:double]
revenue := sum
- Aggregate(FINAL) => [sum:double]
sum := "presto.default.sum"((sum_4))
- LocalExchange[SINGLE] () => [sum_4:double]
- RemoteSource[1] => [sum_4:double]
Fragment 1
- Aggregate(PARTIAL) => [sum_4:double]
sum_4 := "presto.default.sum"((expr))
- ScanFilterProject[table = TableHandle {connectorId='tpch', connectorHandle='lineitem:sf1.0', layout='Optional[lineitem:sf1.0]'}, grouped = false, filterPredicate = ((discount BETWEEN (DOUBLE 0.05) AND (DOUBLE 0.07)) AND ((quantity) < (DOUBLE 24.0))) AND (((shipdate) >= (DATE 1994-01-01)) AND ((shipdate) < (DATE 1995-01-01)))] => [expr:double]
expr := (extendedprice) * (discount)
extendedprice := tpch:extendedprice
discount := tpch:discount
shipdate := tpch:shipdate
quantity := tpch:quantity
This plan has two fragments containing several operators. Fragment 1 contains two operators. The ScanFilterProject scans data, selects the necessary columns (called projecting) needed to satisfy the predicates, and calculates the revenue lost due to the discount for each line item. Then a partial Aggregate operator calculates the partial sum. Fragment 0 contains a LocalExchange operator that receives the partial sums from Fragment 1, and then the final aggregate to calculate the final sum. The sum is then output to the client.
When executing the query, Presto scans data from the external data source in parallel, calculates the partial sum for each split, and then ships the result of that partial sum to a single worker so it can perform the final aggregation. Running this query, I get about $123,141,078.23 in lost revenue due to the discounts.
revenue
----------------------
1.2314107822830005E8
As queries grow more complex, such as joins and group-by operators, the query plans can get very long and complicated. With that said, queries break down into a series of operators that can be executed in parallel against data that is held in memory for the lifetime of the query.
As your data set grows, you can grow your Presto cluster in order to maintain the same expected runtimes. This performance, combined with the flexibility to query virtually any data source, can help empower your business to get more value from your data than ever before — all while keeping the data where it is and avoiding expensive transfers and engineering time to consolidate your data into one place for analysis. Presto!
Ashish Tadose is co-founder and principal software engineer at Ahana. Passionate about distributed systems, Ashish joined Ahana from WalmartLabs, where as principal engineer he built a multicloud data acceleration service powered by Presto while leading and architecting other products related to data discovery, federated query engines, and data governance. Previously, Ashish was a senior data architect at PubMatic where he designed and delivered a large-scale adtech data platform for reporting, analytics, and machine learning. Earlier in his career, he was a data engineer at VeriSign. Ashish is also an Apache committer and contributor to open source projects.
—
New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to [email protected].