The Data Standard
“You can think of Presto almost like a database, but it’s more of an abstraction layer for a large group of databases” – Josh Odmark, CTO & Founder of Pandio.
As an engineer working in many different technologies, it’s interesting to hear what Josh has to say about Presto and how he uses it for public data around museum information. Josh joins us to talk about how he uses Presto by Facebook and gives us a quick demo of his approach and the benefits that he thinks are most valuable.
Presto is a distributed SQL query engine. It’s an open-source Facebook technology and can be adjusted to anyone’s needs. It gives users a completely different approach to querying data. Traditionally, data is copied or moved into some warehouse, but Presto doesn’t do that; it lets you query data in place.
But Presto requires access to a flat file or database. To show this as an example, Odmark downloaded data from the Museum of Modern Art from their open-access database. There’s no need for any kind of preparation, simply download the file and open it.
Even though this file requires a bit of ETL, it’s not an issue since Presto lets users run SQL against these kinds of data sets and open up many options. The second data set that he used was from The Metropolitan Museum of Art. Both of these files are typical CSVs.
Using the AWS s3 Presto one-click install, you can instantly run Presto within AWS. This allows users to run SQL commands against different datasets that have been added to the AWS. Presto’s simplicity is a thing of beauty because users only have to set up the schema and point it to the desired files.
With the “Show Table” command, the user can easily see all of the datasets added visually displayed as tables. These tables have the traditional data frame, including columns and data types. Josh used Presto to set up the table view of the datasets from these two museums in New York.
Even though both museums have different data structures, the platform takes raw files and puts them up in a similar manner and runs SQL queries against them. Odmark showed this on an example where he ran an SQL query against both of the museum datasets to join the two tables based on artist names.
Even though both museums have a different way of storing and structuring their data, Presto can run the query effectively. Still, certain conditionals need to be added, but this is what Presto is about. It gives full SQL capabilities, making it easy to do some ETL actions that will enable you to do your queries.
Presto takes a couple of seconds to run a query against two datasets around 300MB in size. It goes through all the rows and columns to find all of the artist names and display several types of results:
- Which artists are displayed or working in both museums;
- Which artwork is from a single artist is in one museum, and which pieces of the same artist are in the other museum;
Presto has a query planner that adjusts the query that has been added. This way, it allows the same query to run effectively against different clusters of data. Despite these complex processes going on in the background, the execution is really fast.
Even though he used two files to simulate databases, Presto lets users add multiple databases and files and works in the same manner. Presto can connect many different types of data but also connect things within it.
It can query data no matter where it is stored, including services like Cassandra or Hive. It works both with proprietary data stores and relational databases. Simply put, it has the unique ability to combine different data from various sources using queries.
This opens up many opportunities for organizations to do essential analytics quickly that can give valuable answers. At the same time, it’s designed for those analysts, developers, or engineers that need quick solutions.
The results are displayed from a couple of seconds to a couple of minutes, depending on how large the datasets are and how many are there. It’s the best of both worlds when it comes to analytics platforms.
Presto is both quick and free. Even though it already offers some significant benefits, we can expect to see even greater things in the future. As more people use this open-source platform, we will likely see new upgrades and changes that perfect its functionalities and expand its potential.
Make sure to check out the full podcast with Josh Odmark and Darren Kaplan at The Data Standard website.
Meet The Host
Co-Founder & Board Member of HiQ Labs
Darren Kaplan is 2x Founder and recognized as one of the Top 20 Data Science Influencers in 2020. Darren is the co-creator of The Data Standard, the premier networking user-community for data-science, data engineers, and cybersecurity enthusiasts.
Meet The Guest
Chief Technology Officer & Founder at Pandio
Joshua Odmark is a seasoned full-stack engineer that worked in product management, project management, AI, machine learning, data science, and dev ops. Odmark is the creator of iSell, iPhotographs, Furnished.com, Websitez.com LLC, Local Data Exchange, and Pandio.io.