Josh Odmark, Pandio.com CTO, and Co-Founder talks about Presto by Facebook
The Data Standard
Josh Odmark, Pandio CTO, and Co-Founder talks about Presto by Facebook
Josh Odmark, Pandio.com CTO, and Co-Founder talks about Presto by Facebook.
Episode Summary
“You can think of Presto almost like a database, but it’s more of an abstraction layer for a large group of databases” – Josh Odmark, CTO & Founder of Pandio.
In this episode of The Data Standard, Darren Kaplan is joined by the CTO & Founder of Pandio, Josh Odmark. Josh is a full-stack engineer expert that worked with PHP, Python, Ruby, JavaScript, and SQL.
As an engineer working in many different technologies, it’s interesting to hear what Josh has to say about Presto and how he uses it for public data around museum information. Josh joins us to talk about how he uses Presto by Facebook and gives us a quick demo of his approach and the benefits that he thinks are most valuable.
Presto is a distributed SQL query engine. It’s an open-source Facebook technology and can be adjusted to anyone’s needs. It gives users a completely different approach to querying data. Traditionally, data is copied or moved into some warehouse, but Presto doesn’t do that; it lets you query data in place.
But Presto requires access to a flat file or database. To show this as an example, Odmark downloaded data from the Museum of Modern Art from their open-access database. There’s no need for any kind of preparation, simply download the file and open it.
Even though this file requires a bit of ETL, it’s not an issue since Presto lets users run SQL against these kinds of data sets and open up many options. The second data set that he used was from The Metropolitan Museum of Art. Both of these files are typical CSVs.
Using the AWS s3 Presto one-click install, you can instantly run Presto within AWS. This allows users to run SQL commands against different datasets that have been added to the AWS. Presto’s simplicity is a thing of beauty because users only have to set up the schema and point it to the desired files.
With the “Show Table” command, the user can easily see all of the datasets added visually displayed as tables. These tables have the traditional data frame, including columns and data types. Josh used Presto to set up the table view of the datasets from these two museums in New York.
Even though both museums have different data structures, the platform takes raw files and puts them up in a similar manner and runs SQL queries against them. Odmark showed this on an example where he ran an SQL query against both of the museum datasets to join the two tables based on artist names.
Even though both museums have a different way of storing and structuring their data, Presto can run the query effectively. Still, certain conditionals need to be added, but this is what Presto is about. It gives full SQL capabilities, making it easy to do some ETL actions that will enable you to do your queries.
Presto takes a couple of seconds to run a query against two datasets around 300MB in size. It goes through all the rows and columns to find all of the artist names and display several types of results:
- Which artists are displayed or working in both museums;
- Which artwork is from a single artist is in one museum, and which pieces of the same artist are in the other museum;
Meet The Host
Darren Kaplan
Co-Founder & Board Member of HiQ Labs
Darren Kaplan is 2x Founder and recognized as one of the Top 20 Data Science Influencers in 2020. Darren is the co-creator of The Data Standard, the premier networking user-community for data-science, data engineers, and cybersecurity enthusiasts.
Meet The Guest
Josh Odmark
Chief Technology Officer & Founder at Pandio
Joshua Odmark is a seasoned full-stack engineer that worked in product management, project management, AI, machine learning, data science, and dev ops. Odmark is the creator of iSell, iPhotographs, Furnished.com, Websitez.com LLC, Local Data Exchange, and Pandio.io.