Download - Wes McKinney on Improving the Data Stack & Composable Systems

Discover

Podcast Features
Your all-in-one podcasting solution.

Blog to Podcast
Turn your blog into an engaging podcast.
Livestream
High-performing audio live, without limits.

Podcast Studio
Easy-to-use audio recorder app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Podcast App
The best podcast player & podcast app.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.
Live Streaming
Receive livestream rewards from your audience and earn
recurring income from your Fan Club membership.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Advertisers
Enterprise
Pricing
Discover

The Real Python Podcast

Technology

Wes McKinney on Improving the Data Stack & Composable Systems

2024-02-23

Download Right click and do "save link as"

How do you avoid the bottlenecks of data processing systems? Is it possible to build tools that decouple storage and computation? This week on the show, creator of the pandas library Wes McKinney is here to discuss Apache Arrow, composable data systems, and community collaboration.

Wes briefly describes the humble beginnings of the pandas project in 2008 and moving the project to open source in 2011. Since then, he’s been thinking about improvements across the data processing ecosystem.

Wes collaborated with members of the broader data science community to build the in-memory analytics infrastructure of Apache Arrow. Arrow avoids the bottlenecks of repeated data serialization and format conversion. He shares examples of Arrow’s use across the spectrum in tools like Polars and DuckDB.

Wes advocates moving from vertically integrated tools toward composable data systems. We discuss his work on Ibis, a portable dataframe API for data manipulation and exploration in Python. Ibis supports multiple backends by decoupling the API from the execution engine.

This week’s episode is brought to you by Posit Connect.

Course Spotlight: Unleashing the Power of the Console With Rich

Rich is a powerful library for creating text-based user interfaces (TUIs) in Python. It enhances code readability by pretty-printing complex data structures and adds visual appeal with colored text, tables, animations, and more.

Topics:

00:00:00 – Introduction
00:02:26 – Dealing with limitations in early data science
00:04:53 – Making pandas open source
00:07:10 – Making changes to an existing platform
00:12:34 – Decoupling storage and computation
00:23:04 – Sponsor: Posit Connect
00:23:54 – Apache Arrow solving multiple issues
00:27:40 – DuckDB efficient analytic SQL database
00:30:24 – Polars dataframe library
00:31:04 – pandas 2.0 adding Arrow
00:35:56 – Video Course Spotlight
00:37:20 – Apache Software Foundation background
00:41:29 – Shifting from developer to organizer and collaborator
00:45:56 – Creating a portable query layer with Ibis
00:55:34 – Casualties of the language wars
00:57:57 – What’s your role at Posit?
01:01:23 – What are you excited about in the world of Python?
01:04:52 – What do you want to learn next?
01:06:21 – How can people follow your work online?
01:08:20 – Thanks and goodbye

Show Links:

Wes McKinney - Personal Website
Wes McKinney - The Road to Composable Data Systems: Thoughts on the Last 15 Years and the Future
Wes McKinney - Leveling Up the Data Stack: Thoughts on the Last 15 Years - YouTube
Apache Hadoop
Cloudera - The hybrid data company
Wes McKinney - Apache Arrow and the “10 Things I Hate About pandas”
Voltron Data - The Leading Designer and Builder of Enterprise Data Systems
Apache Arrow
DuckDB - An in-process SQL OLAP database management system
DuckDB-Wasm - Efficient Analytical SQL in the Browser
Polars - Dataframes for the new era
pandas 2.2.0 documentation
Episode #167: Exploring pandas 2.0 & Targets for Apache Arrow – The Real Python Podcast
ASF - Welcome to The Apache Software Foundation!
Ursa Labs Blog
Ibis - The Portable Python dataframe Library
Python dataframe interchange protocol
Hadley Wickham
Rust Programming Language
italki - Best language learning app with certificated tutors
Wes McKinney - LinkedIn
Wes McKinney (@wesmckinn) - X
Posit - The Open-Source Data Science Company