skip to content

Search

Say No to Pandas

10 min read

A head-to-head benchmark comparing pandas, polars, and duckdb. See why modern DataFrame libraries deliver 5-20x speedups through lazy evaluation, predicate pushdown, and automatic parallelization.

It’s Time to Say Goodbye to Pandas

Pandas revolutionized data analysis in Python. It made working with tabular data intuitive and accessible. But pandas was designed in 2008 for a different era—single-threaded execution, eager evaluation, and datasets that fit comfortably in memory.

Today’s data is bigger. Today’s machines have more cores. And pandas can’t take advantage of either.

The problems with pandas:

  • Single-threaded: Your 8-core machine sits mostly idle
  • Eager evaluation: Every operation executes immediately, even if you’ll filter 90% away later
  • No query optimization: Operations run in the order you write them, not the optimal order
  • Memory inefficient: Loads everything into RAM, often making multiple copies
  • Weak types: The dreaded object dtype, nullable integer awkwardness

Enter Polars and DuckDB—modern DataFrame libraries built for modern hardware. They offer:

  • Lazy evaluation: Build a query plan, let the optimizer figure out the best execution
  • Predicate pushdown: Filter data at the source before loading into memory
  • Projection pushdown: Only read the columns you actually need
  • Automatic parallelization: Use all your CPU cores by default
  • Strong type systems: Know exactly what you’re working with

In this post, we’ll benchmark all three libraries on the Iowa Liquor Sales dataset—a real-world transactional dataset with millions of rows. Every operation is timed, so you can see exactly how much performance you’re leaving on the table with pandas.


Setup and Timing Infrastructure

First, let’s set up our environment and create a consistent timing mechanism. Every benchmark will use this same wrapper so results are directly comparable.

!pwd
/Users/bswr/smithy/portfolio/site/src/content/post/dataframe_showdown
import time
import pandas as pd
import polars as pl
import duckdb
from contextlib import contextmanager
 
# Configure your data path here
DATA_PATH = "/Users/bswr/smithy/portfolio/data/artifacts/iowa_liquor_sales_monthly/"
 
# Timing results stored for final comparison
BENCHMARKS = {}
 
@contextmanager
def benchmark(name: str, library: str):
    """Context manager to time operations consistently."""
    start = time.perf_counter()
    yield
    elapsed = time.perf_counter() - start
    if name not in BENCHMARKS:
        BENCHMARKS[name] = {}
    BENCHMARKS[name][library] = elapsed
    print(f"{library}: {elapsed:.4f}s")

Type System Comparison

One of the most frustrating aspects of pandas is its loose type system. Strings become object, nullable integers require special dtypes, and type inference can silently produce unexpected results.

Let’s see how each library handles schema inference on our parquet file:

%%time
# Pandas: Notice the 'object' dtype for strings
print("=== Pandas Schema ===")
df_temp = pd.read_parquet(DATA_PATH)
print(df_temp.dtypes)
print(f"\nShape: {df_temp.shape}")
del df_temp
=== Pandas Schema ===
invoice_and_item_number     object
date                        object
store_number                 int64
store_name                  object
address                     object
city                        object
zip_code                    object
store_location              object
county_number              float64
county                      object
category                   float64
category_name               object
vendor_number              float64
vendor_name                 object
item_number                float64
item_description            object
pack                         int64
bottle_volume_ml             int64
state_bottle_cost           object
state_bottle_retail         object
bottles_sold                 int64
sale_dollars                object
volume_sold_liters          object
volume_sold_gallons         object
dtype: object

Shape: (32629654, 24)
CPU times: user 33 s, sys: 8.24 s, total: 41.2 s
Wall time: 31.5 s
%%time
# Polars: Explicit types - Utf8 for strings, proper numeric types
print("=== Polars Schema ===")
print(pl.scan_parquet(DATA_PATH).collect_schema())
=== Polars Schema ===
Schema({'invoice_and_item_number': String, 'date': Date, 'store_number': Int64, 'store_name': String, 'address': String, 'city': String, 'zip_code': String, 'store_location': String, 'county_number': Int64, 'county': String, 'category': Int64, 'category_name': String, 'vendor_number': Int64, 'vendor_name': String, 'item_number': Int64, 'item_description': String, 'pack': Int64, 'bottle_volume_ml': Int64, 'state_bottle_cost': Decimal(precision=38, scale=9), 'state_bottle_retail': Decimal(precision=38, scale=9), 'bottles_sold': Int64, 'sale_dollars': Decimal(precision=38, scale=9), 'volume_sold_liters': Decimal(precision=38, scale=9), 'volume_sold_gallons': Decimal(precision=38, scale=9)})
CPU times: user 2.51 ms, sys: 8.79 ms, total: 11.3 ms
Wall time: 28.6 ms
%%time
# DuckDB: SQL-style types with DESCRIBE
print("=== DuckDB Schema ===")
print(duckdb.sql(f"DESCRIBE SELECT * FROM '{DATA_PATH}*'"))
=== DuckDB Schema ===



---------------------------------------------------------------------------

CatalogException                          Traceback (most recent call last)

File <timed exec>:3


CatalogException: Catalog Error: Table with name /Users/bswr/smithy/portfolio/data/artifacts/iowa_liquor_sales_monthly/* does not exist!
Did you mean "pragma_database_list"?

Benchmark #1: Full Data Read

Our first benchmark is simple: read the entire parquet file into memory. This establishes a baseline and shows how each library handles raw I/O.

print("=== Benchmark #1: Full Data Read ===\n")
 
with benchmark("full_read", "pandas"):
    df_pandas = pd.read_parquet(DATA_PATH)
 
with benchmark("full_read", "polars"):
    df_polars = pl.read_parquet(DATA_PATH)
 
with benchmark("full_read", "duckdb"):
    df_duckdb = duckdb.read_parquet(DATA_PATH + "*").df()
 
    print(f"\nRows loaded: {len(df_pandas):,}")
=== Benchmark #1: Full Data Read ===

pandas: 27.9592s
polars: 2.6359s



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))



Rows loaded: 32,629,654
duckdb: 59.7854s

Lazy Evaluation vs Eager Execution

Here’s where modern libraries shine. Pandas executes every operation immediately—even if you’re about to throw away 90% of the data. Polars and DuckDB build a query plan first, then optimize before executing.

Watch how “scanning” a parquet file is instant—no data is loaded until you call .collect() or execute the query:

# Polars lazy scan - builds query plan only (instant!)
start = time.perf_counter()
lf_polars = pl.scan_parquet(DATA_PATH)
print(f"Polars scan_parquet: {time.perf_counter() - start:.6f}s (no data loaded!)")
print(f"Type: {type(lf_polars).__name__}")
 
# DuckDB relation - also lazy
start = time.perf_counter()
rel_duckdb = duckdb.read_parquet(DATA_PATH+"*")
print(f"DuckDB read_parquet: {time.perf_counter() - start:.6f}s (lazy relation)")
print(f"Type: {type(rel_duckdb).__name__}")
Polars scan_parquet: 0.000509s (no data loaded!)
Type: LazyFrame
DuckDB read_parquet: 0.007091s (lazy relation)
Type: DuckDBPyRelation
# Polars shows you the query plan before execution
query = (
    pl.scan_parquet(DATA_PATH)
    .filter(pl.col("category_name") == "VODKA")
    .select(["store_name", "sale_dollars"])
    .group_by("store_name")
    .agg(pl.col("sale_dollars").sum())
)
 
print("=== Polars Query Plan ===")
print(query.explain())
=== Polars Query Plan ===
AGGREGATE
	[col("sale_dollars").sum()] BY [col("store_name")] FROM
  simple π 2/3 ["store_name", "sale_dollars"]
    Parquet SCAN [/Users/bswr/smithy/portfolio/data/artifacts/iowa_liquor_sales_monthly/iowa_liquor_sales_2012-01.parquet, ... 165 other sources]
    PROJECT 3/24 COLUMNS
    SELECTION: [(col("category_name")) == (String(VODKA))]

Benchmark #2: Predicate Pushdown (Filtered Read)

Predicate pushdown means pushing filter conditions down to the data source. Instead of loading all data and then filtering, the filter is applied during the read—so only matching rows ever enter memory.

Parquet files store data in row groups with statistics (min/max values). Smart libraries skip entire row groups that can’t match your filter. Pandas doesn’t do this.

print("=== Benchmark #2: Filtered Read (Predicate Pushdown) ===\n")
 
# Pandas: Must load ALL data, then filter
with benchmark("filter_category", "pandas"):
    df = pd.read_parquet(DATA_PATH)
    result_pandas = df[df["category_name"] == "VODKA"]
    del df
 
# Polars: Pushes filter to parquet reader
with benchmark("filter_category", "polars"):
    result_polars = (
        pl.scan_parquet(DATA_PATH)
        .filter(pl.col("category_name") == "VODKA")
        .collect()
    )
 
# DuckDB: Same pushdown via SQL
with benchmark("filter_category", "duckdb"):
    result_duckdb = duckdb.sql(
        f"SELECT * FROM '{DATA_PATH}' WHERE category_name = 'VODKA'"
    ).df()
 
print(f"\nFiltered rows: {len(result_pandas):,} (VODKA only)")
=== Benchmark #2: Filtered Read (Predicate Pushdown) ===

Benchmark #3: Projection Pushdown (Column Selection)

Projection pushdown means only reading the columns you need. Parquet is a columnar format—columns are stored separately on disk. If you only need 2 columns out of 20, why read the other 18?

Polars and DuckDB automatically detect which columns your query uses and only read those from disk.

print("=== Benchmark #3: Column Selection (Projection Pushdown) ===\n")
 
# Pandas: Reads all columns, then selects (some optimization with 'columns' param)
with benchmark("select_columns", "pandas"):
    result_pandas = pd.read_parquet(DATA_PATH, columns=["store_name", "sale_dollars"])
 
# Polars: Only reads 2 columns from disk
with benchmark("select_columns", "polars"):
    result_polars = (
        pl.scan_parquet(DATA_PATH)
        .select(["store_name", "sale_dollars"])
        .collect()
    )
 
# DuckDB: Same projection pushdown
with benchmark("select_columns", "duckdb"):
    result_duckdb = duckdb.sql(
        f"SELECT store_name, sale_dollars FROM '{DATA_PATH}'"
    ).df()
 
print(f"\nColumns selected: {list(result_pandas.columns)}")

Benchmark #4: Expressions vs Apply

The apply() function is pandas’ escape hatch for row-wise operations. It’s also a massive performance trap—it drops down to Python loops, bypassing all vectorization.

Even when pandas users write “proper” vectorized code, Polars and DuckDB still win because their expression engines are compiled and optimized.

print("=== Benchmark #4: Row Calculations ===\n")
 
# First, let's see how bad apply() really is (on a sample to not wait forever)
sample_size = 100_000
df_sample = df_pandas.head(sample_size).copy()
 
print(f"Testing on {sample_size:,} rows first (apply is SLOW)...\n")
 
with benchmark("row_calc_apply", "pandas"):
    df_sample["margin"] = df_sample.apply(
        lambda row: row["sale_dollars"] - row["state_bottle_cost"], axis=1
    )
print("Now vectorized operations on FULL dataset:\n")
 
# Pandas vectorized (the right way)
with benchmark("row_calc_vectorized", "pandas"):
    df_pandas["margin"] = df_pandas["sale_dollars"] - df_pandas["state_bottle_cost"]
 
# Polars expression
with benchmark("row_calc_vectorized", "polars"):
    df_polars = df_polars.with_columns(
        (pl.col("sale_dollars") - pl.col("state_bottle_cost")).alias("margin")
    )
 
# DuckDB SQL expression
with benchmark("row_calc_vectorized", "duckdb"):
    result = duckdb.sql("""
        SELECT *, sale_dollars - state_bottle_cost AS margin 
        FROM df_polars
    """).df()

Benchmark #5: Window Functions (Parallel Column Operations)

Window functions compute aggregates over groups while keeping all rows. In pandas, you use .groupby().transform(). In Polars, you use .over().

The key difference: Polars executes multiple window operations in parallel—each column computation runs on a separate thread.

print("=== Benchmark #5: Window Functions ===\n")
 
# Pandas: Sequential window operations
with benchmark("window_ops", "pandas"):
    df_temp = df_pandas.copy()
    df_temp["store_total"] = df_temp.groupby("store_name")["sale_dollars"].transform("sum")
    df_temp["category_avg"] = df_temp.groupby("category_name")["bottles_sold"].transform("mean")
    del df_temp
 
# Polars: Parallel window operations with .over()
with benchmark("window_ops", "polars"):
    result_polars = df_polars.with_columns([
        pl.col("sale_dollars").sum().over("store_name").alias("store_total"),
        pl.col("bottles_sold").mean().over("category_name").alias("category_avg"),
    ])
 
# DuckDB: Window functions via SQL
with benchmark("window_ops", "duckdb"):
    result_duckdb = duckdb.sql("""
        SELECT *,
            SUM(sale_dollars) OVER (PARTITION BY store_name) as store_total,
            AVG(bottles_sold) OVER (PARTITION BY category_name) as category_avg
        FROM df_polars
    """).df()

Benchmark #6: Complex Pipeline (Method Chaining)

Real data work involves multi-step transformations. Let’s compare a typical pipeline: extract year from date, group by year and category, aggregate sales, and sort.

Notice how Polars method chaining reads naturally top-to-bottom, while pandas often requires intermediate variables or awkward patterns like .assign().

print("=== Benchmark #6: Complex Pipeline ===\n")
 
# Pandas: Method chaining gets awkward
with benchmark("complex_pipeline", "pandas"):
    result_pandas = (
        df_pandas
        .assign(year=df_pandas["date"].dt.year)
        .groupby(["year", "category_name"])
        .agg({"sale_dollars": "sum"})
        .reset_index()
        .sort_values("sale_dollars", ascending=False)
    )
 
# Polars: Natural chaining with lazy evaluation
with benchmark("complex_pipeline", "polars"):
    result_polars = (
        df_polars.lazy()
        .with_columns(pl.col("date").dt.year().alias("year"))
        .group_by(["year", "category_name"])
        .agg(pl.col("sale_dollars").sum())
        .sort("sale_dollars", descending=True)
        .collect()
    )
 
# DuckDB: SQL is naturally declarative
with benchmark("complex_pipeline", "duckdb"):
    result_duckdb = duckdb.sql("""
        SELECT 
            YEAR(date) as year, 
            category_name, 
            SUM(sale_dollars) as sale_dollars
        FROM df_polars 
        GROUP BY 1, 2 
        ORDER BY sale_dollars DESC
    """).df()
 
print(f"\nTop 5 results:")
print(result_polars.head())

Benchmark #7: Group-By Aggregation

Group-by operations are the bread and butter of data analysis. Here we compute multiple aggregations per group—total sales, average bottles sold, and transaction count—grouped by store and category.

print("=== Benchmark #7: Group-By Aggregation ===\n")
 
# Pandas
with benchmark("groupby_agg", "pandas"):
    result_pandas = (
        df_pandas
        .groupby(["store_name", "category_name"])
        .agg(
            total_sales=("sale_dollars", "sum"),
            avg_bottles=("bottles_sold", "mean"),
            txn_count=("invoice_line_no", "count")
        )
        .reset_index()
        .sort_values("total_sales", ascending=False)
    )
 
# Polars
with benchmark("groupby_agg", "polars"):
    result_polars = (
        df_polars.lazy()
        .group_by(["store_name", "category_name"])
        .agg([
            pl.col("sale_dollars").sum().alias("total_sales"),
            pl.col("bottles_sold").mean().alias("avg_bottles"),
            pl.col("invoice_line_no").count().alias("txn_count"),
        ])
        .sort("total_sales", descending=True)
        .collect()
    )
 
# DuckDB
with benchmark("groupby_agg", "duckdb"):
    result_duckdb = duckdb.sql("""
        SELECT 
            store_name, 
            category_name,
            SUM(sale_dollars) as total_sales,
            AVG(bottles_sold) as avg_bottles,
            COUNT(invoice_line_no) as txn_count
        FROM df_polars 
        GROUP BY 1, 2 
        ORDER BY total_sales DESC
    """).df()
 
print(f"\nUnique store-category combinations: {len(result_pandas):,}")

Results Summary

Let’s compile all our benchmark results into a single comparison table, including speedup ratios relative to pandas:

print("=== BENCHMARK RESULTS ===\n")
 
results = []
for op, times in BENCHMARKS.items():
    row = {"operation": op}
    row["pandas (s)"] = f"{times.get('pandas', 0):.4f}"
    row["polars (s)"] = f"{times.get('polars', 0):.4f}"
    row["duckdb (s)"] = f"{times.get('duckdb', 0):.4f}"
    
    if "pandas" in times:
        pandas_time = times["pandas"]
        polars_time = times.get("polars", pandas_time)
        duckdb_time = times.get("duckdb", pandas_time)
        row["polars speedup"] = f"{pandas_time / polars_time:.1f}x"
        row["duckdb speedup"] = f"{pandas_time / duckdb_time:.1f}x"
    results.append(row)
 
summary_df = pd.DataFrame(results)
print(summary_df.to_markdown(index=False))

Feature Comparison

Beyond raw speed, here’s how the three libraries compare on key features:

FeaturePandasPolarsDuckDB
Type SystemWeak (object dtype)Strong (explicit types)Strong (SQL types)
Lazy EvaluationNoYesYes
Predicate PushdownNoYesYes
Projection PushdownPartialYesYes
Parallel ExecutionNoYes (automatic)Yes (automatic)
Method ChainingAwkwardNaturalN/A (SQL)
Memory EfficiencyPoorExcellentExcellent
Query OptimizationNoYesYes

Conclusion

The benchmarks speak for themselves. Polars and DuckDB consistently outperform pandas by 5-20x on common operations—not because they’re magic, but because they’re designed for modern hardware and modern data sizes.

When to Use What

Use Polars when:

  • Building data pipelines in Python
  • You want a DataFrame API (similar to pandas but better)
  • You need method chaining and lazy evaluation
  • Your data fits in memory (or can be streamed)

Use DuckDB when:

  • You prefer SQL or need SQL compatibility
  • You’re doing ad-hoc analysis
  • You want to query parquet/CSV files directly without loading
  • You need to integrate with existing SQL workflows

Use Pandas when:

  • Working with legacy codebases
  • Data is small (<100MB) and simplicity matters more than speed
  • You need a specific pandas-only library integration

The Migration Path

The good news: migrating is easier than you think.

  1. Start with new projects - Use Polars or DuckDB for new work
  2. Identify bottlenecks - Profile your pandas code, find the slow parts
  3. Convert hot paths - Replace slow pandas operations with Polars/DuckDB
  4. Polars has pandas interop - df.to_pandas() and pl.from_pandas() make gradual migration painless

Resources

Stop leaving performance on the table. Your data deserves better than pandas.