Skip to main content

Introduction to Polars

Polars is a fast, efficient dataframe library designed to work with large datasets with a lower memory footprint compared to pandas. It’s ideal for scenarios where performance and scalability are crucial. However, it is a newer library, so documentation may be sparse.

PolarsReader and PLLazyFrameReader Classes

Gluestick now supports two additional reader classes specifically designed for Polars:
  1. PolarsReader: Reads sync output into Polars Dataframes
  2. PLLazyFrameReader: Reads sync output into Polars Lazyframes
Both classes inherit from the base Reader class. For more information, see the Polars docs.

PLLazyFrameReader

A Lazyframe is an abstraction of a dataframe that streams your data from your sync output, applies relevant transformations, and then writes to your export format without ever loading the entire dataset into memory.
import gluestick as gs
import polars as pl

reader = gs.PLLazyFrameReader()

TENANT_ID = "TENANT_123"

for stream in reader.input_files.keys():
    lf = reader.get(stream, catalog_types=True)
    lf = lf.with_columns(
        pl.lit(TENANT_ID).alias("tenant_id")
    )
    gs.to_export(lf, stream, "./etl-output", keys=["id"])

PolarsReader

While Lazyframes are more efficient, certain operations can be trickier to implement, For small-data operations, you may be better of using the standard Polars Dataframe with PolarsReader:
import gluestick as gs
import polars as pl

reader = gs.PolarsReader()

TENANT_ID = "TENANT_123"

for stream in reader.input_files.keys():
    df = reader.get(stream, catalog_types=True)
    df["tenant_id"] = TENANT_ID
    gs.to_export(df, stream, "./etl-output", keys=["id"])