Reader

Recommended: A class for reading ETL files from the sync-output directory. This is the recommended way to read ETL files as it provides consistent error handling and metadata access.

Installation

from gluestick.reader import Reader

Basic Usage

# Initialize with default directories
reader = Reader()  # Uses ROOT_DIR/sync-output

# Get available streams
print(reader)  # Shows list of available streams

Key Methods


get(stream, default=None, catalog_types=False, **kwargs)

Reads the selected file into a pandas DataFrame.

# Basic reading
df = reader.get("users")

# Using catalog types
df = reader.get("users", catalog_types=True)

get_metadata(stream)

Retrieves metadata from parquet files.

metadata = reader.get_metadata("users")
# Returns dict of metadata key-value pairs

get_pk(stream)

Gets primary key(s) from parquet file metadata.

primary_keys = reader.get_pk("users")
# Returns list of primary key column names

read_csv_folder

Legacy function for reading multiple CSV files. Consider using Reader class instead.

Usage

from gluestick.etl_utils import read_csv_folder

# Basic usage
entity_data = read_csv_folder(INPUT_DIR)

Parameters

  • path (str): Directory containing CSV files or path to single CSV file
  • converters (dict): Entity-specific column converters
  • index_cols (dict): Entity-specific index columns
  • ignore (list): Files to ignore

Returns

Dictionary of pandas DataFrames, keyed by entity name

read_parquet_folder

Legacy function for reading multiple parquet files. Consider using Reader class instead.

Usage

from gluestick.etl_utils import read_parquet_folder

# Basic usage
entity_data = gs.read_parquet_folder(INPUT_DIR)

Parameters

  • path (str): Directory containing parquet files or path to single file
  • ignore (list): Files to ignore

Returns

Dictionary of pandas DataFrames, keyed by entity name

get_catalog_schema

Retrieves DataFrame schema from Singer catalog file.

Usage

from gluestick.singer import get_catalog_schema, to_singer


# Get schema for specific stream
schema = get_catalog_schema("users")

# Use schema with Singer export
to_singer(df, "users", OUTPUT_DIR, schema=schema)

Returns

Dictionary containing the stream’s schema definition

Notes

  • Requires catalog.json in root directory
  • Raises exception if stream not found in catalog
  • Filters schema to include only type and properties
  • Ensures array types have items dictionary

Common Patterns

Iterating through sultiple streams

etl.py
import gluestick as gs

# iterate through each stream in the input directory
reader = gs.Reader()
for key in eval(str(reader)):

    # define a dataframe and apply transformations
    input_df = reader.get(key, catalog_types=True)
    input_df["tenant"] = TENANT_ID

    # define the primary keys (assuming the data is Parquet)
    key_properties = reader.get_pk(key)

    # write the data out to the output directory
    gs.to_export(input_df, key, OUTPUT_DIR, keys=key_properties)