GitHub Repository
Access the complete source code, contribute, or report issues through our GitHub repository. Star us to show support!
Getting started with gluestick
Key Features
- Robust ETL utilities for data processing
- Singer protocol integration
- Advanced JSON and object handling
- Snapshot management for incremental loads
- Production-ready error handling
File Reading Functions
Reader
RECOMMENDED: Class for reading sync-output data. Provides methods to read directories, get file metadata, and extract primary keys from parquet files.
read_csv_folder
Reads multiple CSV files from a directory, organizing them by entity type based on filename. Supports custom converters and index columns per entity.
read_parquet_folder
Similar to read_csv_folder but for Parquet files. Automatically organizes files by entity type and supports ignoring specific files.
get_catalog_schema
Retrieves DataFrame schema from Singer catalog.
Snapshot Management
snapshot_records
Manages data snapshots by updating existing snapshots or creating new ones. Supports type coercion and handles both CSV and Parquet formats.
read_snapshots
Reads snapshot data for a specific stream from either Parquet or CSV format. Supports additional pandas read options.
drop_redundant
Removes duplicate rows based on content hashing. Maintains state of processed data and supports update tracking.
JSON & Object Handling
json_tuple_to_cols
Converts JSON tuple columns into separate columns based on key-value pairs.
explode_json_to_rows
Explodes array of objects into multiple rows with columns for each object key.
explode_json_to_cols
Converts JSON array columns into separate columns, with one column per array value.
compress_rows_to_col
Compresses exploded rows back into a single column with array data.
array_to_dict_reducer
Converts arrays into dictionaries using specified key-value properties.
clean_obj_null_values
Replaces null values with None in stringified objects for further processing.
Data Transformation
clean_convert
Recursively cleans None values from lists and dictionaries. Handles nested structures and datetime conversions.
map_fields
Maps row values according to a specified mapping dictionary. Supports nested structures and conditional mapping.
rename
Renames DataFrame columns using JSON format with type conversion support.
localize_datetime
Localizes datetime columns to UTC timezone. Handles both naive and timezone-aware timestamps.
deep_convert_datetimes
Transforms all nested datetimes to ISO format recursively.
exception
Standardized error handling with file logging. Creates consistent error reporting across ETL pipelines.