Introduction to Gluestick
A Python library for efficient ETL processes, optimized for hotglue
Gluestick is an open source ETL toolkit developed and maintained by hotglue. It is optimized for usage in hotglue pre-processing scripts.
The code is available on GitHub and is free to use under the MIT license.
GitHub Repository
Access the complete source code, contribute, or report issues through our GitHub repository. Star us to show support!
Getting started with gluestick
Key Features
- Robust ETL utilities for data processing
- Singer protocol integration
- Advanced JSON and object handling
- Snapshot management for incremental loads
- Production-ready error handling
File Reading Functions
Reader
RECOMMENDED: Class for reading sync-output data. Provides methods to read directories, get file metadata, and extract primary keys from parquet files.
read_csv_folder
Reads multiple CSV files from a directory, organizing them by entity type based on filename. Supports custom converters and index columns per entity.
read_parquet_folder
Similar to read_csv_folder but for Parquet files. Automatically organizes files by entity type and supports ignoring specific files.
get_catalog_schema
Retrieves DataFrame schema from Singer catalog.
Snapshot Management
snapshot_records
Manages data snapshots by updating existing snapshots or creating new ones. Supports type coercion and handles both CSV and Parquet formats.
read_snapshots
Reads snapshot data for a specific stream from either Parquet or CSV format. Supports additional pandas read options.
drop_redundant
Removes duplicate rows based on content hashing. Maintains state of processed data and supports update tracking.
JSON & Object Handling
json_tuple_to_cols
Converts JSON tuple columns into separate columns based on key-value pairs.
explode_json_to_rows
Explodes array of objects into multiple rows with columns for each object key.
explode_json_to_cols
Converts JSON array columns into separate columns, with one column per array value.
compress_rows_to_col
Compresses exploded rows back into a single column with array data.
array_to_dict_reducer
Converts arrays into dictionaries using specified key-value properties.
clean_obj_null_values
Replaces null values with None in stringified objects for further processing.
Data Transformation
clean_convert
Recursively cleans None values from lists and dictionaries. Handles nested structures and datetime conversions.
map_fields
Maps row values according to a specified mapping dictionary. Supports nested structures and conditional mapping.
rename
Renames DataFrame columns using JSON format with type conversion support.
localize_datetime
Localizes datetime columns to UTC timezone. Handles both naive and timezone-aware timestamps.
deep_convert_datetimes
Transforms all nested datetimes to ISO format recursively.
exception
Standardized error handling with file logging. Creates consistent error reporting across ETL pipelines.