Introduction to Gluestick

Gluestick is an open source ETL toolkit developed and maintained by hotglue. It is optimized for usage in hotglue pre-processing scripts. The code is available on GitHub and is free to use under the MIT license.

GitHub Repository

Access the complete source code, contribute, or report issues through our GitHub repository. Star us to show support!

Getting started with gluestick

# Install from PyPI
pip install gluestick

# Import utilities
import gluestick as gs

Key Features

Robust ETL utilities for data processing
Singer protocol integration
Advanced JSON and object handling
Snapshot management for incremental loads
Production-ready error handling

File Reading Functions

Reader

read_csv_folder

Reads multiple CSV files from a directory, organizing them by entity type based on filename. Supports custom converters and index columns per entity.

read_parquet_folder

get_catalog_schema

Retrieves DataFrame schema from Singer catalog.

Snapshot Management

snapshot_records

Manages data snapshots by updating existing snapshots or creating new ones. Supports type coercion and handles both CSV and Parquet formats.

read_snapshots

Reads snapshot data for a specific stream from either Parquet or CSV format. Supports additional pandas read options.

drop_redundant

Removes duplicate rows based on content hashing. Maintains state of processed data and supports update tracking.

JSON & Object Handling

json_tuple_to_cols

Converts JSON tuple columns into separate columns based on key-value pairs.

explode_json_to_rows

Explodes array of objects into multiple rows with columns for each object key.

explode_json_to_cols

Converts JSON array columns into separate columns, with one column per array value.

compress_rows_to_col

Compresses exploded rows back into a single column with array data.

array_to_dict_reducer

Converts arrays into dictionaries using specified key-value properties.

clean_obj_null_values

Replaces null values with None in stringified objects for further processing.

Data Transformation

clean_convert

Recursively cleans None values from lists and dictionaries. Handles nested structures and datetime conversions.

map_fields

Maps row values according to a specified mapping dictionary. Supports nested structures and conditional mapping.

rename

Renames DataFrame columns using JSON format with type conversion support.

localize_datetime

Localizes datetime columns to UTC timezone. Handles both naive and timezone-aware timestamps.

deep_convert_datetimes

Transforms all nested datetimes to ISO format recursively.

exception

Standardized error handling with file logging. Creates consistent error reporting across ETL pipelines.

Data Export & Error Handling

to_export

Exports data to various formats (Singer, Parquet, JSON, JSONL, CSV). Supports schema validation, object stringification, and custom formatting.

to_singer

Exports DataFrame to Singer format with schema validation and type handling.

Getting Started

Key Concepts

Managed authentication

Transformation

Custom Connectors

Unified Schema

Environment Settings

CLI

Connectors

Introduction to Gluestick

GitHub Repository

Getting started with gluestick

Key Features

File Reading Functions

Reader

read_csv_folder

read_parquet_folder

get_catalog_schema

Snapshot Management

snapshot_records

read_snapshots

drop_redundant

JSON & Object Handling

json_tuple_to_cols

explode_json_to_rows

explode_json_to_cols

compress_rows_to_col

array_to_dict_reducer

clean_obj_null_values

Data Transformation

clean_convert

map_fields

rename

localize_datetime

deep_convert_datetimes

exception

Data Export & Error Handling

to_export

to_singer

Getting Started

Key Concepts

Managed authentication

Transformation

Custom Connectors

Unified Schema

Environment Settings

CLI

Connectors

GitHub Repository

​Getting started with gluestick

​Key Features

​File Reading Functions

Reader

read_csv_folder

read_parquet_folder

get_catalog_schema

​Snapshot Management

snapshot_records

read_snapshots

drop_redundant

​JSON & Object Handling

json_tuple_to_cols

explode_json_to_rows

explode_json_to_cols

compress_rows_to_col

array_to_dict_reducer

clean_obj_null_values

​Data Transformation

clean_convert

map_fields

rename

localize_datetime

deep_convert_datetimes

exception

​Data Export & Error Handling

to_export

to_singer

Getting started with gluestick

Key Features

File Reading Functions

Snapshot Management

JSON & Object Handling

Data Transformation

Data Export & Error Handling