Skip to main content
Gluestick is an open source ETL toolkit developed and maintained by hotglue. It is optimized for usage in hotglue pre-processing scripts. The code is available on GitHub and is free to use under the MIT license.

GitHub Repository

Access the complete source code, contribute, or report issues through our GitHub repository. Star us to show support!

Getting started with gluestick

# Install from PyPI
pip install gluestick

# Import utilities
import gluestick as gs

Key Features

  • Robust ETL utilities for data processing
  • Singer protocol integration
  • Advanced JSON and object handling
  • Snapshot management for incremental loads
  • Production-ready error handling

File Reading Functions

Reader

RECOMMENDED: Class for reading sync-output data. Provides methods to read directories, get file metadata, and extract primary keys from parquet files.

read_csv_folder

Reads multiple CSV files from a directory, organizing them by entity type based on filename. Supports custom converters and index columns per entity.

read_parquet_folder

Similar to read_csv_folder but for Parquet files. Automatically organizes files by entity type and supports ignoring specific files.

get_catalog_schema

Retrieves DataFrame schema from Singer catalog.

Snapshot Management

snapshot_records

Manages data snapshots by updating existing snapshots or creating new ones. Supports type coercion and handles both CSV and Parquet formats.

read_snapshots

Reads snapshot data for a specific stream from either Parquet or CSV format. Supports additional pandas read options.

drop_redundant

Removes duplicate rows based on content hashing. Maintains state of processed data and supports update tracking.

JSON & Object Handling

json_tuple_to_cols

Converts JSON tuple columns into separate columns based on key-value pairs.

explode_json_to_rows

Explodes array of objects into multiple rows with columns for each object key.

explode_json_to_cols

Converts JSON array columns into separate columns, with one column per array value.

compress_rows_to_col

Compresses exploded rows back into a single column with array data.

array_to_dict_reducer

Converts arrays into dictionaries using specified key-value properties.

clean_obj_null_values

Replaces null values with None in stringified objects for further processing.

Data Transformation

clean_convert

Recursively cleans None values from lists and dictionaries. Handles nested structures and datetime conversions.

map_fields

Maps row values according to a specified mapping dictionary. Supports nested structures and conditional mapping.

rename

Renames DataFrame columns using JSON format with type conversion support.

localize_datetime

Localizes datetime columns to UTC timezone. Handles both naive and timezone-aware timestamps.

deep_convert_datetimes

Transforms all nested datetimes to ISO format recursively.

exception

Standardized error handling with file logging. Creates consistent error reporting across ETL pipelines.

Data Export & Error Handling

to_export

Exports data to various formats (Singer, Parquet, JSON, JSONL, CSV). Supports schema validation, object stringification, and custom formatting.

to_singer

Exports DataFrame to Singer format with schema validation and type handling.